Introduction
Synthetic datasets, artificially generated to mimic real-world data while maintaining anonymization (Jordon et al., Reference Jordon, Houssiau, Cherubin, Cohen, Szpruch and Bottarelli2022; Nikolenko, Reference Nikolenko2021), have emerged as a promising technology in the financial sector. While artificial data has long been used in financial markets for a variety of applications, in recent years advances in generative artificial intelligence (AI) have allowed for the creation of artificial datasets that are indistinguishable from data generated by real-world processes. Because modern synthetic data promise to match the broad distributional properties of real-world data without reproducing them, synthetic data has attracted interest from a diverse range of financial market stakeholders who see it as a potential solution to a broad range of technical and regulatory problems, including data privacy and sharing issues, problems related to data bias, and model overfitting (Assefa et al., Reference Assefa, Dervovic, Mahfouz, Blach, Reddy and Veloso2020; Blach et al., Reference Balch, Potluru, Paramanand and Veloso2024; Potluru et al., Reference Potluru, Borrajo, Coletta, Dalmasso, El-Laham, Fons, Ghassemi, Gopalakrishnan, Gosai, Kreačić, Mani, Obitayo, Paramanand, Raman, Solonin, Sood, Vyetrenko, Zhu, Veloso and Balch2024). Support for increased use of synthetic data has even come from regulators and policymakers, such as the Financial Conduct Authority (FCA) and the European Commission (FCA, 2024; Di Girolamo, Hledik, and Pagano, Reference Di Girolamo, Hledik and Pagano2024). This is somewhat surprising given that, in recent years, regulators have come to approach innovations in financial modeling with caution given the central role that novel financial modeling practices played in the 2008 financial crisis (MacKenzie, Reference MacKenzie2011) and earlier episodes, such as the 1987 stock market crash (MacKenzie, Reference MacKenzie2004). Indeed, even as regulators have embraced synthetic data, they are showing a deep uneasiness about the potential implications of AI on financial markets and institutions (Bank of England 2025).
In response to this enthusiasm for synthetic data – bordering on hype in some cases (Ravn, Reference Ravn2025) – an emerging academic literature in Critical Data Studies (CDS) and related fields has taken an increasingly skeptical eye to the use of synthetic data generation technologies (e.g., Steinhoff, Reference Steinhoff2024; Jacobsen, Reference Jacobsen2023; Offenhuber, Reference Offenhuber2024; Whitney and Norman, Reference Whitney and Norman2024). Yet there has been relatively little work exploring the potential effects of synthetic data on financial markets, despite significant interest in these techniques among market participants and regulators.
This article serves as a call for more attention to the infrastructural dimension of synthetic data (Bowker and Star, Reference Bowker and Star1999; Bernards and Campbell-Verduyn Reference Bernards and Campbell-Verduyn2019; Westermeier, Campbell-Verduyn, and Brandl, Reference Westermeier, Campbell-Verduyn and Brandl2025). Rather than focusing on the properties of synthetic data itself, this article advocates for closer attention to the way that synthetic data generation techniques are being embedded into the broader ‘stack’ of technologies, standards, and infrastructures that constitute machine learning (ML) systems (Straube Reference Straube2016; Hansen 2024). As work in infrastructure studies has emphasized, embedding new technologies into existing financial market infrastructures can generate unexpected effects by reconfiguring relationships between different layers of the stack and among infrastructure owners and users (Jensen and Morita, 2017; Paraná, Reference Paraná, Westermeier, Campbell-Verduyn and Brandl2025). This infrastructural perspective, we argue, is critical for anticipating synthetic data’s potential impact on markets. Whether synthetic data generation technologies prove to be beneficial or harmful, we argue, depends on the infrastructures and practices into which they are embedded. To this end, this article identifies three broad sets of concerns around synthetic data which, depending on how synthetic data generators are integrated into market infrastructures, will create either beneficial or harmful effects. These are a tension between (i) synthetic data’s capacity to facilitate data sharing versus the tendency of synthetic data generators to create new forms of model opacity, (ii) synthetic data’s capacity to diversify data-driven decision-making by generating data corresponding to alternative futures and pasts versus its potential to induce new forms of model-induced isomorphism among financial market participants, and finally (iii) its potential political-economic effects on the market concentration of incumbent data platforms and owners.
Framing synthetic data
Synthetic data is typically defined in terms of its use, rather than the techniques used to generate it, which are diverse. Jordon et al. (Reference Jordon, Houssiau, Cherubin, Cohen, Szpruch and Bottarelli2022), writing in a widely cited report sponsored by the Royal Society, define it as ‘data that has been generated using a purpose-built mathematical model or algorithm, with the aim of solving a (set of) data science task(s)’ (Jordon et al. Reference Jordon, Houssiau, Cherubin, Cohen, Szpruch and Bottarelli2022: 5). The concept of using artificially generated data for statistical purposes is much older than modern data science and ML. Some of the earliest references to ‘synthetic data’ come from papers published in the applied economics literature in the 1960s and 1970s, in which researchers associated with national statistical agencies constructed ‘synthetic microdata’ sets by merging datasets collected from different samples and matching demographic information in order to study questions requiring data from multiple datasets (cf., Ruggles and Ruggles, 1968; Sisson Reference Sisson1979). In the financial markets, artificial data produced via Monte Carlo simulation has long been used for a variety of problems, including derivatives pricing and the estimation of value-at-risk (VaR) for risk management purposes (Jackel, Reference Jäckel2002).
However, recent years have seen the emergence of synthetic data as a distinct research and practice area in the fields of data science and ML. There are several reasons for this. First, recent years have seen the emergence of new generative AI techniques – such as generative adversarial networks (GANs) and large language models (LLMs) – capable of generating synthetic data that is virtually indistinguishable from ‘real-world’ data (Goodfellow et al., Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville and Bengio2020; Nadas et al. Reference Nadas, Diosan and Tomescu2025). These ML techniques produce data that exhibit a level of realism that cannot be easily matched using more traditional statistical techniques, such as Monte Carlo simulation applied to stochastic differential equations, the workhorses of much of quantitative finance. GANs, for instance, are the ML technology that underpins the production of many ‘deep fakes’: images of humans that are nearly indistinguishable from photographs of real people. Partly as a consequence of these modeling developments, modern synthetic data-generating techniques are extremely diverse (Lu et al., Reference Lu, Shen, Wang and Wei2023), ranging from 3D modeling techniques in the field of computer vision, to traditional simulation-based methods, such as agent-based models (ABMs) (Axtell and Farmer, Reference Axtell and Farmer2022), to advanced generative AI techniques like variational auto-encoders (VAEs), as well as GANs and LLMs (Kingma and Welling, Reference Kingma and Welling2019). While simulation-based methods generate data from scratch by mimicking real-world processes, ML-driven techniques instead use ‘real’ data to train a neural network to produce synthetic copies of that data that match its broad distributional properties, but which do not match the original data on a case-by-case basis.
Second, apart from these modeling developments, interest in synthetic data has grown because it provides a technological ‘fix’ to several barriers that constrain the adoption of ML. This is particularly true in domains where there are significant concerns around data privacy, or in situations where there are inadequate quantities of ‘real-world’ data to train ML models (Hansen and Spears Reference Hansen and Spears2025). In the case of privacy-sensitive applications of ML, synthetic data is sought because it can, in principle, closely match the distributional properties of real-world data – for example, the share of borrowers of a particular background who default on a loan – without revealing any personally identifiable information about the specific individuals contained in the original ‘real-world’ dataset. In activities such as credit risk modeling, synthetic data is extremely attractive to ML practitioners because the General Data Protection Regulation (GDPR) and other regulations make it increasingly cumbersome for financial services firms to share credit default data across, and in some cases within, firms. Likewise, where privacy is not an issue but data is scarce, synthetic data generators are attractive because they allow ML practitioners to generate large quantities of data that are statistically indistinguishable from real-world data to train their models. For example, one of the most prominent uses of synthetic data for this purpose is associated with an ML-driven system for automated options hedging developed jointly by quants at J.P. Morgan and ETH-Zurich known as Deep Hedging (Buehler et al. Reference Buehler, Gonon, Teichmann and Wood2019). Because real-world options data is too sparse to train a deep learning system, the Deep Hedging system is trained using data produced by using what quants call a market generator – a separate ML system that can produce large quantities of realistic synthetic market data using a smaller amount of real-world price data (c.f. Buehler et al. 2023; Kondratyev and Schwartz 2019).
In the context of finance, certain global systemically important banks (G-SIBs), including J.P. Morgan and Goldman Sachs, have embraced synthetic data and actively promote its use in different contexts (Assefa, Reference Assefa, Dervovic, Mahfouz, Blach, Reddy and Veloso2020; Bansel and Stefani, Reference Bansel and Stefani2024; Coletta et al., Reference Coletta, Prata, Conti and Balch2021). A report by the International Monetary Fund notes the acceleration in the use of synthetic data in finance and highlights cost-effectiveness, privacy protection, and debiasing as features that make synthetic data attractive to the financial industry (Shabsigh and Boukherouaa, 2023). While financial regulators have expressed serious concerns about the risks of ML and AI on financial markets, they have instead shown considerable enthusiasm for the use of synthetic data. This is due to the way that synthetic data generation technology could help to meet regulatory objectives, including the promotion of competition in the financial markets and the enforcement of data protection laws. For example, the UK FCA has produced several reports on possible use cases of synthetic data in the financial industry; in these reports, it emphasizes that the shareability of synthetic data could contribute to ‘democratizing data access’ by lowering barriers to entry for challenger firms seeking to bring new financial products to market (FCA, 2023). The European Union has adopted a similar attitude with the development of their Data Hub on their Digital Finance Platform. The Data Hub is meant to facilitate data exchange between financial firms and supervisory authorities through the provision of synthetic supervisory data for the purpose of testing and training AI and ML models (Girolamo et al., 2024). However, some regulators and institutions have also pointed to potential risks associated with the use of synthetic data in finance. For example, The US Commodity Futures Trade Commission (CFTC) has raised concerns about data quality in financial risk management if ‘data gaps’ are filled with synthetic data that could lead to inaccurate information. The CFTC’s main concern is with the risk of so-called ‘AI hallucinations’ (Romero et al., Reference Romero, Lee, Lee, Smith and Biaglioli2024). Similarly, the UK FCA (FCA, 2024) has highlighted the importance of careful judgment in the creation of synthetic data. They emphasize that the judgment in selecting appropriate technologies and models is crucial, depending on specific use cases and evaluation metric. For instance, the acceptable levels of privacy, utility, and fidelityFootnote 1 of synthetic data are likely to be influenced by its intended application, which involves trade-off and tension.
Although the use of synthetic data in the financial industry has yet to receive attention in economic sociology and in Social Studies of Finance (SSF), a budding literature on synthetic data is emerging in CDS. This emerging body of CDS research covers a range of themes including privacy (Munkholm and Weihn, Reference Munkholm, Weihn, Søe, Wiehn and Jørgensen2025), governance and regulation (Beduschi, Reference Beduschi2024; Gal and Lynskey, Reference Gal and Lynskey2023), surveillance (Ravn, Reference Ravn2024a; Ridgway and Malevé, Reference Ridgway and Malevé2024), capital and labor (Steinhoff, Reference Steinhoff2024), synthetic media such as deepfakes or other deceitful media (de Vries, Reference De Vries2020; Ferrari and McKelvey, Reference Ferrari and McKelvey2023; Fitzgerald, Reference Fitzgerald2024; Martin and Newell, Reference Martin and Newell2024), ethico-politics (Helm, Lipp, and Pujadas, Reference Helm, Lipp and Pujadas2024; Jacobsen, Reference Jacobsen2024; Ravn, Reference Ravn2024b), the representational logics of data types (Offenhuber, Reference Offenhuber2024; Susser and Seeman, Reference Susser and Seeman2024), data pollution (Wiehn, Reference Wiehn2024), and the construction of synthetic data promises in science and industry (Ravn, Reference Ravn2025). Particularly relevant to this article are discussions of synthetic data as a technology that offsets risk associated with the use of ML for process and decision automation. Jacobsen (Reference Jacobsen2023) conceptualizes synthetic data as a ‘technology of risk’ that ‘de-risks’ ML models by attributing risk solely to the real-world data on which such models are trained: ‘By shifting the domain of risk to the ‘real’ dataset’, Jacobsen (Reference Jacobsen2023: 8) argues, ‘synthetic data promise to be the means by which algorithms can be rendered free from their own manufactured uncertainties’. Understood as a derisking technology, synthetic data generation becomes a technology that enables the development and use of ML and AI models.
We build on this recent work by shifting analytical attention from the properties of synthetic data itself to the ways that synthetic data generation techniques are being embedded within existing stacks in finance, particularly those that underpin the operation of automated ML systems. In employing the term ‘stack’, we build on a growing body of work in sociology, political economy, and science and technology studies (STS) that employs stack-based theorizing to understand developments at the intersection of computing and finance (Straube, Reference Straube2016; Caliskan, Reference Caliskan2020; Hansen, Reference Hansen2025; MacKenzie, Caliskan, Rommerskirchen, Reference MacKenzie, Caliskan and Rommerskirchen2023). This work has in common an attentiveness to the way that modern computing and financial products are typically created by linking or ‘stacking’ together multiple technologies and infrastructures. The term ‘stack’ has its origins in computer networking; it originally referred to the ‘stacked’ nature of protocols that underpin modern Internet communications, particularly in the Open Systems Interconnection (OSI) model. Stack diagrams, like that of the OSI, capture the logical dependency of protocols ‘higher’ in the stack on those ‘lower’ in the stack. At the bottom of the stack is the physical transmission of bits of information across the material infrastructure of the network: cables, switches, and servers. On top of this are layered basic Internet protocols – for example, TCP/IP, the DNS system, and application-level protocols, like HTTP for information displayed on the web. At the top of the protocol ‘stack’ are user applications that depend on these lower ‘layers’ to function. The stack diagram is thus not a description of layering in Euclidean space, but one of logical dependency: changes ‘lower’ in the stack propagate upward, which can lead to failure or unexpected effects on layers higher in the stack. Stack-based theorizing in the social sciences has its origin in the writing of Bratton (2016), who uses the term to describe an emergent political order that has emerged consisting of interlinked computing infrastructure, protocols, and applications, which intersects and challenges more traditional forms of state-centered power.
In the case of ML systems, it has become increasingly common for practitioners to refer to a similar layered stack that organizes the hardware, data, and software that organizes modern automated ML systems. We provide a simple four-layer depiction of an ML stack in Figure 1. At the bottom is what we call the device layer, which includes the physical computing hardware underpinning modern ML systems, such as graphics and tensor processing units (GPUs and TPUs), networking infrastructure, and the software used for training neural networks (e.g., TensorFlow). Above this is the cloud layer, consisting of cloud platforms and services provided by Amazon, Microsoft, and Google, from which ML practitioners can access computing infrastructure at the device layer through standardized interfaces. Stacked above this layer is a data layer, which captures the data platforms and systems that produce and store data that are used to train financial ML systems. This layer encompasses both proprietary and public datasets. And finally on top is the application layer, which corresponds to the ML systems that are developed and trained by practitioners for specific purposes (e.g., trading, risk management, fraud detection, etc.). With the emergence of generative AI techniques, particularly LLMs, some have argued that a new, fifth layer is emerging that sits on top of the cloud and data layers: that of foundation model providers like OpenAI, Anthropic, and Mistral (Gambacorta and Shretti, Reference Gambacorta and Shreeti2025). Both proprietary foundation models, such as GPT-4, and open-source synthetic data generators making direct use of the data layer are likely to be increasingly used to generate synthetic data in financial markets (Nadas, et al., Reference Nadas, Diosan and Tomescu2025).

Figure 1. Illustration of the machine learning stack.
In this article, we argue that synthetic data generation techniques are likely to significantly alter the data layer of the stack, which is likely to propagate uncertain effects ‘up the stack’ into the application layer as the use of synthetic data generation technologies shapes financial market participant behavior and reconfigures power relations between different layers of the financial ML stack. In making this argument, we follow a line of thinking in infrastructure studies which has broadly emphasized the way that changes to the design or operation of market infrastructures can induce significant changes in market behavior (Paraná, Reference Paraná, Westermeier, Campbell-Verduyn and Brandl2025; Arnoldi, Reference Arnoldi2016; Spears, Reference Spears2019; Pinzur, Reference Pinzur2016). In what follows, we outline three tensions related to the use of synthetic data generation as a technological enabler of ML/AI in finance. What the discussions of these tensions suggest is that there are governance-issues looming as the use of synthetic data in finance picks up.
Synthetic data generation technology: Three tensions
First tension: Data circulability versus opacity
First, synthetic data generation techniques have a remarkable capacity to decouple data from the conditions and local context of their production, as data collected for one purpose, will be regenerated to serve different predictive purpose, enabling applications beyond the original intent. This, indeed, is key to its promise to enable enhanced data sharing among firms that are proving attractive to financial market participants and regulators alike. Yet, as a long tradition of work in STS has emphasized, the production of data is deeply shaped by social, historical, and discipline-specific practices (Bowker and Star Reference Bowker and Star1999; Bowker et al., 2019). Far from being an abstract point about the ineluctably socially constructed nature of data, this context can be critically important for consumers of synthetic data – namely, ML system developers and engineers – using it to train ML algorithms in contexts distant from that in which the original data was created. By decoupling data from the context of its production, consumers of synthetic data may overlook its limitations, biases, including its inherent temporal structures (Preda, Reference Preda, Pinch and Swedberg2008). Financial data related to past transactions differs significantly from those concerning ongoing transactions. The temporal structures embedded within synthetic data generation influence how users respond to this data, as the timing of data availability and release may critically shape their contextual reactions in decision-making processes.
This tension is particularly acute in cases where synthetic data is used to address a paucity of real-world data that otherwise prevents the adoption of ML systems. In the options markets, for instance, historical data on options transactions is relatively limited compared to the vast number of quoted options prices offered by dealers and exchanges daily. Historically this has limited the development of automated systems for options market making (Buehler et al., Reference Buehler, Gonon, Teichmann and Wood2019). Likewise, in the equities markets, historical market data are limited except for high-frequency price data and thus not available at the scope required to properly train ML models for systematic investment strategies operating over longer timescales (Arnott et al., Reference Arnott, Harvey and Markowitz2019). In these and other ‘data-limited regimes’, domains in which data are scarce and/or expensive to obtain (Hoffmann et al. Reference Hoffmann, Bar-Sinai, Lee, Andrejevic, Mishra, Rubinstein and Rycroft2019), synthetic data generators are being used to generate large quantities of historical data from relatively small real-world historical datasets (cf., Heaton and Witte, 2019; Buehler et al., Reference Buehler, Horvath, Lyons, Arribas and Wood2020; Limmer and Horvath, Reference Limmer and Horvath2023). However, these synthetic data generators run the risk of amplifying noise in the original, rather limited, training datasets.
In addition to the original context and conditions underpinning the production of the ‘real-world’ data used to produce it, synthetic data are also shaped by design choices made by the developer of the synthetic data generator itself. Synthetic data generation techniques are numerous; they span ML techniques, non-ML techniques such as ABMs, and even more traditional statistical techniques such as Gaussian copulas. Even in the case of ‘data-driven’ ML models, there are several key design decisions that must be made by the model builder and cannot be ‘learned’ from data itself. Among other things, these include how many layers the neural network will have, what type of activation functions to use between layers, which regularization technique to use (Mullainathan and Speiss, Reference Mullainathan and Spiess2017). All these decisions involve a mix of experimentation and subjective judgment on the part of the model developer. These choices, which can shape the synthetic data produced by the generator, ultimately must be made through a combination of domain-specific, theoretical, and experimental knowledge, along with knowledge of how the synthetic data will be used in practice. Synthetic data are thus not only a byproduct of the conditions underpinning the production of the ‘real-world’ data upon which they are derived, but they are also fundamentally ‘model-laden’ in a way that is difficult to detect than data produced via a classical statistical model using Monte Carlo Simulation (Bokulich, Reference Bokulich2020; Offenhuber, Reference Offenhuber2024). Although no data are ‘raw’ and all are somewhat ‘cooked’ (Bowker, Reference Bowker2005; Gitelman, Reference Gitelman2013) or even ‘model-filtered’ (Edwards, Reference Edwards1999), synthetic datasets are ultra-processed in the sense that they are derivatives transposed from already ‘cooked’ data.
For this reason, when synthetic data is used to train other automated systems, it can potentially induce hidden forms of model overfitting (i.e., the situation in which a model learns patterns that are specific to the training data that do not reflect the patterns they are likely to encounter when put into production (Arnott et al., Reference Arnott, Harvey and Markowitz2019)). Drawing on classic work on the opacity of ML systems (e.g., Burrell Reference Burrell2016), we refer to this as the ‘double opacity’ of synthetic data, which arises from the way that synthetic data generators effectively ‘stack’ multiple forms of black boxed ML on top of each other.
Second tension: Model-induced herding versus scattering
A second set of tensions relate to how the embedding of synthetic data generator technology into the data layer of the ML stack may shape the behavior of ML systems and that of financial market participants using these systems to make decisions. On the one hand, we know from classic work in the SSF that when financial models are embedded into market participants’ cognitive or decision-making processes, they can exert isomorphic pressures on their behavior, which can have a destabilizing effect on markets (e.g., MacKenzie and Millo, 2003; Beunza and Stark, 2012; Svetlova, Reference Svetlova2012). Strategy isomorphism is not a new phenomenon in the hedge fund sector, nor is it specific to quant funds. Studies have shown that inter-personal and inter-institutional social ties and communication practices between competing funds can be conducive to strategy conformism and increase herding risk in the sector (Kellard et al., 2016; Millo, Spence, and Valentine, Reference Millo, Spence and Valentine2023). However, herding can also be intentional as a reflexive calculated decision to imitate others, which is not the same as succumbing to irrationality (Beunza and Garud, 2007; cf. Lange, 2016). Along similar lines, Beunza and Stark (2012) have shown how derivatives traders use models to get social cues on what their competitors think. By taking competitors’ social cues into consideration, traders perform ‘reflexive modeling’ whereby they adjust and fine-tune their trading strategies to create dissonance vis-à-vis the competition. While reflexive modeling can improve trading through the creation of dissonance, errors can on the other hand accumulate if many funds use the same models, which creates resonance (Beunza and Stark, 2012). The possibility of doing reflexive modeling is, however, limited in contemporary automated trading where the human trader has become more of an appendage to an automated system than an executor of a trading strategy. In areas like high-frequency trading, resonance in the form of cognitive interdependence is largely replaced by infrastructural and algorithmic interdependence (Borch, 2016).
Given the potentially widespread application of synthetic data generators in the near future, this is a realistic possibility that needs to be considered. In general, regulators have expressed concerns about the potential systemic risks associated with deep learning, although not specifically deep learning-based synthetic data generation techniques, such as GANs or LLMs. Not so long before his term at the helm of the US Securities and Exchange Commission (SEC) came to an end in 2024, Gary Gensler shared concerns about the systemic risk threat he believes that AI poses to financial market stability, an issue he had previously raised with specific emphasis on the role of deep learning (Gensler and Bailey, Reference Gensler and Bailey2020). In a short YouTube video, in the series ‘Office Hours with Gary Gensler’, Gensler argued that a rather small number of leading generative AI companies and cloud providers are dominating the market. These companies’ dominant position has, he stated, implications for the economy writ large, but also for the financial industry more specifically. The core issue is that when financial firms build downstream AI applications, as they do at a large scale, they rely on only a few base models or ‘data aggregators’. Data aggregators are foundational models or AI model development infrastructures on and through which AI applications are built. Gensler’s concern is that the widespread proliferation of AI applications, built on a few base models, could create network interconnectedness, monocultures of model design, limited model explainability, and uniformity of data (Gensler and Bailey, Reference Gensler and Bailey2020: 4-5).Footnote 2 This cocktail of problems would, Gensler believes, increase systemic risks, including herding risks, among financial market participants. Because model risk management tends to work at the micro-prudential (individual firm-level), these types of interconnection and interoperability risks tend to evade those guardrails (Gensler and Bailey, Reference Gensler and Bailey2020: 4-5).
While Gensler articulates a very real concern, because synthetic data can potentially allow for the generation of more diverse training sets than would otherwise be available to market participants, their widespread use may instead diversify market participant behavior, a phenomenon that we call ‘model-induced scattering’, in contrast to herding. This is particularly the case in which synthetic data is used to augment historical data to prevent what is known as ‘backtest overfitting’, a well-known problem where ML algorithms over-index on past events, thereby ignoring possible events that could have happened but did not and which might happen in the future.
Backtest overfitting is arguably the biggest headache of developers of ML solutions for trading and investment management (Arnott et al., Reference Arnott, Harvey and Markowitz2019). A backtest consists of two phases: an in-sample training phase and an out-of-sample test and validation phase. The training and test datasets come from the same dataset, but the former tends to be larger than the latter. Moreover, the training set is labeled in a way that it is known what the model is taught to predict, whereas the test set only contains labels for evaluation purposes. A model is overfitting if it performs impeccably well on training data but poorly on test data (Hansen, Reference Hansen2020; Mullainathan and Spiess, Reference Mullainathan and Spiess2017). If a model performs poorly both in- and out-of-sample means that the model is too simple to learn underlying patterns in the data, which means that the model is underfit. Models that overfit in backtests are likely to perform poorly if put into production. While data quantity is the primary culprit when it comes to backtest overfitting risk, data quality is another potential course of overfitting. Sometimes the root cause of the overfitting problem is exactly a combination of limited historical data and poor data quality. As asset management quant researcher and quant finance scholar Marcos Lopez de Prado (Reference Lopez de Prado2018) points out, there is this dual issue with historical market data: rarely is there enough of it to properly train ML models and, second, history might not be the best proxy for what the future will look like.
Synthetic data generation promises to alleviate this data quant-qual problem that is causing of financial ML models to overfit in the backtest. Framing the issue, Lopez de Prado (Reference Lopez de Prado2018: 170) notes that training on historical data can cause the trading strategy to become ‘so attached to the past that it becomes unfit for the future’ (Lopez de Prado, Reference Lopez de Prado2018: 170). This data quality issue can be resolved, Lopez de Prado (Reference Lopez de Prado2018: 170) continues if the parameters for the trading rules are derived ‘directly from the stochastic process that generates the data, rather than engaging in historical simulations’. The problem of data quantity, on the other hand, simply concerns the size or lack thereof of the training sets used for model development. What synthetic data generation promises in terms of addressing the data quantity issue associated with backtest overfitting is to produce multiple virtual training sets. As Lopez de Prado (Reference Lopez de Prado2020: 9) frames it, ‘while it is easy to overfit a model to one test set, it is hard to overfit a model to thousands of test sets for each security’ Collapsing the problem of data quantity and that of data quality as it pertains to backtest overfitting risk, he summarizes:
We can use historical series to estimate the underlying data-generating process, and sample synthetic data sets that match the statistical properties observed in history. Monte Carlo methods are particularly powerful at producing synthetic data sets that match the statistical properties of a historical series. […] The main advantage of this approach is that those conclusions are not connected to a particular (observed) realization of the data-generating process but to an entire distribution of random realizations. (Lopez de Prado, Reference Lopez de Prado2020: 9)
The judgment here lies in balancing these two tensions – understanding when to prioritize the data herding and the potential for stronger collective performance versus the benefits of diverse, scattered approaches that can avoid overfitting and better account for variability and uncertainty.
Third tension: Flattening versus deepening of data platform power
A final set of uncertainties relates to the potential political-economic effects of synthetic data generation technologies on financial markets. On the one hand, with the growth of platform-based business models in finance, competitive advantage in markets is increasingly secured through control and ownership of the data layer of the financial ML stack, which can be used to train ML models (Birch, Cochrane, and Ward, Reference Birch, Cochrane and Ward2021). It is for this reason that the Economist famously quipped in 2017 that ‘the most valuable resource is no longer oil, but data’ (Economist 2017).Footnote 3 In the financial markets, this trend is manifest in the growing power of data-driven fintech platforms, the entry of so-called ‘TechFins’ into credit issuance, and the growing centrality of data infrastructure provision as a core component of incumbent financial institutions’ business (cf., Cornelli et al., Reference Cornelli, Frost, Gambacorta, Rau, Wardrop and Ziegler2023; Hansen and Borch, Reference Hansen and Borch2022; Petry, Reference Petry2021). Because synthetic data generation technologies can decouple the information content of data from the data platforms that own them, they may potentially have a levelling effect on financial institutions insofar as ownership of data assets may become a weaker source of a platform’s competitive advantage. Indeed, it is in part for this reason that regulators such as the FCA have supported the use of synthetic data technology, due to its promise to make datasets available that could promote competition in markets by allowing new fintechs to challenge the market power of incumbent financial institutions, such as banks (FCA, 2021; 2022). However, it is also possible that synthetic data generation technology may allow for the further monetization of the data layer by platform owners, thereby increasing – rather than challenging – their power.
LLMs represent one synthetic data generation technology in which this tension between the flattening and deepening of platform power is likely to play out. Unlike traditional ML techniques, or even generative AI techniques like GANs, training new LLMs is an extremely capital-intensive process, one that only a relatively small number of firms possess the resources to carry out. In March 2023, Bloomberg LLP announced BloombergGPT, a specialized LLM trained on Bloomberg’s significant proprietary data assets (Wu et al. Reference Wu, Irsoy, Lu, Dabravolski, Dredze, Gehrmann, Kambadur, Rosenberg and Mann2023). To the extent that LLMs will become a critical technology for synthetic data production, tools like BloombergGPT will allow firms such as Bloomberg to reinforce their platform dominance. At the same time, foundation model providers like Anthropic are developing specialized LLM pipelines for financial services applications, which allow financial institutions to use a customized LLM platform on their own proprietary data without training such a model from scratch. By taking advantage of general-purpose LLMs’ capacity to generate realistic synthetic data from a small number of example cases (what is known as ‘few shot’ learning), these product offerings by major foundation model providers may instead erode financial data owners’ own platform advantages (Meng et al. Reference Meng, Michalski, Huang, Zhang, Abdelzaher and Han2023; Ren et al. Reference Ren, Du, Wen, Jia, Dai, Wu and Dong2025).
Examining the interplay of the three tensions: The case of financial audit
Our discussion so far has focused on the use of synthetic data in a financial markets context. However, the growth of ML in financial services has largely taken place in areas peripheral to the trading floor, such as compliance, risk management, and customer management (Spears and Hansen Reference Spears, Hansen, Borch and Pardo-Guerra2025; Bank of England and FCA 2022; Bank of England 2025). Therefore, to examine how these tensions may play out in the future, it is helpful to consider the application of synthetic data in financial audit, another non-trading domain where there has been significant investment in ML. In this domain, the three tensions that we discussed above are likely to be particularly acute. In contrast to ML systems in financial markets, where data sources are typically evaluated pragmatically by their capacity to improve a model’s predictive power, accounting data is evaluated according to the norms of generally accepted accounting practices, which emphasizes the importance of accounting information’s representational faithfulness (IFRS Foundation, 2018). Auditing, as an established practice, thus relies on a clear and verifiable trail of real data sources, enabling auditors to sample, trace, verify, and draw conclusions about an entity’s financial position. This creates inherent tensions between the uses of synthetic data and the norms of accounting. One recent episode that exemplifies those tensions is the case of Frank, a student loan fintech that J.P. Morgan Chase acquired in 2021. Not long after the acquisition, it emerged that Charlie Javice, Frank’s CEO prior to its acquisition, had inflated its reported customer metrics during the due-diligence process with J.P. Morgan by using synthetic data generation techniques to produce fraudulent customer data (Staiger, Reference Staiger2025).
Yet, in the domain of audit, there is growing interest in the use of ML to improve the efficiency of financial audits and reduce auditor workload, which is substantial (Brown-Liburd et al., Reference Brown-Liburd, Issa and Lombardi2015). According to an April 2024 survey by KPMG, sixty-three percent of corporate board members surveyed by the firm across 1,800 companies believed that auditors should prioritize the use of AI for identifying risks and anomalies (KPMG 2025: 14). In this domain, the use of synthetic data is likely to be prioritized given the constraints that auditors face in using and aggregating company-specific data. Consider the problem of building a ML classifier to detect anomalous financial accounting data, which might be indicative of fraud or reporting error (Aftabi, et al., Reference Aftabi, Ahmadi and Farzi2023; Wang et al., Reference Wang, Liu, Zhao, Li and Zhang2024). Here one would encounter several problems that synthetic data generators promise to address. The first is the issue of privacy and data protection. Granular financial data often involves personally identifiable information; because data protection regulations such as the EU GDPR constrain the circulation of such information, a model developer would likely encounter difficulty assembling enough ‘real’ granular financial accounting data to build a robust model. However, even if a large enough dataset of ‘real’ financial data could be assembled, an auditor would encounter a second problem: anomalous data are, by definition, rare. If one were to build an ML classifier on a dataset consisting of a comparatively large number of transactions cases and a small number of anomalous cases, the ML classifier would likely exhibit a high false positive rate (i.e., incorrectly classifying anomalous cases as ‘normal’). ML practitioners refer to this form of overfitting as a ‘class imbalance’ problem (Fernandez et al., 2018; Singh, Ranjan, and Tiwari, 2022).
To address these two issues, a model developer could use a synthetic data generator, trained on a relatively small number of ‘real’ transactions, to build a balanced synthetic dataset of anomalous and typical data. This synthetic dataset could then be circulated to another firm, where it is used to train an ML classifier to detect anomalous data, which could then be embedded into an auditor’s workflow. In principle, the use of synthetic data here could significantly improve an auditor’s capacity to detect potential misstatements of fact. However, the potential impact of synthetic data usage in this case will depend on how the three tensions we identify above play out in practice. First, consider the ‘double opacity’ of the synthetic data used to train the classifier model. Because these synthetic data are already labeled as ‘anomalous’ or ‘typical’, they already embody subjective judgments regarding the materiality of the misstatement. Yet a crucial professional responsibility of auditors is to determine whether a given misstatement is ‘material’ in the sense that it ‘could reasonably be expected to influence the economic decisions of users taken on the basis of the financial statements’ (FRC 2016; Carpenter, et al. 1994). In this context, the double opacity of synthetic data may have the unintended effect of undermining auditors’ autonomy in exercising professional skepticism and oversight regarding traceability. Second, we might consider the potential effects of this technology on the behavior of financial statement users, namely investors. Consistent with Gensler’s concerns about the impact of data aggregators on financial markets, one can imagine how widespread adoption of a particular anomaly detection model by auditors could shape investor behavior in systematic ways. For instance, if the original data used to train the synthetic data generator upon which the fraud detection model depends does not generalize well to the financial data encountered by auditors using the model, it could potentially fail to detect certain types of material omissions in financial statements. On the other hand, this problem could be ameliorated if synthetic data generators are employed by model developers who are ‘close’ to model end-users (e.g., auditors) and who have a high-level of awareness of the intended use of the synthetic data.
Finally, one could consider the potential political-economic effects of the widespread use of synthetic data on the market for auditing services. At present, this market is highly concentrated, with the Big Four accounting firms EY, Deloitte, KPMG, and PwC, earning 98% of audit fees from the FTSE 350 in 2022 (Financial Reporting Council, 2023). The Big Four accounting firms are also seeking to develop AI agents (PwC, 2025) that will be trained on extensive proprietary data to enhance user experience. The development and adoption of synthetic data could have a similar role on the market for audit services to the one expected by financial regulators on the financial markets. The development of publicly available synthetic datasets to detect financial misstatements could help to lower barriers to entry for challenger firms in the market for audit services. On the other hand, the Big Four may also be able to use synthetic data generation technology to find new ways to monetize their existing data assets.
Conclusion
This article has examined the use of synthetic data generators in the financial sector and discussed regulatory and governance-issues that arise from the increasing use of such data for the purpose of developing and training ML and AI models.
Our analysis makes two key contributions to the emerging literature on synthetic data in finance. First, we argue for moving beyond assessments of synthetic data that focus on its intrinsic properties. Instead, we focus on how synthetic data generation technologies will likely be integrated into existing financial practices and infrastructures, and their resulting effects on financial markets and institutions. To this end, we employ the metaphor of the financial ML ‘stack’, the layered technology infrastructure underlying ML systems, to explore how synthetic data may shape financial market participant behavior and the power of existing financial data platforms. Second, our article demonstrates that the adoption of synthetic data generation technologies by financial market participants presents a ‘double-edged sword’ for financial regulators. Whether these technologies are likely to, on net, amplify or ameliorate model-related risks, depends on the specific ways that these techniques are layered into existing practices and infrastructures underpinning ML systems. In particular, we identify three important tensions between the promises and potential risks of generating and using synthetic data in the sector. First, we noted a tension between synthetic data’s capacity to increase the circulability of proprietary datasets and the attendant risk of increased model opacity that such sharing would induce. Second, we discussed the tension between the synthetic data’s capacity to diversify market participant behavior by facilitating the generation of diverse training sets for ML algorithms versus its potential capacity to induce new forms of isomorphic behavior among participants. Finally, we examined the tension between synthetic data’s ability to promote competition in the financial sector by lowering entry barriers in data-intensive markets and its potential to reinforce the market power of data platform owners by creating new opportunities for data monetization.
It is important to note that the three tensions we identify in this article broadly correspond to three existing priority areas for financial regulators, as we illustrate in Figure 2. The first tension, between model circulability and opacity, broadly aligns with regulators’ concerns around model risk management. Our second tension, between model-induced scattering and isomorphism, aligns closely with regulators’ macroprudential concerns around systemic risk. Finally, our third tension – concerning the flattening and deepening of data platforms’ power – speaks to emerging concerns around competition policy in financial services and the role of critical third parties. We propose that regulators as well as firms involved in the generation and use of synthetic data in the financial sector reflect on these tensions to better navigate the necessary trade-offs of wielding synthetic data.

Figure 2. Correspondence between synthetic data generation tensions and key regulatory priorities.
Take as an example the issue of double opacity. Current model risk regulations (SR 11-7 in the US, SS1/23 in the UK) establish comprehensive governance, validation, and testing requirements for financial models, applying these standards to both internally developed models and externally sourced models (e.g., from software vendors). These regulations also contain validation requirements for assessing data quality. SR 11-7, the model risk regulation for US financial institutions, requires that data be assessed for accuracy and relevance, while SS1/23 provides more extensive requirements to ensure that data be suitable for intended use, representative, and free from potential bias. However, a weakness of these regulations in the case of synthetic data is that they place the validation burden on the data user without ensuring that synthetic data providers disclose relevant contextual information that data users might need to satisfy these requirements. Crucially, synthetic data is itself a model output, and the increased circulation of synthetic data – even to address legitimate regulatory and compliance issues like data protection – is likely to induce new model linkages and dependencies across institutions that may go unnoticed by current model risk regulations. Ensuring that metadata capturing the context and limitations of synthetic data is included when such data is shared could help to ameliorate these risks by allowing synthetic data users to better understand its limitations.
For this reason, we suggest that regulators and policymakers consider developing a synthetic data labeling regime to ensure that the context underpinning the production of synthetic datasets is maintained at the point they are transmitted across organizational boundaries and between market participants. Meanwhile, we propose to recognize that each tension represents a trade-off that requires careful judgment. For instance, there may be a tension between data privacy and data utility, where maximizing one may compromise the other. This trade-off necessitates that stakeholders weigh the benefits and risks associated with each option to make informed decisions.
Acknowledgments
We would like to thank the reviewers for providing us with valuable comments and suggestions that we have used to improve this article. Moreover, we want to thank Finance and Society and, in particular, our editor Nathan Coombs for a productive and smooth editorial process. Finally, we are grateful to Julius Ambrosius for assistance with formatting and double-checking our references.