1. Introduction
The various constraints that describe a physical phenomenon are often reflections of underlying symmetry principles, summarising regularities that exist independent of specific dynamics (Gross Reference Gross1996). A notable example is the constraint of momentum conservation, which reflects a translational symmetry of the Euler–Lagrange equations (Landau & Lifshitz Reference Landau and Lifshitz1976). Dimensional analysis, historically linked to the discovery of scaling laws and non-dimensional numbers (Barenblatt Reference Barenblatt1996; Cantwell Reference Cantwell2002), expresses the principle of covariance, i.e. the symmetry of any physical law under a dilational transformation of its units of measurement. More generally, when one considers a specific problem, i.e. the governing physical laws and the corresponding boundary and/or initial conditions, multiple symmetries may be simultaneously present (e.g. spiral, reflectional, rotational) (Pakdemirli & Yurusoy Reference Pakdemirli and Yurusoy1998; Cantwell Reference Cantwell2002).
A type of symmetry which is of fundamental importance across various branches of physics is that of self-similarity (Tennekes & Lumley Reference Tennekes and Lumley1972; Townsend Reference Townsend1976; Barenblatt Reference Barenblatt1996; Cantwell Reference Cantwell2002). Following Townsend (Reference Townsend1976), Barenblatt (Reference Barenblatt1996), Pope (Reference Pope2000), we refer to self-similar (or self-preserving) phenomena as those whose evolution remains invariant under the group transformations of dilation (e.g. heat diffusion), translation (e.g. travelling waves) or a combination of both (e.g. turbulent wakes). Self-similarity allows for the reduction of an
$n$
-independent variable partial differential equation system into a system with
$n-1$
similarity variables (Birkhoff Reference Birkhoff1960; Pakdemirli & Yurusoy Reference Pakdemirli and Yurusoy1998; Cantwell Reference Cantwell2002). The transformation from original to similarity variables is known as a similarity transformation. Similarity transformations are particularly useful in problems involving
$n=2$
independent variables because they transform a partial differential equation into an ordinary differential equation, which is more amenable to analytical solution (Pakdemirli & Yurusoy Reference Pakdemirli and Yurusoy1998; Cantwell Reference Cantwell2002). Even if the governing equations are unknown, similarity considerations allow for the grouping of the problem parameters into similarity variables, reducing the effort required by the experimentalist when performing parametric characterisations of a problem by orders of magnitude (Barenblatt Reference Barenblatt1996). Additionally, the identification of similarity variables, subject to the constraints of the governing equations, can lead to the derivation of scaling laws, which describe the asymptotic evolution of the problem variables (Tennekes & Lumley Reference Tennekes and Lumley1972; Townsend Reference Townsend1976; Beaumard et al. Reference Beaumard, Bragança, Cuvier, Steiros and Vassilicos2024). The identification of similarity transformations and variables has played a prominent role in fluid mechanics research. Indeed, self-similarity is at the heart of theoretical efforts for the modelling of laminar and turbulent boundary layers (Prandtl Reference Prandtl1904; Townsend Reference Townsend1976), free shear flows (Tennekes & Lumley Reference Tennekes and Lumley1972; Townsend Reference Townsend1976; George Reference George1989; Pope Reference Pope2000), cascade dynamics (Kolmogorov Reference Kolmogorov1941b
; Pope Reference Pope2000; Vassilicos Reference Vassilicos2015; Steiros Reference Steiros2022a
), linear and nonlinear waves (Taylor Reference Taylor1950a
,
Reference Taylorb
; Zel’Dovich & Raizer Reference Zel’Dovich and Raizer1967; Whitham Reference Whitham2011), singularities (Eggers & Fontelos Reference Eggers and Fontelos2015) and high Mach number aerofoil design (Cantwell Reference Cantwell2002), among many others.
There are three possible strategies to identify self-similarity. First, if the differential equations describing the problem are known and the boundary conditions are simple, one can formally extract the flow symmetries using the theory of Lie groups (Birkhoff Reference Birkhoff1960; Cantwell Reference Cantwell1978; Cantwell Reference Cantwell2002). Second, if the equations are unknown one may still invoke dimensional analysis to uncover dilational symmetries (Barenblatt Reference Barenblatt1996). Finally, on many occasions, intuition has allowed the uncovering of self-similarity directly, via visual inspection of the problem solution (as for example in the case of turbulent shear flows Townsend Reference Townsend1976). However, this is not always sufficient to reveal the underlying self-similarity of a phenomenon. Intuition depends on the skills of the practitioner and is generally confined to simple problems. Dimensional analysis can only treat problems with dimensional variables, while even then it may uncover only the ‘tip of the iceberg’ of possible self-similar solutions, i.e. some (but far from all) of the self-similarities connected to dilational transformations (Barenblatt Reference Barenblatt1996). Lie group theory may become unfeasible in cases where the boundary conditions are overly complicated. Even if this is not the case, it may reveal some but not all symmetries of the problem. A coordinate transformation may be necessary for additional, ‘hidden’ symmetries to be revealed (Cantwell Reference Cantwell2002; Liu & Tegmark Reference Liu and Tegmark2022). In other cases the governing equations may be inadequate. An example of the above occurs in turbulent flows, where it is customary to consider the ensemble-averaged flow equations of motion i.e. the Reynolds-averaged Navier–Stokes (RANS) equations. Given the chaotic turbulent dynamics and complex boundary conditions, certain original symmetries of the Navier–Stokes equations are broken, but are still hypothesised to be recovered in a statistical sense (Frisch Reference Frisch1995). However, due to the underdetermined nature of the RANS equations (turbulence closure problem), extraction of symmetries is challenging (Oberlack Reference Oberlack1999, Reference Oberlack2001; Oberlack et al. Reference Oberlack, Cabot, Reif and Weller2006; Oberlack & Rosteck Reference Oberlack and Rosteck2010; Oberlack et al. Reference Oberlack, Hoyas, Kraheberger, Alcántara-Ávila and Laux2022).
The recent emergence of machine learning methodologies has provided a significant boost to our ability to uncover symmetries and self-similarities (Desai, Nachman & Thaler Reference Desai, Nachman and Thaler2022; Yang et al. Reference Yang, Walters, Dehmamy and Yu2023; Otto et al. Reference Otto, Zolman, Kutz and Brunton2023). If self-similarity exists in a problem, it must appear in the observables (data), and can be, in principle, discovered by data-driven methods. For instance, a known challenge in conventional dimensional analysis (Buckingham Pi theorem) is that it yields a non-unique set of non-dimensional variables that governs the evolution of a physical problem (Barenblatt Reference Barenblatt1996; Cantwell Reference Cantwell2002). Information from data can in that case provide a constraint to Buckingham Pi, and has been recently used to identify the most appropriate set of non-dimensional variables (Mendez & Ordóñez Reference Mendez and Ordóñez2004; Constantine, del Rosario & Iaccarino Reference Constantine, del Rosario and Iaccarino2017; Jofre, del Rosario & Iaccarino Reference Jofre, del Rosario and Iaccarino2020; Saha et al. Reference Saha, Gan, Cheng, Gao, Kafka, Xie, Li, Tajdari, Kim and Liu2021; Xie et al. Reference Xie, Samaei, Guo, Liu and Gan2022; Bakarji et al. Reference Bakarji, Callaham, Brunton and Kutz2022; Yuan & Lozano-Durán Reference Yuan and Lozano-Durán2025). Other examples are cases in which conservation laws and symmetries of ordinary differential equations (ODE) cannot be readily extracted because they are hidden, i.e. they require a coordinate transformation before they become manifest. Machine learning can be used to provide the necessary coordinate transformation for hidden symmetries to manifest (Liu & Tegmark Reference Liu and Tegmark2022; Mototake Reference Mototake2023). Of high relevance are also efforts to leverage the information contained in data to reveal scale-invariant flow structures (Fukami, Goto & Taira Reference Fukami, Goto and Taira2024) or close underdetermined governing equations (Duraisamy, Iaccarino & Xiao Reference Duraisamy, Iaccarino and Xiao2019), which can support the training and testing of nonlinear machine-learning techniques (e.g. Fukami & Taira (Reference Fukami and Taira2024), Duraisamy, Brunton & Taira (Reference Duraisamy, Brunton and Taira2025)).
In this work, we present a generalised methodology that can identify if similarity variables exist and, if they do, determine their mathematical expressions directly from data, without prior knowledge of the governing equations or boundary conditions. Our work differs from previous efforts (e.g. Bakarji et al. Reference Bakarji, Callaham, Brunton and Kutz2022; Xie et al. Reference Xie, Samaei, Guo, Liu and Gan2022; Yuan & Lozano-Durán Reference Yuan and Lozano-Durán2025) in that it is not based on dimensional analysis (although it is dimensionally consistent) and can thus identify self-similarities beyond dilational transformations connected to dimensional problems. This is achieved by formulating a minimisation problem to identify the similarity variables, which are then interpreted analytically via symbolic regression. In particular, our method simultaneously searches for the optimal nonlinear transformations of both independent variables (spatio-temporal coordinates) and dependent variables (observables) that yield the similarity variables of the problem and is thus different to methods aiming to identify coordinate transformations that render hidden symmetries manifest (e.g. Liu & Tegmark Reference Liu and Tegmark2022). The paper is structured as follows: in § 2 and Appendix A we describe the algorithm that extracts the similarity variables of a self-similar physical process from data. Section 3 demonstrates the potential utility of the proposed method by applying it to five problems which have been influential in fluid dynamics research and are exactly or approximately self-similar under different transformation types, with data derived from both laboratory and numerical experiments. Finally, § 4 discusses potential applications and limitations of the method and provides concluding remarks.
2. Data-driven identification of self-similarity
Let
$q(s,t)$
be a quantity of interest governed by a set of nonlinear partial differential equations, with
$s$
and
$t$
being the independent variables, typically associated with a spatial coordinate and time, respectively. In some problems (e.g. a laminar boundary layer),
$t$
may also correspond to a spatial coordinate. Consider the set of variables
$\xi$
and
$\tilde {q}$
, which lead to the transformation
$q(s,t) \rightarrow \tilde {q} (\xi ,t)$
, with
$\xi = \xi (s,t)$
, where
$\xi$
may also depend on the parameters (constants) of the problem. If
$\tilde {q}(\xi ,t)$
is independent of
$t$
, i.e. there is a function
$\hat {q}(\xi )$
such that
$\tilde {q}(\xi ,t) \equiv \hat {q}(\xi )$
, then
$q(s,t)$
is said to be self-similar (Pope Reference Pope2000). In this case, the variables and the transformation are referred to as similarity variables and similarity transformation, respectively. Assuming that we have measurements of
$q(s,t)$
at distinct instants (stations)
$t_i$
, where
$i=1,\ldots ,n_t$
and
$n_t$
is the number of available instants (stations), we propose a two-stage workflow for extracting the similarity transformation directly from data, without previous knowledge of the governing equations. Because self-similarity can be inexact or the data may include errors, the equivalence
$\tilde {q} (\xi ,t) \equiv \hat {q}(\xi )$
can be at most expected to hold approximately. In higher dimensions,
$s$
and
$\xi$
are vectors.
2.1. Step 1. Search for similarity variables
We express the similarity variables as a superposition of elementary dilation and translation groups

Decomposing the similarity variables in the form of distinct transformations is important for facilitating the success of the ensuing optimisation and regression tasks, as well as for enhancing the interpretability of the method. The search for similarity variables is formulated as a minimisation problem

where
$\boldsymbol{w}$
is the design variable matrix containing the discrete values of the transformation functions
$\alpha$
,
$\beta$
,
$\gamma$
,
$\delta$
. The
$l_2$
-norms,
$\| \boldsymbol{\cdot }\|_2^2$
, are computed following interpolation on the transformed coordinates (
$\xi$
) grid. The resolution of the
$\xi$
grid, i.e. the number of points where the
$l_2$
norms are evaluated, can be freely chosen by the user and can be set equal to the number of points in the input (non-transformed) data (see also Appendix B). In some problems, solving (2.2) can lead to degenerate solutions (i.e.
$\tilde {q} = 0$
), in which case (2.2) is replaced by a mean-regularised cost functional, which prevents trivial optima

This step provides the discrete values of the functions
$\alpha$
,
$\beta$
,
$\gamma$
,
$\delta$
and, by extension through (2.1), the discrete values of the similarity variables
$\xi$
and
$\tilde {q}$
, at all
$n_t$
instants. In practice, in order to further promote the success of the optimisation task whilst also preserving the automated character of the developed method, we propose the implementation of step 1 in successive iterations of increased decomposition complexity (i.e. building gradually from simple dilation (
$\beta =\delta =0$
) to the general decomposition (
$\xi = \alpha (t) s + \beta (s,t),\; \tilde {q} = \gamma (t) q + \delta (s,t)$
)). Depending on the problem (e.g. its boundary conditions), certain candidate transformations might be irrelevant or inadmissible (for example, translation of radial coordinates in an axisymmetric problem). Appendix A provides the pseudocode describing an example implementation of the method with clarifying remarks.
2.2. Step 2. Analytic form of the transformations
Given the knowledge of
$\boldsymbol{w}$
, i.e. the values of the functions
$\alpha , \beta , \gamma , \delta$
at all instants
$n_t$
, we employ symbolic regression to extract the analytic form of the transformations. Symbolic regression is a machine-learning technique that combines mathematical operators, functions, constants and state variables to construct a mathematical expression
$\psi$
that best represents a given dataset
$\mathcal{D}$
. In this work,
$\mathcal{D}$
refers to the discrete values of each function (
$\alpha$
,
$\beta$
,
$\gamma$
or
$\delta$
) that were found in the first step of the workflow. Symbolic regression can identify arbitrary expressions as it does not make any assumptions about the underlying function. The set of variables (library)
$\boldsymbol{A}$
which
$\psi$
can depend on,
$\psi =\psi (\boldsymbol{A})$
, is provided by the user, and is typically composed of the state variables and parameters of the problem.
To reduce the complexity of the regression problem, it is beneficial to take advantage of any prior knowledge regarding the characteristics of the dataset. We enforce two properties. The first property that is critical for discovering physical laws is that of dimensional homogeneity: the units of the identified expression
$\psi (\boldsymbol{A})$
should match those of the given dataset
$\mathcal{D}$
. The second property that is relevant to this work is that scale invariance is related to power laws (Barenblatt Reference Barenblatt1996), allowing us in certain cases to restrict the search for expressions in the form of monomials.
This step of the algorithm can be carried out with any symbolic regression method, giving users the flexibility to choose their preferred tool. Here, we demonstrate the workflow with two different symbolic regression methods: the open-source library PySR (https://astroautomata.com/PySR/), which offers high performance, flexibility, configurability and generality (Cranmer Reference Cranmer2023), and a custom, simple regression algorithm for the identification of monomials. In both cases, we use an
$l_2$
loss function, with an optional penalisation term enforcing dimensional homogeneity

where
$w_D$
is a non-negative regularisation factor and the brackets
$ [ \; ]$
denote the units of a quantity in the form of a dimension vector (see Appendix A for an example). Equation (2.4) is individually applied to each transformation function (
$\alpha$
,
$\beta$
,
$\gamma$
or
$\delta$
). This step provides the analytic form of the similarity transformations
$\alpha$
,
$\beta$
,
$\gamma$
,
$\delta$
, and thereby of the similarity variables
$\xi$
and
$\tilde {q}$
, through (2.1).
3. Results
3.1. Laminar boundary layer
The> first validation example considers the well-known case of the two-dimensional, laminar and incompressible boundary layer, where Prandtl (Reference Prandtl1904) showed that some terms of the Navier–Stokes equations are negligible, and reduced the latter to the ‘boundary layer equations’ (BLE). By utilising the streamfunction
$\varPsi$
, the BLE can be expressed in a single equation. Blasius (Reference Blasius1908) noted that, for semi-infinite flat plate boundary conditions, the BLE is symmetrical under a dilational transformation, which reduces the problem to an ordinary differential equation for the non-dimensional streamfunction
$f(\tilde {y})=\varPsi /\sqrt {\nu U_\infty x}$
, with
$\tilde {y} = y \sqrt {U_\infty /(\nu x)}$
. In the above,
$U_\infty$
is the free-stream velocity away from the flat plate,
$x$
and
$y$
are the streamwise and normal-to-the-wall distances, respectively, and
$\nu$
is the kinematic viscosity of the fluid. The BLE are thus reduced to the boundary value problem



Figure 1. Data-driven identification of self-similarity in the Blasius boundary layer. (a) Streamwise velocity field (Blasius solution). The dashed vertical lines denote the nine velocity profile sampling stations. (b) Sampled velocity profiles. (c) Algorithmically collapsed velocity profiles. (d) Algorithmically identified scaling
$\alpha$
of the wall-normal coordinate
$y$
. (e) Algorithmically identified scaling
$\beta$
of the streamwise velocity
$u$
.
Knowledge of
$\tilde {u} = f_{\tilde {y}}$
(e.g. via numerical integration of the Blasius equation) allows for the calculation of the streamwise velocity distribution
$u = U_\infty \tilde {u}$
at any location in the boundary layer. Here, we derive the Blasius similarity directly from data, without prior knowledge of the Navier–Stokes equations, Prandtl’s scaling analysis or Blasius’ similarity arguments. Our dataset consists of
$u(x,y)$
data from a flat plate laminar boundary layer flow (figure 1
a), acquired by solving the Blasius ODE with a shooting method (Bakarji et al. Reference Bakarji, Callaham, Brunton and Kutz2022). In particular, we extract velocity profiles
$u(y)$
from nine stations at different streamwise coordinates
$x$
(figure 1
b). The algorithm commences by seeking similarity under the dilational transformations
$\tilde {y} = \alpha (x) y$
,
$\tilde {u} = \beta (x) u$
. Figure 1(c) shows that the transformed profiles have collapsed onto a single curve. The identified transformations for the wall-normal coordinate
$y$
and the streamwise velocity
$u$
are shown in figures 1(d) and 1(e), respectively.
The second step consists of expressing the identified transformations as functions of the governing parameters and independent variables, i.e. a symbolic regression task. The library of candidate variables is composed of the free-stream velocity
$U_\infty$
, the viscosity
$\nu$
and the streamwise coordinate of the extracted profiles
$x$
, i.e.
$\alpha = \alpha (U_\infty ,\nu ,x)$
and
$\beta =\beta (U_\infty ,\nu ,x)$
. Owing to the nature of the assumed dilational transformations, we look for expressions in the form of monomials, i.e.
$\alpha =c_1 U_\infty ^{c_2} \nu ^{c_3} x^{c_4}$
and
$\beta =c_5 U_\infty ^{c_6} \nu ^{c_7} x^{c_8}$
, where
$c_1$
and
$c_5$
are multiplicative constants that reflect the arbitrariness of the dilation factors between the original and similarity variables. Dimensional homogeneity is enforced by assuming
$\tilde {y}$
and
$\tilde {u}$
to be dimensionless and a non-zero weighting constant
$w_D$
in the objective function (2.4). The interpreted expressions are


which are in close agreement with the theoretically derived similarity variables of Blasius (
$\tilde {y} = y \sqrt {U_\infty /(\nu x)}$
,
$\tilde {u} = u / U_\infty$
). As previously noted, the identified and theoretical scalings can vary by an arbitrary multiplicative constant (i.e. the offset in figure 1
d), depending as to whether and how the user normalises the collapsed profiles. Appendix B shows results from a sensitivity and robustness analysis with respect to the input data, parameters and noise.
3.2. Burgers’ equation
The second validation example considers Burgers’ equation, which is a partial differential equation that finds wide application in fluid dynamics, nonlinear acoustics and traffic flow, among others. It acts as a prototype for a variety of phenomena including shock wave formation, rarefaction waves and turbulence (Zel’Dovich & Raizer Reference Zel’Dovich and Raizer1967; Whitham Reference Whitham2011). In the absence of diffusion, it is the simplest model for gas dynamics. Its mathematical expression in one dimension, along with its initial condition is

where
$u(x,t)$
is the fluid velocity and
$x$
and
$t$
the spatial and temporal coordinates, respectively. The subscript denotes partial differentiation. Using the method of characteristics, the solution can be expressed as
$u(x,t) = f(\xi ) = f(x-ut) = f(x+\alpha )$
, i.e. the problem described by (3.3) is invariant under a non-uniform translational transformation. In this work, we consider a sinusoidal initial condition,
$u(x,0) = \sin {(2 \pi x)}$
, with
$x \in [0,1 ]$
. Figures 2(a) and 2(b) show the spatio-temporal evolution of velocity and profiles extracted at seven different time instants, respectively, illustrating the steepening of the solution as time evolves. The times shown are non-dimensionalised with the time of shock formation
$t_c$
,
$t^*=t/t_c=2 \pi t$
. The algorithm is provided with the data shown with markers in figure 2(b).

Figure 2. Data-driven identification of self-similarity in Burgers’ equation. (a) Spatio-temporal evolution of the velocity. (b) Extracted profiles at different time instants. Markers show the data given to the algorithm. (c) Algorithmically collapsed profiles.
Figure 2(c) shows the algorithmically collapsed velocity profiles on the transformed coordinates
$ (\tilde {x},t )$
, i.e. the output of step 1 of the algorithm, following a transformation of the form
$\tilde {x} = x+ \alpha (x,t)$
(simpler transformations do not succeed at collapsing the data). Since this is an initial value problem, we only consider the distance of the profiles with respect to the initial condition, i.e. the objective function in (2.2) is simplified to
$\sum _{i=1,\ldots ,n_t} \| u(\tilde {x},t_i) - u(x,0) \|_2^2$
.
We use PySR to interpret the algorithmically computed transformation
$\alpha$
. The library of candidate variables is composed of the problem’s independent and dependent variables, i.e.
$\alpha = \alpha (u,x,t)$
, and the library of operators includes the four basic mathematical operations (
$+,-,\times ,\div$
).
$\texttt {PySR}$
interprets the algorithmically computed transformation as
$\alpha = - u t$
, which is identical to the analytical similarity transformation. Close inspection of figure 2(c) shows that the profiles are not perfectly collapsed (for example, at
$t^*=0.936$
). The range-normalised mean absolute error between the algorithmically computed discrete values of the transformation and the analytical ones is
$2.2\,\%$
. However, in step 2 of the workflow (symbolic regression), the algorithm attempts to fit expressions that balance complexity and accuracy, allowing it to recover the exact analytical transformation
$\alpha = -ut$
.
3.3. Free turbulent flow
As a third example, we consider the statistically stationary flow past a slender bluff body at high Reynolds numbers, which is a characteristic example of a free (i.e. unconfined) turbulent shear flow. The term slender signifies an effectively infinite body length in one cross-flow direction, along which the flow statistics can be considered homogeneous. Such flows can be modelled using a boundary layer approximation, similar to the one employed by Prandtl for laminar boundary layers, applied to the time-averaged flow (Townsend Reference Townsend1976), i.e.


where the overbar denotes time averaging and
$\nu$
is the kinematic viscosity of the fluid. Here,
$u$
and
$v$
are the flow velocities along the streamwise
$x$
and cross-wise
$y$
directions, which are decomposed into time-averaged
$(\bar {u},\bar {v})$
and fluctuating
$(u',v')$
components. The formulation leading to (3.4) introduces the Reynolds stress term
$\overline {u'v'}$
into the flow governing equations, i.e. an additional unknown which cannot be calculated implicitly (this is the well-known closure problem of turbulence). The identification of similarity in this case is, therefore, not possible from the equations themselves, unless an assumption is made regarding the relation of the unknown Reynolds stress terms to the mean velocity distribution (i.e. turbulence modelling). However, it is customary to hypothesise that the turbulent BLE accept self-similar solutions far from initial conditions, a fact also supported by experimental observations (Townsend Reference Townsend1976; Cantwell Reference Cantwell2002). This approximate self-similarity is the origin behind various scaling laws for the evolution of free shear flows, widely used in many applications of the energy and transportation sectors, such as wind farm planning (Bempedelis & Steiros Reference Bempedelis and Steiros2022) and jet engine noise prediction (Tam Reference Tam2019). Returning to the particular example of the slender bluff body wake, we may consider the time-averaged velocity deficit
$\zeta (x,y) = U_\infty - \bar {u}(x,y)$
that the body produces far (i.e. tens of characteristic body lengths) downstream. In the above,
$U_\infty$
is the constant free-stream velocity. The flow statistics are homogeneous along the spanwise direction
$z$
. Wake self-similarity is then known to assume the form (Townsend Reference Townsend1976; Pope Reference Pope2000)


Figure 3. Data-driven identification of self-similarity in the wake of a porous plate. (a) Schematic of the experimental apparatus showing the flume, porous plate and PIV configuration. (b) Mean streamwise velocity profiles at different locations downstream of the plate. (c) Algorithmically collapsed velocity profiles.
where
$\bar {u}_{\textit{cntr}}=\bar {u}(x,0)$
denotes the centreline velocity and
$y_{1/2}(x)$
is the wake half-width defined such that
$\bar {u}(x,\pm y_{1/2}) = 0.5 (U_\infty +u_{\textit{cntr}} )$
. We attempt to extract the self-similar relation (3.5) directly from experimental data. To this end, we measured the turbulent wake of a plate of 53 % porosity immersed in a water flume normal to the flow (see figure 3
a) at a Reynolds number based on the free-stream velocity and plate width
$Re \approx 6000$
. The velocity fields at various positions downstream of the plate were measured using Particle Image Velocimetry (PIV) (Phantom 4MP camera at 50 Hz acquisition frequency) as shown in figure 3(a). Each mean streamwise velocity profile was then calculated by averaging 3000 vector fields. More information regarding the experimental procedure can be found in Bekoglu et al. (Reference Bekoglu, Bempedelis and Steiros2024, Reference Bekoglu, Bempedelis and Steiros2025). We consider the mean velocities at five stations in the wake of the plate (figure 3
b). Figure 3(c) shows the collapse that is obtained via the proposed method, assuming transformations of the form
$\tilde {y} = \alpha (x) y$
and
$\tilde {u} = \beta (x) \bar {u} + \gamma (x)$
(simpler transformations cannot collapse the data). The identified transformations are regressed as functions of the characteristic scales and variables of the problem using
$\texttt {PySR}$
, with
$\alpha = \alpha (x, y_{w}, y_{1/2})$
(with
$y_w(x)$
defined as
$\bar {u}(x,\pm y_{w}) = 0.99 U_\infty$
),
$\beta =\beta (U_\infty , \bar {u}_{\textit{cntr}}, x, \nu )$
and
$\gamma =\gamma (U_\infty , \bar {u}_{\textit{cntr}}, x, \nu )$
. The library of operators consists of the four basic mathematical operations (
$+,-,\times ,\div$
). The interpreted similarity transformations are


which match (ignoring the arbitrary multiplicative constant) the self-similar expressions for turbulent wakes found in the literature (Tennekes & Lumley Reference Tennekes and Lumley1972; Townsend Reference Townsend1976; Pope Reference Pope2000). Besides our own experiments, we also test our method by using experimental data available in the literature (Cimbala, Nagib & Roshko Reference Cimbala, Nagib and Roshko1988), and again retrieve the self-similarity relations for turbulent wakes (see Appendix C).
3.4. Cavity collapse
The bursting of bubbles at the sea’s surface plays a crucial role in the exchanges between oceans and the atmosphere, thereby being important for climate and weather (Deike Reference Deike2022). When a bubble bursts at the sea’s surface, aerosol is produced via two mechanisms: the rupture of the bubble’s cap film, and the formation of a vertical jet that breaks into droplets following cavity collapse (Deike Reference Deike2022). The jet and aerosol properties are a consequence of the nonlinear fluid dynamics near the points where the topology changes (Eggers Reference Eggers1997; Eggers & Fontelos Reference Eggers and Fontelos2015). In the case of a collapsing cavity, several studies have found that, near the collapse of the travelling capillary waves at the axis of symmetry, the liquid–gas interface evolves in a self-similar manner, enabling the derivation of scaling laws for the bubble and jet dynamics (Keller & Miksis Reference Keller and Miksis1983; Zeff et al. Reference Zeff, Kleber, Fineberg and Lathrop2000; Duchemin et al. Reference Duchemin, Popinet, Josserand and Zaleski2002; Ghabache et al. Reference Ghabache, Antkowiak, Josserand and Séon2014; Gañán-Calvo Reference Gañán-Calvo2017; Deike et al. Reference Deike, Ghabache, Liger-Belair, Das, Zaleski, Popinet and Séon2018).
We simulate the collapse of a cavity (
${Bo}=\rho g R_0^2/\gamma = 10^{-3}$
,
${{La}}=\rho \gamma R_0/ \mu ^2 = 2500$
) by solving the two-phase incompressible axisymmetric Navier–Stokes equations using Basilisk (www.basilisk.fr), an open-source flow solver that has been previously used in computational studies of bursting bubbles (Lai, Eggers & Deike Reference Lai, Eggers and Deike2018; Berny et al. Reference Berny, Deike, Séon and Popinet2020; Sanjay, Lohse & Jalaal Reference Sanjay, Lohse and Jalaal2021). In the above,
$\rho$
and
$\mu$
are the liquid density and viscosity, respectively,
$\gamma$
is the interfacial tension,
$R_0$
is the initial bubble radius and
$g$
is the gravitational acceleration. The evolution of the liquid–gas interface is shown in figures 4(a)–4(d). We extract profiles of the interface
$h(r,t)$
as it approaches the axis of symmetry, at
$t_* = (t_0 - t )/t_c = [10, 9, 8, 7, 6 ]$
, where
$t_c$
is the characteristic time of the horizontal capillary wave (Lai et al. Reference Lai, Eggers and Deike2018), and
$t_0$
is the moment where the capillary waves meet at the axis of symmetry,
$r=0$
. The extracted profiles are shown in figure 4(e). Duchemin et al. (Reference Duchemin, Popinet, Josserand and Zaleski2002) and Lai et al. (Reference Lai, Eggers and Deike2018) observed that, in this time window, the interface approximately follows the
$ (t_0-t )^{2/3}$
scaling of inviscid theory (Keller & Miksis Reference Keller and Miksis1983). The algorithm successfully collapses the extracted profiles into a single curve (figure 4
f) for transformations of the form
$\tilde {r} = \alpha (t) r$
for the radial coordinate and
$\tilde {h} = \beta (t) h + \gamma (t)$
for the liquid–gas interface. By regressing the identified transformations
$\alpha$
,
$\beta$
and
$\gamma$
as power-law functions of the non-dimensional time
$t_*$
(figure 4
g–4i), we retrieve the expressions


which are close (ignoring the arbitrary multiplicative constants
$c_1$
,
$c_2$
and
$c_3$
) to the theoretically derived ones (Keller & Miksis Reference Keller and Miksis1983; Zeff et al. Reference Zeff, Kleber, Fineberg and Lathrop2000; Duchemin et al. Reference Duchemin, Popinet, Josserand and Zaleski2002; Ghabache et al. Reference Ghabache, Antkowiak, Josserand and Séon2014; Lai et al. Reference Lai, Eggers and Deike2018).

Figure 4. Data-driven identification of self-similarity in a collapsing cavity. Three-dimensional visualisation of the liquid–gas interface at (a)
$t/\tau =0$
, (b)
$t/\tau =0.25$
and (c)
$t/\tau =0.5$
, where
$\tau =\sqrt {\rho R_0^3 / \gamma }$
is the inertio-capillary time scale. (d) Liquid–gas interface time evolution,
$t/\tau = (0, 0.05, \ldots , 0.5 )$
. (e) Interface profiles near cavity collapse. ( f) Algorithmically collapsed interface profiles near cavity collapse. (g–i) Identified transformations
$\alpha$
,
$\beta$
and
$\gamma$
. Comparison with theoretical scaling laws.
3.5. Decaying turbulence
We consider the case of homogeneous decaying turbulence, experimentally realised by passing a stream of fluid through a uniformly spaced grid inside a wind tunnel (figure 5 a). We analyse measurements of velocity time series, obtained via hot-wire anemometry at seven positions downstream of the grid (the details of the experimental campaign can be found in Steiros Reference Steiros2022a ). The measured turbulence is fully developed, approximately homogeneous and yields the −5/3 law for the energy spectrum at intermediately sized eddies, as predicted by Kolmogorov’s K41 framework (Kolmogorov Reference Kolmogorov1941a , Reference Kolmogorovb , Reference Kolmogorovc ).

Figure 5. Data-driven identification of self-similarity in decaying turbulence. (a) Schematic of the experimental set-up. (b) Experimentally measured power spectral densities. The dashed vertical line delineates the range of the spectrum that is used. (c) Measured spectrum normalised by the inertial scales. (d) Measured spectrum normalised by the Kolmogorov scales. (e) Measured spectrum normalised by the algorithmically identified expression (3.9).
The theoretical derivation of the −5/3 law assumes that, at sufficiently high Reynolds numbers, an intermediate self-similar region forms in the cascade, which is independent of large- and small-scale effects (Batchelor Reference Batchelor1953; Frisch Reference Frisch1995; Pope Reference Pope2000). However, the above description is part of a more complex flow picture, as the widely studied paradigm of grid turbulence sufficiently far from initial conditions shows. The turbulence cascade in that case is generally accepted to be characterised by two self-similarities, both present at the same time and at different eddy sizes: One at large scales where viscosity is negligible (Steiros Reference Steiros2022a ; Lundgren Reference Lundgren2003), and one at small scales where non-equilibrium effects are negligible (Pope Reference Pope2000; Lundgren Reference Lundgren2003). The energy spectrum thus accepts the following general expression (Pope Reference Pope2000):

where
$x$
,
$\kappa$
,
$E_{11}(\kappa ,x)$
and
$\epsilon (x)$
represent the distance from the grid, wavenumber, energy spectrum and dissipation rate, respectively. Here,
$L$
and
$\eta$
are the integral and Kolmogorov scales, characteristic of the large- and small-scale self-similarities, respectively. To retrieve the −5/3 law,
$E_{11}(k,x) = \epsilon (x)^{2/3} \kappa ^{-5/3}$
, one needs to be asymptotically far from large scales (i.e.
$\kappa L \to \infty$
) and far from small scales (
$\kappa \eta \to 0$
) at the same time, as in that case K41 assures that both
$f$
and
$g$
tend to unity. It is of interest to see how our algorithm fares in this two-scale problem.
Figure 5(b) plots the measured power spectral densities versus the wavenumbers at different measurement stations. Figure 5(c,d) shows the collapse of the power spectral densities using large- and small-scale similarity variables, respectively. Using a library that consists of
$(L, \eta )$
for the
$x$
axis and
$(\epsilon , L, \eta , k)$
for the
$y$
axis, symbolic regression of the algorithmically identified dilational transformations (
$\tilde {\kappa } = \alpha (x) \kappa$
,
$\tilde {E_{11}} = \beta (x) E_{11}$
) yields the following expressions, plotted in figure 5(e):

As expected, the algorithm cannot derive the two-scale similarity expression (3.8), as the assumed dilational transformations effectively reduce it to a single-scale algorithm. Further insight may be obtained by using the dissipation scaling
$\epsilon \propto k^{3/2}/L$
, where
$k$
is the turbulence kinetic energy, which is known to characterise homogeneous decaying turbulence far from initial conditions (Steiros Reference Steiros2022a
,
Reference Steirosb
; Vassilicos Reference Vassilicos2015). Expression (3.9) then becomes

Inspection of the above expression reveals that our algorithm has, seemingly, collapsed the totality of the cascade dynamics based on a single empirical length scale,
$l \approx L^{0.38} \eta ^{0.62}$
, which is a combination of
$L$
and
$\eta$
. There have been several attempts to develop single length scale theories of turbulence, derived by assuming self-similarity of the turbulence statistics, i.e. see von Karman & Howarth (Reference von Karman and Howarth1938) and Sedov (Reference Sedov2018), and the later theories of Barenblatt & Gavrilov (Reference Barenblatt and Gavrilov1974) and George (Reference George1992). The last two studies predict the Taylor microscale as the appropriate similarity variable, in close agreement with our data-driven methodology. In homogeneous decaying turbulence the Taylor microscale becomes
$\lambda = L^{1/3} \eta ^{2/3}$
(Pope Reference Pope2000), which is close to our empirically derived length scale.
It might be tempting to interpret our empirical results as being in support of a single-scale picture for homogeneous turbulence, contrary to the commonly accepted view of turbulence as an intrinsically multi-length scale phenomenon (Batchelor Reference Batchelor1953; Tennekes & Lumley Reference Tennekes and Lumley1972; Townsend Reference Townsend1976; Pope Reference Pope2000). However, the results shown in figure 5 (and in most studies of homogeneous decaying turbulence) consider the far region of a turbulence grid where the integral length Reynolds number evolution is relatively slow, leading to a similarly slow evolution of
$L/\eta$
(Vassilicos Reference Vassilicos2015). As a result, our data show that all length scales in question (integral, Kolmogorov, Taylor) produce a (visually) ‘good-enough’ collapse, as shown in figure 5(c–e), and inspection of any of these graphs might give the impression of a single length scale self-similar evolution.
To better appreciate the single or multi-length scale nature of the turbulence cascade, we need data for which the integral length scale Reynolds number varies drastically. To achieve this, we analyse the periodic box Direct Numerical Simulation (DNS) data of decaying turbulence taken from Goto & Vassilicos (Reference Goto and Vassilicos2016) (the details of the simulations can be found in the reference). Similar to the grid case, our goal is to collapse energy spectra taken at different decay times
$\hat {t}$
(see figure 6
a). Here, however, we consider datasets from two simulations which are identical in every aspect, apart from their size (
$N=1024^3$
and
$N=2048^3$
) and the kinematic viscosity of the fluid, i.e. the Reynolds number of the flow varies drastically between the two cases, making the collapse of the curves less arbitrary.
Figure 6(a) presents the dimensional energy spectra from the two simulations, whereas figures 6(b) and 6(c) show the same data with the axes rescaled according to inertial and Kolmogorov similarity variables, respectively. The collapse of the large (small) scales is adequate only when the axes are normalised using the integral (Kolmogorov) scales, in agreement with a multi-scale view of the cascade. However, data from the same simulation size retain a ‘good-enough’ collapse independent of the choice of normalisation length, similar to our grid measurements, a fact which, if viewed in isolation, might give a false impression of a single length scale process. Figure 6(d) shows the algorithmically identified collapse of the data, that is, using the empirical transformations


The algorithm again identifies, approximately, the Taylor microscale (
$\lambda = L^{1/3}\eta ^{2/3})$
as the appropriate length for the collapse of the turbulence dynamics, but this time it cannot be claimed that the Taylor scale collapses both the large- and small-scale dynamics – only that it is an (algorithmically) optimal compromise when a single length scale collapse is enforced. We conjecture that the reason behind this could be explained from Lundgren’s two-scale turbulence theory which is based on the technique of matched asymptotic expansions (Lundgren Reference Lundgren2002, Reference Lundgren2003). The repercussions of this theory are discussed in Obligado & Vassilicos (Reference Obligado and Vassilicos2019), Meldi & Vassilicos (Reference Meldi and Vassilicos2021): in particular, it is suggested that the Taylor microscale is the length scale at which the cascade dynamics is optimally distanced from both non-equilibrium effects (large scales) and dissipative effects (small scales) at the same time (in a matched asymptotic manner). The algorithmically identified collapse (figure 6
d) produces best results in the intermediate, inertial range of the cascade where both above effects tend to become negligible. Based on this argument, the algorithm identifies the Taylor microscale scale because it is the optimal compromise of the two governing scales (inertial, Kolmogorov) of the cascade.

Figure 6. Data-driven identification of self-similarity in decaying turbulence, using the DNS data of Goto & Vassilicos (Reference Goto and Vassilicos2016). (a) Power spectral densities. (b) Spectrum normalised by the inertial scales. (c) Spectrum normalised by the Kolmogorov scales. (d) Spectrum normalised by the algorithmically identified expression (3.11).
4. Discussion and conclusion
We have presented a methodology for identifying similarity variables from data, in the absence of governing equations. Given measurements of a quantity of interest, the method computes the discrete values of the similarity transformations that best collapse the data and expresses their analytic form via symbolic regression. The self-similarity that we identify is the optimal fit of the data, and is not derived from the governing partial differential equations and the boundary conditions of the specific problem.
The proposed method differs from dimensionless learning approaches that aim to reduce the dimensionality of parameter spaces via identification of dimensionless groups (Xie et al. Reference Xie, Samaei, Guo, Liu and Gan2022; Bakarji et al. Reference Bakarji, Callaham, Brunton and Kutz2022; Yuan & Lozano-Durán Reference Yuan and Lozano-Durán2025) because our focus is on discovering the similarity transformations that enable the collapse of entire solution profiles. Our approach is not restricted to dilational similarity transformations, and is capable of identifying more general, non-uniform relationships between variables, as demonstrated in the Burgers equation example. This flexibility broadens the range of problems for which similarity solutions can be uncovered. In practical terms, this approach also reduces the degree of prior knowledge required to reveal self-similar behaviour. For example, in the turbulent wake case, methods based on dimensional analysis would require the characteristic velocity difference to be provided as input in order to reveal the self-similarity. On the other hand, our method can uncover it autonomously. More generally, step 1 of the proposed framework can be used to detect self-similar behaviour whilst being effectively agnostic to the underlying physics, i.e. without requiring prior specification of the relevant variables or parameters. In the symbolic regression step (step 2), the user must supply the candidate parameters to obtain explicit analytic expressions, but the ability to first reveal the existence of self-similarity without specifying inputs may be particularly appealing for exploratory studies or complex systems. On the other hand, for cases of pure dilational self-similarity, methods based on dimensional analysis are expected to be more efficient. A note of caution: the computation of the
$l_2$
norms in (2.2) is performed over the common support of the transformed profiles. If the supports differ significantly, the optimisation may become challenging. Although this issue did not arise in this work, we acknowledge it as a point that might need to be addressed in future work.
The capabilities of the method were demonstrated in five fluid mechanical problems, which are known or assumed to accept similarity solutions, based on data from both numerical and laboratory experiments. In four problems, the transformations identified by the method were in agreement to the analytically or empirically derived similarity transformations, whilst circumventing the need for scaling analysis or similarity arguments. In the fifth problem (turbulence cascade) our method identified an empirical length scale as the optimal similarity length of the cascade, which was close to the Taylor microscale. Further investigation revealed that this result cannot be treated as being unequivocally in support of single length scale theories of turbulence (von Karman & Howarth Reference von Karman and Howarth1938; Barenblatt & Gavrilov Reference Barenblatt and Gavrilov1974; George Reference George1992; Sedov Reference Sedov2018), but rather expresses the role of the Taylor microscale as an intermediate scale between the integral and Kolmogorov scales, at which non-equilibrium and dissipative effects are optimally spaced (Meldi & Vassilicos Reference Meldi and Vassilicos2021).
The results illustrate ways that the proposed algorithm can be used for the identification of similarity and scaling laws in situations where rigorous mathematical analysis is challenging. This includes a wide range of applications in fluid physics but also in other processes such as quasicrystal shape and growth (Kamiya et al. Reference Kamiya, Takeuchi, Kabeya, Wada, Ishimasa, Ochiai, Deguchi, Imura and Sato2018), stellar collapse (Yahil Reference Yahil1983), single protein dynamics (Hu et al. Reference Hu, Hong, Dean Smith, Neusius, Cheng and Smith2016) and others. Other directions for future work include using the method to detect symmetry breaking, distinguish similarity across multiple scales and identify the onset or breakdown of self-similarity.
Algorithm 1: Data-driven similarity inference: implementation example

Acknowledgements
The authors would like to thank E. Bekoğlu for providing the flume data and the schematic of the experimental set-up. K.S. is grateful to Professor S. Goto for providing the periodic box DNS data.
Funding
Parts of this work were conducted while N.B. was at Imperial College London, supported by EPSRC, Grant No. EP/W026686/1. L.M. acknowledges support from the ERC Starting Grant PhyCo No. 949388 and the EU-PNRR YoungResearcher TWIN ERC-PI_0000005. K.S. acknowledges support from the ERC Starting Grant ONSET No. 101163321.
Declaration of interests
The authors report no conflict of interest.
Data availability statement
The code and input data required to reproduce the work reported in the manuscript are available in the GitHub repository https://github.com/nbeb/extracting_self-similarity_from_data. The grid turbulence data are available upon request.
Author contributions
N.B., L.M. and K.S. designed research; N.B. and K.S. performed research; N.B. analysed data; and N.B. and K.S. wrote the paper.
Appendix A. Data-driven similarity inference: implementation example
A pseudocode exemplifying the implementation of the proposed method for the case of a quantity of interest
$q(s,t)$
is provided in Algorithm1. Clarifications or examples are shown in smaller font size below each pseudocode line. More implementation examples can be found in the code and input data that reproduce the problems considered in the paper. These are available in the GitHub repository https://github.com/nbeb/extracting_self-similarity_from_data.
Appendix B. Sensitivity and robustness
In this section, we present results from a sensitivity analysis with respect to the (i) number of provided profiles (stations), (ii) number of points at each station, (iii) number of points where the
$l_2$
norms of (2.2) are evaluated (i.e. discretisation of the transformed coordinates grid) and (iv) noise in the input data, for the Blasius boundary layer problem.
-
(i) Number of provided stations: in the example detailed in § 3, the algorithm is provided with velocity profiles at 9 different streamwise stations. Table 1 shows the identified scalings when fewer stations at regular or irregular spacings are considered. The algorithm shows robust performance across all tested cases.
-
(ii) Number of points at each station (
$N_{\textit{meas}}$ ): in the example detailed in § 3, the velocity profiles at each streamwise station consist of 100 points. Table 2 shows the identified scalings when fewer points are available (50 and 20, respectively). The algorithm shows robust performance across all tested cases. A slight increase in the
$y$ scaling error is observed as the number of available points decreases.
-
(iii) Number of points in the transformed coordinates grid (
$N_{\xi -\textit{points}}$ ): as discussed in § 2, the
$l_2$ -norms in (2.2) are computed following interpolation on the transformed coordinates (
$\xi$ ) grid. In the example detailed in § 3,
$N_{\xi -\textit{points}}$ is set equal to the number of points in the input data
$N_{\textit{meas}}$ . Table 3 shows the identified scalings when this ratio changes. The algorithm shows robust performance across all tested cases.
-
(iv) Lastly, we consider the robustness of the algorithm to noise. To this end, we add zero-mean Gaussian noise to the data, with standard deviation proportional to the local velocity magnitude, scaled by a relative noise level
$\epsilon$ . The identified scalings for three different noise levels are given in table 4. The input and algorithmically scaled data are shown in figure 7. While the algorithm’s accuracy decreases with increasing noise, the higher levels tested exceed typical experimental uncertainties, and the method can be expected to perform reliably under realistic noise conditions, as demonstrated in the examples using experimental datasets.
Appendix C. Application to data from literature: turbulent wake
We attempt to recover the self-similar expressions for turbulent wakes using our method and experimental data from the literature. We consider the mean velocities at five stations in the wake of a
$47\,\%$
solid slender plate, as measured by Cimbala et al. (Reference Cimbala, Nagib and Roshko1988), in turbulent conditions (
$\textit{Re}=5000$
) (see figures 8
a and 8
b). Figure 8(c) shows the collapse that is obtained via the proposed method, assuming transformations of the form
$\tilde {y} = \alpha (x) y$
and
$\tilde {u} = \beta (x) \bar {u} + \gamma (x)$
. The identified transformations are regressed using
$\texttt {PySR}$
, with
$\alpha = \alpha (x, y_{w}, y_{1/2})$
(with
$y_w(x)$
defined as
$\bar {u}(x,\pm y_{w}) = 0.99 U_\infty$
),
$\beta =\beta (U_\infty , \bar {u}_{\textit{cntr}}, x, \nu )$
and
$\gamma =\gamma (U_\infty , \bar {u}_{\textit{cntr}}, x, \nu )$
. The library of operators consists of the four basic mathematical operations (
$+,-,\times ,\div$
). The interpreted similarity transformations are


which match (ignoring the arbitrary multiplicative constant) the self-similar expressions for turbulent wakes found in the literature (Tennekes & Lumley Reference Tennekes and Lumley1972; Townsend Reference Townsend1976; Pope Reference Pope2000).
Table 1. Data-driven identification of self-similarity in the Blasius boundary layer. Identified transformations for different number of available profiles (stations).

Table 2. Data-driven identification of self-similarity in the Blasius boundary layer. Identified transformations for different number of measurements available at each station.

Table 3. Data-driven identification of self-similarity in the Blasius boundary layer. Identified transformations for different discretisations of the transformed coordinates grid.

Table 4. Data-driven identification of self-similarity in the Blasius boundary layer. Identified transformations for different levels of added noise.


Figure 7. Data-driven identification of self-similarity in the Blasius boundary layer with added noise. Top row: input data, bottom row: input data collapsed with the algorithmically identified scalings (table 4). Panels show (a,d)
$\epsilon =0.001$
, (b,e)
$\epsilon =0.01$
, (c,f)
$\epsilon =0.1$
.

Figure 8. Data-driven identification of self-similarity in the wake of a porous plate. (a) Smoke-wire visualisation of a porous plate wake. Figure adapted from Cimbala et al. (Reference Cimbala, Nagib and Roshko1988). (b) Mean streamwise velocity profiles at different locations downstream of the plate. Experimental data extracted from Cimbala et al. (Reference Cimbala, Nagib and Roshko1988). (c) Algorithmically collapsed velocity profiles.