Impact Statement
Given partial information on the solution of an equation, can we infer the full solution? Current data-driven methods that perform this task are limited to steady problems, or they require access to the full solution for training. We propose the physics-constrained convolutional neural network (PC-CNN) to tackle spatiotemporal nonlinear partial differential equations, with applications to fluid mechanics and transport phenomena.
1. Introduction
Physical and engineering phenomena can be modeled by nonlinear partial differential equations, which may vary both in space and time. In partial differential equations, on the one hand, the goal of the forward problem is to predict the output (e.g., the evolution of the system) given inputs on the initial conditions, system parameters, and boundary conditions. On the other hand, the goal of the inverse problem is to compute the input that corresponds to a given set of outputs. Whereas forward problems are typically well-posed, inverse problems are not. The overarching objective of this paper is to propose a method—the physics-constrained convolutional neural networks (PC-CNN)—to solve two types of inverse problems of engineering interest: (i) uncovering the solution to the partial differential equation that models the system under investigation from biased data; and (ii) the reconstruction of the solution on a high-resolution grid from sparse information.
The first inverse problem concerns error removal, which, in the literature, typically considers only the case of aleatoric noise-removal, which is a term attributed to methods that remove small, unbiased stochastic variations from the underlying solution. This has been achieved through various methods, including filtering methods (Kumar and Nachamai, Reference Kumar and Nachamai2017); proper orthogonal decomposition (Mendez et al., Reference Mendez, Raiola, Masullo, Discetti, Ianiro, Theunissen and Buchlin2017; Raiola et al., Reference Raiola, Discetti and Ianiro2015); and the use of autoencoders (Vincent et al., Reference Vincent, Larochelle, Lajoie, Bengio and Manzagol2010), to name only a few. Our first demonstration of the proposed physics-constrained convolutional neural network aims, instead, to formalize and solve an inverse problem to remove large-amplitude bias (systematic error) from data; ultimately, uncovering the true solution of the partial differential equation.
The second inverse problem concerns the reconstruction of solutions to partial differential equations from sparse information, which describes only a small portion of the solution and is a challenge for system identification [e.g., (Brunton et al., Reference Brunton, Proctor and Kutz2016)]. Current methods primarily make use of convolutional neural networks due to their ability to exploit spatial correlations. The current data-driven approaches need access to the full high-resolution samples in the training, which are needed to produce a parametric mapping for the reconstruction task [e.g., (Dong et al., Reference Dong, Loy, He and Tang2014; Shi et al., Reference Shi, Caballero, Huszár, Totz, Aitken, Bishop, Rueckert and Wang2016; Yang et al., Reference Yang, Zhang, Tian, Wang, Xue and Liao2019; Liu et al., Reference Liu, Tang, Huang and Lu2020)]. In many cases, access to the high-resolution samples is limited, on which the method proposed in this paper does not rely.
Common to both inverse problems that we tackle in this paper is the goal of finding a mapping from some known quantity to the corresponding solution of the governing partial differential equation. By exploiting the universal function approximation property of neural networks, it is possible to obtain numerical surrogates for function mappings or operators (Hornik et al., Reference Hornik, Stinchcombe and White1989; Zhou, Reference Zhou2020). For a network parameterized by weights, there exists an optimal set of weights that results in the desired mapping; the challenge is to realize these through an optimization process. This optimization is formulated as the minimization of a loss function, the evaluation of which offers a distance metric to the desired solution. When considering forms of structured data, convolutional neural networks excel due to their ability to exploit spatial correlations in the data, which arise from the use of localized kernel operations.
With physical systems governed by partial differential equations, the problems encountered in the absence of ground-truth high-resolution observations can be mitigated by employing a physics-constrained approach by imposing prior knowledge of the governing equations [e.g., (Lagaris et al., Reference Lagaris, Likas and Fotiadis1998; Raissi et al., Reference Raissi, Perdikaris and Karniadakis2019; Doan et al., Reference Doan, Polifke and Magri2021)]. One such tool is that of Physics-informed neural networks (PINNs) (Raissi et al., Reference Raissi, Perdikaris and Karniadakis2019), which provide a tool for physically-motivated problems, exploiting automatic differentiation to constrain the governing equations. Although the use of PINNs for inverse problems shows promising results for simple systems (Eivazi and Vinuesa, Reference Eivazi and Vinuesa2022), they remain challenging to train and are not designed to exploit spatial correlations (Grossmann et al., Reference Grossmann, Komorowska, Latz and Schönlieb2023; Krishnapriyan et al., Reference Krishnapriyan, Gholami, Zhe, Kirby and Mahoney2021). On the other hand, convolutional neural networks are designed to exploit spatial correlations, but they cannot naturally leverage automatic differentiation to evaluate the physical loss, as PINNs do, because they provide a mapping between states, rather than a mapping from the spatiotemporal coordinates to states as in PINNs. As such, the design of physics-informed convolutional neural networks is more challenging and relies on finite-different approximations or differentiable solvers (Gao et al., Reference Gao, Sun and Wang2021; Kelshaw and Magri, Reference Kelshaw and Magri2023b). For example, the authors of (Gao et al., Reference Gao, Sun and Wang2021) show results for a steady flow field (with no temporal dynamics), which produces a mapping for stationary solutions of the Navier–Stokes equations.
In this work, we propose a physics-constrained convolutional neural network with time-windowing suitable for handling a range of physically motivated inverse problems. Imposing prior knowledge of physics allows us to restrict the function space and use limited collocation points to find a solution. We emphasize that this is not a PINN approach (Raissi et al., Reference Raissi, Perdikaris and Karniadakis2019) because we do not leverage automatic differentiation to obtain gradients of outputs with respect to inputs to constrain the physics. This network employs a time-windowing scheme to effectively compute the time-dependent residuals without resorting to recurrent-based network architectures. Realizations of the network that do not conform to the given governing equations are penalised, which ensures that the network learns to produce physical predictions.
We first provide the necessary background in Section 2, and introduce the two physically motivated inverse problems. A brief background on convolutional neural networks is provided in Section 3, before introducing the proposed physics-constrained convolutional neural networks in Section 4. Here, we outline the generic approach before discussing specifics for each of the inverse problems discussed in the paper. In Section 5 we discuss data generation, and how employing a pseudospectral solver can be used to further constrain physical properties of the system. Finally, we showcase the results for each method. Results for the bias-removal task are shown in Section 6, where the method is demonstrated on three partial differential equations of increasing complexity: linear convection-diffusion, nonlinear convection-diffusion, and the Navier–Stokes equations. Results for the reconstruction task are shown in Section 7, in which results are analyzed on the turbulent chaotic Navier–Stokes equations. Conclusions in Section 8 mark the end of the paper.
2. Background
We consider dynamical systems, which are governed by partial differential equations (PDEs) of the form
 $$ \mathcal{R}\left(\overset{\sim }{u};p\right)\equiv {\partial}_t\overset{\sim }{u}\left(x,t\right)-\mathcal{N}\left(\overset{\sim }{u};p\right), $$
$$ \mathcal{R}\left(\overset{\sim }{u};p\right)\equiv {\partial}_t\overset{\sim }{u}\left(x,t\right)-\mathcal{N}\left(\overset{\sim }{u};p\right), $$
where 
 $ x\in \varOmega \subset {\mathbb{R}}^n $
 is the spatial location;
$ x\in \varOmega \subset {\mathbb{R}}^n $
 is the spatial location; 
 $ n $
 is the space dimension; and
$ n $
 is the space dimension; and 
 $ \varOmega $
 is the spatial domain. Time is denoted by
$ \varOmega $
 is the spatial domain. Time is denoted by 
 $ t\in \left[0,T\right]\subset {\mathcal{R}}_{\ge 0} $
;
$ t\in \left[0,T\right]\subset {\mathcal{R}}_{\ge 0} $
; 
 $ \overset{\sim }{u}\left(x,t\right):\varOmega \times \left[0,T\right]\to {\mathbb{R}}^m $
 is the state, where
$ \overset{\sim }{u}\left(x,t\right):\varOmega \times \left[0,T\right]\to {\mathbb{R}}^m $
 is the state, where 
 $ m $
 is the number of components; and
$ m $
 is the number of components; and 
 $ \boldsymbol{p} $
 is a vector that encapsulates the physical parameters of the system. A sufficiently smooth differential operator is denoted by
$ \boldsymbol{p} $
 is a vector that encapsulates the physical parameters of the system. A sufficiently smooth differential operator is denoted by 
 $ \mathbb{N} $
, which represents the nonlinear part of the system of
$ \mathbb{N} $
, which represents the nonlinear part of the system of 
 $ m $
 partial differential equations;
$ m $
 partial differential equations; 
 $ \mathbb{R} $
 is the residual; and
$ \mathbb{R} $
 is the residual; and 
 $ {\partial}_t $
 is the partial derivative with respect to time. A solution
$ {\partial}_t $
 is the partial derivative with respect to time. A solution 
 $ \boldsymbol{u} $
 of the PDE Eqn. (1) is the state that makes the residual vanish, i.e.
$ \boldsymbol{u} $
 of the PDE Eqn. (1) is the state that makes the residual vanish, i.e. 
 $ \boldsymbol{u} $
 is such that
$ \boldsymbol{u} $
 is such that 
 $ \mathcal{R}\left(u;p\right)=0 $
, with Eq. (1) being subject to prescribed boundary conditions on the domain boundary,
$ \mathcal{R}\left(u;p\right)=0 $
, with Eq. (1) being subject to prescribed boundary conditions on the domain boundary, 
 $ \partial \varOmega $
, and initial conditions at
$ \partial \varOmega $
, and initial conditions at 
 $ t=0 $
.
$ t=0 $
.
2.1. Uncovering solutions from biased data
We first consider the inverse problem of uncovering the solution of a PDE (Eq. 1) from biased data. We assume that we have biased information on the system’s state
 $$ \boldsymbol{\zeta} \left(\boldsymbol{x},t\right)=\boldsymbol{u}\left(\boldsymbol{x},t\right)+\boldsymbol{\phi} \left(\boldsymbol{x}\right), $$
$$ \boldsymbol{\zeta} \left(\boldsymbol{x},t\right)=\boldsymbol{u}\left(\boldsymbol{x},t\right)+\boldsymbol{\phi} \left(\boldsymbol{x}\right), $$
where 
 $ \boldsymbol{\phi} $
 is a bias, which in practice, is spatially varying (systematic error) that can be caused by miscalibrated sensors (Sciacchitano et al., Reference Sciacchitano, Neal, Smith, Warner, Vlachos, Wieneke and Scarano2015), or modelling assumptions (Nóvoa and Magri, Reference Nóvoa and Magri2022). The quantity
$ \boldsymbol{\phi} $
 is a bias, which in practice, is spatially varying (systematic error) that can be caused by miscalibrated sensors (Sciacchitano et al., Reference Sciacchitano, Neal, Smith, Warner, Vlachos, Wieneke and Scarano2015), or modelling assumptions (Nóvoa and Magri, Reference Nóvoa and Magri2022). The quantity 
 $ \boldsymbol{\zeta} \left(\boldsymbol{x},t\right) $
 in Eq. (2) constitutes the biased dataset, which is not a solution of the PDE Eq. (1), i.e.
$ \boldsymbol{\zeta} \left(\boldsymbol{x},t\right) $
 in Eq. (2) constitutes the biased dataset, which is not a solution of the PDE Eq. (1), i.e.
 $$ \mathcal{R}\left(\boldsymbol{\zeta} \left(\boldsymbol{x},t\right);\boldsymbol{p}\right)\ne 0. $$
$$ \mathcal{R}\left(\boldsymbol{\zeta} \left(\boldsymbol{x},t\right);\boldsymbol{p}\right)\ne 0. $$
Given the biased information, 
 $ \boldsymbol{\zeta} $
, our goal is to uncover the correct solution to the governing equations,
$ \boldsymbol{\zeta} $
, our goal is to uncover the correct solution to the governing equations, 
 $ \boldsymbol{u} $
, which is referred to as the true state. Computationally, we seek a mapping,
$ \boldsymbol{u} $
, which is referred to as the true state. Computationally, we seek a mapping, 
 $ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
, such that
$ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
, such that
 $$ {\boldsymbol{\eta}}_{\boldsymbol{\theta}}:\boldsymbol{\zeta} \left(\varOmega, t\right)\mapsto \boldsymbol{u}\left(\varOmega, t\right), $$
$$ {\boldsymbol{\eta}}_{\boldsymbol{\theta}}:\boldsymbol{\zeta} \left(\varOmega, t\right)\mapsto \boldsymbol{u}\left(\varOmega, t\right), $$
where the domain 
 $ \Omega $
 is discretized on a uniform, structured grid
$ \Omega $
 is discretized on a uniform, structured grid 
 $ \varOmega \subset {\mathrm{\mathbb{R}}}^{N^n} $
. We choose the mapping
$ \varOmega \subset {\mathrm{\mathbb{R}}}^{N^n} $
. We choose the mapping 
 $ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
 to be a convolutional neural network, which depends on trainable parameters
$ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
 to be a convolutional neural network, which depends on trainable parameters 
 $ \boldsymbol{\theta} $
 (Section 3). The nature of the partial differential equations has a computational implication. For a linear partial differential equation, the residual is an explicit function of the bias, i.e.,
$ \boldsymbol{\theta} $
 (Section 3). The nature of the partial differential equations has a computational implication. For a linear partial differential equation, the residual is an explicit function of the bias, i.e., 
 $ \mathcal{R}\left(\zeta \left(x,t\right);p\right)=\mathcal{R}\left(\phi \left(x,t\right);p\right) $
. This does not hold true for nonlinear systems, which makes the problem of removing the bias more challenging. This is discussed in Section 7. Figure 1(a) provides an overview of the inverse problem of uncovering solutions from biased data.
$ \mathcal{R}\left(\zeta \left(x,t\right);p\right)=\mathcal{R}\left(\phi \left(x,t\right);p\right) $
. This does not hold true for nonlinear systems, which makes the problem of removing the bias more challenging. This is discussed in Section 7. Figure 1(a) provides an overview of the inverse problem of uncovering solutions from biased data.

Figure 1. Inverse problems investigated in this paper. (a) Uncovering solutions from biased data. The model 
 $ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
 is responsible for recovering the solution (true state),
$ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
 is responsible for recovering the solution (true state), 
 $ \boldsymbol{u}\left(\varOmega, t\right) $
, from the biased data,
$ \boldsymbol{u}\left(\varOmega, t\right) $
, from the biased data, 
 $ \boldsymbol{\zeta} \left(\varOmega, t\right) $
. The bias (systematic error),
$ \boldsymbol{\zeta} \left(\varOmega, t\right) $
. The bias (systematic error), 
 $ \boldsymbol{\phi} \left(\boldsymbol{x}\right) $
, is the difference between the biased data and the solution. (b) Reconstructing a solution from sparse information. The model
$ \boldsymbol{\phi} \left(\boldsymbol{x}\right) $
, is the difference between the biased data and the solution. (b) Reconstructing a solution from sparse information. The model 
 $ {\boldsymbol{f}}_{\boldsymbol{\theta}} $
 is responsible for mapping the sparse field
$ {\boldsymbol{f}}_{\boldsymbol{\theta}} $
 is responsible for mapping the sparse field 
 $ \boldsymbol{u}\left({\varOmega}_L,t\right) $
 to the high-resolution field
$ \boldsymbol{u}\left({\varOmega}_L,t\right) $
 to the high-resolution field 
 $ \boldsymbol{u}\left({\varOmega}_H,t\right) $
. The term
$ \boldsymbol{u}\left({\varOmega}_H,t\right) $
. The term 
 $ \tau $
 in both cases denotes the number of contiguous time-steps passed to the network, required for computing temporal derivatives. An explanation of the proposed physics-constrained convolutional neural network (PC-CNN), which is the ansatz for both mappings
$ \tau $
 in both cases denotes the number of contiguous time-steps passed to the network, required for computing temporal derivatives. An explanation of the proposed physics-constrained convolutional neural network (PC-CNN), which is the ansatz for both mappings 
 $ {\boldsymbol{\eta}}_{\theta } $
 and
$ {\boldsymbol{\eta}}_{\theta } $
 and 
 $ {\boldsymbol{f}}_{\theta } $
, is provided in Section 4.
$ {\boldsymbol{f}}_{\theta } $
, is provided in Section 4.
2.2. Reconstruction from sparse information
 We formalize the inverse problem of reconstructing the solution from sparse information. Given sparse information, 
 $ \boldsymbol{u}\left({\varOmega}_L,t\right) $
, we aim to reconstruct the solution to the partial differential equation on a fine grid,
$ \boldsymbol{u}\left({\varOmega}_L,t\right) $
, we aim to reconstruct the solution to the partial differential equation on a fine grid, 
 $ \boldsymbol{u}\left({\varOmega}_H,t\right) $
, where the domain
$ \boldsymbol{u}\left({\varOmega}_H,t\right) $
, where the domain 
 $ \varOmega $
 is discretised on low- and high-resolution uniform grids,
$ \varOmega $
 is discretised on low- and high-resolution uniform grids, 
 $ {\varOmega}_L\subset {\mathrm{\mathbb{R}}}^{N^n} $
 and
$ {\varOmega}_L\subset {\mathrm{\mathbb{R}}}^{N^n} $
 and 
 $ {\varOmega}_H\subset {\mathrm{\mathbb{R}}}^{M^n} $
, respectively, such that
$ {\varOmega}_H\subset {\mathrm{\mathbb{R}}}^{M^n} $
, respectively, such that 
 $ {\varOmega}_L\cap {\varOmega}_H={\varOmega}_L $
;
$ {\varOmega}_L\cap {\varOmega}_H={\varOmega}_L $
; 
 $ M=\kappa N $
; and
$ M=\kappa N $
; and 
 $ \kappa \in {\mathrm{\mathbb{N}}}^{+} $
 is the scale factor. Our objective is to find
$ \kappa \in {\mathrm{\mathbb{N}}}^{+} $
 is the scale factor. Our objective is to find 
 $ \boldsymbol{u}\left({\varOmega}_H,t\right) $
 given the sparse information
$ \boldsymbol{u}\left({\varOmega}_H,t\right) $
 given the sparse information 
 $ \boldsymbol{u}\left({\varOmega}_L,t\right) $
. We achieve this by seeking a parameterized mapping
$ \boldsymbol{u}\left({\varOmega}_L,t\right) $
. We achieve this by seeking a parameterized mapping 
 $ {\boldsymbol{f}}_{\boldsymbol{\theta}} $
 such that
$ {\boldsymbol{f}}_{\boldsymbol{\theta}} $
 such that
 $$ {\boldsymbol{f}}_{\boldsymbol{\theta}}:\boldsymbol{u}\left({\varOmega}_L,t\right)\to \boldsymbol{u}\left({\varOmega}_H,t\right), $$
$$ {\boldsymbol{f}}_{\boldsymbol{\theta}}:\boldsymbol{u}\left({\varOmega}_L,t\right)\to \boldsymbol{u}\left({\varOmega}_H,t\right), $$
where the mapping 
 $ {\boldsymbol{f}}_{\boldsymbol{\theta}} $
 is a convolutional neural network parameterised by
$ {\boldsymbol{f}}_{\boldsymbol{\theta}} $
 is a convolutional neural network parameterised by 
 $ \boldsymbol{\theta} $
Footnote 
1. We consider the solution
$ \boldsymbol{\theta} $
Footnote 
1. We consider the solution 
 $ \boldsymbol{u} $
 to be discretised with
$ \boldsymbol{u} $
 to be discretised with 
 $ {N}_t $
 times steps. Figure 1(b) provides an overview of the reconstruction task.
$ {N}_t $
 times steps. Figure 1(b) provides an overview of the reconstruction task.
3. Convolutional neural networks
 The mappings 
 $ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
 and
$ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
 and 
 $ {\boldsymbol{f}}_{\boldsymbol{\theta}} $
 (Eqs. (4, 5) respectively) are design choices. For data from spatiotemporal partial differential equations, we look for a tool that: (i) can, in principle, learn any continuous functions, i.e., is a universal approximator; (ii) learn spatial correlations, as partial differential equations are defined by local operators; (iii) do not violate the conservation laws imposed by the PDE (1); and (iv) can handle sequential data for the temporal dynamics. In this paper, we choose convolutional neural networks (LeCun and Bengio, Reference LeCun and Bengio1998), which are suitable tools for learning datasets of partial differential equations, from fluid mechanics (Gao et al., Reference Gao, Sun and Wang2021; Kelshaw and Magri, Reference Kelshaw and Magri2023a; Murata et al., Reference Murata, Fukami and Fukagata2020) to heat transfer (Edalatifar et al., Reference Edalatifar, Tavakoli, Ghalambaz and Setoudeh2021; Kim and Lee, Reference Kim and Lee2020). Convolutional neural networks naturally fulfill requirements (i, ii), but do not fulfill (iii, iv). In this paper, we propose the physics-constrained convolutional neural network (PC-CNN) to fulfill all requirements (i–iv). A convolutional neural network,
$ {\boldsymbol{f}}_{\boldsymbol{\theta}} $
 (Eqs. (4, 5) respectively) are design choices. For data from spatiotemporal partial differential equations, we look for a tool that: (i) can, in principle, learn any continuous functions, i.e., is a universal approximator; (ii) learn spatial correlations, as partial differential equations are defined by local operators; (iii) do not violate the conservation laws imposed by the PDE (1); and (iv) can handle sequential data for the temporal dynamics. In this paper, we choose convolutional neural networks (LeCun and Bengio, Reference LeCun and Bengio1998), which are suitable tools for learning datasets of partial differential equations, from fluid mechanics (Gao et al., Reference Gao, Sun and Wang2021; Kelshaw and Magri, Reference Kelshaw and Magri2023a; Murata et al., Reference Murata, Fukami and Fukagata2020) to heat transfer (Edalatifar et al., Reference Edalatifar, Tavakoli, Ghalambaz and Setoudeh2021; Kim and Lee, Reference Kim and Lee2020). Convolutional neural networks naturally fulfill requirements (i, ii), but do not fulfill (iii, iv). In this paper, we propose the physics-constrained convolutional neural network (PC-CNN) to fulfill all requirements (i–iv). A convolutional neural network, 
 $ {\boldsymbol{k}}_{\boldsymbol{\theta}} $
, is a composition of functions
$ {\boldsymbol{k}}_{\boldsymbol{\theta}} $
, is a composition of functions
 $$ {\boldsymbol{k}}_{\boldsymbol{\theta}}={\boldsymbol{k}}_{{\boldsymbol{\theta}}_Q}^Q\circ \cdots \circ h\left({\boldsymbol{k}}_{{\boldsymbol{\theta}}_2}^2\right)\circ h\left({\boldsymbol{k}}_{{\boldsymbol{\theta}}_1}^1\right), $$
$$ {\boldsymbol{k}}_{\boldsymbol{\theta}}={\boldsymbol{k}}_{{\boldsymbol{\theta}}_Q}^Q\circ \cdots \circ h\left({\boldsymbol{k}}_{{\boldsymbol{\theta}}_2}^2\right)\circ h\left({\boldsymbol{k}}_{{\boldsymbol{\theta}}_1}^1\right), $$
where 
 $ {\boldsymbol{k}}_{{\boldsymbol{\theta}}_i}^i $
 denote discrete convolutional layers;
$ {\boldsymbol{k}}_{{\boldsymbol{\theta}}_i}^i $
 denote discrete convolutional layers; 
 $ h $
 is an element-wise nonlinear activation function, which increases the expressive power of the network; and
$ h $
 is an element-wise nonlinear activation function, which increases the expressive power of the network; and 
 $ Q $
 is the number of layers. The convolutional layersFootnote 
2 are responsible for the discrete operation
$ Q $
 is the number of layers. The convolutional layersFootnote 
2 are responsible for the discrete operation 
 $ {\boldsymbol{k}}_{\boldsymbol{\theta}}:\left(\boldsymbol{x};\boldsymbol{w},\boldsymbol{b}\right)\mapsto \boldsymbol{w}\ast \boldsymbol{d}+\boldsymbol{b} $
, where
$ {\boldsymbol{k}}_{\boldsymbol{\theta}}:\left(\boldsymbol{x};\boldsymbol{w},\boldsymbol{b}\right)\mapsto \boldsymbol{w}\ast \boldsymbol{d}+\boldsymbol{b} $
, where 
 $ \boldsymbol{\theta} =\left(\boldsymbol{w},\boldsymbol{b}\right) $
,
$ \boldsymbol{\theta} =\left(\boldsymbol{w},\boldsymbol{b}\right) $
, 
 $ \ast $
 is the convolution operation,
$ \ast $
 is the convolution operation, 
 $ \boldsymbol{x} $
 is the data,
$ \boldsymbol{x} $
 is the data, 
 $ \boldsymbol{w} $
 are the trainable weights of the kernel, and
$ \boldsymbol{w} $
 are the trainable weights of the kernel, and 
 $ \boldsymbol{b} $
 are the trainable biases. As the kernel operates locally around each pixel, information is leveraged from the surrounding grid cells. For a pedagogical explanation about CNNs the reader is referred to (Magri, Reference Magri2023).
$ \boldsymbol{b} $
 are the trainable biases. As the kernel operates locally around each pixel, information is leveraged from the surrounding grid cells. For a pedagogical explanation about CNNs the reader is referred to (Magri, Reference Magri2023).
4. Physics-constrained convolutional neural networks (PC-CNN)
We propose a physics-constrained convolutional neural network (PC-CNN) with time-windowing to allow for processing of sequential data. In this section, we describe the loss functions for the tasks of uncovering solutions from biased data and reconstruction from sparse information (Sections 2.1, 2.2, respectively), and the time windowing for computing the time derivatives required in the physics-constraint.
4.1. Losses for training
 The mappings 
 $ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
 and
$ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
 and 
 $ {\boldsymbol{f}}_{\boldsymbol{\theta}} $
 are parameterised as convolutional neural networks with the ansatz of equation (6). Training these networks is an optimization problem, which seeks an optimal set of parameters
$ {\boldsymbol{f}}_{\boldsymbol{\theta}} $
 are parameterised as convolutional neural networks with the ansatz of equation (6). Training these networks is an optimization problem, which seeks an optimal set of parameters 
 $ {\boldsymbol{\theta}}^{\ast } $
 such that
$ {\boldsymbol{\theta}}^{\ast } $
 such that
 $$ {\theta}^{\ast }=\underset{\theta }{\mathrm{argmin}}{\mathcal{L}}_{\theta }\ \mathrm{where}\hskip1em {\mathcal{L}}_{\theta }={\lambda}_D{\mathcal{L}}_{\mathcal{D}}+{\lambda}_P{\mathcal{L}}_{\mathcal{P}}+{\lambda}_C{\mathcal{L}}_C, $$
$$ {\theta}^{\ast }=\underset{\theta }{\mathrm{argmin}}{\mathcal{L}}_{\theta }\ \mathrm{where}\hskip1em {\mathcal{L}}_{\theta }={\lambda}_D{\mathcal{L}}_{\mathcal{D}}+{\lambda}_P{\mathcal{L}}_{\mathcal{P}}+{\lambda}_C{\mathcal{L}}_C, $$
where the loss function 
 $ {\mathcal{L}}_{\theta } $
 consists of:
$ {\mathcal{L}}_{\theta } $
 consists of: 
 $ {\unicode{x1D543}}_D $
, which quantifies the error between the data and the prediction (“data loss”);
$ {\unicode{x1D543}}_D $
, which quantifies the error between the data and the prediction (“data loss”); 
 $ {\unicode{x1D543}}_P $
, which is responsible for quantifying how unphysical the predictions are (“physics loss”); and
$ {\unicode{x1D543}}_P $
, which is responsible for quantifying how unphysical the predictions are (“physics loss”); and 
 $ {\unicode{x1D543}}_C $
, which embeds any further required constraints (“constraint loss”). Each loss is weighted by a non-negative regularization factor,
$ {\unicode{x1D543}}_C $
, which embeds any further required constraints (“constraint loss”). Each loss is weighted by a non-negative regularization factor, 
 $ {\lambda}_{\left(\cdot \right)} $
, which is an empirical hyperparameter. The definitions of the losses and regularization factors are task dependent; that is, we must design a suitable loss function for each inverse problem we solve.
$ {\lambda}_{\left(\cdot \right)} $
, which is an empirical hyperparameter. The definitions of the losses and regularization factors are task dependent; that is, we must design a suitable loss function for each inverse problem we solve.
4.1.1. Losses for uncovering solutions from biased data
 Given biased data 
 $ \boldsymbol{\zeta} \left(\varOmega, t\right) $
, we define the loss terms for the task of uncovering solutions as
$ \boldsymbol{\zeta} \left(\varOmega, t\right) $
, we define the loss terms for the task of uncovering solutions as
 $$ {\mathrm{\mathcal{L}}}_D=\frac{1}{N_t}\sum \limits_{i=0}^{N_t-1}\parallel {\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\boldsymbol{\zeta} \left(\partial \varOmega, {t}_i\right)\right)-\boldsymbol{u}\left(\partial \varOmega, {t}_i\right){\parallel}_{\partial \varOmega}^2, $$
$$ {\mathrm{\mathcal{L}}}_D=\frac{1}{N_t}\sum \limits_{i=0}^{N_t-1}\parallel {\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\boldsymbol{\zeta} \left(\partial \varOmega, {t}_i\right)\right)-\boldsymbol{u}\left(\partial \varOmega, {t}_i\right){\parallel}_{\partial \varOmega}^2, $$
 $$ {\mathrm{\mathcal{L}}}_P=\frac{1}{N_t}\sum \limits_{i=0}^{N_t-1}\parallel \mathcal{R}\left({\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\zeta \left(\varOmega, {t}_i\right)\right);\boldsymbol{p}\right){\parallel}_{\varOmega}^2, $$
$$ {\mathrm{\mathcal{L}}}_P=\frac{1}{N_t}\sum \limits_{i=0}^{N_t-1}\parallel \mathcal{R}\left({\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\zeta \left(\varOmega, {t}_i\right)\right);\boldsymbol{p}\right){\parallel}_{\varOmega}^2, $$
 $$ {\mathcal{L}}_C=\frac{1}{N_t}{\sum}_{i=0}^{N_t-1}\parallel {\partial}_t\left[\zeta \left(\varOmega, {t}_i\right)-{\eta}_{\theta}\left(\zeta \left(\varOmega, {t}_i\right)\right)\right]{\parallel}_{\varOmega}^2, $$
$$ {\mathcal{L}}_C=\frac{1}{N_t}{\sum}_{i=0}^{N_t-1}\parallel {\partial}_t\left[\zeta \left(\varOmega, {t}_i\right)-{\eta}_{\theta}\left(\zeta \left(\varOmega, {t}_i\right)\right)\right]{\parallel}_{\varOmega}^2, $$
 where 
 $ \partial \varOmega $
 denotes boundary points of the grid, and
$ \partial \varOmega $
 denotes boundary points of the grid, and 
 $ \parallel \hskip0.33em \cdot \hskip0.33em {\parallel}_{\varOmega } $
 is the
$ \parallel \hskip0.33em \cdot \hskip0.33em {\parallel}_{\varOmega } $
 is the 
 $ {\mathrm{\ell}}^2 $
-norm over the discretized domain. We impose the prior knowledge that we have on the dynamical system by defining the physics loss,
$ {\mathrm{\ell}}^2 $
-norm over the discretized domain. We impose the prior knowledge that we have on the dynamical system by defining the physics loss, 
 $ {\unicode{x1D543}}_P $
, which penalises network parameters that yield predictions that violate the governing equations (1)
Footnote 
3. Mathematically, this means that in the absence of observations (i.e., data), the physics loss
$ {\unicode{x1D543}}_P $
, which penalises network parameters that yield predictions that violate the governing equations (1)
Footnote 
3. Mathematically, this means that in the absence of observations (i.e., data), the physics loss 
 $ {\unicode{x1D543}}_P $
 alone does not provide a unique solution. By augmenting the physics loss Eqn. (9) with the data loss,
$ {\unicode{x1D543}}_P $
 alone does not provide a unique solution. By augmenting the physics loss Eqn. (9) with the data loss, 
 $ {\unicode{x1D543}}_D $
 (see equations (7), (8)), we ensure that network realizations conform to the observations (data) on the boundaries whilst fulfilling the governing equations, e.g., conservation laws. In the case of this inverse problem, the constraint loss
$ {\unicode{x1D543}}_D $
 (see equations (7), (8)), we ensure that network realizations conform to the observations (data) on the boundaries whilst fulfilling the governing equations, e.g., conservation laws. In the case of this inverse problem, the constraint loss 
 $ {\mathrm{\mathcal{L}}}_{\mathbf{C}} $
 Eqn. (10) prevents the bias
$ {\mathrm{\mathcal{L}}}_{\mathbf{C}} $
 Eqn. (10) prevents the bias 
 $ \boldsymbol{\phi} =\boldsymbol{\phi} (x) $
 from temporally changing. Penalizing network realizations that do not provide a unique bias helps stabilize training and drive predictions away from trivial solutions, such as the stationary solution
$ \boldsymbol{\phi} =\boldsymbol{\phi} (x) $
 from temporally changing. Penalizing network realizations that do not provide a unique bias helps stabilize training and drive predictions away from trivial solutions, such as the stationary solution 
 $ \boldsymbol{u}\left(\varOmega, t\right)=0 $
. Although we could use collocation points within the domain (as well as on the boundary), we show that our method can uncover the solution of the PDE with samples on the boundary only. Spatial sparsity could be increased by only using a subset of samples on the boundary, but for this task, we use information from all the boundary nodes.
$ \boldsymbol{u}\left(\varOmega, t\right)=0 $
. Although we could use collocation points within the domain (as well as on the boundary), we show that our method can uncover the solution of the PDE with samples on the boundary only. Spatial sparsity could be increased by only using a subset of samples on the boundary, but for this task, we use information from all the boundary nodes.
4.1.2. Losses for reconstruction from sparse information
 Given low-resolution observations 
 $ \boldsymbol{u}\left({\varOmega}_L,t\right) $
, we define the loss terms for the reconstruction from sparse information as
$ \boldsymbol{u}\left({\varOmega}_L,t\right) $
, we define the loss terms for the reconstruction from sparse information as
 $$ {\displaystyle \begin{array}{l}{\mathrm{\mathcal{L}}}_D=\frac{1}{N_t}\sum \limits_{i=0}^{N_t-1}\parallel {\left.{\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,{t}_i\right)\right)\right|}^{\varOmega_L}-\boldsymbol{u}\left({\varOmega}_L,{t}_i\right){\parallel}_{\varOmega_L}^2,\\ {}{\mathrm{\mathcal{L}}}_P=\frac{1}{N_t}\sum \limits_{i=0}^{N_t-1}\parallel \mathcal{R}\left({\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,{t}_i\right)\right);\boldsymbol{p}\right){\parallel}_{\varOmega_H}^2,\end{array}} $$
$$ {\displaystyle \begin{array}{l}{\mathrm{\mathcal{L}}}_D=\frac{1}{N_t}\sum \limits_{i=0}^{N_t-1}\parallel {\left.{\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,{t}_i\right)\right)\right|}^{\varOmega_L}-\boldsymbol{u}\left({\varOmega}_L,{t}_i\right){\parallel}_{\varOmega_L}^2,\\ {}{\mathrm{\mathcal{L}}}_P=\frac{1}{N_t}\sum \limits_{i=0}^{N_t-1}\parallel \mathcal{R}\left({\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,{t}_i\right)\right);\boldsymbol{p}\right){\parallel}_{\varOmega_H}^2,\end{array}} $$
where 
 $ {\left.{\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\cdot \right)\right|}^{\varOmega_L} $
 denotes the corestriction of
$ {\left.{\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\cdot \right)\right|}^{\varOmega_L} $
 denotes the corestriction of 
 $ {\varOmega}_H $
 on
$ {\varOmega}_H $
 on 
 $ {\varOmega}_L $
. In order to find an optimal set of parameters
$ {\varOmega}_L $
. In order to find an optimal set of parameters 
 $ {\boldsymbol{\theta}}^{\ast } $
, the loss is designed to regularize predictions that do not conform to the desired output. Given sparse information
$ {\boldsymbol{\theta}}^{\ast } $
, the loss is designed to regularize predictions that do not conform to the desired output. Given sparse information 
 $ \boldsymbol{u}\left({\varOmega}_L,t\right) $
, the data loss,
$ \boldsymbol{u}\left({\varOmega}_L,t\right) $
, the data loss, 
 $ {\unicode{x1D543}}_D $
, is defined to minimise the distance between known sparse data,
$ {\unicode{x1D543}}_D $
, is defined to minimise the distance between known sparse data, 
 $ \boldsymbol{u}\left({\varOmega}_L,t\right) $
, and their corresponding predictions on the high-resolution grid,
$ \boldsymbol{u}\left({\varOmega}_L,t\right) $
, and their corresponding predictions on the high-resolution grid, 
 $ {\left.{\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right)\right|}^{\varOmega_L} $
. Crucially, as a consequence of the proposed training objective, we do not need the high-resolution field as a labelled dataset, which is required with conventional super-resolution methods (Dong et al., Reference Dong, Loy, He and Tang2014; Freeman et al., Reference Freeman, Jones and Pasztor2002; Vincent et al., Reference Vincent, Larochelle, Lajoie, Bengio and Manzagol2010).
$ {\left.{\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right)\right|}^{\varOmega_L} $
. Crucially, as a consequence of the proposed training objective, we do not need the high-resolution field as a labelled dataset, which is required with conventional super-resolution methods (Dong et al., Reference Dong, Loy, He and Tang2014; Freeman et al., Reference Freeman, Jones and Pasztor2002; Vincent et al., Reference Vincent, Larochelle, Lajoie, Bengio and Manzagol2010).
4.2. Time-windowing of the physics loss
 Computing the residual of a partial differential equation is a temporal task, as shown in Eq. (1). We propose a simple time-windowing approach to allow the network to account for the temporal nature of the data. This windowing approach provides a means to compute the time derivative 
 $ {\partial}_t\boldsymbol{u} $
 required for the physics loss
$ {\partial}_t\boldsymbol{u} $
 required for the physics loss 
 $ {\unicode{x1D543}}_P $
. The network takes time-windowed samples as inputs, each sample consisting of
$ {\unicode{x1D543}}_P $
. The network takes time-windowed samples as inputs, each sample consisting of 
 $ \tau $
 sequential time steps. Precisely, we first group all time steps into non-overlapping sets of
$ \tau $
 sequential time steps. Precisely, we first group all time steps into non-overlapping sets of 
 $ \tau $
 sequential time steps to form the set
$ \tau $
 sequential time steps to form the set 
 $ \unicode{x1D54B}={\left\{\left[{t}_{i\tau},\dots, {t}_{\left(i+1\right)\tau -1}\right]\right\}}_{i=0}^{N_t\tau } $
. Elements of this set are treated as minibatches and are passed through the network to evaluate their output on which the physics loss is computed. The time derivative
$ \unicode{x1D54B}={\left\{\left[{t}_{i\tau},\dots, {t}_{\left(i+1\right)\tau -1}\right]\right\}}_{i=0}^{N_t\tau } $
. Elements of this set are treated as minibatches and are passed through the network to evaluate their output on which the physics loss is computed. The time derivative 
 $ {\partial}_t\boldsymbol{u} $
 is computed by applying a forward-Euler approximation across successive time steps. Figure 2 provides an overview of the time-batching scheme and how elements of each batch are used to compute the temporal derivative.
$ {\partial}_t\boldsymbol{u} $
 is computed by applying a forward-Euler approximation across successive time steps. Figure 2 provides an overview of the time-batching scheme and how elements of each batch are used to compute the temporal derivative.

Figure 2. Time-windowing scheme. Time-steps are first grouped into non-overlapping subsets of successive elements of length 
 $ \tau $
. Each of these subsets can be taken for either training or validation. Subsets are treated as minibatches and passed through the network to evaluate their output. The temporal derivative is then approximated using a forward-Euler approximation across adjacent time steps.
$ \tau $
. Each of these subsets can be taken for either training or validation. Subsets are treated as minibatches and passed through the network to evaluate their output. The temporal derivative is then approximated using a forward-Euler approximation across adjacent time steps.
Employing this time-windowing scheme, for the task of reconstructing from sparse information, the physics loss is
 $$ {\mathrm{\mathcal{L}}}_P=\frac{1}{\left(\tau -1\right)\mid \mathcal{T}\mid}\sum \limits_{t\in \mathcal{T}}\sum \limits_{i=0}^{\tau -1}\parallel {\partial}_t{\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,{t}_{i+1}\right)\right)-\mathcal{N}\left({\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,{t}_{i+1}\right)\right);\boldsymbol{p}\right){\parallel}_{\varOmega_H}^2, $$
$$ {\mathrm{\mathcal{L}}}_P=\frac{1}{\left(\tau -1\right)\mid \mathcal{T}\mid}\sum \limits_{t\in \mathcal{T}}\sum \limits_{i=0}^{\tau -1}\parallel {\partial}_t{\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,{t}_{i+1}\right)\right)-\mathcal{N}\left({\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,{t}_{i+1}\right)\right);\boldsymbol{p}\right){\parallel}_{\varOmega_H}^2, $$
whereas for the task of uncovering solutions from biased data, the physics loss is
 $$ {\mathrm{\mathcal{L}}}_P=\frac{1}{\left(\tau -1\right){N}_t}\sum \limits_{t\in \mathcal{T}}\sum \limits_{i=0}^{\tau -1}\parallel {\partial}_t{\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\varOmega, {t}_{i+1}\right)\Big)-\mathcal{N}\left({\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\boldsymbol{\zeta} \left(\varOmega, {t}_{i+1}\right)\right);\boldsymbol{p}\right){\parallel}_{\varOmega}^2, $$
$$ {\mathrm{\mathcal{L}}}_P=\frac{1}{\left(\tau -1\right){N}_t}\sum \limits_{t\in \mathcal{T}}\sum \limits_{i=0}^{\tau -1}\parallel {\partial}_t{\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\varOmega, {t}_{i+1}\right)\Big)-\mathcal{N}\left({\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\boldsymbol{\zeta} \left(\varOmega, {t}_{i+1}\right)\right);\boldsymbol{p}\right){\parallel}_{\varOmega}^2, $$
where 
 $ \mid \mathcal{T}\mid $
 denotes the cardinality of the set, or the number of minibatchesFootnote 
4. Predicting the high-resolution field for
$ \mid \mathcal{T}\mid $
 denotes the cardinality of the set, or the number of minibatchesFootnote 
4. Predicting the high-resolution field for 
 $ \tau $
 sequential time steps allows us to obtain the residual for the predictions in a temporally local sense, i.e., we compute the derivatives across discrete time windows rather than the entire simulation domain. The networks
$ \tau $
 sequential time steps allows us to obtain the residual for the predictions in a temporally local sense, i.e., we compute the derivatives across discrete time windows rather than the entire simulation domain. The networks 
 $ {\boldsymbol{f}}_{\boldsymbol{\theta}} $
 and
$ {\boldsymbol{f}}_{\boldsymbol{\theta}} $
 and 
 $ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
 are augmented to operate on these time windows, which vectorizes the operations over the time window. For a fixed quantity of training data, the choice of
$ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
 are augmented to operate on these time windows, which vectorizes the operations over the time window. For a fixed quantity of training data, the choice of 
 $ \tau $
 introduces a trade-off between the number of input samples
$ \tau $
 introduces a trade-off between the number of input samples 
 $ {N}_t $
, and the size of each time-window
$ {N}_t $
, and the size of each time-window 
 $ \tau $
. A larger value of
$ \tau $
. A larger value of 
 $ \tau $
 corresponds to fewer time windows. Limiting the number of time windows used for training has an adverse effect on the ability of the model to generalize; the information content of contiguous timesteps is scarcer than taking timesteps uniformly across the time domain. Although evaluating the residual across large time windows promotes numerical stability, this smooths the gradients computed in backpropagation and makes the model more difficult to train, especially in the case of chaotic systems where the dynamics might vary erratically. Upon conducting a hyperparameter study, we take a value
$ \tau $
 corresponds to fewer time windows. Limiting the number of time windows used for training has an adverse effect on the ability of the model to generalize; the information content of contiguous timesteps is scarcer than taking timesteps uniformly across the time domain. Although evaluating the residual across large time windows promotes numerical stability, this smooths the gradients computed in backpropagation and makes the model more difficult to train, especially in the case of chaotic systems where the dynamics might vary erratically. Upon conducting a hyperparameter study, we take a value 
 $ \tau =2 $
, which is also the minimum window size for computing the temporal derivative of the physics loss by finite difference. We find that this is sufficient for training the network whilst simultaneously maximizing the number of independent samples used for training. To avoid duplication of the data in the training set, we ensure that all samples are at least
$ \tau =2 $
, which is also the minimum window size for computing the temporal derivative of the physics loss by finite difference. We find that this is sufficient for training the network whilst simultaneously maximizing the number of independent samples used for training. To avoid duplication of the data in the training set, we ensure that all samples are at least 
 $ \tau $
 time-steps apart so that independent time windows are guaranteed to not contain any overlaps. The remaining loss terms operate in the same manner, i.e., over the time window as well as the conventional batch dimension.
$ \tau $
 time-steps apart so that independent time windows are guaranteed to not contain any overlaps. The remaining loss terms operate in the same manner, i.e., over the time window as well as the conventional batch dimension.
5. Synthetic data generation
 We utilize a differentiable pseudospectral spatial discretization to solve the partial differential equations of this paper (Kelshaw, Reference Kelshaw2022). The solution is computed on the spectral grid 
 $ {\hat{\varOmega}}_k\in {\mathrm{\mathbb{Z}}}^K $
, where
$ {\hat{\varOmega}}_k\in {\mathrm{\mathbb{Z}}}^K $
, where 
 $ K $
 is the number of wavenumbers. This spectral discretization enforces periodic boundary conditions on the domain boundary,
$ K $
 is the number of wavenumbers. This spectral discretization enforces periodic boundary conditions on the domain boundary, 
 $ \mathrm{\partial \Omega } $
. A solution is produced by time-integration of the dynamical system with the explicit forward-Euler scheme, in which we choose the timestep
$ \mathrm{\partial \Omega } $
. A solution is produced by time-integration of the dynamical system with the explicit forward-Euler scheme, in which we choose the timestep 
 $ \Delta t $
 to satisfy the Courant–Friedrichs–Lewy (CFL) condition (Canuto et al., Reference Canuto, Hussaini, Quarteroni and Zang1988). The initial conditions in each case are generated using the equation
$ \Delta t $
 to satisfy the Courant–Friedrichs–Lewy (CFL) condition (Canuto et al., Reference Canuto, Hussaini, Quarteroni and Zang1988). The initial conditions in each case are generated using the equation
 $$ {\displaystyle \begin{array}{l}\boldsymbol{u}\left({\hat{\varOmega}}_k,0\right)=\frac{\iota {e}^{2\pi i\boldsymbol{\unicode{x025B}}}}{\sigma \sqrt{2\pi }}{e}^{-\frac{1}{2}{\left(\frac{\mid {\hat{\varOmega}}_k\mid }{\sigma}\right)}^2}\\ {}\hskip3em \mathrm{with}\hskip1em {\boldsymbol{\unicode{x025B}}}_i\sim N\left(0,1\right),\end{array}} $$
$$ {\displaystyle \begin{array}{l}\boldsymbol{u}\left({\hat{\varOmega}}_k,0\right)=\frac{\iota {e}^{2\pi i\boldsymbol{\unicode{x025B}}}}{\sigma \sqrt{2\pi }}{e}^{-\frac{1}{2}{\left(\frac{\mid {\hat{\varOmega}}_k\mid }{\sigma}\right)}^2}\\ {}\hskip3em \mathrm{with}\hskip1em {\boldsymbol{\unicode{x025B}}}_i\sim N\left(0,1\right),\end{array}} $$
where 
 $ \iota $
 denotes the magnitude;
$ \iota $
 denotes the magnitude; 
 $ \sigma $
 denotes the standard deviation; and
$ \sigma $
 denotes the standard deviation; and 
 $ {\boldsymbol{\unicode{x025B}}}_i $
 is a sample from a unit normal distribution. Equation (14) produces a pseudo-random field scaled by the magnitude of the wavenumber
$ {\boldsymbol{\unicode{x025B}}}_i $
 is a sample from a unit normal distribution. Equation (14) produces a pseudo-random field scaled by the magnitude of the wavenumber 
 $ \mid {\hat{\varOmega}}_k\mid $
, where magnitude is computed pointwise for each wavenumber in the spectral grid. This scaling ensures that the resultant field has spatial structures of varying lengthscale. We take
$ \mid {\hat{\varOmega}}_k\mid $
, where magnitude is computed pointwise for each wavenumber in the spectral grid. This scaling ensures that the resultant field has spatial structures of varying lengthscale. We take 
 $ \iota =10,\sigma =1.2 $
 for all simulations in order to provide an initial condition, which provides numerical stability.
$ \iota =10,\sigma =1.2 $
 for all simulations in order to provide an initial condition, which provides numerical stability.
 As a consequence of the Nyquist–Shannon sampling criterion (Canuto et al., Reference Canuto, Hussaini, Quarteroni and Zang1988), the resolution of the spectral grid 
 $ {\hat{\varOmega}}_k $
 places a lower bound on the spatial resolution. For a signal containing a maximal frequency
$ {\hat{\varOmega}}_k $
 places a lower bound on the spatial resolution. For a signal containing a maximal frequency 
 $ {\omega}_{\mathrm{max}} $
, the sampling frequency
$ {\omega}_{\mathrm{max}} $
, the sampling frequency 
 $ {\omega}_s $
 must satisfy the condition
$ {\omega}_s $
 must satisfy the condition 
 $ {\omega}_{\mathrm{max}}<{\omega}_s/2 $
, therefore, we ensure that the spectral resolution satisfies
$ {\omega}_{\mathrm{max}}<{\omega}_s/2 $
, therefore, we ensure that the spectral resolution satisfies 
 $ K\le N/2 $
. Violation of this condition induces spectral aliasing, in which energy content from frequencies exceeding the Nyquist limit
$ K\le N/2 $
. Violation of this condition induces spectral aliasing, in which energy content from frequencies exceeding the Nyquist limit 
 $ {\omega}_s/2 $
 is fed back to the low frequencies, which amplifies energy content unphysically (Canuto et al., Reference Canuto, Hussaini, Quarteroni and Zang1988). To prevent aliasing, in the task of uncovering solutions from biased data, we employ a spectral grid
$ {\omega}_s/2 $
 is fed back to the low frequencies, which amplifies energy content unphysically (Canuto et al., Reference Canuto, Hussaini, Quarteroni and Zang1988). To prevent aliasing, in the task of uncovering solutions from biased data, we employ a spectral grid 
 $ {\hat{\varOmega}}_k\in {\mathrm{\mathbb{Z}}}^{32\times 32} $
, which corresponds to the physical grid
$ {\hat{\varOmega}}_k\in {\mathrm{\mathbb{Z}}}^{32\times 32} $
, which corresponds to the physical grid 
 $ \varOmega \in {\mathrm{\mathbb{R}}}^{64\times 64} $
. For the task of reconstructing the solution from sparse data, we employ a high-resolution spectral grid
$ \varOmega \in {\mathrm{\mathbb{R}}}^{64\times 64} $
. For the task of reconstructing the solution from sparse data, we employ a high-resolution spectral grid 
 $ {\hat{\varOmega}}_k\in {\mathrm{\mathbb{Z}}}^{35\times 35} $
, which corresponds to the physical grid
$ {\hat{\varOmega}}_k\in {\mathrm{\mathbb{Z}}}^{35\times 35} $
, which corresponds to the physical grid 
 $ \varOmega \in {\mathrm{\mathbb{R}}}^{70\times 70} $
. Approaching the Nyquist limit allows us to resolve the smallest structures possible without introducing spectral aliasing.
$ \varOmega \in {\mathrm{\mathbb{R}}}^{70\times 70} $
. Approaching the Nyquist limit allows us to resolve the smallest structures possible without introducing spectral aliasing.
5.1. Physics loss in the Fourier domain
 The pseudospectral discretization provides an efficient means to compute the differential operator 
 $ \mathbb{N} $
, which allows us to evaluate the physics loss
$ \mathbb{N} $
, which allows us to evaluate the physics loss 
 $ {\unicode{x1D543}}_P $
 Eqn. (11) in the Fourier domain. Computationally, we now evaluate the physics loss for the task of uncovering solutions from biased data Eqn. (12) as
$ {\unicode{x1D543}}_P $
 Eqn. (11) in the Fourier domain. Computationally, we now evaluate the physics loss for the task of uncovering solutions from biased data Eqn. (12) as
 $$ {\displaystyle \begin{array}{l}{\mathrm{\mathcal{L}}}_P=\frac{1}{\left(\tau -1\right)\mid \mathcal{T}\mid}\sum \limits_{t\in \mathcal{T}}\sum \limits_{i=0}^{\tau -1}\parallel {\partial}_t{\hat{\boldsymbol{\eta}}}_{\theta}\left(\boldsymbol{\zeta} \left(\varOmega, {t}_{i+1}\right)\right)-\hat{\mathcal{N}}\left({\hat{\boldsymbol{\eta}}}_{\theta}\left(\boldsymbol{\zeta} \left(\varOmega, {t}_{i+1}\right)\right)\right){\parallel}_{{\hat{\varOmega}}_k}^2,\end{array}} $$
$$ {\displaystyle \begin{array}{l}{\mathrm{\mathcal{L}}}_P=\frac{1}{\left(\tau -1\right)\mid \mathcal{T}\mid}\sum \limits_{t\in \mathcal{T}}\sum \limits_{i=0}^{\tau -1}\parallel {\partial}_t{\hat{\boldsymbol{\eta}}}_{\theta}\left(\boldsymbol{\zeta} \left(\varOmega, {t}_{i+1}\right)\right)-\hat{\mathcal{N}}\left({\hat{\boldsymbol{\eta}}}_{\theta}\left(\boldsymbol{\zeta} \left(\varOmega, {t}_{i+1}\right)\right)\right){\parallel}_{{\hat{\varOmega}}_k}^2,\end{array}} $$
and for the task of reconstruction from sparse data Eqn. (13) as
 $$ {\displaystyle \begin{array}{l}{\mathrm{\mathcal{L}}}_P=\frac{1}{\left(\tau -1\right)\mid \mathcal{T}\mid}\sum \limits_{t\in \mathcal{T}}\sum \limits_{i=0}^{\tau -1}\parallel {\partial}_t{\hat{\boldsymbol{f}}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,{t}_{i+1}\right)\right)-\hat{\mathcal{N}}\left({\hat{\boldsymbol{f}}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,{t}_{i+1}\right)\right)\right){\parallel}_{{\hat{\varOmega}}_{\mathbf{k}}}^2,\end{array}} $$
$$ {\displaystyle \begin{array}{l}{\mathrm{\mathcal{L}}}_P=\frac{1}{\left(\tau -1\right)\mid \mathcal{T}\mid}\sum \limits_{t\in \mathcal{T}}\sum \limits_{i=0}^{\tau -1}\parallel {\partial}_t{\hat{\boldsymbol{f}}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,{t}_{i+1}\right)\right)-\hat{\mathcal{N}}\left({\hat{\boldsymbol{f}}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,{t}_{i+1}\right)\right)\right){\parallel}_{{\hat{\varOmega}}_{\mathbf{k}}}^2,\end{array}} $$
where 
 $ \hat{\cdot}=\mathcal{F}\left\{\cdot \right\} $
 is the Fourier-transformed variable, and
$ \hat{\cdot}=\mathcal{F}\left\{\cdot \right\} $
 is the Fourier-transformed variable, and 
 $ \hat{\mathcal{N}} $
 denotes the Fourier-transformed differential operator. The pseudospectral discretization is fully differentiable, which allows us to numerically compute gradients with respect to the parameters of the network
$ \hat{\mathcal{N}} $
 denotes the Fourier-transformed differential operator. The pseudospectral discretization is fully differentiable, which allows us to numerically compute gradients with respect to the parameters of the network 
 $ \boldsymbol{\theta} $
. Computing the loss
$ \boldsymbol{\theta} $
. Computing the loss 
 $ {\mathrm{\mathcal{L}}}_P $
 in the Fourier domain provides two advantages: (i) periodic boundary conditions are naturally enforced, which enforces the prior knowledge in the loss calculations; and (ii) gradient calculations yield spectral accuracy, as shown in equations (15, 16). In contrast, a conventional finite differences approach requires a computational stencil, the spatial extent of which places an error bound on the gradient computation. This error bound is a function of the spatial resolution of the field.
$ {\mathrm{\mathcal{L}}}_P $
 in the Fourier domain provides two advantages: (i) periodic boundary conditions are naturally enforced, which enforces the prior knowledge in the loss calculations; and (ii) gradient calculations yield spectral accuracy, as shown in equations (15, 16). In contrast, a conventional finite differences approach requires a computational stencil, the spatial extent of which places an error bound on the gradient computation. This error bound is a function of the spatial resolution of the field.
6. Results on uncovering solutions from biased data
We first introduce the mathematical parameterization of the bias (i.e., systematic error). Second, we discuss numerical details and performance metrics used for the analysis. Third, we showcase results for three partial differential equations of increasing complexity: the linear convection-diffusion, nonlinear convection-diffusion (Burgers’ equation (Bateman, Reference Bateman1915; Burgers, Reference Burgers1948)), and two-dimensional turbulent Navier–Stokes equations (the Kolmogorov flow (Fylladitakis, Reference Fylladitakis2018)). The Navier–Stokes, which is a spatiotemporal chaotic system, will be used for the task of reconstruction from sparse information in Section 7. Finally, we analyze the physical consistency of physics-constrained convolutional neural network (PC-CNN) predictions for the two-dimensional turbulent flow.
6.1. Parameterization of the bias
Errors in predictions can broadly fall into two categories: (i) aleatoric errors, which are typically referred to as noise due to environmental and sensors’ stochastic behavior; and (ii) epistemic uncertainties, which are typically referred to as bias. We parameterize the bias with a non-convex spatially varying function to model an epistemic error. We employ a modified Rastrigin parameterization (Rastrigin, Reference Rastrigin1974), which is commonly used in non-convex optimization benchmarks
 $$ \phi \left(x;\mathcal{M},{k}_{\phi },{u}_{\mathrm{max}}\right)=\frac{\mathcal{M}{u}_{\mathrm{max}}}{2{\pi}^2+40}\left(20+{\sum}_{i=1}^2\left[{\left({x}_i-\pi \right)}^2-10\mathit{\cos}\left({k}_{\phi}\left({x}_i-\pi \right)\right)\right]\right) $$
$$ \phi \left(x;\mathcal{M},{k}_{\phi },{u}_{\mathrm{max}}\right)=\frac{\mathcal{M}{u}_{\mathrm{max}}}{2{\pi}^2+40}\left(20+{\sum}_{i=1}^2\left[{\left({x}_i-\pi \right)}^2-10\mathit{\cos}\left({k}_{\phi}\left({x}_i-\pi \right)\right)\right]\right) $$
where 
 $ {u}_{\mathrm{max}} $
 is the maximum absolute velocity in the flow;
$ {u}_{\mathrm{max}} $
 is the maximum absolute velocity in the flow; 
 $ \mathcal{M} $
 is the relative magnitude of the bias;
$ \mathcal{M} $
 is the relative magnitude of the bias; 
 $ {k}_{\phi } $
 is the wavenumber of the bias, and
$ {k}_{\phi } $
 is the wavenumber of the bias, and 
 $ {\boldsymbol{x}}_1 $
 and
$ {\boldsymbol{x}}_1 $
 and 
 $ {\boldsymbol{x}}_2 $
 are the streamwise and transversal coordinates respectively. Parameterising the bias with
$ {\boldsymbol{x}}_2 $
 are the streamwise and transversal coordinates respectively. Parameterising the bias with 
 $ \mathcal{M} $
 and
$ \mathcal{M} $
 and 
 $ {k}_{\phi } $
 allows us to evaluate the performance of the methodology with respect to different degrees of non-convexity of the bias. Aleatoric noise removal is a fairly established field and is not the focus of this task.
$ {k}_{\phi } $
 allows us to evaluate the performance of the methodology with respect to different degrees of non-convexity of the bias. Aleatoric noise removal is a fairly established field and is not the focus of this task.
6.2. Numerical details, data, and performance metrics
 We uncover the solutions of partial differential equations from biased data for three physical systems of increasing complexity. Each system is solved via the pseudospectral discretisation as described in Section 5, producing a solution by time-integration using the Euler-forward method with a timestep of 
 $ \Delta t=5\times {10}^{-3} $
, which is chosen to satisfy of the Courant–Friedrichs–Lewy (CFL) condition to promote numerical stability. Model training is performed with 1024 training samples and 256 validation samples, which are pseudo-randomly selected from the time domain of the solution. Each sample contains
$ \Delta t=5\times {10}^{-3} $
, which is chosen to satisfy of the Courant–Friedrichs–Lewy (CFL) condition to promote numerical stability. Model training is performed with 1024 training samples and 256 validation samples, which are pseudo-randomly selected from the time domain of the solution. Each sample contains 
 $ \tau =2 $
 sequential time steps. Following a hyperparameter study, the Adam optimizer is employed (Kingma & Ba, Reference Kingma, Ba, Bengio and LeCun2015) with a learning rate of
$ \tau =2 $
 sequential time steps. Following a hyperparameter study, the Adam optimizer is employed (Kingma & Ba, Reference Kingma, Ba, Bengio and LeCun2015) with a learning rate of 
 $ 3\times {10}^{-4} $
. The weighting factors for the loss, as described in equation 7, are taken as
$ 3\times {10}^{-4} $
. The weighting factors for the loss, as described in equation 7, are taken as 
 $ {\lambda}_P={\lambda}_C={10}^3 $
, which are empirically determined (through a hyperparameter search) to provide stable training. Details on the network architecture can be found in A.1. Each model is trained for a total of
$ {\lambda}_P={\lambda}_C={10}^3 $
, which are empirically determined (through a hyperparameter search) to provide stable training. Details on the network architecture can be found in A.1. Each model is trained for a total of 
 $ {10}^4 $
 epochs, which is chosen to provide sufficient convergence. The accuracy of prediction is quantified by the relative error on the validation dataset
$ {10}^4 $
 epochs, which is chosen to provide sufficient convergence. The accuracy of prediction is quantified by the relative error on the validation dataset
 $$ e=\sqrt{\frac{\sum_{t\in \mathcal{T}}\parallel \boldsymbol{u}\left(\varOmega, t\right)-{\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\boldsymbol{\zeta} \left(\varOmega, t\right)\right){\parallel}_{\varOmega}^2}{\sum_{t\in \mathcal{T}}\parallel \boldsymbol{u}\left(\varOmega, t\right){\parallel}_{\varOmega}^2}}. $$
$$ e=\sqrt{\frac{\sum_{t\in \mathcal{T}}\parallel \boldsymbol{u}\left(\varOmega, t\right)-{\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\boldsymbol{\zeta} \left(\varOmega, t\right)\right){\parallel}_{\varOmega}^2}{\sum_{t\in \mathcal{T}}\parallel \boldsymbol{u}\left(\varOmega, t\right){\parallel}_{\varOmega}^2}}. $$
This metric takes the magnitude of the solution into account, which allows the results from different systems to be compared. All experiments in this paper are run on a single Quadro RTX 8000 GPU. Numerical tests are run with different random initializations (seeds) to assess the robustness of the network.
6.3. The linear convection-diffusion equation
The linear convection-diffusion equation is used to describe transport phenomena (Majda and Kramer, Reference Majda and Kramer1999).
 $$ {\partial}_t\boldsymbol{u}+\boldsymbol{c}\cdot \nabla \boldsymbol{u}=\nu \Delta \boldsymbol{u}, $$
$$ {\partial}_t\boldsymbol{u}+\boldsymbol{c}\cdot \nabla \boldsymbol{u}=\nu \Delta \boldsymbol{u}, $$
where 
 $ \nabla $
 is the nabla operator;
$ \nabla $
 is the nabla operator; 
 $ \Delta $
 is the Laplacian operator;
$ \Delta $
 is the Laplacian operator; 
 $ \boldsymbol{c}\equiv \left(c,c\right) $
, where
$ \boldsymbol{c}\equiv \left(c,c\right) $
, where 
 $ c $
 is the convective coefficient; and
$ c $
 is the convective coefficient; and 
 $ \nu $
 is the diffusion coefficient. The flow energy is subject to rapid decay because the solution quickly converges towards a fixed-point solution at
$ \nu $
 is the diffusion coefficient. The flow energy is subject to rapid decay because the solution quickly converges towards a fixed-point solution at 
 $ \boldsymbol{u}\left(\varOmega, t\right)=0 $
, notably in the case where
$ \boldsymbol{u}\left(\varOmega, t\right)=0 $
, notably in the case where 
 $ \nu /\boldsymbol{c} $
 is large. In the presence of the fixed-point solution, we observe
$ \nu /\boldsymbol{c} $
 is large. In the presence of the fixed-point solution, we observe 
 $ \boldsymbol{\zeta} \left(\varOmega, t\right)=\boldsymbol{\phi} \left(\varOmega \right) $
, which is a trivial case for identification and removal of the bias. In order to avoid rapid convergence to the fixed-point solution, we take
$ \boldsymbol{\zeta} \left(\varOmega, t\right)=\boldsymbol{\phi} \left(\varOmega \right) $
, which is a trivial case for identification and removal of the bias. In order to avoid rapid convergence to the fixed-point solution, we take 
 $ c=1.0,\nu =1/500 $
 as the coefficients, producing a convective-dominant solution. A snapshot of the results for
$ c=1.0,\nu =1/500 $
 as the coefficients, producing a convective-dominant solution. A snapshot of the results for 
 $ {k}_{\phi }=3,\mathcal{M}=0.5 $
 is shown in Figure 3(a). There is a marked difference between the biased data
$ {k}_{\phi }=3,\mathcal{M}=0.5 $
 is shown in Figure 3(a). There is a marked difference between the biased data 
 $ \zeta \left(\varOmega, t\right) $
 in panel (i) and the true state
$ \zeta \left(\varOmega, t\right) $
 in panel (i) and the true state 
 $ \boldsymbol{u}\left(\varOmega, t\right) $
 in panel (ii), most notably in the magnitude of the field. Network predictions
$ \boldsymbol{u}\left(\varOmega, t\right) $
 in panel (ii), most notably in the magnitude of the field. Network predictions 
 $ {\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\zeta \left(\varOmega, t\right)\right) $
 in panel (iii) uncover the solution to the partial differential equation with a relative error of
$ {\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\zeta \left(\varOmega, t\right)\right) $
 in panel (iii) uncover the solution to the partial differential equation with a relative error of 
 $ e=6.612\times {10}^{-2} $
 on the validation set.
$ e=6.612\times {10}^{-2} $
 on the validation set.

Figure 3. Uncovering solutions from biased data. (a) Linear convection-diffusion case 
 $ {k}_{\phi }=3 $
 and
$ {k}_{\phi }=3 $
 and 
 $ \mathcal{M}=0.5 $
. (b) Nonlinear convection-diffusion case with
$ \mathcal{M}=0.5 $
. (b) Nonlinear convection-diffusion case with 
 $ {k}_{\phi }=5 $
 and
$ {k}_{\phi }=5 $
 and 
 $ \mathcal{M}=0.5 $
. (c) Two-dimensional turbulent flow case with
$ \mathcal{M}=0.5 $
. (c) Two-dimensional turbulent flow case with 
 $ {k}_{\phi }=7 $
 and
$ {k}_{\phi }=7 $
 and 
 $ \mathcal{M}=0.5 $
. Panel (i) shows the biased data,
$ \mathcal{M}=0.5 $
. Panel (i) shows the biased data, 
 $ \boldsymbol{\zeta} $
; (ii) shows the true state,
$ \boldsymbol{\zeta} $
; (ii) shows the true state, 
 $ \boldsymbol{u} $
, which we wish to uncover from the biased data; (iii) shows the bias, which represents the bias (i.e., systematic error),
$ \boldsymbol{u} $
, which we wish to uncover from the biased data; (iii) shows the bias, which represents the bias (i.e., systematic error), 
 $ \boldsymbol{\phi} $
; (iv) shows the network predictions,
$ \boldsymbol{\phi} $
; (iv) shows the network predictions, 
 $ {\boldsymbol{\eta}}_{\theta } $
; and (v) shows the predicted bias,
$ {\boldsymbol{\eta}}_{\theta } $
; and (v) shows the predicted bias, 
 $ \boldsymbol{\zeta} $
 -
$ \boldsymbol{\zeta} $
 - 
 $ {\boldsymbol{\eta}}_{\theta } $
.
$ {\boldsymbol{\eta}}_{\theta } $
.
6.4. Nonlinear convection-diffusion equation
The nonlinear convection-diffusion equation, also known as Burgers’ equation (Burgers, Reference Burgers1948), is
 $$ {\partial}_t\boldsymbol{u}+\left(\boldsymbol{u}\cdot \nabla \right)\boldsymbol{u}=\nu \Delta \boldsymbol{u}, $$
$$ {\partial}_t\boldsymbol{u}+\left(\boldsymbol{u}\cdot \nabla \right)\boldsymbol{u}=\nu \Delta \boldsymbol{u}, $$
where the nonlinearity lies in the convective term 
 $ \left(\boldsymbol{u}\cdot \nabla \right)\boldsymbol{u} $
. The nonlinear convective term provides a further challenge: below a certain threshold, the velocity interactions lead to further energy decay. The kinematic viscosity is set to
$ \left(\boldsymbol{u}\cdot \nabla \right)\boldsymbol{u} $
. The nonlinear convective term provides a further challenge: below a certain threshold, the velocity interactions lead to further energy decay. The kinematic viscosity is set to 
 $ \nu =1/500 $
, which produces a convective-dominant solution by time integration. In contrast to the linear convection-diffusion system (Section 6.3), the relationship between the dynamics of the biased state and the true state is more challenging. The introduction of nonlinearities in the differential operator revokes the linear relationship between the bias and data, i.e.,
$ \nu =1/500 $
, which produces a convective-dominant solution by time integration. In contrast to the linear convection-diffusion system (Section 6.3), the relationship between the dynamics of the biased state and the true state is more challenging. The introduction of nonlinearities in the differential operator revokes the linear relationship between the bias and data, i.e., 
 $ \mathcal{R}\left(\boldsymbol{\zeta} \left(\boldsymbol{x},t\right);\boldsymbol{p}\right)\ne \mathcal{R}\left(\boldsymbol{\phi} \left(\boldsymbol{x},t\right);\boldsymbol{p}\right) $
 as discussed in Section 2.1. Consequently, this increases the complexity of the physics loss term
$ \mathcal{R}\left(\boldsymbol{\zeta} \left(\boldsymbol{x},t\right);\boldsymbol{p}\right)\ne \mathcal{R}\left(\boldsymbol{\phi} \left(\boldsymbol{x},t\right);\boldsymbol{p}\right) $
 as discussed in Section 2.1. Consequently, this increases the complexity of the physics loss term 
 $ {\unicode{x1D543}}_P $
.
$ {\unicode{x1D543}}_P $
.
 
Figure 3(b) shows a snapshot of results for 
 $ {k}_{\phi }=5,\mathcal{M}=0.5 $
. The true state
$ {k}_{\phi }=5,\mathcal{M}=0.5 $
. The true state 
 $ \boldsymbol{u}\left(\varOmega, t\right) $
 of the partial differential equation contains high-frequency spatial structures, shown in panel (ii), which are less prominent in the biased data
$ \boldsymbol{u}\left(\varOmega, t\right) $
 of the partial differential equation contains high-frequency spatial structures, shown in panel (ii), which are less prominent in the biased data 
 $ \boldsymbol{\zeta} \left(\varOmega, t\right) $
, shown in panel (i). Despite the introduction of a nonlinear differential operator, we demonstrate that the network retains the ability to recover the true state, as shown in panels (ii, iv), with a relative error on the validation set of
$ \boldsymbol{\zeta} \left(\varOmega, t\right) $
, shown in panel (i). Despite the introduction of a nonlinear differential operator, we demonstrate that the network retains the ability to recover the true state, as shown in panels (ii, iv), with a relative error on the validation set of 
 $ e=6.791\times {10}^{-2} $
.
$ e=6.791\times {10}^{-2} $
.
6.4.1. Two-dimensional turbulent flow
Finally, we consider a spatiotemporally chaotic flow, which is governed by the incompressible Navier–Stokes equations evaluated on a two-dimensional domain with periodic boundary conditions, which is also known as the Kolmogorov flow (Fylladitakis, Reference Fylladitakis2018). The partial differential equations are expressions of the mass and momentum conservation, respectively
 $$ {\displaystyle \begin{array}{c}\nabla \cdot \boldsymbol{u}=0,\\ {}{\partial}_t\boldsymbol{u}+\left(\boldsymbol{u}\cdot \nabla \right)\boldsymbol{u}=-\nabla p+\nu \Delta \boldsymbol{u}+\boldsymbol{g}.\end{array}} $$
$$ {\displaystyle \begin{array}{c}\nabla \cdot \boldsymbol{u}=0,\\ {}{\partial}_t\boldsymbol{u}+\left(\boldsymbol{u}\cdot \nabla \right)\boldsymbol{u}=-\nabla p+\nu \Delta \boldsymbol{u}+\boldsymbol{g}.\end{array}} $$
where 
 $ p $
 is the pressure field; and
$ p $
 is the pressure field; and 
 $ \boldsymbol{g} $
 is a body forcing, which enables the dynamics to be sustained by ensuring that the flow energy does not dissipate. The flow density is constant and assumed to be unity, i.e., the flow is incompressible. The use of a pseudospectral discretization allows us to eliminate the pressure term and handle the continuity constraint implicitly, as shown in the spectral representation of the Navier–Stokes equations in B. The spatiotemporal dynamics are chaotic at the prescribed viscosity
$ \boldsymbol{g} $
 is a body forcing, which enables the dynamics to be sustained by ensuring that the flow energy does not dissipate. The flow density is constant and assumed to be unity, i.e., the flow is incompressible. The use of a pseudospectral discretization allows us to eliminate the pressure term and handle the continuity constraint implicitly, as shown in the spectral representation of the Navier–Stokes equations in B. The spatiotemporal dynamics are chaotic at the prescribed viscosity 
 $ \nu =1/42 $
. The periodic forcing is
$ \nu =1/42 $
. The periodic forcing is 
 $ \boldsymbol{g}\left(\boldsymbol{x}\right)=\left(\sin \left(4{\boldsymbol{x}}_{\mathbf{2}}\right),0\right) $
. In order to remove the transient and focus on a statistically stationary regime, the transient time series up to
$ \boldsymbol{g}\left(\boldsymbol{x}\right)=\left(\sin \left(4{\boldsymbol{x}}_{\mathbf{2}}\right),0\right) $
. In order to remove the transient and focus on a statistically stationary regime, the transient time series up to 
 $ {T}_t=180.0 $
 is discarded.
$ {T}_t=180.0 $
 is discarded.
 A snapshot for the two-dimensional turbulent flow case is shown in Figure 3(c) for 
 $ {k}_{\phi }=7,\mathcal{M}=0.5 $
. The bias field
$ {k}_{\phi }=7,\mathcal{M}=0.5 $
. The bias field 
 $ \zeta \left(\varOmega, t\right) $
 as shown in panel (i) contains prominent, high-frequency spatial structures not present in the true state
$ \zeta \left(\varOmega, t\right) $
 as shown in panel (i) contains prominent, high-frequency spatial structures not present in the true state 
 $ \boldsymbol{u}\left(\varOmega, t\right) $
 shown in panel (ii). The biased field bears little resemblance to the true solution. In spite of the chaotic dynamics of the system, we demonstrate that network predictions
$ \boldsymbol{u}\left(\varOmega, t\right) $
 shown in panel (ii). The biased field bears little resemblance to the true solution. In spite of the chaotic dynamics of the system, we demonstrate that network predictions 
 $ {\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\boldsymbol{\zeta} \left(\varOmega, t\right)\right) $
 shown in panel (iv) successfully uncover the solution with a relative error on the validation set of
$ {\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\boldsymbol{\zeta} \left(\varOmega, t\right)\right) $
 shown in panel (iv) successfully uncover the solution with a relative error on the validation set of 
 $ e=2.044\times {10}^{-2} $
.
$ e=2.044\times {10}^{-2} $
.
6.5. Robustness
 We analyse the robustness of the methodology by varying the wavenumber 
 $ {k}_{\phi } $
 and magnitude
$ {k}_{\phi } $
 and magnitude 
 $ \mathcal{M} $
 of the bias parameterization (17). Varying these parameters allows us to assess the degree to which the true state of a partial differential equation can be recovered when subjected to spatially varying bias with different degrees of non-convexity. To this end, we perform two parametric studies for each partial differential equation: (i)
$ \mathcal{M} $
 of the bias parameterization (17). Varying these parameters allows us to assess the degree to which the true state of a partial differential equation can be recovered when subjected to spatially varying bias with different degrees of non-convexity. To this end, we perform two parametric studies for each partial differential equation: (i) 
 $ \mathcal{M}=0.5 $
,
$ \mathcal{M}=0.5 $
, 
 $ {k}_{\phi}\in \left\{\mathrm{1,3,5,7,9}\right\} $
; and (ii)
$ {k}_{\phi}\in \left\{\mathrm{1,3,5,7,9}\right\} $
; and (ii) 
 $ {k}_{\phi }=3 $
,
$ {k}_{\phi }=3 $
, 
 $ \mathcal{M}\in \left\{\mathrm{0.01,0.1,0.25,0.5,1.0}\right\} $
. For each case, we compute the relative error,
$ \mathcal{M}\in \left\{\mathrm{0.01,0.1,0.25,0.5,1.0}\right\} $
. For each case, we compute the relative error, 
 $ e $
, between the predicted solution
$ e $
, between the predicted solution 
 $ {\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\boldsymbol{\zeta} \left(\varOmega, t\right)\right) $
 and the true solution
$ {\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\boldsymbol{\zeta} \left(\varOmega, t\right)\right) $
 and the true solution 
 $ \boldsymbol{u}\left(\varOmega, t\right) $
, as defined in Eq. (18). We show the mean relative
$ \boldsymbol{u}\left(\varOmega, t\right) $
, as defined in Eq. (18). We show the mean relative 
 $ {\mathrm{\ell}}^2 $
-error for an ensemble of five computations for each test to ensure that the results are representative of the true performance. Empirically, we find that performance is robust to pseudo-random initialization of network parameters with standard deviation of
$ {\mathrm{\ell}}^2 $
-error for an ensemble of five computations for each test to ensure that the results are representative of the true performance. Empirically, we find that performance is robust to pseudo-random initialization of network parameters with standard deviation of 
 $ \mathcal{O}\left({10}^{-4}\right) $
 (result not shown). We employ the same network setups described in Section 6.2 with the same parameters for training. Assessing results using the same parameters for the three partial differential equations allows us to draw conclusions on the robustness of the methodology.
$ \mathcal{O}\left({10}^{-4}\right) $
 (result not shown). We employ the same network setups described in Section 6.2 with the same parameters for training. Assessing results using the same parameters for the three partial differential equations allows us to draw conclusions on the robustness of the methodology.
 First, in the linear convection–diffusion problem as shown in Figure 4(a), we demonstrate that the relative error is largely independent of the form of parameterized bias. The relative magnitude 
 $ \mathcal{M} $
 and Rastrigin wavenumber
$ \mathcal{M} $
 and Rastrigin wavenumber 
 $ {k}_{\phi } $
 have a small impact on the ability of the model to uncover the solution to the partial differential equation, with the performance being consistent across all cases. The model performs best for the case
$ {k}_{\phi } $
 have a small impact on the ability of the model to uncover the solution to the partial differential equation, with the performance being consistent across all cases. The model performs best for the case 
 $ {k}_{\varphi }=7,\mathcal{M}=0.5 $
, achieving a relative error of
$ {k}_{\varphi }=7,\mathcal{M}=0.5 $
, achieving a relative error of 
 $ e=2.568\times {10}^{-1} $
. Second, results for the nonlinear convection-diffusion case show marked improvement in comparison with the linear case, with errors for the numerical study shown in Figure 4(b). Despite the introduction of a nonlinear operator, the relative error is consistently lower regardless of the parameterization of bias. The network remains equally capable of uncovering the true state across a wide range of modalities and magnitudes. Although the physics loss from the network predictions
$ e=2.568\times {10}^{-1} $
. Second, results for the nonlinear convection-diffusion case show marked improvement in comparison with the linear case, with errors for the numerical study shown in Figure 4(b). Despite the introduction of a nonlinear operator, the relative error is consistently lower regardless of the parameterization of bias. The network remains equally capable of uncovering the true state across a wide range of modalities and magnitudes. Although the physics loss from the network predictions 
 $ {\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\boldsymbol{\zeta} \left(\varOmega, t\right)\right) $
 no longer scales in a linear fashion, this is beneficial for the training dynamics. This is because the second-order nonlinearity promotes convexity in the loss-surface, which is then exploited by our gradient-based optimization approach. Third, the PC-CNN is able to uncover the true state of the turbulent flow. Results shown in Figure 4(c) demonstrate an improvement in performance from the standard nonlinear convection-diffusion case. For both fixed Rastrigin wavenumber and magnitude, increasing the value of the parameter tends to decrease the relative error. The relative error in Figure 4(c) decreases as the non-convexity of the bias increases. This is because it becomes increasingly simple to distinguish the bias from the underlying flow field if there is a larger discrepancy. We also emphasize that these tests were run with a fixed computational budget to ensure fair comparison—with additional training, we observe a further decrease in relative error across all experiments. (We also ran tests with biases parameterized with multiple wavenumbers, an example of which is shown in Appendix C).
$ {\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\boldsymbol{\zeta} \left(\varOmega, t\right)\right) $
 no longer scales in a linear fashion, this is beneficial for the training dynamics. This is because the second-order nonlinearity promotes convexity in the loss-surface, which is then exploited by our gradient-based optimization approach. Third, the PC-CNN is able to uncover the true state of the turbulent flow. Results shown in Figure 4(c) demonstrate an improvement in performance from the standard nonlinear convection-diffusion case. For both fixed Rastrigin wavenumber and magnitude, increasing the value of the parameter tends to decrease the relative error. The relative error in Figure 4(c) decreases as the non-convexity of the bias increases. This is because it becomes increasingly simple to distinguish the bias from the underlying flow field if there is a larger discrepancy. We also emphasize that these tests were run with a fixed computational budget to ensure fair comparison—with additional training, we observe a further decrease in relative error across all experiments. (We also ran tests with biases parameterized with multiple wavenumbers, an example of which is shown in Appendix C).

Figure 4. Unconvering solutions from biased data: robustness analysis through relative error, 
 $ e $
. (a) Linear convection-diffusion case. (b) Nonlinear convection-diffusion case. (c) Two-dimensional turbulent flow case. Orange-bars denote results for case
$ e $
. (a) Linear convection-diffusion case. (b) Nonlinear convection-diffusion case. (c) Two-dimensional turbulent flow case. Orange-bars denote results for case 
 $ (i) $
: fixing the magnitude and varying the Rastrigin wavenumber. Blue-bars denote results for case
$ (i) $
: fixing the magnitude and varying the Rastrigin wavenumber. Blue-bars denote results for case 
 $ (ii) $
: fixing the Rastrigin wavenumber and varying the magnitude.
$ (ii) $
: fixing the Rastrigin wavenumber and varying the magnitude.
6.6. Physical consistency of the uncovered Navier–Stokes solutions
 Results in Section 6.5 show that, for a generic training setup, we are able to achieve small relative error for a variety of degrees of non-convexity. Because the two-dimensional turbulent flow case is chaotic, infinitesimal perturbations 
 $ \sim O\left(\unicode{x025B} \right) $
 exponentially grow in time, therefore, the residual
$ \sim O\left(\unicode{x025B} \right) $
 exponentially grow in time, therefore, the residual 
 $ \mathcal{R}\left(\boldsymbol{u}\left(\varOmega, t\right)+\unicode{x025B} \left(\varOmega, t\right);\boldsymbol{p}\right)\gg \mathcal{R}\left(\boldsymbol{u}\left(\varOmega, t\right);\boldsymbol{p}\right) $
 where
$ \mathcal{R}\left(\boldsymbol{u}\left(\varOmega, t\right)+\unicode{x025B} \left(\varOmega, t\right);\boldsymbol{p}\right)\gg \mathcal{R}\left(\boldsymbol{u}\left(\varOmega, t\right);\boldsymbol{p}\right) $
 where 
 $ \unicode{x025B} $
 is a perturbation parameter. In this section, we analyze the physical properties of the solutions of the Navier–Stokes equation (Eq. 21) uncovered by the PC-CNN,
$ \unicode{x025B} $
 is a perturbation parameter. In this section, we analyze the physical properties of the solutions of the Navier–Stokes equation (Eq. 21) uncovered by the PC-CNN, 
 $ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
. First, we show snapshots of the time evolution of the two-dimensional turbulent flow in Figure 5. These confirm that the model is learning a physical solution in time, as shown in the error on a log scale in panel (iv). The parameterization of the bias is fixed in this case with
$ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
. First, we show snapshots of the time evolution of the two-dimensional turbulent flow in Figure 5. These confirm that the model is learning a physical solution in time, as shown in the error on a log scale in panel (iv). The parameterization of the bias is fixed in this case with 
 $ {k}_{\varphi }=7,\mathcal{M}=0.5 $
.
$ {k}_{\varphi }=7,\mathcal{M}=0.5 $
.

Figure 5. Uncovering solution from biased data. Temporal evolution of the two-dimensional turbulent flow with 
 $ {k}_{\varphi }=7,\mathcal{M}=0.5 $
.
$ {k}_{\varphi }=7,\mathcal{M}=0.5 $
. 
 $ {T}_t $
 denotes the length of the transient. (i) Biased data,
$ {T}_t $
 denotes the length of the transient. (i) Biased data, 
 $ \boldsymbol{\zeta} $
; (ii) true state,
$ \boldsymbol{\zeta} $
; (ii) true state, 
 $ \boldsymbol{u} $
; (iii) predicted solution,
$ \boldsymbol{u} $
; (iii) predicted solution, 
 $ {\boldsymbol{\eta}}_{\theta } $
; and (iv) squared error,
$ {\boldsymbol{\eta}}_{\theta } $
; and (iv) squared error, 
 $ {\left\Vert {\boldsymbol{\eta}}_{\theta }-\boldsymbol{u}\right\Vert}^2 $
.
$ {\left\Vert {\boldsymbol{\eta}}_{\theta }-\boldsymbol{u}\right\Vert}^2 $
.
 Second, we analyze the statistics of the solution. The mean kinetic energy of the flow at each timestep is directly affected by the introduction of the bias. In the case of our strictly positive bias 
 $ \boldsymbol{\phi} $
 (Eq. (2)), the mean kinetic energy is increased at every point. Results in Figure 6(a) show the kinetic energy time series across the time domain for: the true state
$ \boldsymbol{\phi} $
 (Eq. (2)), the mean kinetic energy is increased at every point. Results in Figure 6(a) show the kinetic energy time series across the time domain for: the true state 
 $ \boldsymbol{u} $
; the biased data
$ \boldsymbol{u} $
; the biased data 
 $ \boldsymbol{\zeta} $
; and the network’s predictions
$ \boldsymbol{\zeta} $
; and the network’s predictions 
 $ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
. The chaotic nature of the solution can be observed from the erratic (but not random) evolution of the energy. We observe that the network predictions accurately reconstruct the kinetic energy trace across the simulation domain.
$ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
. The chaotic nature of the solution can be observed from the erratic (but not random) evolution of the energy. We observe that the network predictions accurately reconstruct the kinetic energy trace across the simulation domain.

Figure 6. Uncovering solution from biased data: analysis of the solutions in the turbulent flow with 
 $ {k}_{\varphi }=7,\mathcal{M}=0.5 $
. (a) Kinetic energy for the two-dimensional turbulent flow. (b) Energy spectrum for the two-dimensional turbulent flow.
$ {k}_{\varphi }=7,\mathcal{M}=0.5 $
. (a) Kinetic energy for the two-dimensional turbulent flow. (b) Energy spectrum for the two-dimensional turbulent flow. 
 $ {T}_t $
 denotes the length of the transient.
$ {T}_t $
 denotes the length of the transient.
 Third, we analyze the energy spectrum, which is characteristic of turbulent flows, in which the energy content decreases with the wavenumber. Introducing the bias at a particular wavenumber artificially alters the energy spectrum due to increased energy content. In Figure 6(b), we show the energy spectrum for the two-dimensional turbulent flow, where the unphysical increase in energy content is visible for 
 $ \mid \boldsymbol{k}\mid \ge 7 $
. Model predictions
$ \mid \boldsymbol{k}\mid \ge 7 $
. Model predictions 
 $ {\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\boldsymbol{\zeta} \left(\varOmega, t\right)\right) $
 correct the energy content for
$ {\boldsymbol{\eta}}_{\boldsymbol{\theta}}\left(\boldsymbol{\zeta} \left(\varOmega, t\right)\right) $
 correct the energy content for 
 $ \mid \boldsymbol{k}\mid <21 $
, and successfully characterize and reproduce scales of turbulence. Although the energy content exponentially decreases with the spatial frequency, the true solution is correctly uncovered up to large wavenumbers, i.e., when aliasing occurs (
$ \mid \boldsymbol{k}\mid <21 $
, and successfully characterize and reproduce scales of turbulence. Although the energy content exponentially decreases with the spatial frequency, the true solution is correctly uncovered up to large wavenumbers, i.e., when aliasing occurs (
 $ \mid \boldsymbol{k} \mid \gtrsim\ 29 $
).
$ \mid \boldsymbol{k} \mid \gtrsim\ 29 $
).
7. Results on reconstruction from sparse information
 First, we discuss the generation of the sparse data. Next, we show the ability of the PC-CNN to infer the high-resolution solution of the partial differential equation on points that are not present in the training set. A high-resolution solution of the partial differential equation is generated on the grid 
 $ {\varOmega}_H $
 prior to extracting a low-resolution grid
$ {\varOmega}_H $
 prior to extracting a low-resolution grid 
 $ {\varOmega}_L $
 with the scale factor of
$ {\varOmega}_L $
 with the scale factor of 
 $ \kappa =N/M $
 (Section 2.2). Both the solver and physics loss are discretized with
$ \kappa =N/M $
 (Section 2.2). Both the solver and physics loss are discretized with 
 $ K=N2 $
 wavenumbers in the Fourier domain, which complies with the Nyquist–Shannon sampling criterion. The downsampling by
$ K=N2 $
 wavenumbers in the Fourier domain, which complies with the Nyquist–Shannon sampling criterion. The downsampling by 
 $ \kappa $
 is performed by extracting the value of the solution at spatial locations
$ \kappa $
 is performed by extracting the value of the solution at spatial locations 
 $ {\varOmega}_L\cap {\varOmega}_H $
, that is a low-resolution representation of the high-resolution solution
$ {\varOmega}_L\cap {\varOmega}_H $
, that is a low-resolution representation of the high-resolution solution
 $$ \boldsymbol{u}\left({\varOmega}_L,t\right)\triangleq {\left.\boldsymbol{u}\left({\varOmega}_H,t\right)\right|}^{\varOmega_L}, $$
$$ \boldsymbol{u}\left({\varOmega}_L,t\right)\triangleq {\left.\boldsymbol{u}\left({\varOmega}_H,t\right)\right|}^{\varOmega_L}, $$
where 
 $ \mid {}^{\varOmega_L} $
 is the corestriction which extracts points only in the relevant domain. While the data is sparse in space, a minimal temporal resolution is required for the computation of the time derivatives.
$ \mid {}^{\varOmega_L} $
 is the corestriction which extracts points only in the relevant domain. While the data is sparse in space, a minimal temporal resolution is required for the computation of the time derivatives.
7.1. Comparison with standard upsampling
 We show the results for a scale factor 
 $ \kappa =7 $
, with results for further scale factors shown in D. Results are compared with interpolating upsampling methods, i.e., bi-linear, and bi-cubic interpolation to demonstrate the ability of the method. We quantify the accuracy by computing the relative error between the true solution,
$ \kappa =7 $
, with results for further scale factors shown in D. Results are compared with interpolating upsampling methods, i.e., bi-linear, and bi-cubic interpolation to demonstrate the ability of the method. We quantify the accuracy by computing the relative error between the true solution, 
 $ \boldsymbol{u}\left({\varOmega}_H,t\right) $
, and the corresponding network realization,
$ \boldsymbol{u}\left({\varOmega}_H,t\right) $
, and the corresponding network realization, 
 $ {\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right) $
$ {\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right) $
 $$ e=\sqrt{\frac{\sum_{t\in \mathcal{T}}\parallel \boldsymbol{u}\left({\varOmega}_{\boldsymbol{H}},t\right)-{\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_{\mathbf{L}},t\right)\right){\parallel}_{\varOmega_{\boldsymbol{H}}}^2}{\sum_{t\in \mathcal{T}}\parallel \boldsymbol{u}\left({\varOmega}_{\boldsymbol{H}},t\right){\parallel}_{\varOmega_{\boldsymbol{H}}}^2}}. $$
$$ e=\sqrt{\frac{\sum_{t\in \mathcal{T}}\parallel \boldsymbol{u}\left({\varOmega}_{\boldsymbol{H}},t\right)-{\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_{\mathbf{L}},t\right)\right){\parallel}_{\varOmega_{\boldsymbol{H}}}^2}{\sum_{t\in \mathcal{T}}\parallel \boldsymbol{u}\left({\varOmega}_{\boldsymbol{H}},t\right){\parallel}_{\varOmega_{\boldsymbol{H}}}^2}}. $$
 Upon discarding the transient, a solution 
 $ \boldsymbol{u}\left({\varOmega}_H,t\right)\in {\mathrm{\mathbb{R}}}^{70\times 70} $
 is generated by time integration over
$ \boldsymbol{u}\left({\varOmega}_H,t\right)\in {\mathrm{\mathbb{R}}}^{70\times 70} $
 is generated by time integration over 
 $ 12\times {10}^3 $
 time steps, with
$ 12\times {10}^3 $
 time steps, with 
 $ \Delta t=1\times {10}^{-3} $
. We extract the low-resolution solution
$ \Delta t=1\times {10}^{-3} $
. We extract the low-resolution solution 
 $ \boldsymbol{u}\left({\varOmega}_L,t\right)\in {\mathrm{\mathbb{R}}}^{10\times 10} $
. We extract
$ \boldsymbol{u}\left({\varOmega}_L,t\right)\in {\mathrm{\mathbb{R}}}^{10\times 10} $
. We extract 
 $ 2048 $
 samples at random from the time domain of the solution, each sample consisting of
$ 2048 $
 samples at random from the time domain of the solution, each sample consisting of 
 $ \tau =2 $
 consecutive time steps. The Adam optimizer (Kingma and Ba, Reference Kingma, Ba, Bengio and LeCun2015) is employed for training with a learning rate of
$ \tau =2 $
 consecutive time steps. The Adam optimizer (Kingma and Ba, Reference Kingma, Ba, Bengio and LeCun2015) is employed for training with a learning rate of 
 $ 3\times {10}^{-4} $
. We take
$ 3\times {10}^{-4} $
. We take 
 $ {\lambda}_P={10}^3 $
 as the regularisation factor for the loss, as shown in Eq. (7), and train for a total of
$ {\lambda}_P={10}^3 $
 as the regularisation factor for the loss, as shown in Eq. (7), and train for a total of 
 $ {10}^3 $
 epochs, which is empirically determined to provide sufficient convergence, as per the results of a hyperparameter study (result not shown).
$ {10}^3 $
 epochs, which is empirically determined to provide sufficient convergence, as per the results of a hyperparameter study (result not shown).
 
Figure 7 shows a snapshot of results for the streamwise component of the velocity field, comparing network realizations 
 $ {\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right) $
 with the interpolated alternatives. Bi-linear and bi-cubic interpolation are denoted by
$ {\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right) $
 with the interpolated alternatives. Bi-linear and bi-cubic interpolation are denoted by 
 $ BL\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right), BC\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right) $
 respectively. We observe that network realizations of the high-resolution solution yield qualitatively more accurate results as compared with the interpolation. Artifacts indicative of the interpolation scheme used are present in both of the interpolated fields, whereas the network realization captures the structures present in the high-resolution field correctly. Across the entire solution domain the model
$ BL\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right), BC\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right) $
 respectively. We observe that network realizations of the high-resolution solution yield qualitatively more accurate results as compared with the interpolation. Artifacts indicative of the interpolation scheme used are present in both of the interpolated fields, whereas the network realization captures the structures present in the high-resolution field correctly. Across the entire solution domain the model 
 $ {\boldsymbol{f}}_{\boldsymbol{\theta}} $
 achieves a relative error of
$ {\boldsymbol{f}}_{\boldsymbol{\theta}} $
 achieves a relative error of 
 $ e=6.972\times {10}^{-2} $
 compared with
$ e=6.972\times {10}^{-2} $
 compared with 
 $ e=2.091\times {10}^{-1} $
 for bi-linear interpolation and
$ e=2.091\times {10}^{-1} $
 for bi-linear interpolation and 
 $ e=1.717\times {10}^{-1} $
 for bi-cubic interpolation.
$ e=1.717\times {10}^{-1} $
 for bi-cubic interpolation.

Figure 7. Reconstruction from sparse information: physics-constrained convolutional neural network (PC-CNN) compared with traditional interpolation methods to reconstruct a solution from a sparse grid 
 $ {\varOmega}_L\in {\mathrm{\mathbb{R}}}^{10\times 10} $
 (100 points) to a high-resolution grid
$ {\varOmega}_L\in {\mathrm{\mathbb{R}}}^{10\times 10} $
 (100 points) to a high-resolution grid 
 $ {\varOmega}_H\in {\mathrm{\mathbb{R}}}^{70\times 70} $
 (4900 points). Panel (i) shows the low-resolution input,
$ {\varOmega}_H\in {\mathrm{\mathbb{R}}}^{70\times 70} $
 (4900 points). Panel (i) shows the low-resolution input, 
 $ \boldsymbol{u}\left({\varOmega}_L,t\right) $
; (ii) bi-linear interpolation,
$ \boldsymbol{u}\left({\varOmega}_L,t\right) $
; (ii) bi-linear interpolation, 
 $ BL\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right) $
; (iii) bi-cubic interpolation,
$ BL\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right) $
; (iii) bi-cubic interpolation, 
 $ BC\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right) $
; (iv) true high-resolution field,
$ BC\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right) $
; (iv) true high-resolution field, 
 $ \boldsymbol{u}\left({\varOmega}_H,t\right) $
; (v) model prediction of the high-resolution field,
$ \boldsymbol{u}\left({\varOmega}_H,t\right) $
; (v) model prediction of the high-resolution field, 
 $ {\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right) $
; and (vi) energy spectra for each of the predictions. Vertical lines
$ {\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right) $
; and (vi) energy spectra for each of the predictions. Vertical lines 
 $ {f}^n $
 denote the Nyquist frequencies for spectral grid,
$ {f}^n $
 denote the Nyquist frequencies for spectral grid, 
 $ {f}_{{\hat{\varOmega}}_{\boldsymbol{k}}}^n $
, and for the low-resolution grid,
$ {f}_{{\hat{\varOmega}}_{\boldsymbol{k}}}^n $
, and for the low-resolution grid, 
 $ {f}_{\varOmega_{\boldsymbol{L}}}^n $
.
$ {f}_{\varOmega_{\boldsymbol{L}}}^n $
.
 Although the relative error provides a notion of predictive accuracy, it is crucial to assess the physical characteristics of the reconstructed field (Pope, Reference Pope and Pope2000). The energy spectrum, which is characteristic of turbulent flows, represents a multi-scale phenomenon where energy content decreases with the wavenumber. From the energy spectrum of the network’s prediction, 
 $ {\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right) $
, we gain physical insight into the multi-scale nature of the solution. Results in Figure 7 show that the energy content of the low-resolution field diverges from that of the high-resolution field, which is a consequence of spectral aliasing. Network realizations
$ {\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right) $
, we gain physical insight into the multi-scale nature of the solution. Results in Figure 7 show that the energy content of the low-resolution field diverges from that of the high-resolution field, which is a consequence of spectral aliasing. Network realizations 
 $ {\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right) $
 are capable of capturing finer scales of turbulence compared to both interpolation approaches, prior to diverging from the true spectrum as
$ {\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_L,t\right)\right) $
 are capable of capturing finer scales of turbulence compared to both interpolation approaches, prior to diverging from the true spectrum as 
 $ \mid \boldsymbol{k}\mid =18 $
. These results show the efficacy of the method, recovering in excess of
$ \mid \boldsymbol{k}\mid =18 $
. These results show the efficacy of the method, recovering in excess of 
 $ 99\% $
 of the system’s total energy. The physics loss,
$ 99\% $
 of the system’s total energy. The physics loss, 
 $ {\mathrm{\mathcal{L}}}_P $
, enables the network to act beyond simple interpolation. The network is capable of de-aliasing, thereby inferring unresolved physics. Parametric studies show similar results across a range of scale factors
$ {\mathrm{\mathcal{L}}}_P $
, enables the network to act beyond simple interpolation. The network is capable of de-aliasing, thereby inferring unresolved physics. Parametric studies show similar results across a range of scale factors 
 $ \kappa $
 (shown in Appendix D).
$ \kappa $
 (shown in Appendix D).
8. Conclusions
We propose a physics-constrained convolutional neural network (PC-CNN) to solve inverse problems that are relevant to spatiotemporal partial differential equations (PDEs). First, we introduce the physics-constrained convolutional neural network (PC-CNN), which provides the means to compute the physical residual, i.e., the areas of the field in which the physical laws are violated by the prediction provided by the convolutional neural network (CNN). We formulate an optimization problem by leveraging prior knowledge of the underlying physical system to regularise the predictions from the PC-CNN. This is augmented by a simple time-windowing approach for the computation of temporal derivative. Second, we apply the PC-CNN to solve an inverse problem in which we are given data that is contaminated by a spatially varying systematic error (i.e., bias, or epistemic uncertainty). The task is to uncover from the biased data the true state, which is the solution of the PDE. We numerically test the method to remove biases from data generated by three partial differential equations of increasing complexity (linear and nonlinear diffusion-convection, and Navier–Stokes). The PC-CNN successfully infers the true state, and the bias as a by-product, for a large range of magnitudes and degrees of non-convexity. This demonstrates that the method is accurate and robust. Third, we apply the PC-CNN to solve an inverse problem in which we are given sparse information on the solution of a PDE. The task is to reconstruct the solution in space with high resolution but without using the full high-resolution data in the training. We reconstruct the spatiotemporal chaotic solution of the Navier–Stokes equations on a high-resolution grid from a sparse grid containing only 1% of the information. We demonstrate that the proposed PC-CNN provides accurate physical results, both qualitatively and quantitatively, as compared to traditional interpolation methods, such as bi-cubic interpolation. For both inverse tasks, we investigate the physical consistency of the inferred solutions for the two-dimensional turbulent flow. We find that the PC-CNN predictions provide correct physical properties of the underlying partial differential equation (Navier–Stokes), such as the energy spectrum. This work opens opportunities for solving inverse problems with nonlinear partial differential equations. Future work will be focused on experimental data.
Acknowledgements
The authors are grateful to G. Rigas for insightful discussions on turbulence.
Data availability statement
Codes and data are accessible via GitHub: https://github.com/MagriLab/pisr; https://github.com/MagriLab/picr; and https://github.com/MagriLab/kolsol.
Author contribution
Conceptualization: D.K. & L.M.; Funding acquisition: L.M.; Investigation: D.K. & L.M.; Methodology: D.K. & L.M.; Project administration: L.M.; Software: D.K.; Supervision: L.M.; Validation: D.K. & L.M.; Visualization: D.K.; Writing-original draft: D.K. & L.M.; writing—review and editing: D.K. & L.M.
Funding statement
D. Kelshaw. & L. Magri. acknowledge support from the UK Engineering & Physical Sciences Research Council, and ERC Starting Grant PhyCo 949,388. L.M. acknowledges support from the grant EU-PNRR YoungResearcher TWIN ERC-PI_0000005.
Competing interest
None.
Ethical standard
The research meets all ethical guidelines, including adherence to the legal requirements of the study country.
Appendix
A. Architectural details
Here, we provide details for the network architectures used to parameterize the mappings employed for each of the inverse problems.
A.1. Uncovering solutions from biased data
For the bias removal task, we use a UNet architecture to extract relevant features from the biased field. The architecture, as well as the general training loop, can be seen in Figure 8. This particular network configuration was found empirically. The input and output layers each feature two channels, corresponding to the two velocity components of the flow being studied. The network does not overfit, also thanks to the physics-based regularisation term in the loss function.

Figure 8. Uncovering solutions from biased data. The model 
 $ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
 is responsible for mapping the biased state
$ {\boldsymbol{\eta}}_{\boldsymbol{\theta}} $
 is responsible for mapping the biased state 
 $ \boldsymbol{\zeta} \left(\varOmega, t\right) $
 to the true solution
$ \boldsymbol{\zeta} \left(\varOmega, t\right) $
 to the true solution 
 $ \boldsymbol{u}\left(\varOmega, t\right) $
. Convolutional layers are parameterized as torch.Conv2D(c_{in}, c_{out}, k), where c_{in}, c_{out} denote the number of input and output channels, respectively; and k is the spatial extent of the filter. The term
$ \boldsymbol{u}\left(\varOmega, t\right) $
. Convolutional layers are parameterized as torch.Conv2D(c_{in}, c_{out}, k), where c_{in}, c_{out} denote the number of input and output channels, respectively; and k is the spatial extent of the filter. The term 
 $ h $
 represents the activation, tanh in this case. The terms
$ h $
 represents the activation, tanh in this case. The terms 
 $ {\unicode{x1D543}}_D,{\unicode{x1D543}}_P,{\unicode{x1D543}}_C $
 denote the data loss, physics loss, and constraint loss respectively. The combination of these losses forms the objective loss
$ {\unicode{x1D543}}_D,{\unicode{x1D543}}_P,{\unicode{x1D543}}_C $
 denote the data loss, physics loss, and constraint loss respectively. The combination of these losses forms the objective loss 
 $ {\mathcal{L}}_{\theta } $
, which is used to update the network’s parameters. The term
$ {\mathcal{L}}_{\theta } $
, which is used to update the network’s parameters. The term 
 $ \tau $
 denotes the number of contiguous time-steps passed to the network, required for computing temporal derivatives. We provide an in-depth explanation of these losses in Section 4.1.
$ \tau $
 denotes the number of contiguous time-steps passed to the network, required for computing temporal derivatives. We provide an in-depth explanation of these losses in Section 4.1.
A.2. Reconstruction from sparse information
For the reconstruction task, a three-layer convolutional neural network is used for the model. Figure 9 provides an overview of the architecture, as well as the general training loop. The baseline network configuration is similar to that of Dong et al. (Reference Dong, Loy, He and Tang2014), with the additional constraints imposed through the physics-based loss function. The input and output layers each feature two channels, corresponding to the two velocity components of the flow being studied.

Figure 9. Reconstruction from sparse information. The model 
 $ {\boldsymbol{f}}_{\boldsymbol{\theta}} $
 is responsible for mapping the low-resolution field
$ {\boldsymbol{f}}_{\boldsymbol{\theta}} $
 is responsible for mapping the low-resolution field 
 $ \boldsymbol{u}\left({\varOmega}_L,t\right) $
 to the high-resolution field
$ \boldsymbol{u}\left({\varOmega}_L,t\right) $
 to the high-resolution field 
 $ \boldsymbol{u}\left({\varOmega}_H,t\right) $
. The upsampling layer (blue) performs bi-cubic upsampling to obtain the correct spatial dimensions. Convolutional layers are parameterized as torch.Conv2D(c_{in}, c_{out}, k), where c_{in}, c_{out} denote the number of input and output channels, respectively; and k is the spatial extent of the filter. The term
$ \boldsymbol{u}\left({\varOmega}_H,t\right) $
. The upsampling layer (blue) performs bi-cubic upsampling to obtain the correct spatial dimensions. Convolutional layers are parameterized as torch.Conv2D(c_{in}, c_{out}, k), where c_{in}, c_{out} denote the number of input and output channels, respectively; and k is the spatial extent of the filter. The term 
 $ h $
 represents the activation, tanh in this case. The terms
$ h $
 represents the activation, tanh in this case. The terms 
 $ {\unicode{x1D543}}_D,{\unicode{x1D543}}_P $
 denote the data loss and physics loss, respectively, the combination of which forms the objective loss
$ {\unicode{x1D543}}_D,{\unicode{x1D543}}_P $
 denote the data loss and physics loss, respectively, the combination of which forms the objective loss 
 $ {\mathcal{L}}_{\theta } $
, which is used to update the network’s parameters,
$ {\mathcal{L}}_{\theta } $
, which is used to update the network’s parameters, 
 $ \boldsymbol{\theta} $
. The term
$ \boldsymbol{\theta} $
. The term 
 $ \tau $
 denotes the number of contiguous time-steps passed to the network, required for computing temporal derivatives. We provide an in-depth explanation of these losses in Section 4.1.
$ \tau $
 denotes the number of contiguous time-steps passed to the network, required for computing temporal derivatives. We provide an in-depth explanation of these losses in Section 4.1.
B. Pseudo-spectral discretization of Navier–Stokes residual
Operating in the Fourier domain eliminates the diverge-free constraint (Canuto et al., Reference Canuto, Hussaini, Quarteroni and Zang1988). The equations for the spectral representation of the Kolmogorov flow are.
 $ \mathcal{R}\left({\hat{\boldsymbol{u}}}_k;\boldsymbol{p}\right)=\left(\frac{d}{dt}+\nu {\left|\boldsymbol{k}\right|}^2\right){\hat{\boldsymbol{u}}}_k-{\hat{\boldsymbol{f}}}_k+\boldsymbol{k}\frac{\boldsymbol{k}\cdot {\hat{\boldsymbol{f}}}_k}{{\left|\boldsymbol{k}\right|}^2}-{\hat{\boldsymbol{g}}}_k, $
 (B.1)
$ \mathcal{R}\left({\hat{\boldsymbol{u}}}_k;\boldsymbol{p}\right)=\left(\frac{d}{dt}+\nu {\left|\boldsymbol{k}\right|}^2\right){\hat{\boldsymbol{u}}}_k-{\hat{\boldsymbol{f}}}_k+\boldsymbol{k}\frac{\boldsymbol{k}\cdot {\hat{\boldsymbol{f}}}_k}{{\left|\boldsymbol{k}\right|}^2}-{\hat{\boldsymbol{g}}}_k, $
 (B.1)
with 
 $ {\hat{\boldsymbol{f}}}_k=-{\left(\hat{\boldsymbol{u}\cdot \nabla \boldsymbol{u}}\right)}_k $
, where nonlinear terms are handled pseudospectrally, employing the 23 de-aliasing rule to avoid the unphysical culmination of energy at high frequencies (Canuto et al., Reference Canuto, Hussaini, Quarteroni and Zang1988). A solution is produced by time-integration of the dynamical system with the explicit forward-Euler scheme, choosing a time-step
$ {\hat{\boldsymbol{f}}}_k=-{\left(\hat{\boldsymbol{u}\cdot \nabla \boldsymbol{u}}\right)}_k $
, where nonlinear terms are handled pseudospectrally, employing the 23 de-aliasing rule to avoid the unphysical culmination of energy at high frequencies (Canuto et al., Reference Canuto, Hussaini, Quarteroni and Zang1988). A solution is produced by time-integration of the dynamical system with the explicit forward-Euler scheme, choosing a time-step 
 $ \Delta t $
 that satisfies the Courant–Friedrichs–Lewy (CFL) condition. Initial conditions are generated by producing a random field scaled by the wavenumber, which retains the spatial structures of varying lengthscale in the physical domain (Ruan and McLaughlin, Reference Ruan and McLaughlin1998).
$ \Delta t $
 that satisfies the Courant–Friedrichs–Lewy (CFL) condition. Initial conditions are generated by producing a random field scaled by the wavenumber, which retains the spatial structures of varying lengthscale in the physical domain (Ruan and McLaughlin, Reference Ruan and McLaughlin1998).
C. Multi-wave number bias removal
 Further to the parameterization outlined in Section 6.1, we consider stationary biases that contain multiple wave numbers. We define this composite bias as the mean of two bias fields, with 
 $ {k}_{\phi}\in \left\{3,7\right\},\mathcal{M}=0.5 $
. Bias removal was performed for the Kolmogorov flow case, following the same procedure as outlined in Section 2.1. Results for the bias removal achieved a relative error of
$ {k}_{\phi}\in \left\{3,7\right\},\mathcal{M}=0.5 $
. Bias removal was performed for the Kolmogorov flow case, following the same procedure as outlined in Section 2.1. Results for the bias removal achieved a relative error of 
 $ 5.83\times {10}^{-2} $
, with results for multiple time-steps shown in Figure 10. The structure of the predicted bias captures the extrema of the bias, however, there is a small discrepancy in the magnitude as the field should be symmetric along both axes. This further demonstrates the robustness of the proposed approach.
$ 5.83\times {10}^{-2} $
, with results for multiple time-steps shown in Figure 10. The structure of the predicted bias captures the extrema of the bias, however, there is a small discrepancy in the magnitude as the field should be symmetric along both axes. This further demonstrates the robustness of the proposed approach.

Figure 10. Removal of a stationary bias field containing multiple wave numbers. The first three columns show the biased, original, and predicted flow fields respectively. The rightmost column depicts the predicted additive bias. Results are shown for multiple time-steps in the simulation trajectory.
D. Parametric study on the scale factor
 Experiments for reconstruction from sparse information were also run for 
 $ \kappa \in \left\{\mathrm{7,9,11,13,15}\right\} $
 using the same experimental setup; ultimately, demonstrating reconstruction results for a range of scale factors. Results for each of these scale factors are shown in Figure 11. We observe that the relative
$ \kappa \in \left\{\mathrm{7,9,11,13,15}\right\} $
 using the same experimental setup; ultimately, demonstrating reconstruction results for a range of scale factors. Results for each of these scale factors are shown in Figure 11. We observe that the relative 
 $ {\mathrm{\ell}}^2 $
-error increases with increasing
$ {\mathrm{\ell}}^2 $
-error increases with increasing 
 $ \kappa $
; this is to be expected as the available information becomes more sparse.
$ \kappa $
; this is to be expected as the available information becomes more sparse.

Figure 11. Reconstruction from sparse information. Performance for the reconstruction task is shown for a range of scale factors 
 $ \kappa $
, measuring performance with the relative
$ \kappa $
, measuring performance with the relative 
 $ {\mathrm{\ell}}^2 $
-errors,
$ {\mathrm{\ell}}^2 $
-errors, 
 $ e $
, between the predicted high-resolution fields
$ e $
, between the predicted high-resolution fields 
 $ {\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_{\mathbf{L}},t\right)\right) $
, and the true high-resolution fields
$ {\boldsymbol{f}}_{\boldsymbol{\theta}}\left(\boldsymbol{u}\left({\varOmega}_{\mathbf{L}},t\right)\right) $
, and the true high-resolution fields 
 $ \boldsymbol{u}\left({\varOmega}_{\boldsymbol{H}},t\right) $
.
$ \boldsymbol{u}\left({\varOmega}_{\boldsymbol{H}},t\right) $
.
 
  











































































 
              
Comments
No Comments have been published for this article.