Impact statement
Many existing machine learning techniques for surrogate modeling of spatial fields are unable to handle variable mesh topologies and associated unstructured meshes. Further, many techniques also process entire solution snapshots as inputs and outputs, presenting scaling difficulties when applied to very large engineering simulation datasets. The deep learning methods presented in this work allow for fast surrogate models of the solution fields to be constructed via training on heterogeneous data snapshots that may involve variable geometries and mesh topologies. The solution is constructed as a continuous field. Demonstrations are shown on a vehicle aerodynamics application, and the performance is enhanced by leveraging different approaches for training and the use of random Fourier features.
1. Introduction
High-fidelity numerical simulations are ubiquitous in engineering design and analysis but are often prohibitively expensive in design applications. Data-driven and machine-learning surrogate-modeling techniques offer an interesting alternative, particularly in situations where model accuracy may be acceptably traded for computational savings. However, many existing methods face limitations when confronted with unstructured and varying mesh topologies across the parameter or design-variable space. This confines such methods to problems that can be defined with a shared discretization or requires additional lossy interpolation to map solutions onto consistent meshes. These limitations pose a significant challenge for problems involving multiscale phenomena, commonly found in fluid and structural mechanics, where solutions frequently contain regions with large gradients and tightly clustered mesh cells. Additionally, the domains may contain intricate and varying geometric features among solution instances. In such scenarios, interpolating the solutions to a common and often Cartesian mesh results in unacceptable loss of critical information and fidelity.
Many advances in machine learning have been driven by methods which approximate mappings involving high-dimensional spaces, with the cardinality of input or output spaces $ \sim \mathcal{O}\left({10}^2-{10}^5\right) $ (Krizhevsky et al., Reference Krizhevsky, Sutskever and Hinton2012; Lecun et al., Reference Lecun, Bottou, Bengio and Haffner1998; Ronneberger et al., Reference Ronneberger, Fischer and Brox2015; Vincent et al., Reference Vincent, Larochelle, Bengio and Manzagol2008). This includes matrix-decomposition-based methods (Kutz et al., Reference Kutz, Brunton, Brunton and Proctor2016; Willcox and Peraire, Reference Willcox and Peraire2002) and deep-learning autoencoder techniques (Bhatnagar et al., Reference Bhatnagar, Afshar, Pan, Duraisamy and Kaushik2019; Thuerey et al., Reference Thuerey, Weißenow, Prantl and Hu2020; Xu and Duraisamy, Reference Xu and Duraisamy2020) used in scientific applications. Computational limitations arise when applied to scientific or engineering simulations of scale, where the mesh-cell cardinality $ N $ may be in the tens-of-millions or even billions in super-computing settings (Arroyo et al., Reference Arroyo, Dombard, Duchaine, Gicquel, Martin, Odier and Staffelbach2021), with the total degrees-of-freedom for a solution state an even-larger multiple of $ N $ . This requires mapping between high-dimensional snapshots in autoencoder-style models, and decomposing even larger snapshot matrices for projection-based ROMs, dynamic mode decomposition, or related Koopman methods (Schmid, Reference Schmid2022).
A recent line of research instead approximates infinite-dimensional (continuous) functions by mapping between lower-dimensional spaces using simple coordinate-based fully connected multi-layer perceptron (MLP) neural networks. That is, instead of mapping between snapshots over the full domain of the problem, for example, with a nonautoassociative autoencoder convolutional neural network (CNN) (Bhatnagar et al., Reference Bhatnagar, Afshar, Pan, Duraisamy and Kaushik2019), a mapping is regressed between inputs and outputs at each individual point in space. The resulting key distinction is that coordinate inputs $ \mathbf{x} $ are taken pointwise, for example, as a single physical-coordinate tuple $ \left(x,y,z\right) $ , instead of entire-solution snapshots. A driving application is object and scene representation for rendering in computer graphics by which objects are represented by continuous implicit fields such as the signed-distance function (SDF) zero-level set (Davies et al., Reference Davies, Nowrouzezahrai and Jacobson2020; Park et al., Reference Park, Florence, Straub, Newcombe and Lovegrove2019), SDF decision boundary (Mescheder et al., Reference Mescheder, Oechsle, Niemeyer, Nowozin and Geiger2019), or as a density/differential-opacity along a light ray (Mildenhall et al., Reference Mildenhall, Srinivasan, Tancik, Barron, Ramamoorthi and Ng2021).Footnote 1 This concept is known by several names, including coordinate-based networks, neural fields, neural implicits, and implicit neural representations (Xie et al., Reference Xie, Takikawa, Saito, Litany, Yan, Khan, Tombari, Tompkin, sitzmann and Sridhar2022), the latter of which provides a framework which encompasses and generalizes the methods, extending them to other problem scenarios including recovery of typical supervised learning problems. Using a continuous representation is key to attaining discretization independence when applied to scientific simulation data. Every point in each mesh may be included separately, eliminating the need for lossy interpolation of solution data onto a common Cartesian mesh.
In this work, coordinate-based MLPs are applied to the prediction of partial differential equation (PDE) solutions with variable geometry. The predictive models must be able to concurrently handle unstructured data and variation in physical design and operating conditions, as described by design-variable vector $ \boldsymbol{\mu} $ , to be useful in design optimization or other engineering tasks. The physical coordinate inputs are augmented to include an evaluation of the SDF to provide global information about the domain at each point; this may be viewed as a form of concatenation-based local conditioning as the SDF is a function of space over the relevant domains. In addition to the SDF, the network predictions are globally conditioned upon the design variables $ \boldsymbol{\mu} $ . Several different techniques for conditioning neural network predictions are used in the literature (Dumoulin et al., Reference Dumoulin, Perez, Schucher, Strub, Vries, Courville and Bengio2018; Park et al., Reference Park, Florence, Straub, Newcombe and Lovegrove2019; Perez et al., Reference Perez, Strub, de Vries, Dumoulin and Courville2018; Xie et al., Reference Xie, Takikawa, Saito, Litany, Yan, Khan, Tombari, Tompkin, sitzmann and Sridhar2022) with concatenation-based conditioning and the use of hypernetworks (Ha et al., Reference Ha, Dai and Le2016) explored here. The method which utilizes concatenation-based conditioning is termed design-variable multilayer perceptron (DV-MLP), while design-variable hypernetworks (DVH) of course utilize a hypernetwork structure. Most conditioning schemes involve learning an additional embedding to condition the networks upon, whereas here, the design variables and SDF are used instead; both of which are known once a design is selected, simplifying the overall scheme.
The outline of the article is as follows: Background information relating to existing discretization-dependent methods along with approaches for generalization and conditioning of predictions are given in the remainder of the introduction, Sections 1.1–1.3. The proposed methods are treated in greater detail in Section 2. Numerical experiments for 2D vehicle aerodynamics with complex, realistic automotive shapes are given in Sections 3 and 4. Section 3 provides a baseline result for single and multiple vehicle speeds, and Section 4 explores the effects of using random Fourier features, where accuracy and generalization properties are improved. Conclusions follow in Section 5, with additional information given in the Appendix.
1.1. Discretization-dependent methods
As a classical dimensionality-reduction technique, proper orthogonal decomposition (POD) has been used to construct surrogate and reduced-order models (Benner et al., Reference Benner, Gugercin and Willcox2015; Dolci and Arina, Reference Dolci and Arina2016; Salmoiraghi et al., Reference Salmoiraghi, Scardigli, Telib and Rozza2018; Willcox and Peraire, Reference Willcox and Peraire2002). Despite many attractive properties, conventional POD implementations process discretized data and require the use of a fixed topology mesh across all parameter regimes, fixing the number of degrees-of-freedom. This is restrictive in many engineering problems in which various solution features (e.g., relative motion of bodies, crack propagation, etc.) may emerge in different regions of parameter space. Further, data may be available from multiple sources with varying discretization and mesh topologies. Other snapshot-based methods inherit these disadvantages, including autoencoders and their variants.
CNNs have been used to construct surrogate models for both steady-state (Bhatnagar et al., Reference Bhatnagar, Afshar, Pan, Duraisamy and Kaushik2019; Guo et al., Reference Guo, Li and Iorio2016; Ronneberger et al., Reference Ronneberger, Fischer and Brox2015; Tangsali et al., Reference Tangsali, Krishnamurthy and Hasnain2021; Thuerey et al., Reference Thuerey, Weißenow, Prantl and Hu2020) and time-varying parametric problems (Hasegawa et al., Reference Hasegawa, Fukami, Murata and Fukagata2020; Xu and Duraisamy, Reference Xu and Duraisamy2020) by including an additional time-advance model such as a long short-term memory (LSTM) or temporal-convolutional network. However, CNNs place even greater restrictions on the discretization than POD-based methods, requiring inputs and outputs to be defined on regular Cartesian grids of consistent dimension for all parameter regimes. Overcoming this restriction requires interpolation from the computational mesh to a Cartesian grid overlain on the problem domain, equivalent to pixelization. The interpolation results in a number of undesirable effects, including a reduced-fidelity representation of the domain geometry, and a loss of information in regions of tightly clustered mesh points, such as within boundary layers, shocks, and wakes. CNN-based models may then be conceptualized as image-to-image mappings, where researchers Guo et al. (Reference Guo, Li and Iorio2016) note improved results when using a signed-distance field as the network input in contrast to a binarized representation, inspiring its use as an additional input feature in this work for coordinate-based networks.
Another more problematic but surprisingly overlooked issue is that the memory requirements for 3D convolutions, commonly implemented on a single GPU, are not affordable for typical resolutions in realistic engineering problems. When taking mini-batch training into consideration, even storing the output of one single hidden layer (a 5-dimensional tensor), requires memory typically on the order of $ O(10)-O\left({10}^2\right) $ GB for a 3D Cartesian field with 40 million cells. As a result, most reported works using 3D CNNs for engineering problems are limited to below 5-6 million degrees of freedom (Mohan et al., Reference Mohan, Daniel, Chertkov and Livescu2019; Santos et al., Reference Santos, Xu, Jo, Landry, Prodanovic and Pyrcz2020) and often still require lossy interpolation (Gao et al., Reference Gao, Zhang, Jia, Zhu, Zhou and Wang2022).
1.2. Hypernetworks and methods for conditioning neural networks
Hypernetworks constitute a metamodeling approach where one neural network is used to generate the weights of another main network (Ha et al., Reference Ha, Dai and Le2016) and are part of a broader class of proposed techniques where network weights are conditioned on model inputs or features (Bertinetto et al., Reference Bertinetto, Henriques, Valmadre, Torr and Vedaldi2016; Jaderberg et al., Reference Jaderberg, Simonyan, Zisserman and Kavukcuoglu2015; Jia et al., Reference Jia, De Brabandere, Tuytelaars, Gool, Lee, Sugiyama, Luxburg, Guyon and Garnett2016). Hypernetworks were originally applied to convolutional and recurrent neural networks for image- and natural-language-processing tasks, with the goal of reducing the number of trainable parameters while maintaining or improving model accuracy. The use cases for hypernetworks have largely been the domain of computer science but lately have been applied to scientific machine learning in some instances. Pan et al., Reference Pan, Brunton and Kutz2023 leveraged a hypernetwork structure known as neural implicit flow (NIF) to learn latent representations from turbulence on arbitrary meshes in a scheme similar to DVH used here. HyperPINNs apply hypernetworks to physics-informed neural networks (PINNs) (Raissi et al., Reference Raissi, Perdikaris and Karniadakis2019) for parametric PDE solutions of 1D-viscous Burgers and the Lorenz system, with improved accuracy seen over baseline PINN models despite a smaller main network (de Avila Belbute-Peres et al., Reference de Avila Belbute-Peres, Chen and Sha2021).
Other methods of conditioning include the use of feed-forward encoders, auto-decoders (Park et al., Reference Park, Florence, Straub, Newcombe and Lovegrove2019), concatenation-based conditioning, and feature-wise transformations (Dumoulin et al., Reference Dumoulin, Perez, Schucher, Strub, Vries, Courville and Bengio2018; Perez et al., Reference Perez, Strub, de Vries, Dumoulin and Courville2018). Hypernetworks are the most general conditioning method, as all the other methods may be derived from a hypernetwork, which outputs only a portion of the network weights (Xie et al., Reference Xie, Takikawa, Saito, Litany, Yan, Khan, Tombari, Tompkin, sitzmann and Sridhar2022). Further, the conditioning may be global or local. Global conditioning uses a single embedding for an entire instance, whereas local conditioning uses an embedding which is itself a function of space over the domain of interest. In this work, a combination of the two is used, where the inclusion of the SDF coordinate may be viewed as local conditioning, while the use of design variables $ \boldsymbol{\mu} $ is a form of global conditioning.
1.3. Other discretization-independent methods
In addition to neural fields, some graph neural networks (GNNs), point-cloud networks, and operator regression methods also allow for discretization independence. GNNs have been developed to generalize convolutional-like models to problems defined on non-Euclidean domains, or with non-regular Cartesian structure. GNNs may be classified as either spectral (Bruna et al., Reference Bruna, Zaremba, Szlam and LeCun2014; Defferrard et al., Reference Defferrard, Bresson and Vandergheynst2016; Henaff et al., Reference Henaff, Bruna and LeCun2015; Kipf and Welling, Reference Kipf and Welling2017) or spatial (Duvenaud et al., Reference Duvenaud, Maclaurin, Iparraguirre, Bombarell, Hirzel, Aspuru-Guzik and Adams2015) approaches, although the two may be generalized by the message-passing graph neural network (MPGNN) (Gilmer et al., Reference Gilmer, Schoenholz, Riley, Vinyals and Dahl2017). Spectral GNNs require a consistent discretization among all instances similar to POD, while MPGNNs allow the meshes to vary. MPGNNs have been used for body-force predictions of aerodynamic flows (Ogoke et al., Reference Ogoke, Meidani, Hashemi and Farimani2021). Additionally, MPGNNs are used as a sub-component for certain learning and prediction schemes, with a focus on PDEs in either a mesh-based (Pfaff et al., Reference Pfaff, Fortunato, Sanchez-Gonzalez and Battaglia2020; Xu et al., Reference Xu, Pradhan and Duraisamy2021) or mesh-free scenario (Sanchez-Gonzalez et al., Reference Sanchez-Gonzalez, Godwin, Pfaff, Ying, Leskovec and Battaglia2020) in addition to some neural operator methods discussed later. Additionally, Geodesic CNNs can handle varying mesh topologies and have been used for aerodynamics and optimization (Baque et al., Reference Baque, Remelli, Fleuret and Fua2018), along with heat transfer through porous media (Mallya et al., Reference Mallya, Baqué, Yvernay, Pozzetti, Fua and Haussener2023).
Point cloud neural networks are useful in situations in which the data is available in the form of unstructured point clouds, such as the raw output of a LIDAR unit or other three-dimensional sensor. As such they are often seen in the context of autonomous vehicles or robotic vision. PointNet (Qi, Su, et al., Reference Qi, Su, Mo and Guibas2017) and PointNet++ (Qi, Yi, et al., Reference Qi, Yi, Su and Guibas2017) are architectures designed for point clouds and are used for scene recognition, classification, and segmentation tasks. PointNet++ has been adapted in the context of predicting viscous, incompressible flows over 2D shapes lying on unstructured meshes (Kashefi et al., Reference Kashefi, Rempe and Guibas2021).
Another class of relevant techniques capable of handling unstructured data includes operator-regression methods, such as those based on DeepONet (Cai et al., Reference Cai, Wang, Lu, Zaki and Karniadakis2021; Lu et al., Reference Lu, Jin, Pang, Zhang and Karniadakis2021; Wang et al., Reference Wang, Wang and Perdikaris2021), Neural Operator (Kovachki et al., Reference Kovachki, Li, Liu, Azizzadenesheli, Bhattacharya, Stuart and Anandkumar2023; Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, et al., Reference Li, Kovachki, Azizzadenesheli, Liu, Stuart, Bhattacharya and Anandkumar2020; Li, Kovachki, Azizzadenesheli, Liu, Stuart, et al., Reference Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, Stuart and Anandkumar2020), and GMLS Nets (Trask et al., Reference Trask, Patel, Gross and Atzberger2019). DeepONet’s structure may be viewed as a partial hypernetwork for just the output layer and is thus similar to what is pursued here with DVHs, with another major difference corresponding to hypernetwork (branch network) inputs. The DeepONet branch network takes as input a sampling of input functions across the domain, and in fact DeepONet is conceived in 1D with reference to an operator representation theorem, which further assumes the sensors have consistent sampling locations which span the domain. Unstructured meshes with varying topology among instances generally do not have a shared set of points to place sensors among all instances, complicating the use of DeepONet in the present scenario.
2. Methods
Engineering models are typically expressed as systems of PDEs describing the design under consideration, with the highest-fidelity versions termed the full-order model (FOM). Quantities of interest (QoIs) relevant to assessing the design are extracted from numerically generated FOM solutions. In the context of design, the FOM may be parameterized by vector $ \boldsymbol{\mu} \in {\mathrm{\mathbb{R}}}^{n_{\mu }} $ so that important and defining design elements may be varied and the QoI response measured. Often, these variations include the shape or geometry of some problem element, and these geometric design variables may be collected and represented as $ {\boldsymbol{\mu}}_{\mathrm{geo}}\in {\mathrm{\mathbb{R}}}^{n_{\mu_{\mathrm{geo}}}} $ . These parameters are often used with and defined in the context of a separate parametric geometry model, relating the design variables to the fine details of the physical design. In this work, the design variables $ \boldsymbol{\mu} $ are used in place of a learned representation in order to condition predicted PDE solutions $ \mathbf{q}\left(\mathbf{x};\boldsymbol{\mu} \right)\in {\mathrm{\mathbb{R}}}^{n_q} $ . Two methods of global conditioning are used, including concatenation-based conditioning and the use of dense hypernetworks. Additionally, the signed-distance function $ {f}_{\mathrm{sdf}}\left(\mathbf{x};\Omega \right) $ or minimum distance function $ \phi \left(\mathbf{x};\Omega \right) $ are used as inputs to the main network as an implicit representation of the domain geometry, along with the spatial coordinates $ \mathbf{x}\in {\mathrm{\mathbb{R}}}^{n_x} $ . This may be viewed as a form of local concatenation-based conditioning.
Consider FOMs for steady-state, parametric PDEs, written generally as
with boundary conditions prescribed as
where $ \mathbf{q} $ is the solution state, $ \Omega /\mathrm{\partial \Omega}\subset {\mathrm{\mathbb{R}}}^{n_x} $ are the problem domain/boundary which are implicit functions of $ {\boldsymbol{\mu}}_{\mathrm{geo}} $ , $ \mathbf{x}\in \Omega $ are the spatial coordinates, $ \mathcal{R} $ is the PDE operator, and $ \mathcal{B} $ is the boundary-condition operator. The geometric design variables $ {\boldsymbol{\mu}}_{\mathrm{geo}} $ are included as part of $ \boldsymbol{\mu} $ , which may also contain PDE coefficients or numerical values of boundary or operating conditions. In this context, the elements of $ \boldsymbol{\mu} $ apply to the entire physical domain over which the FOM is solved, in contrast to problem scenarios involving spatially distributed parameters, such as a thermal conductivity or another physical property.
2.1. Shape and scene representation via coordinate-based neural networks
Given a set of points representing a surface $ \mathcal{S}=\partial \mathcal{V} $ of an object $ \mathcal{V}\in {\mathrm{\mathbb{R}}}^m $ in $ m $ -dimensional physical space, the signed-distance function may be defined as
where
is a minimum-distance function (MDF), and $ d\left(\cdot, \cdot \right) $ is the Euclidean-distance function. Stated simply, the SDF is the minimum distance between the field point $ \mathbf{x} $ and the surface $ \mathcal{S} $ in consideration. It takes positive values for points outside the object ( $ \mathbf{x}\notin \mathcal{V} $ ), negative values for points inside ( $ \mathbf{x}\in \mathcal{V} $ ), and is identically zero on the surface.
Coordinate-based MLPs have been used to represent 3D objects for rendering tasks. The object’s surface is implicitly represented within a volumetric field; including as the zero-level-set of a directly regressed signed-distance field (Davies et al., Reference Davies, Nowrouzezahrai and Jacobson2020; Park et al., Reference Park, Florence, Straub, Newcombe and Lovegrove2019) or decision boundary (interior/exterior) (Chen and Zhang, Reference Chen and Zhang2019; Mescheder et al., Reference Mescheder, Oechsle, Niemeyer, Nowozin and Geiger2019), or as an emitted radiance and density/differential-opacity field (Mildenhall et al., Reference Mildenhall, Srinivasan, Tancik, Barron, Ramamoorthi and Ng2021). Many of these methods include loss terms describing a rendering process, such that the entire image generation process contributes to the loss during training. This concept may be generalized by implicit neural representations, introduced along with sin-activation SIREN networks in by Sitzmann, Martel, et al. (Reference Sitzmann, Martel, Bergman, Lindell and Wetzstein2020), from which defining Equations (2.5) and (2.6) are taken. In this setting, a function of interest $ \Phi $ with input coordinates $ \mathbf{x} $ , written $ \Phi :\mathbf{x}\to \Phi \left(\mathbf{x}\right) $ , is defined by a set of constraints $ \mathcal{C} $ ,
which optionally depend on the function values $ \Phi $ , function gradients $ {\nabla}_{\mathbf{x}}\Phi $ , and additional quantities $ \mathbf{a}\left(\mathbf{x}\right) $ which are needed to compute the constraints. When a neural network $ N $ with parameters $ \theta $ is used to approximate $ \Phi $ , then this is referred to as an implicit neural representation. To train the neural-network approximation, a loss function with $ M $ terms is defined by penalizing deviation from the constraints,
where the indicator function $ {1}_{\Omega_m} $ activates over valid locations within the domain $ {\Omega}_m $ . A key distinction between this and other methods is that coordinate inputs $ \mathbf{x} $ are taken pointwise, for example as a single physical-coordinate-tuple $ \left(x,y,z\right) $ , instead of as entire-solution snapshots. Surprisingly, many problems may be cast in this form, including as-discussed surface representation and variations on classic deep-learning problems, such as classifying MNIST hand-written digits. The approaches proposed in this work may be viewed through this lens where the only constraint is that the predictions match the data, recovering basic supervised regression. However, this does not directly align with the main thrust of implicit neural representations, where an implicit function is regressed and additional constraints are imposed.
A popular approach to generalization across instances involves autodecoders such as DeepSDF, which uses a form of concatenation-based conditioning, where concurrently learned embedding-vector $ {\mathbf{z}}_j $ is concatenated with spatial coordinates $ \mathbf{x} $ as network input (Park et al., Reference Park, Florence, Straub, Newcombe and Lovegrove2019). This is a powerful concept, as it allows for a single network to predict the SDF for many shapes, but it may come at the expense of blurred fine-object details (Davies et al., Reference Davies, Nowrouzezahrai and Jacobson2020), along with the need to solve an optimization problem for $ {\mathbf{z}}_j $ in order to make predictions for an unseen case. An alternative to embedding vector concatenation is weight-encoding (Davies et al., Reference Davies, Nowrouzezahrai and Jacobson2020), where a neural network is overfit to each shape separately without an embedding vector. In this scenario, the weights themselves are viewed as the embedding. To make predictions for a rendering, the weights for the shape under consideration are loaded and used. This method may provide greater accuracy but has the additional complications of training a separate network for each object and for loading and unloading weights. Additionally, this does not allow one to make predictions for objects outside of the training set. This leads naturally to the concept of applying hypernetworks to the coordinate-based MLPs as an alternative to training networks separately, an idea explored briefly by Sitzmann, Chan, et al. (Reference Sitzmann, Chan, Tucker, Snavely, Wetzstein, Larochelle, Ranzato, Hadsell, Balcan and Lin2020).
2.2. Problem setup
Denote the solution snapshot for a single instance $ j $ of the FOM as
where the solution output-input pairs are defined at $ {n}_j $ spatial (mesh) locations. Considering a dataset
containing $ {n}_D $ snapshots, a distinctive feature of this approach is that each snapshot may correspond to a solution domain with different spatial extent and discretization, with varying number and location of mesh points. Models are sought which can approximate the solution snapshots stored in $ D $ , without interpolation of ground-truth data or prediction. In other words, given the generative factors or design variables for a problem $ \boldsymbol{\mu} \in \mathcal{M}\subset {\mathrm{\mathbb{R}}}^{n_{\mu }} $ , predict the system state $ \mathbf{q} $ at any location $ \mathbf{x}\in \Omega \left(\boldsymbol{\mu} \right) $ . Denote the input space as $ \mathbf{x}\in \mathcal{X}\subset {\mathrm{\mathbb{R}}}^{n_x} $ , and the output space as $ \mathbf{q}\in \mathcal{Q}\subset {\mathrm{\mathbb{R}}}^{n_q} $ .
The desired model should generalize across solution instances, and thus approximate the mapping $ f:\mathcal{M}\times \mathcal{X}\to \mathcal{Q} $ , without direct knowledge of the system state. Note that this is in contrast to initial-value problems, where the initial state to be integrated forward in time is necessary. The problems considered may contain complex and variable geometry, the input space is augmented to include an additional minimum-distance function coordinate, defined as
In the scenarios considered, all mesh-distances are positive, so the use of MDF and SDF is equivalent. This defines an augmented input space, $ {\mathcal{X}}^{\prime}\subset {\mathrm{\mathbb{R}}}^{n_x+1} $ , where $ {\mathbf{x}}^{\prime}\in {\mathcal{X}}^{\prime } $ , and in turn, the desired mapping is
The model approximation is then written as $ f\left({\mathbf{x}}^{\prime };\boldsymbol{\mu} \right)\approx \hat{\mathbf{q}}\left({\mathbf{x}}^{\prime };\boldsymbol{\mu} \right) $ or just $ \hat{\mathbf{q}} $ in compact notation. Each design variable $ \boldsymbol{\mu} $ defines a spatial solution field over a domain, thus it is natural to condition the models upon the design variables.
2.3. Method 1: design-variable MLP (DV-MLP)
DV-MLP uses concatenation-based conditioning to account for different solution instances, where the augmented coordinates $ {\mathbf{x}}^{\prime } $ are concatenated with the design variables $ \boldsymbol{\mu} $ . This is a from of global conditioning as the same $ \boldsymbol{\mu} $ is used for all spatial locations for a given solution field. The main network (denoted as $ {N}_m $ with weights $ {\theta}_m $ ) and resulting prediction are written as
Simple fully connected layers are used, where the hidden state of each layer has the same dimension or number of nodes, and no skip or recurrent connections are used. See Figure 1a for a schematic.
2.4. Method 2: design-variable hypernetworks (DVH)
DVH generates the weights and biases of the main network $ {\theta}_m $ using a dense hypernetwork which takes as input the design variables $ \boldsymbol{\mu} $ . Davies et al. (Reference Davies, Nowrouzezahrai and Jacobson2020) noted an autodecoder trained to represent many shapes had marginally worse performance representing the fine-grain details of the objects as compared to over-fitting a network on each case separately. In an effort to avoid this loss in fine-grain detail and to avoid training a separate model for each case, a hypernetwork is used to generate main network weights $ {\theta}_m $ for each solution instance. The hypernetwork is written as
and the main-network prediction written as
where $ {\theta}_h $ and $ {\theta}_m $ are the weights and biases of the hypernetwork and main network. In the following experiments, simple MLPs are used for both the main network and hypernetwork. All of the weights and biases contained in $ {\theta}_m $ are generated at once as one large vector which is sliced and reshaped as required. The training loss depends on the main-network predictions, but only the hypernetwork weights $ {\theta}_h $ are adjusted during training; the loss gradients are back-propagated through the hypernetwork. See Figure 1b for a schematic. The use of a hypernetwork in this way is a form of global conditioning upon the design variables $ \boldsymbol{\mu} $ .
2.4.1. Network-size scaling considerations
Dense hypernetworks have an important scaling consideration relating the number of trainable hypernetwork parameters in $ {\theta}_h $ to the size of the main network $ {\theta}_m $ . Consider a main network with $ {L}_m $ hidden layers each with a hidden dimension $ H $ , and an analogous hypernetwork with hidden dimension $ H $ except for the final hidden layer which has dimension $ {H}_L $ . Since all of the main network weights are generated at once, the output layer of the hypernetwork has roughly $ {H}_L $ times as many weights as the main network. That is,
showing that the total number of hypernetwork weights is linear in main-network depth $ {L}_m $ , quadratic in main-network hidden dimension $ H $ , and linear in hypernet-final-hidden dimension $ {H}_L $ . This is intuitive given the use of dense layers throughout, but Equation 2.14 is developed in more detail in Appendix Section A.1. Once a main network architecture is chosen with $ {L}_m $ and $ H $ selected, then $ {H}_L $ is the remaining term to be selected which drives the number of weights, which of course must be managed for the given training hardware and problem at hand, suggesting selecting $ {H}_L<H $ . This leads to an encoder-like interpretation for the hypernetwork, where the final hidden state is an $ {H}_L $ -dimensional embedding for a linear neural-network generator; the hypernetwork output layer. The interpretation exists regardless, even if $ {H}_L\ge H $ , but the bottleneck structure brings it out and is reminiscent of autoencoder-style models.
The scaling relationship of Equation 2.14 also highlights a difference between this and many other hypernetwork models. Frequently, the goal is to reduce the number of trainable parameters in a given main network while retaining predictive accuracy, while here, the differences in accuracy and generalization are studied between vector-embedded and weight-embedded coordinate-based networks. In the following experiments, $ {H}_L $ is chosen to be simply $ {H}_L=H=50 $ , across main and hypernetworks, meaning that the size of the DVH model is roughly 50 times larger than the DV-MLP model, as seen in Section A.2 of the Appendix. This is taken into account by comparing the time and resources needed to train each type of model while also assessing predictive accuracy. Further, the training methods which follow in Section 2.4.2 are found to have a large impact on the required training time. Differing hypernetwork architectures which reduce the number of trainable parameters are possible but are outside the scope of this paper.
2.4.2. Training considerations
Equation 2.14 implies that the computational complexity of training DVH models may follow a similar scaling relative to training DV-MLP models. This is investigated by considering two approaches for training DVH models, with differences relating to details of model evaluation. Consider a minibatch $ j $ consisting of $ N $ spatial locations drawn from each of $ M $ solution instances. Represent the minibatch as a tuple of tensors, $ \left({\mathbf{M}}_j,{\mathbf{X}}_j,{\mathbf{Q}}_j\right) $ , where tensor $ {\mathbf{M}}_j $ holds hypernetwork inputs, $ {\mathbf{X}}_j $ holds main network inputs, and $ {\mathbf{Q}}_j $ holds the target solution variables. The tensor shapes vary between the following methods, which are described below and represented in Figure 2.
-
• Method 1: Fully-Mixed Batches Minibatches are created, which consist of points from different cases, with both the hypernetwork and main network forward-propagated for each data point, meaning hypernetwork input vector $ \boldsymbol{\mu} $ is tiled across the mesh. In this scenario, all tensors have two axes, $ \dim \left({\mathbf{M}}_j\right)=\left(M\times N\right)\times {n}_{\mu } $ , $ \dim \left(\mathbf{X}\right)=\left(M\times N\right)\times {n}_{x^{\prime }} $ , and $ \dim \left(\mathbf{Q}\right)=\left(M\times N\right)\times {n}_q $ as shown in Figure 2a. Let $ {\mathcal{C}}_M $ / $ {\mathcal{C}}_H $ be the cost for each the main and hypernetwork, then forward propagating the minibatch is proportional to
(2.15) $$ {\mathcal{C}}_{FM}\propto \left(M\times N\right){\mathcal{C}}_H+\left(M\times N\right){\mathcal{C}}_M $$The data are fully mixed by shuffling over all locations and solution instances used in building the minibatch. The batch size then corresponds to the number of spatial locations where predictions are sought.
-
• Method 2: Batch-by-Case This method takes advantage of the fact that design variables $ \boldsymbol{\mu} $ apply to an entire solution instance and that the hypernetwork is a neural-network generator. A single forward pass of the hypernetwork is combined with multiple forward passes of the main network for a given number of spatial locations, all coming from the same solution instance, defined by $ \boldsymbol{\mu} $ . In this scenario, the design-variable tensor has two axes, $ \dim \left({\mathbf{M}}_j\right)=M\times {n}_{\mu } $ , while the other tensors have three; $ \dim \left(\mathbf{X}\right)=M\times N\times {n}_{x^{\prime }} $ and $ \dim \left(\mathbf{Q}\right)=M\times N\times {n}_q $ , as represented in 2b. The complexity of forward-propagating the minibatch scales as
The batch size is then the number of cases per mini-batch, where the same number of spatial locations $ N $ are evaluated for each case in each minibatch.
Comparing first terms between Equations 2.15 and 2.16 shows the batch-by-case first-term complexity is reduced by a factor of $ N\times {\mathcal{C}}_H $ due to the many fewer expensive hypernetwork calls. The actual computational complexity will be measured through profiling in later sections.
3. Numerical experiments I: vehicle aerodynamics
External vehicle aerodynamics are considered, with parametric vehicle shapes lying on unstructured, nonparametric meshes. The incompressible RANS equations were solved using Star CCM+ with the $ k $ - $ \varepsilon $ turbulence model. The dataset—generated by General Motors, Inc.—consists of 2D slices along the vehicle centerline for 124 unique vehicle shapes at speeds of 90 and 130 kilometers-per-hour (kph). The incompressible RANS equations are shown and discussed in greater detail in Appendix Section A.5, including a discussion of nondimensional flow variables. The simulations utilize unstructured polyhedral meshes of varying size, with an example mesh shown in Figure 5a. Each vehicle shape is parameterized by 8 geometric parameters as summarized in Table 1, with all 124 shapes overlain on one set of axes in Figure 5b. The vehicle designs were selected using Latin hypercube sampling, and random subsets of the dataset are chosen for training and validation.
The drag coefficient is an important quantity commonly used in assessing aerodynamic designs and is given by
where $ {F}_D $ is the drag force, consisting of pressure-force and skin-friction components, subscript $ \infty $ corresponds to freestream conditions, and $ A $ is the frontal area. Only three of the design variables, the windshield angle, vehicle length, and floor-to-roof height, are significantly correlated with $ {C}_D $ , as shown in Figure 3. The Pearson correlation coefficient for each is given on the plots, with a p-value $ p<0.05 $ as the criteria for determining significance. The floor-to-roof height shows the strongest correlation, with taller vehicles having greater drag.
The flow field structure for all vehicles and both speeds is typical for a time-averaged vehicle wake. Dominant features include a region of high pressure in front of the vehicle as the flow stagnates upon the grille, a low pressure region on and above the vehicles roof, and a large separated wake behind the vehicle as the pressure recovers to free-stream values, with a decaying free-shear layer emanating from the vehicle’s rear-upper surface. The pressure forces on the vehicle contribute strongly to the drag, and thus, examining the minimum and maximum pressures in the field gives insight to the variation in the dataset. Scatter plots of the min/max dimensional and nondimensional static pressure are shown in Figure 4. When dimensional static pressure is considered, the two left-most plots, the data are grouped by vehicle speed. However, when nondimensional pressure is used, as with the two right-most plots, the distributions collapse and one trend is observed for both vehicle speeds.
The computational fluid dynamics (CFD) meshes vary in size and topology, with the number of cells ranging from 108,748 to 115,751. In all sections, a spatial batch size of 54,000 points is used, resulting in two minibatches per case for a total of 108,000 mesh points used for each case. Mesh points are randomly dropped from each solution as required to make the minibatches, corresponding to 0.7%–6.7% of the points being left out for each training case. Reported training-error metrics include all mesh points, even if they were dropped from the training set, with further discussion of this in Section 3.2. The effect of varying spatial batch sizes is discussed in the Appendix, Section A.6.
The spatial input quantities are
and the predicted state is
The design-variable vectors differ slightly depending on the number of speeds considered and are given as
All spatial and predicted state quantities in $ {\mathbf{x}}^{\prime } $ and $ \mathbf{q} $ are nondimensionalized according to Section A.5 of the Appendix. Nondimensional quantities and dimensional analysis are widely used in fluid mechanics, leading to satisfying similarity solutions and increased interpretability. Following nondimensionalization, all input and output vectors are min–max normalized component-wise using training-set stats before use with the neural networks. Details on this normalization and other implementation details related to training and initialization are given in Appendix Section A.3. All reported errors are in the fully-dimensional units of the state, with the exception of training curves which correspond to fully-normalized quantities; see Section A.4 of the Appendix for error-metric definitions. Without normalization, the loss and loss-gradients may be biased toward the state quantities with the largest unit-scale. Normalization eases this problem and ensures that smaller output quantities are not ignored during training.
3.1. Model architecture and training options
In this section, 5-hidden-layer networks are used for both main and hypernetworks, each with a hidden dimension of 50 for all layers. Model architecture summaries are given in Appendix Section A.2, with the number of trainable weights for Section 3 given in Table 2. As noted in Section 2.4.1, the DVH model has roughly $ {H}_L=50 $ times as many trainable weights as the corresponding DV-MLP model, and the computational consequences of this are explored and quantified here through profiling. Three model-method combinations are considered and include DV-MLP with its sole fully-mixed training mode, DVH method 1 (M1) fully mixed batches, and DVH method 2 (M2) batch-by-case. See Section 2.4.2 for greater detail. The average step time, average compute time, and maximum memory usage during training among the models and methods are reported in Table 3, where the spatial batch size is 54,000 points for DV-MLP and DVH M1 and a corresponding case batch size of 1 for DVH M2. All profiled results were obtained using tensorboard callbacks. Three different levels of precision are compared for each method, including double precision (float64), single precision (float32), and mixed precision (combination of float32 and float16). The reported values are averaged over the first 100 minibatches for 100 epochs of training, and the $ \Delta t $ step standard deviation is given in parentheses.
There are several important takeaways from these profiled results. First, the precision has a large impact on the step-times and memory requirements, which decrease massively between float64 and float32 in all instances. Subsequent smaller improvements are seen moving between float32 and mixed precision. Next, M2 batch-by-case training decreases the step time and memory requirements by roughly an order of magnitude as compared to M1 fully mixed for a given precision. Further, the DVH M2 step times are comparable to the DV-MLP step times and consume less memory despite the much greater parameter count. Two average times are given, $ \Delta t $ step, which includes all operations in a single optimizer update, and $ \Delta t $ compute, which includes only GPU operations per update. Rows marked with an asterisk (*) in Table 3 may benefit from improvements in the data pipeline, as the overhead associated with those operations becomes appreciable as the $ \Delta t $ step times decrease.
When comparing the accuracy of the resulting fully trained models, it was found that using single precision generally improved the mean-relative-L2 error (MRL2E) by roughly 1–8 percentage points as compared to mixed precision. The use of double precision negligibly improved the predictive performance beyond that seen with single precision, and in some cases, single precision performed best overall. Thus, given the greatly improved step times from double to single precision, and the smaller improvements afforded by using mixed precision, all models in later sections are trained using single precision unless otherwise stated.
A piecewise learning-rate schedule was used for all experiments, consisting of periods of constant and exponentially decaying learning rates. This was devised to combat large variation around training-loss plateaus seen in some instances and stabilize DVH M1 training dynamics which showed large training-loss fluctuations. Additionally, several influential, state-of-the-art works in machine learning have used a similar scheme whereby the learning rate is manually decreased by a factor of 10 when plateaus in the training loss are observed, a process repeated up to three times (He et al., Reference He, Zhang, Ren and Sun2016; Krizhevsky et al., Reference Krizhevsky, Sutskever and Hinton2012; Simonyan and Zisserman, Reference Simonyan and Zisserman2014). This aided model convergence, and it is now commonplace to start training with a large initial learning rate which is decreased during a subsequent annealing period. The large initial learning rate is thought to aid generalization (Y. Li et al., Reference Li, Wei and Ma2019), while it has been shown that the learning rate is proportional to the variance of training-loss fluctuations around a local minima (Murata et al., Reference Murata, Müller, Ziehe, Amari, Mozer, Jordan and Petsche1996), providing a greater rationale for an annealing period. To define the learning rate at optimizer-step $ i $ , written as $ {\alpha}^{(i)} $ , denote the number of optimizer steps with constant learning rate as $ {s}_c $ , the decay rate $ r $ , and decay steps as $ {s}_d $ , then
defines the learning rate schedule. The interpretation is that the learning rate will decrease by a factor of $ r $ every $ {s}_d $ steps when the exponential term is active. To convert between optimizer steps and dataset iterations or epochs, use
It is commonplace to use train-validation-test splits when training and evaluating neural networks. The training set is used to perform optimizer updates, and the validation set is used to assess generalization and/or over-fitting during training, while the test set is completely unobserved until after training is complete. In this work, due to a limited number of available solutions, only training and validation groups are used, without cross-validation or hyperparameter tuning.
3.2. Single vehicle speed
Training information is given in Table 4 where $ {s}_t $ is the total number of training steps, and the values shown in parentheses are the corresponding number of epochs. Note that DVH M1 uses $ {\alpha}_0=5\times {10}^{-5} $ instead of $ {\alpha}_0=1\times {10}^{-3} $ as reported in the table due to large instabilities which frequently resulted in training-loss divergence with higher learning rates.
DV-MLP and DVH models are trained using Adam optimizer (Kingma and Ba, Reference Kingma and Ba2014) with default options and single precision. In the case of DVH, both training methods described in Section 2.4.2 were employed, where DVH M1/M2 correspond to methods 1 and 2, respectively. Table 5 presents error metrics, revealing that DVH M2 performs best across the board, with both DVH methods having very similar and lesser errors as compared to DV-MLP. DVH M2 slightly outperforms DVH M1 while requiring significantly less training time, approximately 4.4 ms per optimizer step versus 40.5 ms, as shown in Table 3.
Each model’s ability to infill or predict a training-group case at truncated spatial locations may be evaluated since training cases had points randomly dropped out or truncated during minibatch construction. Error metrics were calculated separately for the truncated and retained locations, with the case-wise mean-absolute error (MAE) for truncated locations plotted against that for retained locations in Figure 6 for DV-MLP and DVH M2. The dashed line is the identity line $ y(x)=x $ , meaning points which lie above the dashed line correspond to truncated-point errors which are larger than those for the retained points, and the opposite for points below the line. The points are generally located near the line, some above and some below, indicating that the performance is very similar between the retained and truncated groups. This is quantified in Table 6 where the mean error metrics for retained and truncated points are given separately, while the training errors reported in Table 5 include all points for the instance. This shows that the error metrics are nearly identical between the two groups for all models, demonstrating that the models are effective in this in-filling scenario.
Pressure-field predictions near a validation-group vehicle are shown in Figure 7 for DV-MLP and DVH M2, with predictions for all field variables over the full domain shown in the Appendix, Section A.7. The error contour colorbars are limited to $ \pm 4\times \mathrm{RMSE} $ centered on the average case error in order to see more fine-grain detail. Infrequent, comparably large errors obscure these details in many instances by skewing the colorbar, and points where the error is outside of this band are left white. The pressure field predictions match the ground truth well, with the main features captured including the high-pressure in front, the low-pressure due to flow acceleration on the roof, and the slowly recovering pressure in the wake seen in the full-domain plots. Some distortion of the contour lines is present in the predictions. The largest errors are seen near the vehicle surface, near the ground in front, and at locations along the free-shear layer and in the wake. The accuracy in predicted pressure drag coefficients are reported in Section 4.2.
3.3. Multiple speeds and generalization: low-data regime
The goal in situations such as surrogate-based design optimization is to obtain an accurate and generalizable surrogate using the least amount of training data and resources. The effect of the training dataset size on the generalization and convergence properties is explored by varying the number of available training cases. Two vehicle speeds of 90 and 130 kph are used, giving a total of 248 available solution instances. DV-MLP and DVH are compared while the number of training cases is varied from 5 to 199 instances, corresponding to training fractions ranging from 0.02 to 0.8. Training method 2 batch-by-case is used to train DVH models given that similar or better accuracy may be obtained as compared to fully-mixed training in a fraction of the time, as demonstrated for a single vehicle speed in Section 3.2. The dataset and training options are given in Table 7. The number of epochs remains fixed, resulting in varying learning rate schedule entries as the number of training cases is adjusted. Given a training fraction $ {f}_{\mathrm{train}} $ , the number of training cases is chosen to be $ \mathrm{ceil}\left({f}_{\mathrm{train}}{n}_{\mathrm{cases}}\right) $ , then the number of steps is given as
The number of epochs are given in Table 7 in parentheses and $ {n}_{\mathrm{batches}\;\mathrm{per}\;\mathrm{case}}=2 $ .
In the low-data regime, it is evident that neither model demonstrates strong generalization capabilities, leading to a substantial disparity between training and validation losses. The effect is more pronounced for DV-MLP, as illustrated by the training curves in Figure 8. In the figure, the solid lines represent the training loss, the dashed lines the validation loss, and darker lines correspond to a greater number of training cases. It is worth noting that the training loss reaching a plateau at a higher value with less data can be attributed, in part, to the fact that the number of epochs remains consistent rather than the number of optimizer updates. This should be kept in mind while interpreting the results.
During training the best weights based on the validation loss are saved along with the final weights. Figure 9a shows the MRL2E versus the number of training cases using the final weights for each flow quantity. In the very low-data regime, a large gap between training and validation losses is seen for both models, with DV-MLP exhibiting poorer performance. As the number of cases is increased, this gap is closed more quickly for DVH than DV-MLP. When 199 cases are used, the models perform similarly, with a slight advantage observed for DVH. Similar plots are generated using the validation best weights, as shown in Figure 9b. In this scenario, there is a smaller gap between the training and validation losses for each model. However, a persistent gap between DVH and DV-MLP remains across all training fractions considered, most notable for the $ y $ -velocity $ v $ .
Dimensional error metrics computed using the validation-best weights with 199 training instances, corresponding to the rightmost point in Figure 9b, are reported for both DV-MLP and DVH in Table 8. This shows that DVH performs best across the board, as expected. The RMSEs with two vehicle speeds are larger than those reported for a single speed in Table 5, and this is due in part to the 130 kph solutions having higher pressures and velocities than at 90 kph. To dig into this, the nondimensional and dimensional error metrics are broken out by vehicle speed instead of the training group for DVH in Table 9. This reveals that the nondimensional errors compare similarly for each vehicle speed but with 130 kph errors being slightly larger. The dimensional pressure errors for 90 kph lie between the training and validation errors for DVH M2 at a single speed, while the velocity component errors are slightly larger.
DVH pressure field predictions for a single vehicle shape at both 90 and 130 kph are shown in Figure 10, where the 90 kph case is from the training set while the 130 kph case is from the validation set. Good agreement between ground truth and prediction is seen at both speeds. Additional plots for velocities $ u $ and $ v $ are shown in the Appendix Section A.7.2 with similar good agreement.
4. Numerical Experiments II: effect of random fourier features
Spectral bias is a noted difficulty in training neural networks, where low-frequency signal content is learned more quickly and readily than high-frequency content. Consequently, the networks exhibit a bias toward low-frequency signal content (Rahaman et al., Reference Rahaman, Baratin, Arpit, Draxler, Lin, Hamprecht, Bengio and Courville2019). This issue may be addressed though the use of random Fourier features (Tancik et al., Reference Tancik, Srinivasan, Mildenhall, Fridovich-Keil, Raghavan, Singhal, Ramamoorthi, Barron and Ng2020) or positional encoding techniques (Mildenhall et al., Reference Mildenhall, Srinivasan, Tancik, Barron, Ramamoorthi and Ng2021). With these methods, the input coordinates are processed by sinusoidal terms of varying frequency before being fed into the MLP. Given network inputs $ \mathbf{x} $ , a Fourier-feature mapping is written as
where coefficients $ {a}_i $ , frequency-vectors $ {\mathbf{b}}_i $ , and their number $ m $ are parameters of the mapping. It has been shown that the simple strategy of setting $ {a}_i=1 $ and drawing each $ {\mathbf{b}}_i $ randomly from a sampling distribution is an effective strategy in practice, with the width $ \sigma $ of the sampling distribution having much greater importance than the distribution shape (Tancik et al., Reference Tancik, Srinivasan, Mildenhall, Fridovich-Keil, Raghavan, Singhal, Ramamoorthi, Barron and Ng2020). This width has a major impact on the convergence and generalization properties of the network, with underfitting observed for $ \sigma $ “too small” and overfitting observed for $ \sigma $ “too large.” Positional-encoding strategies are generally a special case of Fourier features (Rahimi and Recht, Reference Rahimi, Recht, Platt, Koller, Singer and Roweis2007) where the frequencies are specified as a geometric progression and applied along each input dimension $ {x}_i $ separately (Mildenhall et al., Reference Mildenhall, Srinivasan, Tancik, Barron, Ramamoorthi and Ng2021; Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, Guyon, Luxburg, Bengio, Wallach, Fergus, Vishwanathan and Garnett2017). This may be written as
where $ m $ may be set independently along each input axis. Another similar alternative includes the specific weight-initialization of sin-activated SIREN networks (Sitzmann, Martel, et al., Reference Sitzmann, Martel, Bergman, Lindell and Wetzstein2020). A downside to random Fourier features is that the width $ \sigma $ and the number of features $ m $ must generally be found by trial and error or hyperparameter tuning. In all experiments, a zero-mean truncated isotropic Gaussian sampling distribution is used, $ {\mathbf{b}}_i\sim \mathcal{N}\left(\mathbf{0},\sigma \mathbf{I}\right) $ , with $ \sigma =3 $ for all results here. Samples greater than 2 $ \sigma $ from the mean are discarded. The number of features is selected to be $ m=256 $ , the same number of features chosen in Tancik et al., Reference Tancik, Srinivasan, Mildenhall, Fridovich-Keil, Raghavan, Singhal, Ramamoorthi, Barron and Ng2020, despite the smaller hidden dimension of $ H=50 $ used in experiments here.
4.1. Model architectures and training options
The options and architectures of Section 3.1 are used again here but with the insertion of a random Fourier feature layer after the input layer of the main network. Note that the Fourier layer does not contain any trainable parameters; the sampled $ {\mathbf{b}}_i $ are fixed. Applying this to DVH is straightforward, and the same set of random Fourier features are used for all main networks. However, Fourier features are not used in the hypernetwork. For DV-MLP, Fourier features are applied only to spatial inputs $ {\mathbf{x}}^{\prime } $ . The resulting output $ \gamma \left({\mathbf{x}}^{\prime}\right) $ is then concatenated with $ \boldsymbol{\mu} $ and passed into the MLP. Naively feeding all inputs through the Fourier features resulted in poorly performing models, see Section A.8 of the Appendix.
Insertion of a random Fourier layer with $ m=256 $ results in models with more weights as compared to a model without a random Fourier layer. This is because $ \dim \left(\gamma \left({\mathbf{x}}^{\prime}\right)\right)=2m\gg \dim \left({\mathbf{x}}^{\prime}\right) $ , leading to many more parameters in the first layer of the MLP. Models are profiled with the addition of the Fourier-feature layer, as shown in Table 10, with the number of trainable parameters given in Table 11 as compared against Table 2. Double precision is not considered here due to the much greater training time. Additionally only DVH training method 2 batch-by-case is considered due to the lessened training time and resources.
4.2. Single vehicle speed
DV-MLP and DVH using random Fourier features are first trained on a single vehicle speed using the same dataset and training options as given in Table 4. Error metrics are reported in Table 12, with the corresponding percentage improvement in RMSE compared to the models without Fourier features indicated in parentheses. With the use of Fourier features, DVH and DV-MLP exhibit similar performance, and DV-MLP is now best for several entries. For DV-MLP, this is a rather large improvement as the errors are roughly halved for a 35% increase in training step time (from 3.7 ms to 5.0 ms) when using single precision. DVH shows large but less significant improvements, with the training step time increased by 23% (from 4.4 to 5.4 ms) when using single precision.
The pointwise absolute error probability distributions with and without Fourier features are visualized using kernel density estimates, computed using the FFTKDE function of the python library KDEpy (Odland, Reference Odland2018) using the Silverman method for kernel bandwidth selection. Other implementations were found to be very slow by comparison due to the large number of points: $ \sim $ 11.1 million training and $ \sim $ 2.8 million validation mesh points per vehicle speed. These are shown in Figure 11(a–c) for DV-MLP and Figure 11(d–f) for DVH. In all instances, the absolute error distributions are narrowed when using Fourier features, meaning smaller errors are more prevalent.
The MRL2E in computing the pressure drag coefficient is shown in Table 13 for both the Fourier models of this section and the non-Fourier models of Section 3.2. DVH M2 FF shows the smallest overall percent error over both the training and validation groups. In general, the Fourier models perform better than the non-Fourier models, with the lone exception of the DVH M2 training group errors.
4.3. Multiple speeds and generalization: low-data regime
The impact of Fourier features on generalization in the low-data regime with multiple vehicle speeds is studied here in an analogous fashion to Section 3.3. Dataset and training options of Table 7 apply. Figure 12a shows the training and validation MRL2E for each output quantity as the number of training instances is varied. As before, in the very-low data regime, there is a large gap between training and validation losses, with DV-MLP showing a greater disparity. However, as the number of training instances is increased to around 40 the difference between DV-MLP and DVH is greatly reduced, and thereafter, their performance is very similar to one another. Figure 12b shows the corresponding plots using the best weights per validation loss. Without Fourier features there was a persistent gap between DVH and DV-MLP, but when Fourier features are used this gap is more or less eliminated.
Dimensional error metrics computed using the validation-best weights with 199 training instances, corresponding to the rightmost point in Figure 12b, are reported for both DV-MLP and DVH in Table 14. The percentage improvement in RMSE as compared to a model without Fourier features is given in parentheses. As with a single vehicle speed, substantial improvements are seen for both DV-MLP and DVH, with the effect larger for DV-MLP. Similarly to the single-speed scenario, DV-MLP now performs best for several entries as well.
DVH predictions and errors of the $ x $ -velocity field at both speeds are shown in Figure 13, where neither speed is included in the training dataset. The predicted fields match the ground truth well and capture the dominant flow features including the small recirculating regions in front of the vehicle, acceleration and flow turning over the roof, and a decaying free-shear layer in the wake. Similar plots for the pressure and $ y $ -velocity predictions are given in the Appendix, Section A.9.
For further comparison, vertical line probes are placed near the vehicle, with one in front, one through the vehicle’s highest point, and two in the wake. The probe in front of the vehicle is offset by 1 m, while those in the wake are offset by 1 and 3 m. The ground truth and DVH predictions with and without Fourier features are interpolated from mesh points to the line probe locations using the griddata function from the scipy.interpolate library, with the results shown in Figure 14. Generally, the line probe predictions match the ground truth well, though some oscillation is present in the predictions. This is most prevalent for the pressure probes in the vehicle wake, though the effect is more pronounced due to the x-axis limits.
Pointwise absolute error probability distributions for DV-MLP and DVH using Fourier features are visualized in Figure 15 using kernel density estimates, again computed using the FFTKDE function of the python library KDEpy (Odland, Reference Odland2018) using the Silverman method for kernel bandwidth selection. The distribution shapes compare similarly for both network types, showing a peak and gradual trailing off as the errors increase.
The MRL2E in predicting the pressure drag coefficient using the Fourier models of this section and the non-Fourier models of Section 3.3 are shown in Table 15. In general, the Fourier models slightly outperform the non-Fourier models, with the lone exception being the validation set using DVH without Fourier features.
5. Summary and conclusions
The past few years have witnessed significant activity in the use of neural networks to develop surrogate representations of physical fields (e.g. Bhatnagar et al., Reference Bhatnagar, Afshar, Pan, Duraisamy and Kaushik2019; Guo et al., Reference Guo, Li and Iorio2016; Xu and Duraisamy, Reference Xu and Duraisamy2020) on a given discretized mesh. The present work seeks the development of surrogate models on unseen meshes, mesh topologies, and geometries as a continuous field, allowing learning and prediction on meshes with arbitrary discretization and topology. This is achieved through the use of coordinate-based neural networks, which map between low-dimensional spaces, inspired by a line of research involving scene and object representation for rendering tasks. Two methods of global conditioning were compared, namely concatenation-based conditioning as in DV-MLP, and conditioning the network weights through a hypernetwork as with DVH. Input features include spatially varying quantities, collected in $ {\mathbf{x}}^{\prime } $ , and non-spatially-varying design variables, collected in $ \boldsymbol{\mu} $ . The spatial coordinates $ \mathbf{x} $ are paired with a MDF evaluation $ \phi \left(\mathbf{x};\boldsymbol{\mu} \right) $ to implicitly encode the geometry, acting as a form of local conditioning.
The utilization of batch-by-case training for DVH models significantly reduced the computational cost associated with training. This approach led to a substantial decrease in step times and memory consumption, approximately by an order of magnitude, compared to fully-mixed training when considering a backend precision policy. This training method enabled DVH models with significantly more parameters to be trained in a comparable amount of time as DV-MLP models. The effectiveness of DVH and DV-MLP was evaluated in the context of a challenging vehicle aerodynamics problem, utilizing realistic vehicle shapes with solutions lying on unstructured meshes of variable topology. Baseline results indicated that DVH consistently outperformed DV-MLP by a few percentage points, while also demonstrating superior convergence and generalization properties in the low data regime.
By incorporating a random Fourier features layer to process the spatially varying inputs $ {\mathbf{x}}^{\prime } $ , the RMSEs were significantly reduced by approximately 40–60% with greater improvements seen for DV-MLP as compared to DVH. In this scenario, DV-MLP results were improved so substantially that the performance gap between DVH and DV-MLP was essentially closed, including the convergence and generalization properties in the low-data regime. Both DVH and DV-MLP exhibited strong generalization capabilities when multiple vehicle speeds were considered, with minimal disparities between training and validation errors. The pressure field errors were near 2%, $ x $ -velocity errors around 1%, and $ y $ -velocity errors around 6–7%. The pressure drag coefficients were also very well predicted, with the best models have an error of 1–2%. The main features of the flow fields were well-captured, with the largest errors often clustering in regions close to the vehicle, in the fine details of the grill, and along the free-shear layer of the wake. Line probes showed some oscillation in the network predictions compared to the ground truth, consistent with discrepancies seen in the contour levels. Several researchers have noted the effectiveness of positional encoding techniques in a variety of problem scenarios (Mildenhall et al., Reference Mildenhall, Srinivasan, Tancik, Barron, Ramamoorthi and Ng2021; Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, Guyon, Luxburg, Bengio, Wallach, Fergus, Vishwanathan and Garnett2017; Zhong et al., Reference Zhong, Bepler, Davis and Berger2019). The success of the techniques may be explained using Neural Tangent Kernel theory (Jacot et al., Reference Jacot, Gabriel and Hongler2018), where the use of random Fourier features results in a stationary, shift-invariant kernel with a tunable bandwidth controlled by sampled frequency vectors $ {\mathbf{b}}_i $ (Tancik et al., Reference Tancik, Srinivasan, Mildenhall, Fridovich-Keil, Raghavan, Singhal, Ramamoorthi, Barron and Ng2020). However, this result is in the context of dense, nearly uniform sampling of input coordinates without an additional signed-distance input, and further study is warranted in the current scenario with unstructured, non-uniform meshes.
The results suggest that both methods can be accurate and effective alternatives to CNNs for surrogate modeling of PDE solution fields over complex geometries and arbitrary mesh topologies. It is again emphasized that CNNs typically require a fixed grid topology and have a large memory footprint for 3D problems since the entire grid is an input. In contrast, coordinate-based networks take pointwise information and design variables as inputs, allowing model size and memory requirements to be decoupled from the solution field degrees-of-freedom.
Acknowledgements
Special thanks to Dr. Shaowu Pan of Rensselaer Polytechnic Institute who helped guide the early stages of this research and contributed to important discussion relating to memory requirements for CNNs.
Data availability statement
Code for model implementations may be found at https://github.com/jamesduv/DISMv2. The vehicle aerodynamics dataset is proprietary to General Motors and cannot be made public.
Author contribution
J.D.: Conceptualization, data curation, formal analysis, investigation, methodology, software, validation, visualization, writing—original draft. K.D.: Conceptualization, funding acquisition, methodology, project administration, resources, supervision, writing—review & editing.
Funding statement
This work is funded by General Motors, Inc. under a contract titled “Deep Learning and Reduced Order Modeling for Automotive Aerodynamics,” and by Advanced Research Projects Agency-Energy (ARPA-E) DIFFERENTIATE program under the project “Multi-source Learning-accelerated Design of High-efficiency Multi-stage Compressor,” in collaboration with Raytheon Technologies Research Center (RTRC). Computing resources were provided in part by the NSF via grant 1531752 MRI: Acquisition of Conflux, A Novel Platform for Data-Driven Computational Physics.
Competing interest
The authors declare none.
Appendix
A.1. Network scaling: further details
Considering a main network with $ {L}_m $ hidden layers, each with hidden dimension $ H $ , the total number of weights in the main network is
where the first term corresponds to the first hidden layer, the second term to all remaining hidden layers, and the final term the output layer. With low-dimensional input and output spaces, typically $ {n}_{x^{\prime }},{n}_{\mu },{n}_q<<H $ , thus quadratic $ {H}^2 $ terms dominate, leading to simplifying approximations
Next consider a hypernetwork with $ {L}_h $ hidden layers, where the first $ {L}_h-1 $ layers also have a hidden dimension of $ H $ , while the final hidden layer has dimension $ {H}_L $ . The total number of weights in the hypernetwork is
where the terms are again in forward-propagation order. Retaining quadratic terms leads to the approximation
The third term in this expression typically dominates, corresponding to the hypernetwork output layer, revealing the important scaling consideration: a dense hypernetwork model will have roughly $ {H}_L $ times as many trainable weights as a similar DV-MLP model. Writing this proportionality, and substituting Equation A.2 gives the final result:
corresponding to Equation 2.14 of the main text.
A.2. Model architecture summaries
A summary of the model architectures used is given in Table 16, with details on model inputs and outputs summarized in Table 17
In Section 3, the architectures are exactly as described in Table 16. When random Fourier features are applied in Section 4, the only difference is that spatial inputs $ {\mathbf{x}}^{\prime } $ are fed through a random Fourier layer before entering the main network. The number of weights in each scenario are given in the respective sections.
A.3. Neural network implementation details
Model implementation and training. All models are implemented via Python 3.X classes using Tensorflow v2.X (Abadi et al., Reference Abadi, Agarwal, Barham, Brevdo, Chen, Citro, Corrado, Davis, Dean, Devin, Ghemawat, Goodfellow, Harp, Irving, Isard, Jia, Jozefowicz, Kaiser, Kudlur and Zheng2015) and are trained using Nvidia RTX A6000 48 GPUs. The DV-MLP implenentation uses off-the-shelf Tensorflow-Keras sequential models, constructed using the Keras functional API. DVH models subclass Tensorflow-Keras Models (tf.keras.Model), overwriting the call method as required, depending on the batching method used. The DVH implementation uses a Tensorflow-Keras sequential model for the hypernetwork, required during class instantiation. Tensorflow-Keras models are useful as they provide high-level abstraction and contain easy-to-use functions for common tasks, such as training models and saving/loading network weights. All model weights are initialized using the Glorot-uniform weight-initialization scheme (Glorot and Bengio, Reference Glorot, Bengio, Teh and Titterington2010) and trained using calls to tf.keras.Model.fit() or custom training loops. Adam optimizer with default settings is used for all numerical experiments. Mixed or single precision is used in training all models, and model checkpoints are used to save the model weights corresponding to the best training and validation losses obtained as training progresses. A simple mean-squared-error (MSE) loss function is used, written for a single minibatch as
where $ {N}_{tot} $ is the total number of mesh points in the minibatch.
Normalization. All inputs and outputs are min-max normalized using the statistics of the training group, on a signal-by-signal basis, so that they lie approximately in the range $ \left[0,1\right] $ . Some members of the validation group may be slightly above or below this range if they are smaller than the smallest element of the training set or larger than the largest element of the training set. Vectors $ \mathbf{x}/\boldsymbol{\mu} /\mathbf{q} $ are normalized component-wise. The formula for computing the normalization is
where $ r $ is an element of $ \mathbf{x} $ , $ \mathbf{q} $ , or $ \boldsymbol{\mu} $ from either the training or validation group. Vector $ \mathbf{r} $ is the collection of all instances of $ r $ from the training dataset, $ {r}^j $ is dimensional, and $ {\overline{r}}^j $ is the normalized quantity. The predictions are fully dimensionalized for computing errors and plotting by rearranging Equation A.8 for $ {r}^j $ . If a nondimensional input or output is used, the nondimensionalization must must also be undone to achieve a fully dimensional result.
A.4. Error metrics
The error metrics of root-mean-squared error (RMSE) and mean-absolute error (MAE) are computed by averaging across all $ {n}_j $ mesh points in each case, and then averaging across all $ {n}_c $ cases. The RMSE for the $ k $ th component of the state is then defined as
The mean-absolute error (MAE) is computed analogously as
Both RMSE and MAE have units consistent with the predicted quantities, making them more intuitive than MSE alone. They provide similar measures, though the RMSE penalizes larger errors more than the MAE.
The mean-relative-L2-error (MRL2E) is also reported. To ease notation, gather all predictions for case $ j $ in matrix $ {\hat{\mathbf{Q}}}^j\in {\mathrm{\mathbb{R}}}^{n_j\times {n}_q} $ , and all ground-truth in matrix $ {\mathbf{Q}}^j\in {\mathrm{\mathbb{R}}}^{n_j\times {n}_q} $ . Then, $ {\hat{\mathbf{Q}}}^j\left[:,k\right] $ is the full snapshot for the $ k $ th component of the predicted state, for the $ j $ th case. Then, the MRL2E for the $ k $ th state component may be expressed as
The final multiplication by 100% is placed in parentheses as this operation is not always performed. The distinction is made clear in context by reporting the value with or without the percentage symbol.
A.5. RANS equations
The Reynolds Averaged Navier Stokes (RANS) equations are derived by ensemble averaging the Navier Stokes equations and substituting the Reynolds-decomposed state variables. This decomposition separates the state variables into mean (ensemble averaged) and fluctuating components $ q=\overline{q}+{q}^{\prime } $ , where $ q $ is a generic state variable, $ \overline{q} $ is the mean, and $ {q}^{\prime } $ is the fluctuating component. In the incompressible limit, the steady RANS equations may be written as
where velocity vector $ \mathbf{u}=u+v+w\hat{\mathbf{k}} $ . Nondimensional flow quantities and inputs are used, denoted by a tilde $ \tilde{\cdot} $ . The Reynolds number $ \operatorname{Re} $ is an important nondimensional similarity parameter describing the relative importance of inertial and viscous forces in a flow, and since it has no dimensional counterpart the tilde is omitted.
Free-stream flow conditions are used to define the Reynolds number, using the vehicle length $ {L}_v $ as the length scale. The freestream dynamic pressure is written as $ {q}_{\infty }=\frac{1}{2}\rho {\left|{\mathbf{u}}_{\infty}\right|}^2 $ , then the nondimensional form of other relevant quantities are given as
The nondimensional pressure of Equation A.16 is equivalent to the pressure coefficient $ {c}_p $ , and for incompressible flows the expression is further simplified by using the gauge pressure; taking $ {p}_{\infty }=0 $ . This is allowed because only derivatives of pressure enter into the incompressible momentum equation, Equation A.13. If a compressible flow were considered this equivalency would not hold, as pressure enters directly into the compressible energy equation.
A.6. Effect of spatial batch size
A spatial batch size of 54,000 points is used for all results presented in the main text, but this is a relatively large batch size compared to many other existing works. Here, the effect of varying the spatial batch size on non-Fourier DVH and DV-MLP model predictive performance and training time is quantified, with spatial batch sizes ranging from 512 to 108,000 points. Using a smaller spatial batch size results in a larger number of optimizer updates per epoch, and in what follows the number of epochs is scaled so that the number of optimizer updates is approximately equal to that used when the spatial batch size is 54,000. In fact, the number of required epochs is rounded up to ensure at least the same number of optimizer updates is performed when smaller batch sizes are considered. Additionally, both vehicle speeds are considered, the same piecewise learning rate schedule is used, and single-precision training is employed using the same model architectures as the main text.
Figure 16 shows the variation in training RMSE for each flow-quantity as the spatial batch size is varied. This figure shows that generally larger errors are seen using smaller spatial batch sizes for both DVH and DV-MLP, where the second-to-rightmost points correspond to the results in the main text. This point represents the best predictive performance for DVH, while DV-MLP shows the best results using a larger spatial batch size of 108,000 points. However, the trends in the plot are noisy as only a single replicant at each condition was trained. It is observed the DVH outperforms DV-MLP at every spatial batch size considered.
The models were also profiled using Tensorboard callbacks in an analogous fashion to Sections 3.1 and 4.1 of the main text. The optimizer step time versus spatial batch size is shown below in Figure 17. DVH has larger optimizer step times for all spatial batch sizes considered, but the variation is not as large as may be anticipated given the greater relative expense of evaluating the hypernetwork as the spatial batch size is decreased.
A.7. Baseline results: additional figures
A.7.1. Single vehicle speed
Full domain predictions for an unseen vehicle shape corresponding to Figure 7 are shown below.
A.7.2. Multiple vehicle speeds
Additional plots of ground truth and predicted $ x $ -velocity and $ y $ -velocity fields for a single vehicle shape at both speeds of 90 and 130 kph, corresponding to the same vehicle shape as Figure 10. As with the pressure field, the velocity component predictions closely match the ground truth at both speeds, with the largest errors seen near the vehicle surface and in the free-shear layer of the wake.
A.8. Fourier features for DV-MLP
Naively applying Fourier features to all DV-MLP inputs $ {\mathbf{x}}^{\prime } $ and $ \mu $ results in poor convergence and generalization, as shown in Figure 23a. When applied to only the spatial inputs $ {\mathbf{x}}^{\prime } $ , the models converge readily as shown in 23b. The network architecture and dataset are identical between the two models other than the difference in which inputs are processed by Fourier features.
A.9. Fourier features: Additional figures
Additional figures showing DVH pressure field and $ y $ -velocity field predictions at speeds of 90 and 130 kph, where neither instance was included in the training set. These correspond to the same case as shown in Figure 13.
Comments
No Comments have been published for this article.