Impact Statement
High-resolution precipitation projections are fundamental for assessing the local impact of global warming and informing targeted mitigation and adaptation strategies. This work introduces a novel regional climate model (RCM) emulator that operates at the kilometre-scale with hourly temporal resolution, combining an innovative training framework with a new deep learning architecture based on graph neural networks (GNNs). Unlike existing approaches, the emulator is not specific to any single climate model and reduces biases by incorporating reanalysis and observational data during training. It demonstrates potential for spatial transferability, future climate extrapolation, and generalisability across different domains and scenarios.
1. Introduction
Every year across the world, natural catastrophes cause casualties and significant damage to properties and assets, driven by the increasing frequency of weather-related extremes linked to global warming (IPCC, 2023). Precipitation-related events (floods, droughts, landslides) have a tremendous social and economic impact and are all projected to increase (Araújo et al., Reference Araújo, Ramos, Soares, Melo, Oliveira and Trigo2022; Pei et al., Reference Pei, Qiu, Yang, Liu, Ma, Li, Cao and Wufuer2023; Collins et al., Reference Collins, Beverley, Bracegirdle, Catto, McCrystall, Dittus, Freychet, Grist, Hegerl, Holland, Holmes, Josey, Joshi, Hawkins, Lo, Lord, Mitchell, Monerie, Priestley, Scaife, Screen, Senior, Sexton, Shuckburgh, Siegert, Simpson, Stephenson, Sutton, Thompson, Wilcox and Woollings2024). Disaster risk forecasting highly depends on the ability to correctly quantify the hazard related to the natural phenomenon, which is not straightforward, particularly for severe precipitation. This is likely linked to the evolution of convective systems, notably challenging to model due to complex land–atmosphere interactions, particularly in regions of complicated topography. Severe precipitation is indeed one of the most common and disastrous extreme events, yet difficult to model.
In this context, climate projections are fundamental to address the challenges posed by climate change and develop the most appropriate strategies of mitigation and adaptation. Climate projections are derived from global climate models (GCMs), which simulate the global climate under specified greenhouse gas emissions scenarios. GCMs operate at global to sub-continental scale (typically
$ 50 $
-
$ 250 $
km), too coarse for investigating the climate change impact at the local scale, which requires a much finer resolution. Downscaling is used to bridge the gap between the two resolutions (Wilby, Reference Wilby2004; Giorgi and Gutowski, Reference Giorgi and Gutowski2015; Laflamme et al., Reference Laflamme, Linder and Pan2016). There are two types of downscaling: statistical and dynamical. Statistical downscaling works by deriving statistical relationships between observed small-scale and simulated GCM large-scale variables using analogue methods or regression analysis. Then, future projections from GCM are statistically downscaled to estimate future projections of the small-scale phenomena. Statistical downscaling is cost-effective, but it depends on the availability of high-resolution observations, with many existing methodologies each with its own limitations (Maraun et al., Reference Maraun, Widmann and Gutiérrez2019). Dynamical downscaling (or regional climate modelling) instead uses regional climate models (RCMs) to refine the output of GCMs into much more detailed local scales (Leung et al., Reference Leung, Mearns, Giorgi and Wilby2003). RCM’s resolution is usually
$ 50 $
–
$ 10 $
km up to the kilometre scale resolution (
$ \le 3 $
km) when they are usually referred to as convection-permitting regional climate models (CP-RCMs) (Coppola et al., Reference Coppola, Sobolowski, Pichelli, Raffaele, Ahrens, Anders, Ban, Bastin, Belda, Belusic, Caldas-Alvarez, Cardoso, Davolio, Dobler, Fernandez, Fita, Fumiere, Giorgi, Goergen, Güttler, Halenka, Heinzeller, Hodnebrog, Jacob, Kartsios, Katragkou, Kendon, Khodayar, Kunstmann, Knist, Lavín-Gullón, Lind, Lorenz, Maraun, Marelle, van Meijgaard, Milovac, Myhre, H-J, Piazza, Raffa, Raub, Rockel, Schär, Sieck, Soares, Somot, Srnec, Stocchi, Tölle, Truhetz, Vautard, de Vries and Warrach-Sagi2020). Dynamical downscaling is more physically grounded, yet computationally expensive, especially when CP-RCMs are considered (Coppola et al., Reference Coppola, Sobolowski, Pichelli, Raffaele, Ahrens, Anders, Ban, Bastin, Belda, Belusic, Caldas-Alvarez, Cardoso, Davolio, Dobler, Fernandez, Fita, Fumiere, Giorgi, Goergen, Güttler, Halenka, Heinzeller, Hodnebrog, Jacob, Kartsios, Katragkou, Kendon, Khodayar, Kunstmann, Knist, Lavín-Gullón, Lind, Lorenz, Maraun, Marelle, van Meijgaard, Milovac, Myhre, H-J, Piazza, Raffa, Raub, Rockel, Schär, Sieck, Soares, Somot, Srnec, Stocchi, Tölle, Truhetz, Vautard, de Vries and Warrach-Sagi2020; Kendon et al., Reference Kendon, Prein, Senior and Stirling2021). CP-RCMs can explicitly model convective systems, but become prohibitively expensive when long climate projections are needed or many simulations are required to estimate the uncertainty of the climate projections.
These limitations of traditional downscaling approaches have recently led to an increasing interest in research at the intersection between climate science and machine learning (ML), to explore the potential added value that data-driven techniques can bring to the field. Rampal et al. (Reference Rampal, Hobeichi, Gibson, Baño-Medina, Abramowitz, Beucler, González-Abad, Chapman, Harder and Gutiérrez2024a) made a clear classification of the possible ML applications in climate downscaling and we will use their terminology throughout the manuscript. In this context, we define observational downscaling as the task where an ML algorithm is used to reproduce high-resolution observations. Perfect prognosis (PP) and super-resolution (SR) are the two main approaches and differ in the predictor fields used to infer the high-resolution field. The former uses large-scale reanalysis data, while the latter relies on coarse-resolution observational data. Instead, we define RCM emulation as the case where an ML algorithm is asked to reproduce the high-resolution output of an RCM, starting from either GCM or RCM simulations as predictive fields. To date, two alternative frameworks have been studied to train the RCM emulators based on the predictor fields domain: the perfect framework and the imperfect framework. The former uses coarsened RCM fields while the latter relies on the GCM fields directly. Emulators trained in the perfect framework are easier to train, yet can not fully capture the inconsistencies that may exist between the GCM and RCM. In the imperfect framework, instead, the emulator needs to learn both the downscaling function and to account for deviations between GCM and RCM, thus providing improved performance. However, training is more challenging and leads to model-dependent patterns, which may be more difficult to interpret and may lack physical consistency (Boé et al., Reference Boé, Mass and Deman2023; Baño-Medina et al., Reference Baño-Medina, Iturbide, Fernández and Gutiérrez2024). We refer the reader to Rampal et al. (Reference Rampal, Hobeichi, Gibson, Baño-Medina, Abramowitz, Beucler, González-Abad, Chapman, Harder and Gutiérrez2024a) and van der Meer et al. (Reference van der Meer, de Roda Husman and Lhermitte2023) for a more in-depth analysis of these two frameworks.
Regarding the deep learning (DL) architecture, convolutional neural networks (CNNs) and generative models have been the preferred choice. Among others, Doury et al. (Reference Doury, Somot, Gadat, Ribes and Corre2022) used a CNN-based UNet architecture to emulate the downscaling of daily near-surface temperature as well as daily precipitation (Doury et al., Reference Doury, Somot and Gadat2024). Rampal et al. (Reference Rampal, Gibson, Sherwood, Abramowitz and Hobeichi2024b) used a conditional generative adversarial network (cGAN) to downscale daily precipitation. Addison et al. (Reference Addison, Kendon, Ravuri, Aitchison and Watson2024) proposed a diffusion-based DL model to estimate daily precipitation at high resolution.
Recently, some studies used reanalysis data to downscale earth system models (ESMs) simulations through the SR framework. For instance, Hess et al. (Reference Hess, Aich, Pan and Boers2025) used a consistency model (CM) to downscale precipitation and Schmidt et al. (Reference Schmidt, Schmidt, Strnad, Ludwig and Hennig2025) used a score-based diffusion model to downscale multiple variables such as near-surface wind speeds, surface air temperature, and sea level pressure. In both cases, the diffusion model is trained solely on the high-resolution reanalysis fields, which it learns to reproduce. During inference, the corresponding coarse-resolution fields from ESM simulations are used as conditioning input.
In this work, we present a novel RCM emulator for high-resolution precipitation projections, called GNN4CD (graph neural networks for climate downscaling; see Figure 1), which integrates an innovative training approach and a new graph neural network (GNNs) based model. Instead of training directly on climate model outputs, we adopt a hybrid imperfect framework: we train the model to perform reanalysis to observation downscaling (Figure 1a), and we show that it can subsequently function effectively as an RCM emulator to downscale future projections. Using reanalysis predictor fields during training helps in reducing biases and leads to an emulator that is not specific to any single climate model, unlike existing approaches. We also propose a novel DL model for the emulator, based on GNNs (Battaglia et al., Reference Battaglia, Hamrick, Bapst, Sanchez, Zambaldi, Malinowski, Tacchetti, Raposo, Santoro, Faulkner, Gulcehre, Song, Ballard, Gilmer, Dahl, Vaswani, Allen, Nash, Langston, Dyer, Heess, Wierstra, Kohli, Botvinick, Vinyals, Li and Pascanu2018; Schlichtkrull et al., Reference Schlichtkrull, Kipf, Bloem, Van Den Berg, Titov and Welling2018; Sanchez-Lengeling et al., Reference Sanchez-Lengeling, Reif, Pearce and Wiltschko2021). GNNs can capture complex relationships and dependencies within data, making them well-suited for a range of applications and have become increasingly relevant in climate-related studies (Lam et al., Reference Lam, Sanchez-Gonzalez, Willson, Wirnsberger, Fortunato, Alet, Ravuri, Ewalds, Eaton-Rosen, Hu, Merose, Hoyer, Holland, Vinyals, Stott, Pritzel, Mohamed and Battaglia2023; Price et al., Reference Price, Sanchez-Gonzalez, Alet, Andersson, El-Kadi, Masters, Ewalds, Stott, Mohamed, Battaglia, Lam and Willson2025). In this work, we adopt GNNs to address the challenges posed by irregular grids and non-rectangular domains typical of climate data (e.g., land-only regions), where data are often undefined over large areas (e.g., the sea). In such settings, CNNs require interpolation and padding, which can introduce biases and waste computation on irrelevant regions. GNNs naturally handle these irregular structures, offering a flexible and computationally efficient alternative. Unlike standard CNNs, which require identical image sizes to work properly, GNNs intrinsically support graphs with a variable number of nodes and edges. This key characteristic leads to a flexible model that can easily be tested in performing domain transferability, i.e., estimating the target variable across geographical areas distinct from those encountered during training. To the best of our knowledge, this is the first application of GNNs to RCM emulation, providing a novel approach that supports generalisation across grids and improves adaptability to real-world climate data.

Figure 1. The hybrid imperfect framework, applied to the GNN4CD emulator. Scheme of (a) training: reanalysis to observation downscaling, (b) inference: reanalysis to observation downscaling, (c) inference: RCM emulation.
Our emulator is built to operate at high spatial and temporal resolution comparable to CP-RCMs simulations. Working on a convection-permitting scale is crucial for capturing the dynamics of severe precipitation events, including localised thunderstorms and extreme rainfall associated with convective systems (Luu et al., Reference Luu, Vautard, Yiou and Soubeyroux2022). The choice of using hourly data is driven by the need to capture precipitation extremes localised in space and time, which can cause floods in a short time window. Daily observational datasets smooth out these short-duration events, whereas hourly precipitation can capture these types of extremes, which are essential for flood hazard studies (Fantini, Reference Fantini2019).
We evaluate the performance of the GNN4CD emulator in two different settings: reanalysis to observation downscaling (Figure 1b) and RCM emulation (Figure 1c). In the former, reanalysis data are used as predictors with observations as ground truth (like in the training), allowing for a first thorough assessment of the model reanalysis to observation downscaling capabilities. In the latter, predictor data come from RCM simulations, and the GNN4CD model is properly used as an emulator to derive historical and future projections at the CP-RCM scale. In both cases, the model is evaluated in a geographical area larger than the region used for training.
2. Data
In this study, we refer to the atmospheric predictor data as “low-resolution,” relative to the higher-resolution target dataset, reflecting their role as the coarser input in the downscaling framework rather than an absolute classification of resolution.
Five atmospheric variables are used as low-resolution predictors, at five pressure levels (Table 1), each reported on a grid of
$ 0.25{}^{\circ} $
degrees of longitude-latitude (
$ \sim 25 $
km for Europe) with hourly temporal resolution. As training input data, these variables are taken from the ERA5 reanalysis dataset (Hersbach et al., Reference Hersbach, Bell, Berrisford, Hirahara, Horányi, Muñoz-Sabater, Nicolas, Peubey, Radu, Schepers, Simmons, Soci, Abdalla, Abellan, Balsamo, Bechtold, Biavati, Bidlot, Bonavita, De Chiara, Dahlgren, Dee, Diamantakis, Dragani, Flemming, Forbes, Fuentes, Geer, Haimberger, Healy, Hogan, Hólm, Janisková, Keeley, Laloyaux, Lopez, Lupu, Radnoti, de Rosnay, Rozum, Vamborg, Villaume and Thépaut2020) from the European Centre for Medium-Range Weather Forecasts (ECMWF). In the inference phase, either ERA5 or RCM data are used as predictors. RCM data come from a CP simulation (Pichelli et al., Reference Pichelli, Coppola, Sobolowski, Ban, Giorgi, Stocchi, Alias, Belušić, Berthou, Caillaud, Cardoso, Chan, Christensen, Dobler, de Vries, Goergen, Kendon, Keuler, Lenderink, Lorenz, Mishra, H-J, Schär, Soares, Truhetz and Vergara-Temprado2021) with the RegCM model (Coppola et al., Reference Coppola, Stocchi, Pichelli, Alavez, Glazer, Giuliani, Di Sante, Nogherotto and Giorgi2021), upscaled to the training resolution of
$ 25 $
km.
Table 1. Variables used as predictors (P) and target (T), each reported with its symbol, unit, pressure levels, space and time resolutions

A remapping of the global multi-resolution terrain elevation data (Danielson and Gesch, Reference Danielson and Gesch2011) to a grid of
$ 3 $
km is also used as a predictor. Additionally, the land-use information is included on the same high-resolution grid. These data are taken from the Community Land Model (CLM) version
$ 4.5 $
(Oleson et al., Reference Oleson, Lawrence, Bonan, Drewniak, Huang, Koven, Levis, Li, Riley, Subin, Swenson, Thornton, Bozbiyik, Fisher, Heald, Kluzek, Lamarque, Lawrence, Leung, Lipscomb, Muszala, Ricciuto, Sacks, Sun, Tang and Yang2013; Thiery et al., Reference Thiery, Davin, Lawrence, Hirsch, Hauser and Seneviratne2017). Elevation and land-use are the only predictors available on high-resolution, yet they are static, i.e., a single value per spatial location, independent of time.
The GRidded Italian Precipitation Hourly Observations (GRIPHO) dataset, a high-resolution hourly precipitation dataset for Italy (Fantini, Reference Fantini2019), is used as the target to train the DL model. Originally developed as input to hydrological models and to validate RCM simulations, GRIPHO was created by using exclusively in-situ precipitation station data as input. After a quality check of the station data time series, the data were re-gridded on a Lambert Conformal Conic grid, which is neither orthogonal nor regular in longitude-latitude coordinates. The choice of a curvilinear grid was primarily informed by the average station density (
$ \sim 10 $
km) and offers methodological advantages over a regular latitude-longitude grid, as it ensures uniform grid cell areas and facilitates subsequent computational analyses. GRIPHO currently represents the only high-resolution, station-based precipitation dataset available for the entire Italian peninsula, covering the period from 2001 to 2016. We used the GRIPHO dataset at a
$ 3 $
km spatial resolution, consistent with Pichelli et al. (Reference Pichelli, Coppola, Sobolowski, Ban, Giorgi, Stocchi, Alias, Belušić, Berthou, Caillaud, Cardoso, Chan, Christensen, Dobler, de Vries, Goergen, Kendon, Keuler, Lenderink, Lorenz, Mishra, H-J, Schär, Soares, Truhetz and Vergara-Temprado2021).
All low- and high-resolution predictor variables are normalised to zero-mean unit-variance, which is common practice in ML to ensure comparable feature contributions and to improve numerical stability. Normalisation statistics are computed on the training set and then applied to both training and inference data. The target dataset is preprocessed to comply with the instrument sensitivity, i.e., all values strictly less than
$ 0.1 $
mm are set to zero, and then the dataset is rounded to one decimal place. The target data is then transformed using
$ \log \left(1+y\right) $
. The logarithmic transformation compresses the range of target values, improving model stability, which is particularly useful in the case of highly skewed data.
3. Hybrid imperfect framework
We propose the hybrid imperfect framework as a third alternative to the now-established perfect and imperfect frameworks to train RCM emulators. Here, hybrid refers to the use of different types of predictor data during training and inference-specifically, reanalysis and observations during training, and climate model outputs during inference. The term imperfect reflects the conceptual similarities to the imperfect framework, especially the use of two different source systems for input and target during training, which can show biases, spatial misalignments, or different error structures that the model needs to cope with. Reanalyses are typically better aligned with observations (in space, time and dynamics) than a GCM is with an RCM, but still, the relationship is far from perfect. During inference, either GCM or RCM predictors can potentially be used as input to the emulator trained in the hybrid imperfect framework. As a starting point, this study aims to examine the scenario in which RCM predictors are employed.
We found it significant to explore a third alternative method to train the emulator, as both existing frameworks have several practical limitations. For example, both frameworks still require long future simulations of the RCM (or CP-RCM) to serve as target data to train the RCM emulator, thus not resolving the prohibitive cost issue of dynamical downscaling. In contrast, we show that our model learns effectively from a limited amount of available observational data, making the construction of the emulator entirely cost-effective. Moreover, the use of reanalysis data helps to mitigate biases and uncertainties that may be inherent in climate model simulations. Reanalysis assimilates a wide range of observational data, providing a more accurate representation of present-day climate conditions. Thus, the emulator trained with the hybrid imperfect framework is expected to develop a broader foundation in atmospheric dynamics, with the potential to intrinsically learn effective domain adaptation skills.
While the hybrid imperfect framework offers promising advantages in bias correction and generalisation across different domains, it also presents possible limitations that should be carefully addressed. First, the domain mismatch between training and inference predictors introduces a risk of distributional shift, where the emulator may encounter patterns never seen during training. Second, observational datasets used as targets may contain their own uncertainties, inconsistencies, or sparse coverage. Finally, reanalysis data are limited to the present day; thus, the model can only learn from historical climate. In this study, we show both strengths and limitations in using the hybrid imperfect framework in the current setup. However, we believe that a phase of fine-tuning post-training but prior to inference could mitigate these potential limitations and could be a cost-effective way of improving the emulator’s ability to generalise to future scenarios and different RCM simulations. This may be particularly helpful when GCM predictors are employed and will be the subject of future studies.
4. Deep learning model
The GNN4CD emulator uses a new DL model, specifically designed for this task. The architecture relies on GNNs to model the downscaling from the coarse grid of atmospheric predictors to the fine grid of precipitation data and to account for spatial interactions between the high-resolution locations. To this aim, the given spatial grids need to be converted into graphs, i.e., mathematical objects with nodes and edges. The definition of the graphs and architecture is described in detail in the following sections.
4.1. Graph conceptualization
Each point within the low-resolution and high-resolution grids corresponds to a specific geographical location, suggesting that they can be naturally modelled as nodes. For convenience, we model both grids as a unified heterogeneous graph featuring two node- and two edge-types:
-
• Low nodes: first set of nodes, generated from the points on the low-resolution grid with a spatial resolution of approximately
$ 25 $ km.
-
• High nodes: second set of nodes, created from the points on the high-resolution grid with a spatial resolution of
$ 3 $ km.
-
• Low-to-High edges: unidirectional edges, which connect Low to High nodes, ensuring each High node is linked to a fixed number
$ k $ of Low nodes. These edges model the downscaling of atmospheric variable information from the Low nodes to the High nodes (Figure 2a).
Figure 2. Graph conceptualisation: Low nodes (blue dots) and High nodes (red dots) and close-up of (a) Low-to-High unidirectional edges (orange), connecting Low nodes to High nodes (b) High-within-High bidirectional edges (red), linking High nodes.
-
• High-within-High edges: bidirectional edges that capture relationships between High nodes based on an 8-neighbours approach, ensuring each node is connected to its eight nearest neighbours (Figure 2b).
4.2. Model design
Precipitation data contain a significant amount of zeros, as rain events only occur during a limited number of hours. In the case of the GRIPHO dataset, almost
$ 90\% $
of the values fall below the meteorological threshold assumed as
$ 0.1 $
mm, effectively rendering them as zeros. In view of this, we explored two different designs for the DL model. In both cases, the model outputs an estimate for the time step
$ t $
based on a time series of predictors spanning
$ \left[t-L,\dots, t\right] $
, where
$ L $
is a hyper-parameter.
In the first approach, we addressed the challenge posed by zero precipitation values by adopting a Hurdle modelling scheme (Cragg, Reference Cragg1971). The method relies on the construction of two distinct models, which are subsequently combined: a Regressor and a Classifier. The Classifier is trained on the entire dataset and discerns between two classes:
$ 0 $
, i.e., precipitation below the threshold, and
$ 1 $
, i.e., precipitation above the threshold. Conversely, the Regressor is exclusively trained on targets where precipitation values exceed the threshold, and provides a quantitative estimation of hourly precipitation. During the inference phase, predictions from the Regressor and Classifier models are computed independently and then multiplied to yield a singular estimate of the precipitation value. We refer to this model design as RC (Figure 3a). In addition, we considered a second approach where a single Regressor is used. This model is trained using as target the full GRIPHO dataset, i.e., including instances with zero precipitation, and outputs a quantitative estimation of hourly precipitation. We refer to this model design as R-all (Figure 3b). To enhance training efficiency and focus on meaningful targets, time steps containing only values below the threshold were excluded. This resulted in a reduction of approximately 50% in the training set size for the RC Regressor, whereas the impact was negligible for the RC Classifier and the R-all Regressor, with
$ 99.9\% $
of time steps retained in both cases. In both the RC and R-all approaches, the models are trained on the complete graph, while the loss is computed exclusively on nodes with valid target values by applying a masking strategy.

Figure 3. Schematic views of (a) RC, designed as a combination of Regressor and Classifier components, (b) R-all consisting of a single Regressor, (c) architecture, composed of four modules: a RNN-based pre-processor, a GNN-based downscaler, a GNN-based processor, and a FCNN-based predictor.
The two alternative designs offer valuable insights due to their differing model complexities and problem formulations. The RC case adopts a two-model structure, separating the classification of precipitation occurrence from the regression of its intensity. This allows for targeted learning but increases the number of parameters and model complexity. Moreover, the classification task remains particularly imbalanced and complex to solve, and even the data obtained by considering only the values larger than the threshold remain very skewed. In contrast, the R-all case adopts a single model, thus resulting in a simpler architecture with fewer parameters but requiring the model to simultaneously handle both occurrence and intensity prediction within a single task, thus increasing the problem complexity. In the experiments, we will analyse both model designs to see whether the added value justifies the cost of using two models instead of one.
4.3. Architecture
Regressor and Classifier components share the same structure, which consists of four primary modules (see Figure 3c). First, a pre-processor module, which acts at the Low nodes level and handles the predictors’ temporal component through the adoption of a recurrent neural network (RNN), specifically a gated recurrent unit (GRU). The RNN encoder captures the temporal dependencies across time steps, outputting a sequence that is flattened and passed through a fully connected layer to produce a fixed-dimensional latent representation. This encoding serves as input to the graph-based components of the model. The next stage involves the downscaler module, which uses a graph convolution (GC) layer to map the preprocessed atmospheric variables, represented as Low node features, to learned attributes on the High nodes. This transformation is crucial for bridging the different spatial scales within the input and output data. The downscaler incorporates additional high-resolution spatial attributes (elevation and land use data), ensuring that the model is well-informed about local geographical features. Following the downscaling step, the core of the model is a multi-layer processor network comprising several graph attention layers (GAT, Veličković et al., Reference Veličković, Cucurull, Casanova, Romero, Lio and Bengio2018). These layers are designed to dynamically attend to neighbouring nodes, thereby capturing complex spatial relationships. Each GAT layer is followed by batch normalization and ReLU activations to stabilise training and introduce non-linearity, respectively. The use of multiple GAT layers allows the model to progressively refine its understanding of spatial dependencies, essential for accurately predicting local precipitation. The final component of the architecture is the predictor, a fully connected neural network (FCNN) that takes the processed graph features and returns the desired output on each High node. The model is entirely implemented in PyTorch (Paszke et al., Reference Paszke, Gross, Massa, Lerer, Bradbury, Chanan, Killeen, Lin, Gimelshein, Antiga, Desmaison, Kopf, Yang, DeVito, Raison, Tejani, Chilamkurthy, Steiner, Fang, Bai and Chintala2019) and PyTorch Geometric (Fey and Lenssen, Reference Fey and Lenssen2019), using GraphConv (Morris et al., Reference Morris, Ritzert, Fey, Hamilton, Lenssen, Rattan and Grohe2021) and GATv2Conv (Brody et al., Reference Brody, Alon and Yahav2022) as GC and GAT layers, respectively.
The downscaling task requires the specification of the spatial and temporal length scales at which predictors can influence outputs. While this represents an assumption, it is grounded in the underlying physics of the problem and adjusted by empirical evidence. From a spatial point of view, we assume that the relevant region extends to approximately
$ 50 $
km around the target location. From a time perspective, we assume that the precipitation prediction at time
$ t $
is influenced by the atmospheric variables up to
$ 24 $
hours before. These ranges are explicitly considered in designing the graph and the emulator architecture. Respectively, the spatial assumption is realised in choosing a value of
$ k=9 $
, as the number of nearest Low nodes to which each High node is connected (see subsection 4.1), effectively representing the downscaling relationship in terms of graph edges. This assumption dictates the graph structure, but has no direct influence on the GNN parameters, as GNNs accept graphs with an arbitrary number of edges. The temporal assumption is instead reflected in choosing
$ L=24 $
(see Subsection 4.2), thus designing the predictors’ time series as
$ \left[t-24,\dots, t\right] $
, encompassing a total of
$ 25 $
time steps. This assumption sets the length of the time series of predictors that are passed as input to the RNN module, thus defining the RNN sequence length parameter.
4.4. Regressor loss function
Mean square error (MSE) is the standard loss function for regression problems, yet it becomes less effective when the target data is highly imbalanced or skewed, as in the case of precipitation data. MSE leads to estimates that are biased towards frequent values, which is detrimental for modelling rare events, thus we chose to use a modified MSE loss function which can explicitly address imbalance and skewness. Multiple studies in the literature use weighted MSE loss for training in such unfavourable conditions (Wang et al., Reference Wang, Wang, Wang, Xue and Wang2022; Scheepens et al., Reference Scheepens, Schicker, Hlaváčková-Schindler and Plant2023). The most sensitive part is in the definition of an optimal weighting strategy, which should be consistent with the target data distribution and training objectives, e.g., giving more weight to the tail of the distribution, in order to estimate the extreme events. Recently, Szwarcman et al. (Reference Szwarcman, Guevara, Macedo, Zadrozny, Watson, Rosa and Oliveira2024) proposed a formulation to quantise the reconstruction loss and improve the synthesis of extreme weather data with variational autoencoders (VAEs). This modification of the MSE loss was designed to address the skewed distribution typical of weather data by giving more weight to rare, extreme values. The idea is to penalise the loss according to the observed values frequency, by quantising the target data and averaging losses for each bin. In this study, we adopt the same loss to train the Regressor, and we call it quantised MSE (QMSE) loss (Equation 4.1). The only parameters of QMSE are the bins into which the target data are quantised. During training, examples are dynamically assigned to the corresponding bins based on their true target values, ensuring that the weights in the QMSE loss reflect the actual value distribution within each batch. This approach is beneficial because batches are formed by randomly selecting time instants, and each batch includes all points in the graph for the chosen time instants. Consequently, the target distribution can vary slightly between batches.

In Equation 4.1,
$ j $
represents the bin index, defined based on a histogram of the training data,
$ {\Omega}_j $
is the set of target indices whose values fall within bin
$ j $
, thus
$ \mid \Omega \mid $
is the observed frequency and
$ 1/\mid \Omega \mid $
weighs the loss inversely to the frequency. The quantities
$ {y}_i $
and
$ {\hat{y}}_i $
are the true and estimated target values, respectively. Finally, in the training, we used a combined MSE-QMSE loss (Equation 4.2) with a coefficient
$ \overline{\alpha} $
which accounts for the different scale and balances the contribution of the two terms.

4.5. Classifier loss function
The Classifier is trained using focal loss (FL) (Lin et al., Reference Lin, Goyal, Girshick, He and Dollár2020), specifically designed to address class imbalance during training. In this setting, standard cross entropy (CE) loss tends to focus on minimising errors for the majority class, often leading to poor performance on the minority class (Johnson and Khoshgoftaar, Reference Johnson and Khoshgoftaar2019). FL introduces a modulating term
$ \gamma $
in the CE formulation to dynamically scale the CE loss. Thanks to this scaling factor, the contribution of easy examples is down-weighted and the model is guided to focus on hard examples. Additionally, a hyper-parameter
$ \alpha $
helps to handle the class imbalance. The formulation of the FL loss is in Equation 4.3, where
$ p $
is function of the Classifier output, i.e. depends on the input data, and
$ y\in \left\{0,1\right\} $
is the ground-truth class.

4.6. Training and evaluation
As mentioned in Section 2, the GRIPHO observational data are available for a period of
$ 16 $
years, with spatial coverage of the entire Italian territory. To evaluate the emulator’s capability of generalisation in spatial domains not used during training, we chose to train only on northern Italy and use the whole peninsula for the inference phase (see Figure 4a). In this setting, the geographical area considered for training is approximately
$ \mathrm{120,000} $
km
$ {}^2 $
with around
$ 400 $
Low nodes and
$ \mathrm{14,000} $
High nodes. Instead, the evaluation area is approximately
$ \mathrm{300,000} $
km
$ {}^2 $
, with around
$ 1000 $
Low nodes and
$ \mathrm{33,000} $
High nodes. The time range spanned by the GRIPHO dataset is rather limited, which is usually the case when dealing with high-resolution observational data, which are quite difficult to find. We decided to use the first
$ 15 $
years for training (
$ 2001 $
–
$ 2006 $
and
$ 2008 $
–
$ 2015 $
) and validation (
$ 2007 $
) and leave the last available year (
$ 2016 $
) to test the GNN4CD model in the reanalysis to observation downscaling task. Considering the number of time instants and High nodes, we obtain a ratio of approximately
$ 75-12.5-12.5 $
for the train-validation-test datasets. In the RCM emulation setting, we considered the area of the Italian peninsula covered by RegCM and three different time slices of the RegCM simulations: historical (
$ 1996 $
–
$ 2005 $
), mid-century (
$ 2041 $
–
$ 2049 $
) and end-of-century (
$ 2090 $
–
$ 2099 $
). All projections were performed under the RCP8.5 scenario (Riahi et al., Reference Riahi, Rao, Krey, Cho, Chirkov, Fischer, Kindermann, Nakicenovic and Rafaj2011; IPCC, 2014), which represents a high-emissions pathway associated with the most pronounced climate change signal. This choice increases the difficulty of the emulation task, particularly for capturing changes in the distribution of extreme precipitation events.

Figure 4. (a) training (northern Italy) and inference (entire Italy) areas, (b) locations of original stations used to create the GRIPHO dataset, and (c) percentage of valid time steps for each station.
The RC Regressor and Classifier components and the R-all Regressor component were all trained separately for
$ 50 $
epochs, i.e. approximately
$ 24 $
hours each on
$ 4\times $
NVIDIA Ampere GPUs on Leonardo, the new pre-exascale Tier-0 EuroHPC supercomputer, currently hosted by CINECA in Bologna, Italy (Turisini et al., Reference Turisini, Amati and Cestari2023). The trained emulator needs only few minutes to compute the hourly precipitation estimates for an entire year over Italy, much less than the dynamical downscaling of a CP-RCM, which needs approximately a couple of days on an equivalent high-resolution grid.
We used the validation year to empirically tune the hyper-parameters that appear in the model architecture and loss functions. Considering the computational cost of training the model, we opted for manual hyper-parameters tuning to limit the resource usage to what was strictly necessary, while still achieving quick convergence to good values. Systematic hyper-parameter tuning is expected to further improve performance. Thus, we view this not as a limitation, but as a promising direction for future enhancement, contingent on the availability of additional computational resources and time. For the Regressor components, we empirically tested the number of bins in QMSE and we concluded that it does not have a strong impact on the training, provided that the coefficient
$ \overline{\alpha} $
is properly tuned. We chose to use logarithmically equispaced bins with a bin size of
$ \log \left(0.5+1\right) $
. Empirical findings showed a trade-off when training the emulator with the combined MSE-QMSE loss, with lower values of
$ \overline{\alpha} $
that favour average results, while higher values of
$ \overline{\alpha} $
lead to improved accuracy in the tail of the distribution. We chose
$ \overline{\alpha} $
$ =0.025 $
for the RC Regressor component and
$ \overline{\alpha} $
$ =0.005 $
for the R-all model, both of which lean toward the latter behaviour. This choice reflects our interest in accurately capturing the full distribution of precipitation values, but with particular emphasis on the extremes. For the FL, we started from the parameter values suggested in Lin et al. (Reference Lin, Goyal, Girshick, He and Dollár2020) (
$ \alpha =0.25 $
and
$ \gamma =2 $
) and we operated a manual grid search to adjust these values to our specific case. We found that the FL loss used to train the RC Classifier component works best with parameters
$ \alpha =0.75 $
and
$ \gamma =2 $
.
5. Metrics
In evaluating the reanalysis to observation downscaling and RCM emulation capabilities of GNN4CD, we used several key metrics, which are here introduced:
-
• probability density function;
-
• seasonal diurnal cycles (average, frequency and intensity);
-
• spatial average,
$ 99 $ th and
$ 99.9 $ th percentiles;
-
• percentage bias for spatial average,
$ 99 $ th and
$ 99.9 $ th percentiles;
-
• spatial correlation;
-
• spatial change in future projections;
The first fundamental metric is the probability density function (PDF), defined as the normalised frequency of occurrence of events within a certain bin. The PDF is obtained by considering all the hourly values for all the grid points over the desired spatial area and temporal period. Next, seasonal diurnal cycles are crucial to investigate temporal patterns on a sub-daily scale. Diurnal cycles are obtained by averaging the values of all time steps corresponding to the same hour, for each hour of the day, over the considered time span. Specifically, we consider the diurnal cycles of precipitation average, frequency, and intensity. The average is obtained by dividing the total sum by the number of instances, considering both zero and non-zero precipitation cases. The frequency is defined as the percentage of non-zero precipitation cases over the total, also referred to as the percentage of rainy hours. Finally, the intensity is computed similarly to the average, but considering only non-zero precipitation cases. The diurnal cycles are presented separately for each season. Seasons in climatological studies are usually defined as DJF (December, January, February), MAM (March, April, May), JJA (June, July, August), and SON (September, October, November), and we adopt the same terminology. As spatial statistics, we consider the average and the
$ 99 $
th and
$ 99.9 $
th percentiles (p
$ 99 $
and p
$ 99.9 $
). These metrics are computed individually for each location in space, for the given period of time, and are visualised through spatial maps. Both zero and non-zero precipitation cases are considered in the computation. The p
$ 99 $
and p
$ 99.9 $
quantities are defined as the values below which
$ 99\% $
and
$ 99.9\% $
of the values fall, respectively. The former represents a high-end threshold and illustrates whether the estimates capture the extreme events. The latter focuses on even more extreme events, providing insights into the ability to capture the rarest precipitation occurrences that are usually responsible for flood episodes. For all these quantities, we also consider the percentage bias, which quantifies the relative error in estimates, indicating whether the estimates overestimate or underestimate the reference values. The percentage bias is computed by taking the difference between the estimates and the reference values, then dividing by the reference values and multiplying by
$ 100 $
. To quantify the degree of agreement between the spatial patterns of the estimates and those of the reference, we adopt the Pearson correlation coefficient (PCC). We computed the PCC for the maps of precipitation average, p
$ 99 $
and p
$ 99.9 $
. This coefficient measures linear correlation between two sets of data and can take values between
$ -1 $
and
$ 1 $
, with
$ 0 $
being the discriminant value for negative/positive correlation. The final metric we consider is the spatial change, used to evaluate future projections. This quantity indicates the difference between the future projection quantities (mid-century or end-of-century) and the historical amount, computed consistently within the same model.
6. Results and discussion
In this section, we present the inference results of GNN4CD, assessing its performance in both the reanalysis to observation downscaling and RCM emulation settings.
For the reanalysis to observation downscaling task, GNN4CD estimates are evaluated against GRIPHO observations with a focus on PDFs, spatial average, p
$ 99 $
and p
$ 99.9 $
maps of hourly precipitation. We also examine the seasonal diurnal cycles, which are particularly relevant, given that one of the key added values of the CP-RCMs lies in their improved representation of sub-daily precipitation patterns, especially the afternoon convective peak typically observed in summer (Ban et al., Reference Ban, Caillaud, Coppola, Pichelli, Sobolowski, Adinolfi, Ahrens, Alias, Anders, Bastin, Belušić, Berthou, Brisson, Cardoso, Chan, Christensen, Fernández, Fita, Frisius, Gašparac, Giorgi, Goergen, Haugen, Hodnebrog, Kartsios, Katragkou, Kendon, Keuler, Lavin-Gullon, Lenderink, Leutwyler, Lorenz, Maraun, Mercogliano, Milovac, H-J, Raffa, Remedio, Schär, Soares, Srnec, Steensen, Stocchi, Tölle, Truhetz, Vergara-Temprado, de Vries, Warrach-Sagi, Wulfmeyer and Zander2021). Additionally, we evaluate the GNN4CD performance on
$ 10 $
documented flood episodes (observed between
$ 2011 $
and
$ 2016 $
) as an initial step towards assessing its potential for impact-oriented applications.
In the RCM emulation setting, we compare the output of GNN4CD to the RegCM simulations. To this end, we analyse the PDFs across the three time slices and examine the spatial distribution of average and extreme precipitation changes for the future periods. In the PDFs comparison, we include a
$ 10 $
-year subset of the GRIPHO observational dataset. For the historical period, our aim is for the emulator to reproduce results that are closer to GRIPHO, thereby mitigating the biases typically present in climate model simulations. For the future time slices, where no ground truth is available, we conduct a comparative analysis between the emulator’s estimates and the RegCM outputs. Here, particular attention is given to assessing the emulator’s ability to capture changes in the precipitation distribution associated with global warming.
Furthermore, given that the use of a time series of predictors to downscale a single precipitation time step represents a relatively novel approach, we conducted an evaluation to assess the added value of this methodology. The corresponding findings are presented and discussed in Appendix A.3.
6.1. Reanalysis to observation downscaling
The aim of this initial evaluation is to provide a comprehensive assessment of the GNN4CD behaviour in a setting that closely resembles the training environment.
We begin by assessing the PDFs and diurnal cycles (Figures 5 and A1, respectively, for GNN4CD RC and R-all). The PDFs computed over the entire Italian domain show a good agreement between the estimates and the observational reference (panels a). GNN4CD tends to slightly overestimate precipitation amounts (in the order of a few millimetres per hour) while higher values, from the p
$ 99 $
onward and in the tail of the distribution, are more accurately captured. The p
$ 99 $
and p
$ 99.9 $
computed on the same aggregated data are also close to the GRIPHO reference, especially for the RC model (Table 2). Panels c-d-e display the diurnal cycles of hourly precipitation average, frequency (percentage of rainy hours), and intensity, respectively. Each panel is further divided by seasons and presents the daily evolution at one-hour intervals. Overall, GNN4CD provides a good match with GRIPHO observations, particularly in terms of average precipitation and frequency in the RC model configuration. Both RC and R-all exhibit a larger bias in average JJA precipitation compared to other seasons. In the RC case, this arises predominantly due to too high precipitation intensities. In the R-all case, too frequent precipitation is the main contributor. Nonetheless, in both cases, GNN4CD is well able to capture the evolution of the precipitation with a very good timing of the precipitation peak in the late afternoon (around 17:00-18:00 hours).

Figure 5. Results in the reanalysis to observation downscaling setting and comparison with GRIPHO observations for the testing year
$ 2016 $
for the PDF of hourly precipitation [mm/h] with bin size of
$ 0.5 $
mm for (a) Italy (I) (b) northern Italy (N) and central-south Italy (C-S); the insets provide a magnified view of the tail of the distribution; (c) average [mm/h], (d) frequency [%] and (e) intensity [mm/h] seasonal diurnal cycles for Italy (I).
Table 2. Extreme percentiles computed for GRIPHO and the GNN4CD RC and R-all model designs for Italy (I), northern Italy (N), and central-south Italy (C-S)

Next, we examine the spatial maps and the corresponding maps of percentage bias (Figures 6 and A2 for GNN4CD RC and R-all, respectively). In the case of average precipitation (panels a), GNN4CD generally leads to overestimation, with the largest bias happening in areas of complex topography. Systematic biases across the estimates have been identified in the Apulia region, as well as along the Tyrrhenian and Adriatic coasts of the Tuscany and Marche regions. These biases are more pronounced for the RC model. In the case of p
$ 99 $
and p
$ 99.9 $
(panels b-c), GNN4CD still tends to overestimate extreme precipitation in regions characterised by complex topography, although this overestimation is less pronounced relative to the bias observed in average precipitation. Conversely, a clear underestimation is evident in plain and hilly areas. The RC model exhibits overestimation in Apulia and along the Tyrrhenian coastlines, also for the extreme percentiles. The overestimation in regions of complex topography may be likely linked to the well-known issue of gauge undercatch, as the stations used to create the reference observational dataset are primarily located in valleys. Regions of complex topography are instead rarely covered, leading to poorer interpolation and thus affecting the DL model learning (Figure 4b). The temporal coverage of station data is also very diverse and may have negatively influenced the quality of the GRIPHO dataset in the less covered locations (Figure 4c).

Figure 6. Results in the reanalysis to observation downscaling setting and comparison with GRIPHO observations for the testing year
$ 2016 $
for (a) average precipitation [mm/h] and percentage bias [%], (b) p
$ 99 $
[mm/h] and percentage bias [%], (c) p
$ 99.9 $
[mm/h] and percentage bias [%].
Similarly, we analyse the seasonal spatial maps for GRIPHO and the corresponding percentage bias for GNN4CD RC and R-all. The evaluation focuses on average precipitation and extreme percentiles, presented in Figures A3, A4, and A5. Figure A6 shows instead the seasonal PDFs. Both RC and R-all models tend to be wetter in JJA and drier in MAM. The seasonal PDFs suggest that the principal cause might be an overestimation of the occurrence of low-to-moderate events for the RC model and low events for the R-all model. The tails are instead better represented, in line with the p
$ 99.9 $
maps showing the lowest biases.
Moreover, we evaluate the PCC for the entire Italian peninsula, the north and central-south areas (Table 3). In five cases out of nine, the correlation coefficients computed for the R-all model are higher than the RC values. Specifically, R-all performs better in all the metrics for the central-south area. Nevertheless, the RC model also reports positive correlation coefficients for all the cases investigated, with the highest values observed for the north of Italy, as expected. However, the systematic biases that we observe in some parts of the central-south area (e.g., the Apulia region) may have a detrimental influence on the aggregated spatial correlations values, where the gap with northern Italy is much more pronounced than in other metrics we assessed.
Table 3. Spatial correlation between the reference GRIPHO maps and the GNN4CD RC and R-all estimated maps for precipitation average, p
$ 99 $
and p
$ 99.9 $
; results are shown for Italy (I), northern Italy (N), and central-south Italy (C-S)

GNN4CD is further evaluated in representing the total precipitation for
$ 10 $
flood episodes occurring within the GRIPHO time span. All flood events exceed the
$ 99 $
th percentile of the precipitation distribution in the affected area, except one case, which remains above the
$ 90 $
th percentile, and seven events that also surpass the
$ 99.9 $
th percentile. For this specific application, both the RC and R-all models have been retrained by excluding the time steps of the floods from the training set, in order to allow a fair evaluation. Figure 7 displays the comparison with the GRIPHO observational reference for all the events, for both the RC and R-all model designs. The results are promising, as all flood events are captured in terms of both spatial extent and severity. The overestimation of precipitation amounts is consistent with the patterns observed in previous evaluations, except for the second event, where the overestimation is more pronounced. This case requires a more in-depth study of the physical event to understand whether the event is particularly out-of-distribution compared to the cases encountered during training, and will be the subject of further studies.

Figure 7. Total precipitation [mm] for
$ 10 $
flood events in Italy. Events
$ 1 $
,
$ 4 $
,
$ 5 $
,
$ 6 $
,
$ 8 $
,
$ 10 $
took place in northern Italy (N), events
$ 2 $
,
$ 7 $
in central Italy (C), events
$ 3 $
,
$ 9 $
in southern Italy (S).
Overall, GNN4CD demonstrates consistent performance in the reanalysis to observation downscaling task over the grid points of the Italian peninsula, indicating a degree of spatial transferability across the precipitation distribution. This includes the DL model’s capacity to represent both average and extreme spatial patterns, as well as the characteristics of individual flood events across the entire domain. A comparison of the PDFs relative to northern Italy and central-south Italy further illustrates the spatial transferability of GNN4CD, although some regional differences remain evident (Figures 5b and A1b). Indeed, notable degradation in performance is observed in certain specific regions where systematic biases persist and worsen the aggregate results. In future work, we aim to address these issues by investigating the underlying causes of the observed discrepancies and developing targeted solutions to improve the spatial transferability and overall performance of the model in the affected areas.
6.2. RCM emulation
In the RCM emulation setting, GNN4CD is used as a proper emulator to downscale RCM simulations.
We begin by comparing the PDFs (Figure 8). Panels a and d show the precipitation PDFs estimated by GNN4CD RC and R-all for the historical period. We compare them to the original climate model output and to the
$ 10 $
-year subset of the GRIPHO observational dataset. The PDFs estimated by the RC and R-all models tend to yield a higher frequency of low precipitation values while underestimating the tail of the distribution, relative to the observational data. In contrast, RegCM tends to overestimate precipitation in the distribution tail when compared to observations. As expected, GNN4CD estimates are closer to the observed PDFs than to the RegCM distribution, especially for the RC case. The R-all model shows instead a more evident underestimation. Nevertheless, results are generally promising and in line with the reanalysis to observation downscaling case. Panels b-c and e-f present a comparison between the PDFs shown in panels a and d and those generated by the GNN4CD and RegCM models for the mid-century and end-of-century projections. The results indicate that the emulator generally captures the climate change signal exhibited by the RegCM model, reflected in a shift of the precipitation distribution towards more frequent and intense events, when moving to the future time slices. The magnitude of the shift differs slightly between the RC and R-all model, with the latter being closer to the RegCM shift and the former slightly under-representing the change between mid-century and end-of-century. These findings are quite remarkable, considering that GNN4CD was not trained on any precipitation scenario data, yet it shows the surprising ability to capture general trends even beyond the climate regimes where it was trained.

Figure 8. PDF of hourly precipitation [mm/h] with bin size of
$ 0.5 $
mm for Italy (I); comparison of GRIPHO
$ 10 $
-years (grey) and RegCM historical (black) with (a) historical GNN4CD RC (blue), (b) mid-century RegCM (pink) and GNN4CD RC (orange), (c) end-of-century RegCM (magenta) and GNN4CD RC (dark orange), (d) historical GNN4CD R-all (blue), (e) mid-century RegCM (pink) and GNN4CD R-all (orange), (f) end-of-century RegCM (magenta) and GNN4CD R-all (dark orange); the insets provide a magnified view of the tail of the distribution.
Next, we compare the spatial change projected by GNN4CD RC and R-all, and RegCM. The results when moving from historical to end-of-century are shown in Figure 9, while Figure A7 shows the case of moving from historical to mid-century. Panels a-b show the reference maps in terms of average precipitation for the historical period and the corresponding changes, computed consistently for each of the models. The emulator’s projections show an end-of-century dry change signal in the central regions towards the Tyrrhenian coast and in the Sardinia island, in line with the projections of RegCM. The intensification of the signal from mid-century to end-of-century, both wet and dry, is also confirmed by the projections of GNN4CD. The emulator also agrees in representing the wet change signal over the Alpine chain for the mid-century time slice. The same agreement is observed for the precipitation increase in the Apulia, Basilicata, Veneto and Friuli-Venezia-Giulia regions, mainly evident in the end-of-century estimates. However, there are multiple cases in which the projections of GNN4CD and RegCM disagree. For instance, GNN4CD projects a wetter climate signal in the Padania region (central-northern Italy) for the mid-century, more evident for the R-all model. The same behaviour is observed over the Alpine chain in the end-of-century change. Panels c-d and e-f show the same results (historical reference and corresponding change) in terms of p
$ 99 $
and p
$ 99.9 $
. When looking at the p
$ 99 $
end-of-century change, the disagreement between the projections of GNN4CD and RegCM is even more evident. In this case, both RC and R-all project a wetter climate in almost all the peninsula, whereas the RegCM projections show a dry signal along the Tyrrhenian coast. Notably, the projections of GNN4CD and RegCM for the p
$ 99.9 $
change show an overall alignment. The emulator generally agrees with the RegCM change signal sign, although it generally continues to project a wetter climate.

Figure 9. Maps for GNN4CD RC, GNN4CD R-all, and RegCM showing (a) historical average hourly precipitation [mm/h] and (b) end-of-century average change [%]; (c)-(d) the same for p
$ 99 $
and (e)-(f) the same for p
$ 99.9 $
.
Additionally, we derive box-plots from the spatial maps of precipitation average, p
$ 99.9 $
, and percentage of rainy hours over Italy (Figure 10). These quantities are displayed for each time slice for GNN4CD RC and R-all and for RegCM. Similarly, box plots are also displayed for the relative percentage bias maps. The boxes span from the first to the third quartile of the data, with a horizontal line indicating the median value. The whiskers span from the edges of the box to the most extreme data point that falls within 1.5 times the interquartile range from the lower and upper quartiles. The average precipitation map box-plots highlight the opposite behaviour of the RC and R-all models. The former leads to a much higher median than RegCM, whereas the latter is much closer, with a slightly lower value. Accordingly, the corresponding relative bias map box-plot shows a median value very close to zero for the R-all case. For the p
$ 99.9 $
map statistics, both RC and R-all models lead to lower median values, smaller in the case of R-all. The median percentage of rainy hours is instead higher in both cases, again with the larger difference produced by the R-all model. For the average precipitation map, the spread in the estimates of the two model designs is comparable. Instead, in the p
$ 99.9 $
case the RC model exhibits a significantly larger spread. Finally, in the rainy hours case, the R-all model exhibits the larger spread. When the same statistics are derived for northern Italy and central-south Italy (Figure A8), similar conclusions can be drawn. However, the case of Northern Italy exhibits much greater spreads, which could have worsened the aggregated statistics relative to Italy.

Figure 10. Box-plots for RegCM (red) and GNN4CD RC (green) and R-all (blue), derived for Italy (I) from the spatial maps of (a) average precipitation [mm/h], (b) p
$ 99.9 $
[mm/h] and (c) percentage of rainy hours [%]; the lower panels show the box plots for the relative bias maps of the same quantities.
The observed performance in projecting the precipitation change signal across both spatial and temporal dimensions suggests that the proposed emulator is potentially capable of projecting precipitation changes associated with global warming, as well as it may also possess a degree of spatial transferability in the RCM emulation setting.
7. Conclusions
In this work, we proposed GNN4CD, a combination of a new GNN-based DL model with an innovative training strategy that can be used for both reanalysis to observation downscaling and RCM emulation. Using as input low-resolution atmospheric variables, GNN4CD can efficiently derive precipitation estimates at high resolution. Different from most climate emulators available in literature (Addison et al., Reference Addison, Kendon, Ravuri, Aitchison and Watson2024; Baño-Medina et al., Reference Baño-Medina, Iturbide, Fernández and Gutiérrez2024; Doury et al., Reference Doury, Somot and Gadat2024), GNN4CD was trained on reanalysis and observations. Reanalysis data have the advantage of not suffering from the intrinsic biases of climate models, which can affect the training when outputs from climate models numerical simulations are used as predictors. This training strategy, which we refer to as a hybrid imperfect framework, should facilitate the ability of the emulator to generalise to climates and models unseen during training.
When used for reanalysis to observation downscaling, GNN4CD was able to reproduce the observed precipitation distribution and the extreme percentiles with a relatively good accuracy. In this respect, we observed a trade-off between optimising the initial part versus the tail of the distribution. The chosen loss configuration led to greater accuracy on extreme events, at the expense of low precipitation values, which were often overestimated. We plan to investigate different loss functions to improve this aspect. Nevertheless, GNN4CD estimated the total precipitation quite well during the selected
$ 10 $
flood events. The sub-daily variability (diurnal cycles) was also well replicated for all the seasons, with some overestimation in summer. Despite this, the summer afternoon convection precipitation peak was well captured. The link between overestimation in regions of complex topography and the limitations of the current observational data should be further evaluated and will be the subject of further studies.
When used for RCM emulation, with climate data predictors, GNN4CD was evaluated in generating future precipitation projections and emulating the downscaling function between typical RCM and CP-RCM resolutions. Results were quite remarkable, especially considering that the emulator was tested under the RCP8.5 scenario, i.e., a high-end emissions pathway that poses a particularly challenging setting for emulation. Despite this, GNN4CD successfully reproduced the shift of the precipitation distribution toward more frequent and intense precipitation events, demonstrating a promising ability to capture general trends even beyond the climate regimes where it was trained. Further research is needed to explore the potential added value of fine-tuning the trained model using climate simulation data prior to inference.
Moreover, GNN4CD proved capable of estimating precipitation over a spatial domain larger than the training area, without a significant degradation in performance. However, some regions exhibited systematic biases, which should be carefully examined in future evaluations. Further research will be dedicated to incorporating additional regions beyond Italy during both the training and inference phases to help in understanding the limits of the emulator’s spatial transferability. This is important as spatial transferability is a unique feature of GNN4CD and has the potential to extend the emulator’s application to remote and/or data-sparse regions of the world.
Between the two model designs examined (RC and R-all), neither demonstrated a consistently superior performance. They both produced generally comparable results, with instances where each outperformed the other. For future work, we plan to prioritise the development of the R-all model, given its reduced computational cost and the advantage of working with a single model.
While the proposed GNN4CD shows promising skills in emulating regional climate simulations and capturing key aspects of precipitation variability and change, its current implementation has limitations that require further investigation. In particular, the RCM emulation skills are evaluated using coarsened CP-RCM data, which inherently preserves some physical information from the high-resolution simulations. Additionally, potential uncertainties in observational datasets and the historical constraint of reanalysis inputs may limit the emulator’s generalisation to future climate conditions. These considerations highlight the importance of developing strategies such as targeted fine-tuning or hybrid training schemes that can enhance the robustness and generalisation capacity of GNN4CD in more realistic applications. Future work will aim to address these challenges by extending the framework to more robustly support downscaling from GCM-based scenarios and by evaluating the emulator across multiple ensemble members from different RCMs, in order to assess its potential to enhance the CP-RCM ensemble. In addition, we plan to broaden the application of the emulator to other climate variables beyond precipitation.
All these future research directions will contribute to further establishing the effectiveness and reliability of GNN4CD. In turn, this would make high-resolution ensembles of climate projections accessible at a fraction of the cost and time required by dynamical methods.
Open peer review
To view the open peer review materials for this article, please visit http://doi.org/10.1017/eds.2025.10022.
Acknowledgements
The Authors are grateful to the two Reviewers for their constructive feedback and valuable suggestions, which have substantially contributed to improving the quality and clarity of the manuscript.
Author contribution
Conceptualization: E.C., G.S., L.B.; Methodology: V.B., E.C., V.A., S.D.G.; Software: V.B.; Visualisation: V.B; Data curation: V.B., S.D.G.; Writing original draft: V.B; Writing – review and editing: V.B., E.C, G.S., V.A., S.D.G.; Supervision: E.C., G.S., L.B.; all authors approved the final submitted draft.
Competing interests
The authors declare none.
Data availability statement
The code for the GNN4CD emulator is available at https://github.com/valebl/GNN4CD.
Ethics statement
The research meets all ethical guidelines, including adherence to the legal requirements of the study country.
Funding statement
V.B., E.C., acknowledge support from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 101003469 (XAIDA). E.C. acknowledges support from the European Union’s Horizon Europe programme under Grant Agreement No. 101081555 (Impetus4Change). V.B acknowledges co-funding from Assicurazioni Generali through grant D_S_ELIBER_GENERALI_SISSA_CoordCentroDS_0708. GS acknowledges co-funding from Next Generation EU, in the context of the National Recovery and Resilience Plan, Investment PE1 - Project FAIR “Future Artificial Intelligence Research”. This resource was co-financed by the Next Generation EU [DM 1555 del 11.10.22]. L.B. acknowledges that the study was partially carried out within the PNRR research activities of the consortium iNEST funded by the European Union Next-GenerationEU (PNRR, Missione 4 Componente 2, Investimento 1.5 – D.D. 1058 23/06/2022, ECS 00000043).
A. Appendix
A.1. Reanalysis to observation downscaling additional results

Figure A1. Same as Figure 5 but for the R-all model.

Figure A2. Same as Figure 6 but for the R-all model.

Figure A3. Seasonal results in the reanalysis to observation downscaling setting for the testing year
$ 2016 $
for the hourly average precipitation; (a) GRIPHO observational reference [mm/h], (b) GNN4CD RC percentage bias [%], (c) GNN4CD R-all percentage bias [%].

Figure A4. Same as Figure A3 but for p
$ 99 $
.

Figure A5. Same as Figure A3 but for p
$ 99.9 $
.

Figure A6. Seasonal results in the reanalysis to observation downscaling setting for the testing year
$ 2016 $
for the PDF of hourly precipitation [mm/h] with bin size of
$ 0.5 $
mm for Italy (I); comparison between GRIPHO and (a) GNN4CD RC, (b) GNN4CD R-all; the insets provide a magnified view of the tail of the distribution.
A.2. RCM emulation additional results

Figure A7. Same as Figure 9, but computing the change for the mid-century period.

Figure A8. Same as Figure 10 but for (a) northern Italy (N) and (b) central-south Italy (C-S).
A.3. On the influence of predictors on time series length
To the best of our knowledge, previous studies that served as key references for this work, such as Doury et al. (Reference Doury, Somot, Gadat, Ribes and Corre2022), Doury et al. (Reference Doury, Somot, Gadat and Ribes2023), van der Meer et al. (Reference van der Meer, de Roda Husman and Lhermitte2023), Addison et al. (Reference Addison, Kendon, Ravuri, Aitchison and Watson2024), and Hess et al. (Reference Hess, Aich, Pan and Boers2025) did not incorporate a time series of predictors as input for downscaling. However, it is important to note that these studies were not conducted at sub-daily temporal resolutions, as is the case in our work. We identified one related study (Schmidt et al., Reference Schmidt, Schmidt, Strnad, Ludwig and Hennig2025) that employs a time series of predictors to generate sub-daily precipitation estimates. Nevertheless, their methodological framework differs substantially from ours, as they use a score-based diffusion model trained in a super-resolution setting, with coarse-resolution conditioning applied only at inference time.
Given the importance of capturing sub-daily variability, we believe that incorporating a time series of predictors can be particularly beneficial for hourly-scale modelling. To support this hypothesis, we performed an ablation study comparing the original GNN4CD configuration with three alternative setups:
-
• GNN4CD
$ \left[t-24,\dots, t\right] $ : the original model
-
• GNN4CD
$ \left[t-12,\dots, t\right] $ : the same model, but using a reduced sequence from
$ t-12 $ to
$ t $
-
• GNN4CD
$ \left[t-6,\dots, t\right] $ : the same model, but using a reduced sequence from
$ t-6 $ to
$ t $
-
• GNN4CD
$ \left[t\right] $ : a baseline using only time t predictors, without the recurrent component
The evaluations are performed over the validation year
$ 2007 $
for the Italian peninsula, considering the R-all model configuration. The metrics considered in the comparison are the spatial maps of average, p
$ 99 $
and p
$ 99.9 $
, the PDF and the diurnal cycles. Results show that the full-time series (GNN4CD
$ \left[t-24,\dots, t\right] $
) provides the most accurate results, especially in spatial maps (Figure A9) and PDF metrics (Figure A10a). While the performance of the truncated models does not degrade drastically, the version without temporal context (GNN4CD
$ \left[t\right] $
) performs notably worse, particularly in the central–south Italy region, and fails to reproduce the diurnal cycles (Figure A10b-c-d). Given that computational savings from using a shorter sequence were minimal (on the order of a few minutes per epoch), we retained the original configuration.

Figure A9. Comparison between the four alternative setups for the R-all model configuration, i.e., GNN4CD
$ \left[t-24,\dots, t\right] $
, GNN4CD
$ \left[t-12,\dots, t\right] $
, GNN4CD
$ \left[t-6,\dots, t\right] $
and GNN4CD
$ \left[t\right] $
in terms of relative percentage bias [%] with respect to GRIPHO, considering the validation year
$ 2007 $
; (a) average, (b) p
$ 99 $
and (c) p
$ 99.9 $
spatial maps.

Figure A10. Comparison between the four alternative setups for the R-all model configuration, i.e., GNN4CD
$ \left[t-24,\dots, t\right] $
, GNN4CD
$ \left[t-12,\dots, t\right] $
, GNN4CD
$ \left[t-6,\dots, t\right] $
and GNN4CD
$ \left[t\right] $
with respect to GRIPHO, considering the validation year
$ 2007 $
and Italy (I); (a) hourly precipitation PDF [mm/h] using a bin size of
$ 0.5 $
mm and (b) average [mm/h] (c) frequency [%], and (d) intensity [mm/h] seasonal diurnal cycles.
Comments
Dear Professor Monteleoni,
I am pleased to submit our manuscript entitled “Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect approach” authored by Valentina Blasone, Erika Coppola, Guido Sanguinetti, Viplove Arora, Serafina Di Gioia and Luca Bortolussi for consideration for publication in the Environmental Data Science journal. This paper is an original contribution and is not published in the present or other forms elsewhere neither is considered for publication in any other journal. This paper presents a new machine learning-based climate emulator, combined with an innovative training strategy to derive high-resolution precipitation projections and we believe it will be of significant interest to your readers.
Please feel free to contact me if you need any additional information.
Sincerely,
Valentina Blasone