Hostname: page-component-54dcc4c588-64p75 Total loading time: 0 Render date: 2025-10-13T08:12:31.631Z Has data issue: false hasContentIssue false

Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework

Published online by Cambridge University Press:  13 October 2025

Valentina Blasone*
Affiliation:
Department of Mathematics, Informatics and Geosciences, University of Trieste , Trieste, Italy
Erika Coppola
Affiliation:
Earth System Physics, Abdus Salam International Centre for Theoretical Physics (ICTP) , Trieste, Italy
Guido Sanguinetti
Affiliation:
Theoretical and Scientific Data Science, Scuola Internazionale Superiore di Studi Avanzati (SISSA) , Trieste, Italy
Viplove Arora
Affiliation:
Theoretical and Scientific Data Science, Scuola Internazionale Superiore di Studi Avanzati (SISSA) , Trieste, Italy
Serafina Di Gioia
Affiliation:
Earth System Physics, Abdus Salam International Centre for Theoretical Physics (ICTP) , Trieste, Italy
Luca Bortolussi
Affiliation:
Department of Mathematics, Informatics and Geosciences, University of Trieste , Trieste, Italy
*
Corresponding author: Valentina Blasone; Email: valentina.blasone@phd.units.it

Abstract

Extreme precipitation events are projected to increase both in frequency and intensity due to climate change. High-resolution climate projections are essential to effectively model the convective phenomena responsible for severe precipitation and to plan any adaptation and mitigation action. Existing numerical methods struggle with either insufficient accuracy in capturing the evolution of convective dynamical systems, due to the low resolution, or are limited by the excessive computational demands required to achieve kilometre-scale resolution. To fill this gap, we propose a novel deep learning regional climate model (RCM) emulator called graph neural networks for climate downscaling (GNN4CD) to estimate high-resolution precipitation. The emulator is innovative in architecture and training strategy, using graph neural networks (GNNs) to learn the downscaling function through a novel hybrid imperfect framework. GNN4CD is initially trained to perform reanalysis to observation downscaling and then used for RCM emulation during the inference phase. The emulator is able to estimate precipitation at very high resolution both in space ($ 3 $km) and time ($ 1 $h), starting from lower-resolution atmospheric data ($ \sim 25 $km). Leveraging the flexibility of GNNs, we tested its spatial transferability in regions unseen during training. The model trained on northern Italy effectively reproduces the precipitation distribution, seasonal diurnal cycles, and spatial patterns of extreme percentiles across all of Italy. When used as an RCM emulator for the historical, mid-century, and end-of-century time slices, GNN4CD shows the remarkable ability to capture the shifts in precipitation distribution, especially in the tail, where changes are most pronounced.

Information

Type
Application Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices
Open materials
Copyright
© The Author(s), 2025. Published by Cambridge University Press

Impact Statement

High-resolution precipitation projections are fundamental for assessing the local impact of global warming and informing targeted mitigation and adaptation strategies. This work introduces a novel regional climate model (RCM) emulator that operates at the kilometre-scale with hourly temporal resolution, combining an innovative training framework with a new deep learning architecture based on graph neural networks (GNNs). Unlike existing approaches, the emulator is not specific to any single climate model and reduces biases by incorporating reanalysis and observational data during training. It demonstrates potential for spatial transferability, future climate extrapolation, and generalisability across different domains and scenarios.

1. Introduction

Every year across the world, natural catastrophes cause casualties and significant damage to properties and assets, driven by the increasing frequency of weather-related extremes linked to global warming (IPCC, 2023). Precipitation-related events (floods, droughts, landslides) have a tremendous social and economic impact and are all projected to increase (Araújo et al., Reference Araújo, Ramos, Soares, Melo, Oliveira and Trigo2022; Pei et al., Reference Pei, Qiu, Yang, Liu, Ma, Li, Cao and Wufuer2023; Collins et al., Reference Collins, Beverley, Bracegirdle, Catto, McCrystall, Dittus, Freychet, Grist, Hegerl, Holland, Holmes, Josey, Joshi, Hawkins, Lo, Lord, Mitchell, Monerie, Priestley, Scaife, Screen, Senior, Sexton, Shuckburgh, Siegert, Simpson, Stephenson, Sutton, Thompson, Wilcox and Woollings2024). Disaster risk forecasting highly depends on the ability to correctly quantify the hazard related to the natural phenomenon, which is not straightforward, particularly for severe precipitation. This is likely linked to the evolution of convective systems, notably challenging to model due to complex land–atmosphere interactions, particularly in regions of complicated topography. Severe precipitation is indeed one of the most common and disastrous extreme events, yet difficult to model.

In this context, climate projections are fundamental to address the challenges posed by climate change and develop the most appropriate strategies of mitigation and adaptation. Climate projections are derived from global climate models (GCMs), which simulate the global climate under specified greenhouse gas emissions scenarios. GCMs operate at global to sub-continental scale (typically $ 50 $ - $ 250 $ km), too coarse for investigating the climate change impact at the local scale, which requires a much finer resolution. Downscaling is used to bridge the gap between the two resolutions (Wilby, Reference Wilby2004; Giorgi and Gutowski, Reference Giorgi and Gutowski2015; Laflamme et al., Reference Laflamme, Linder and Pan2016). There are two types of downscaling: statistical and dynamical. Statistical downscaling works by deriving statistical relationships between observed small-scale and simulated GCM large-scale variables using analogue methods or regression analysis. Then, future projections from GCM are statistically downscaled to estimate future projections of the small-scale phenomena. Statistical downscaling is cost-effective, but it depends on the availability of high-resolution observations, with many existing methodologies each with its own limitations (Maraun et al., Reference Maraun, Widmann and Gutiérrez2019). Dynamical downscaling (or regional climate modelling) instead uses regional climate models (RCMs) to refine the output of GCMs into much more detailed local scales (Leung et al., Reference Leung, Mearns, Giorgi and Wilby2003). RCM’s resolution is usually $ 50 $ $ 10 $ km up to the kilometre scale resolution ( $ \le 3 $ km) when they are usually referred to as convection-permitting regional climate models (CP-RCMs) (Coppola et al., Reference Coppola, Sobolowski, Pichelli, Raffaele, Ahrens, Anders, Ban, Bastin, Belda, Belusic, Caldas-Alvarez, Cardoso, Davolio, Dobler, Fernandez, Fita, Fumiere, Giorgi, Goergen, Güttler, Halenka, Heinzeller, Hodnebrog, Jacob, Kartsios, Katragkou, Kendon, Khodayar, Kunstmann, Knist, Lavín-Gullón, Lind, Lorenz, Maraun, Marelle, van Meijgaard, Milovac, Myhre, H-J, Piazza, Raffa, Raub, Rockel, Schär, Sieck, Soares, Somot, Srnec, Stocchi, Tölle, Truhetz, Vautard, de Vries and Warrach-Sagi2020). Dynamical downscaling is more physically grounded, yet computationally expensive, especially when CP-RCMs are considered (Coppola et al., Reference Coppola, Sobolowski, Pichelli, Raffaele, Ahrens, Anders, Ban, Bastin, Belda, Belusic, Caldas-Alvarez, Cardoso, Davolio, Dobler, Fernandez, Fita, Fumiere, Giorgi, Goergen, Güttler, Halenka, Heinzeller, Hodnebrog, Jacob, Kartsios, Katragkou, Kendon, Khodayar, Kunstmann, Knist, Lavín-Gullón, Lind, Lorenz, Maraun, Marelle, van Meijgaard, Milovac, Myhre, H-J, Piazza, Raffa, Raub, Rockel, Schär, Sieck, Soares, Somot, Srnec, Stocchi, Tölle, Truhetz, Vautard, de Vries and Warrach-Sagi2020; Kendon et al., Reference Kendon, Prein, Senior and Stirling2021). CP-RCMs can explicitly model convective systems, but become prohibitively expensive when long climate projections are needed or many simulations are required to estimate the uncertainty of the climate projections.

These limitations of traditional downscaling approaches have recently led to an increasing interest in research at the intersection between climate science and machine learning (ML), to explore the potential added value that data-driven techniques can bring to the field. Rampal et al. (Reference Rampal, Hobeichi, Gibson, Baño-Medina, Abramowitz, Beucler, González-Abad, Chapman, Harder and Gutiérrez2024a) made a clear classification of the possible ML applications in climate downscaling and we will use their terminology throughout the manuscript. In this context, we define observational downscaling as the task where an ML algorithm is used to reproduce high-resolution observations. Perfect prognosis (PP) and super-resolution (SR) are the two main approaches and differ in the predictor fields used to infer the high-resolution field. The former uses large-scale reanalysis data, while the latter relies on coarse-resolution observational data. Instead, we define RCM emulation as the case where an ML algorithm is asked to reproduce the high-resolution output of an RCM, starting from either GCM or RCM simulations as predictive fields. To date, two alternative frameworks have been studied to train the RCM emulators based on the predictor fields domain: the perfect framework and the imperfect framework. The former uses coarsened RCM fields while the latter relies on the GCM fields directly. Emulators trained in the perfect framework are easier to train, yet can not fully capture the inconsistencies that may exist between the GCM and RCM. In the imperfect framework, instead, the emulator needs to learn both the downscaling function and to account for deviations between GCM and RCM, thus providing improved performance. However, training is more challenging and leads to model-dependent patterns, which may be more difficult to interpret and may lack physical consistency (Boé et al., Reference Boé, Mass and Deman2023; Baño-Medina et al., Reference Baño-Medina, Iturbide, Fernández and Gutiérrez2024). We refer the reader to Rampal et al. (Reference Rampal, Hobeichi, Gibson, Baño-Medina, Abramowitz, Beucler, González-Abad, Chapman, Harder and Gutiérrez2024a) and van der Meer et al. (Reference van der Meer, de Roda Husman and Lhermitte2023) for a more in-depth analysis of these two frameworks.

Regarding the deep learning (DL) architecture, convolutional neural networks (CNNs) and generative models have been the preferred choice. Among others, Doury et al. (Reference Doury, Somot, Gadat, Ribes and Corre2022) used a CNN-based UNet architecture to emulate the downscaling of daily near-surface temperature as well as daily precipitation (Doury et al., Reference Doury, Somot and Gadat2024). Rampal et al. (Reference Rampal, Gibson, Sherwood, Abramowitz and Hobeichi2024b) used a conditional generative adversarial network (cGAN) to downscale daily precipitation. Addison et al. (Reference Addison, Kendon, Ravuri, Aitchison and Watson2024) proposed a diffusion-based DL model to estimate daily precipitation at high resolution.

Recently, some studies used reanalysis data to downscale earth system models (ESMs) simulations through the SR framework. For instance, Hess et al. (Reference Hess, Aich, Pan and Boers2025) used a consistency model (CM) to downscale precipitation and Schmidt et al. (Reference Schmidt, Schmidt, Strnad, Ludwig and Hennig2025) used a score-based diffusion model to downscale multiple variables such as near-surface wind speeds, surface air temperature, and sea level pressure. In both cases, the diffusion model is trained solely on the high-resolution reanalysis fields, which it learns to reproduce. During inference, the corresponding coarse-resolution fields from ESM simulations are used as conditioning input.

In this work, we present a novel RCM emulator for high-resolution precipitation projections, called GNN4CD (graph neural networks for climate downscaling; see Figure 1), which integrates an innovative training approach and a new graph neural network (GNNs) based model. Instead of training directly on climate model outputs, we adopt a hybrid imperfect framework: we train the model to perform reanalysis to observation downscaling (Figure 1a), and we show that it can subsequently function effectively as an RCM emulator to downscale future projections. Using reanalysis predictor fields during training helps in reducing biases and leads to an emulator that is not specific to any single climate model, unlike existing approaches. We also propose a novel DL model for the emulator, based on GNNs (Battaglia et al., Reference Battaglia, Hamrick, Bapst, Sanchez, Zambaldi, Malinowski, Tacchetti, Raposo, Santoro, Faulkner, Gulcehre, Song, Ballard, Gilmer, Dahl, Vaswani, Allen, Nash, Langston, Dyer, Heess, Wierstra, Kohli, Botvinick, Vinyals, Li and Pascanu2018; Schlichtkrull et al., Reference Schlichtkrull, Kipf, Bloem, Van Den Berg, Titov and Welling2018; Sanchez-Lengeling et al., Reference Sanchez-Lengeling, Reif, Pearce and Wiltschko2021). GNNs can capture complex relationships and dependencies within data, making them well-suited for a range of applications and have become increasingly relevant in climate-related studies (Lam et al., Reference Lam, Sanchez-Gonzalez, Willson, Wirnsberger, Fortunato, Alet, Ravuri, Ewalds, Eaton-Rosen, Hu, Merose, Hoyer, Holland, Vinyals, Stott, Pritzel, Mohamed and Battaglia2023; Price et al., Reference Price, Sanchez-Gonzalez, Alet, Andersson, El-Kadi, Masters, Ewalds, Stott, Mohamed, Battaglia, Lam and Willson2025). In this work, we adopt GNNs to address the challenges posed by irregular grids and non-rectangular domains typical of climate data (e.g., land-only regions), where data are often undefined over large areas (e.g., the sea). In such settings, CNNs require interpolation and padding, which can introduce biases and waste computation on irrelevant regions. GNNs naturally handle these irregular structures, offering a flexible and computationally efficient alternative. Unlike standard CNNs, which require identical image sizes to work properly, GNNs intrinsically support graphs with a variable number of nodes and edges. This key characteristic leads to a flexible model that can easily be tested in performing domain transferability, i.e., estimating the target variable across geographical areas distinct from those encountered during training. To the best of our knowledge, this is the first application of GNNs to RCM emulation, providing a novel approach that supports generalisation across grids and improves adaptability to real-world climate data.

Figure 1. The hybrid imperfect framework, applied to the GNN4CD emulator. Scheme of (a) training: reanalysis to observation downscaling, (b) inference: reanalysis to observation downscaling, (c) inference: RCM emulation.

Our emulator is built to operate at high spatial and temporal resolution comparable to CP-RCMs simulations. Working on a convection-permitting scale is crucial for capturing the dynamics of severe precipitation events, including localised thunderstorms and extreme rainfall associated with convective systems (Luu et al., Reference Luu, Vautard, Yiou and Soubeyroux2022). The choice of using hourly data is driven by the need to capture precipitation extremes localised in space and time, which can cause floods in a short time window. Daily observational datasets smooth out these short-duration events, whereas hourly precipitation can capture these types of extremes, which are essential for flood hazard studies (Fantini, Reference Fantini2019).

We evaluate the performance of the GNN4CD emulator in two different settings: reanalysis to observation downscaling (Figure 1b) and RCM emulation (Figure 1c). In the former, reanalysis data are used as predictors with observations as ground truth (like in the training), allowing for a first thorough assessment of the model reanalysis to observation downscaling capabilities. In the latter, predictor data come from RCM simulations, and the GNN4CD model is properly used as an emulator to derive historical and future projections at the CP-RCM scale. In both cases, the model is evaluated in a geographical area larger than the region used for training.

2. Data

In this study, we refer to the atmospheric predictor data as “low-resolution,” relative to the higher-resolution target dataset, reflecting their role as the coarser input in the downscaling framework rather than an absolute classification of resolution.

Five atmospheric variables are used as low-resolution predictors, at five pressure levels (Table 1), each reported on a grid of $ 0.25{}^{\circ} $ degrees of longitude-latitude ( $ \sim 25 $ km for Europe) with hourly temporal resolution. As training input data, these variables are taken from the ERA5 reanalysis dataset (Hersbach et al., Reference Hersbach, Bell, Berrisford, Hirahara, Horányi, Muñoz-Sabater, Nicolas, Peubey, Radu, Schepers, Simmons, Soci, Abdalla, Abellan, Balsamo, Bechtold, Biavati, Bidlot, Bonavita, De Chiara, Dahlgren, Dee, Diamantakis, Dragani, Flemming, Forbes, Fuentes, Geer, Haimberger, Healy, Hogan, Hólm, Janisková, Keeley, Laloyaux, Lopez, Lupu, Radnoti, de Rosnay, Rozum, Vamborg, Villaume and Thépaut2020) from the European Centre for Medium-Range Weather Forecasts (ECMWF). In the inference phase, either ERA5 or RCM data are used as predictors. RCM data come from a CP simulation (Pichelli et al., Reference Pichelli, Coppola, Sobolowski, Ban, Giorgi, Stocchi, Alias, Belušić, Berthou, Caillaud, Cardoso, Chan, Christensen, Dobler, de Vries, Goergen, Kendon, Keuler, Lenderink, Lorenz, Mishra, H-J, Schär, Soares, Truhetz and Vergara-Temprado2021) with the RegCM model (Coppola et al., Reference Coppola, Stocchi, Pichelli, Alavez, Glazer, Giuliani, Di Sante, Nogherotto and Giorgi2021), upscaled to the training resolution of $ 25 $ km.

Table 1. Variables used as predictors (P) and target (T), each reported with its symbol, unit, pressure levels, space and time resolutions

A remapping of the global multi-resolution terrain elevation data (Danielson and Gesch, Reference Danielson and Gesch2011) to a grid of $ 3 $ km is also used as a predictor. Additionally, the land-use information is included on the same high-resolution grid. These data are taken from the Community Land Model (CLM) version $ 4.5 $ (Oleson et al., Reference Oleson, Lawrence, Bonan, Drewniak, Huang, Koven, Levis, Li, Riley, Subin, Swenson, Thornton, Bozbiyik, Fisher, Heald, Kluzek, Lamarque, Lawrence, Leung, Lipscomb, Muszala, Ricciuto, Sacks, Sun, Tang and Yang2013; Thiery et al., Reference Thiery, Davin, Lawrence, Hirsch, Hauser and Seneviratne2017). Elevation and land-use are the only predictors available on high-resolution, yet they are static, i.e., a single value per spatial location, independent of time.

The GRidded Italian Precipitation Hourly Observations (GRIPHO) dataset, a high-resolution hourly precipitation dataset for Italy (Fantini, Reference Fantini2019), is used as the target to train the DL model. Originally developed as input to hydrological models and to validate RCM simulations, GRIPHO was created by using exclusively in-situ precipitation station data as input. After a quality check of the station data time series, the data were re-gridded on a Lambert Conformal Conic grid, which is neither orthogonal nor regular in longitude-latitude coordinates. The choice of a curvilinear grid was primarily informed by the average station density ( $ \sim 10 $ km) and offers methodological advantages over a regular latitude-longitude grid, as it ensures uniform grid cell areas and facilitates subsequent computational analyses. GRIPHO currently represents the only high-resolution, station-based precipitation dataset available for the entire Italian peninsula, covering the period from 2001 to 2016. We used the GRIPHO dataset at a $ 3 $ km spatial resolution, consistent with Pichelli et al. (Reference Pichelli, Coppola, Sobolowski, Ban, Giorgi, Stocchi, Alias, Belušić, Berthou, Caillaud, Cardoso, Chan, Christensen, Dobler, de Vries, Goergen, Kendon, Keuler, Lenderink, Lorenz, Mishra, H-J, Schär, Soares, Truhetz and Vergara-Temprado2021).

All low- and high-resolution predictor variables are normalised to zero-mean unit-variance, which is common practice in ML to ensure comparable feature contributions and to improve numerical stability. Normalisation statistics are computed on the training set and then applied to both training and inference data. The target dataset is preprocessed to comply with the instrument sensitivity, i.e., all values strictly less than $ 0.1 $ mm are set to zero, and then the dataset is rounded to one decimal place. The target data is then transformed using $ \log \left(1+y\right) $ . The logarithmic transformation compresses the range of target values, improving model stability, which is particularly useful in the case of highly skewed data.

3. Hybrid imperfect framework

We propose the hybrid imperfect framework as a third alternative to the now-established perfect and imperfect frameworks to train RCM emulators. Here, hybrid refers to the use of different types of predictor data during training and inference-specifically, reanalysis and observations during training, and climate model outputs during inference. The term imperfect reflects the conceptual similarities to the imperfect framework, especially the use of two different source systems for input and target during training, which can show biases, spatial misalignments, or different error structures that the model needs to cope with. Reanalyses are typically better aligned with observations (in space, time and dynamics) than a GCM is with an RCM, but still, the relationship is far from perfect. During inference, either GCM or RCM predictors can potentially be used as input to the emulator trained in the hybrid imperfect framework. As a starting point, this study aims to examine the scenario in which RCM predictors are employed.

We found it significant to explore a third alternative method to train the emulator, as both existing frameworks have several practical limitations. For example, both frameworks still require long future simulations of the RCM (or CP-RCM) to serve as target data to train the RCM emulator, thus not resolving the prohibitive cost issue of dynamical downscaling. In contrast, we show that our model learns effectively from a limited amount of available observational data, making the construction of the emulator entirely cost-effective. Moreover, the use of reanalysis data helps to mitigate biases and uncertainties that may be inherent in climate model simulations. Reanalysis assimilates a wide range of observational data, providing a more accurate representation of present-day climate conditions. Thus, the emulator trained with the hybrid imperfect framework is expected to develop a broader foundation in atmospheric dynamics, with the potential to intrinsically learn effective domain adaptation skills.

While the hybrid imperfect framework offers promising advantages in bias correction and generalisation across different domains, it also presents possible limitations that should be carefully addressed. First, the domain mismatch between training and inference predictors introduces a risk of distributional shift, where the emulator may encounter patterns never seen during training. Second, observational datasets used as targets may contain their own uncertainties, inconsistencies, or sparse coverage. Finally, reanalysis data are limited to the present day; thus, the model can only learn from historical climate. In this study, we show both strengths and limitations in using the hybrid imperfect framework in the current setup. However, we believe that a phase of fine-tuning post-training but prior to inference could mitigate these potential limitations and could be a cost-effective way of improving the emulator’s ability to generalise to future scenarios and different RCM simulations. This may be particularly helpful when GCM predictors are employed and will be the subject of future studies.

4. Deep learning model

The GNN4CD emulator uses a new DL model, specifically designed for this task. The architecture relies on GNNs to model the downscaling from the coarse grid of atmospheric predictors to the fine grid of precipitation data and to account for spatial interactions between the high-resolution locations. To this aim, the given spatial grids need to be converted into graphs, i.e., mathematical objects with nodes and edges. The definition of the graphs and architecture is described in detail in the following sections.

4.1. Graph conceptualization

Each point within the low-resolution and high-resolution grids corresponds to a specific geographical location, suggesting that they can be naturally modelled as nodes. For convenience, we model both grids as a unified heterogeneous graph featuring two node- and two edge-types:

  • Low nodes: first set of nodes, generated from the points on the low-resolution grid with a spatial resolution of approximately $ 25 $ km.

  • High nodes: second set of nodes, created from the points on the high-resolution grid with a spatial resolution of $ 3 $ km.

  • Low-to-High edges: unidirectional edges, which connect Low to High nodes, ensuring each High node is linked to a fixed number $ k $ of Low nodes. These edges model the downscaling of atmospheric variable information from the Low nodes to the High nodes (Figure 2a).

    Figure 2. Graph conceptualisation: Low nodes (blue dots) and High nodes (red dots) and close-up of (a) Low-to-High unidirectional edges (orange), connecting Low nodes to High nodes (b) High-within-High bidirectional edges (red), linking High nodes.

  • High-within-High edges: bidirectional edges that capture relationships between High nodes based on an 8-neighbours approach, ensuring each node is connected to its eight nearest neighbours (Figure 2b).

4.2. Model design

Precipitation data contain a significant amount of zeros, as rain events only occur during a limited number of hours. In the case of the GRIPHO dataset, almost $ 90\% $ of the values fall below the meteorological threshold assumed as $ 0.1 $ mm, effectively rendering them as zeros. In view of this, we explored two different designs for the DL model. In both cases, the model outputs an estimate for the time step $ t $ based on a time series of predictors spanning $ \left[t-L,\dots, t\right] $ , where $ L $ is a hyper-parameter.

In the first approach, we addressed the challenge posed by zero precipitation values by adopting a Hurdle modelling scheme (Cragg, Reference Cragg1971). The method relies on the construction of two distinct models, which are subsequently combined: a Regressor and a Classifier. The Classifier is trained on the entire dataset and discerns between two classes: $ 0 $ , i.e., precipitation below the threshold, and $ 1 $ , i.e., precipitation above the threshold. Conversely, the Regressor is exclusively trained on targets where precipitation values exceed the threshold, and provides a quantitative estimation of hourly precipitation. During the inference phase, predictions from the Regressor and Classifier models are computed independently and then multiplied to yield a singular estimate of the precipitation value. We refer to this model design as RC (Figure 3a). In addition, we considered a second approach where a single Regressor is used. This model is trained using as target the full GRIPHO dataset, i.e., including instances with zero precipitation, and outputs a quantitative estimation of hourly precipitation. We refer to this model design as R-all (Figure 3b). To enhance training efficiency and focus on meaningful targets, time steps containing only values below the threshold were excluded. This resulted in a reduction of approximately 50% in the training set size for the RC Regressor, whereas the impact was negligible for the RC Classifier and the R-all Regressor, with $ 99.9\% $ of time steps retained in both cases. In both the RC and R-all approaches, the models are trained on the complete graph, while the loss is computed exclusively on nodes with valid target values by applying a masking strategy.

Figure 3. Schematic views of (a) RC, designed as a combination of Regressor and Classifier components, (b) R-all consisting of a single Regressor, (c) architecture, composed of four modules: a RNN-based pre-processor, a GNN-based downscaler, a GNN-based processor, and a FCNN-based predictor.

The two alternative designs offer valuable insights due to their differing model complexities and problem formulations. The RC case adopts a two-model structure, separating the classification of precipitation occurrence from the regression of its intensity. This allows for targeted learning but increases the number of parameters and model complexity. Moreover, the classification task remains particularly imbalanced and complex to solve, and even the data obtained by considering only the values larger than the threshold remain very skewed. In contrast, the R-all case adopts a single model, thus resulting in a simpler architecture with fewer parameters but requiring the model to simultaneously handle both occurrence and intensity prediction within a single task, thus increasing the problem complexity. In the experiments, we will analyse both model designs to see whether the added value justifies the cost of using two models instead of one.

4.3. Architecture

Regressor and Classifier components share the same structure, which consists of four primary modules (see Figure 3c). First, a pre-processor module, which acts at the Low nodes level and handles the predictors’ temporal component through the adoption of a recurrent neural network (RNN), specifically a gated recurrent unit (GRU). The RNN encoder captures the temporal dependencies across time steps, outputting a sequence that is flattened and passed through a fully connected layer to produce a fixed-dimensional latent representation. This encoding serves as input to the graph-based components of the model. The next stage involves the downscaler module, which uses a graph convolution (GC) layer to map the preprocessed atmospheric variables, represented as Low node features, to learned attributes on the High nodes. This transformation is crucial for bridging the different spatial scales within the input and output data. The downscaler incorporates additional high-resolution spatial attributes (elevation and land use data), ensuring that the model is well-informed about local geographical features. Following the downscaling step, the core of the model is a multi-layer processor network comprising several graph attention layers (GAT, Veličković et al., Reference Veličković, Cucurull, Casanova, Romero, Lio and Bengio2018). These layers are designed to dynamically attend to neighbouring nodes, thereby capturing complex spatial relationships. Each GAT layer is followed by batch normalization and ReLU activations to stabilise training and introduce non-linearity, respectively. The use of multiple GAT layers allows the model to progressively refine its understanding of spatial dependencies, essential for accurately predicting local precipitation. The final component of the architecture is the predictor, a fully connected neural network (FCNN) that takes the processed graph features and returns the desired output on each High node. The model is entirely implemented in PyTorch (Paszke et al., Reference Paszke, Gross, Massa, Lerer, Bradbury, Chanan, Killeen, Lin, Gimelshein, Antiga, Desmaison, Kopf, Yang, DeVito, Raison, Tejani, Chilamkurthy, Steiner, Fang, Bai and Chintala2019) and PyTorch Geometric (Fey and Lenssen, Reference Fey and Lenssen2019), using GraphConv (Morris et al., Reference Morris, Ritzert, Fey, Hamilton, Lenssen, Rattan and Grohe2021) and GATv2Conv (Brody et al., Reference Brody, Alon and Yahav2022) as GC and GAT layers, respectively.

The downscaling task requires the specification of the spatial and temporal length scales at which predictors can influence outputs. While this represents an assumption, it is grounded in the underlying physics of the problem and adjusted by empirical evidence. From a spatial point of view, we assume that the relevant region extends to approximately $ 50 $ km around the target location. From a time perspective, we assume that the precipitation prediction at time $ t $ is influenced by the atmospheric variables up to $ 24 $ hours before. These ranges are explicitly considered in designing the graph and the emulator architecture. Respectively, the spatial assumption is realised in choosing a value of $ k=9 $ , as the number of nearest Low nodes to which each High node is connected (see subsection 4.1), effectively representing the downscaling relationship in terms of graph edges. This assumption dictates the graph structure, but has no direct influence on the GNN parameters, as GNNs accept graphs with an arbitrary number of edges. The temporal assumption is instead reflected in choosing $ L=24 $ (see Subsection 4.2), thus designing the predictors’ time series as $ \left[t-24,\dots, t\right] $ , encompassing a total of $ 25 $ time steps. This assumption sets the length of the time series of predictors that are passed as input to the RNN module, thus defining the RNN sequence length parameter.

4.4. Regressor loss function

Mean square error (MSE) is the standard loss function for regression problems, yet it becomes less effective when the target data is highly imbalanced or skewed, as in the case of precipitation data. MSE leads to estimates that are biased towards frequent values, which is detrimental for modelling rare events, thus we chose to use a modified MSE loss function which can explicitly address imbalance and skewness. Multiple studies in the literature use weighted MSE loss for training in such unfavourable conditions (Wang et al., Reference Wang, Wang, Wang, Xue and Wang2022; Scheepens et al., Reference Scheepens, Schicker, Hlaváčková-Schindler and Plant2023). The most sensitive part is in the definition of an optimal weighting strategy, which should be consistent with the target data distribution and training objectives, e.g., giving more weight to the tail of the distribution, in order to estimate the extreme events. Recently, Szwarcman et al. (Reference Szwarcman, Guevara, Macedo, Zadrozny, Watson, Rosa and Oliveira2024) proposed a formulation to quantise the reconstruction loss and improve the synthesis of extreme weather data with variational autoencoders (VAEs). This modification of the MSE loss was designed to address the skewed distribution typical of weather data by giving more weight to rare, extreme values. The idea is to penalise the loss according to the observed values frequency, by quantising the target data and averaging losses for each bin. In this study, we adopt the same loss to train the Regressor, and we call it quantised MSE (QMSE) loss (Equation 4.1). The only parameters of QMSE are the bins into which the target data are quantised. During training, examples are dynamically assigned to the corresponding bins based on their true target values, ensuring that the weights in the QMSE loss reflect the actual value distribution within each batch. This approach is beneficial because batches are formed by randomly selecting time instants, and each batch includes all points in the graph for the chosen time instants. Consequently, the target distribution can vary slightly between batches.

(4.1) $$ \mathrm{QMSE}=\sum \limits_j^B\frac{1}{\mid {\Omega}_j\mid}\sum \limits_{i\in {\Omega}_j}{\left({y}_i-{\hat{y}}_i\right)}^2 $$

In Equation 4.1, $ j $ represents the bin index, defined based on a histogram of the training data, $ {\Omega}_j $ is the set of target indices whose values fall within bin $ j $ , thus $ \mid \Omega \mid $ is the observed frequency and $ 1/\mid \Omega \mid $ weighs the loss inversely to the frequency. The quantities $ {y}_i $ and $ {\hat{y}}_i $ are the true and estimated target values, respectively. Finally, in the training, we used a combined MSE-QMSE loss (Equation 4.2) with a coefficient $ \overline{\alpha} $ which accounts for the different scale and balances the contribution of the two terms.

(4.2) $$ L=\mathrm{MSE}+\overline{\alpha}\cdot \mathrm{QMSE}\hskip0.1em $$

4.5. Classifier loss function

The Classifier is trained using focal loss (FL) (Lin et al., Reference Lin, Goyal, Girshick, He and Dollár2020), specifically designed to address class imbalance during training. In this setting, standard cross entropy (CE) loss tends to focus on minimising errors for the majority class, often leading to poor performance on the minority class (Johnson and Khoshgoftaar, Reference Johnson and Khoshgoftaar2019). FL introduces a modulating term $ \gamma $ in the CE formulation to dynamically scale the CE loss. Thanks to this scaling factor, the contribution of easy examples is down-weighted and the model is guided to focus on hard examples. Additionally, a hyper-parameter $ \alpha $ helps to handle the class imbalance. The formulation of the FL loss is in Equation 4.3, where $ p $ is function of the Classifier output, i.e. depends on the input data, and $ y\in \left\{0,1\right\} $ is the ground-truth class.

(4.3) $$ {\displaystyle \begin{array}{c} FL\left({p}_t\right)=-{\alpha}_t{\left(1-{p}_t\right)}^{\gamma}\cdot \log \left({p}_t\right)\\ {}{p}_t=\left\{\begin{array}{ll}p& \mathrm{if}\;y=1\\ {}1-p& \mathrm{otherwise}\end{array}\right.\hskip2em {\alpha}_t=\left\{\begin{array}{ll}\alpha & \mathrm{if}\;y=1\\ {}1-\alpha & \mathrm{otherwise}\end{array}\right.\end{array}} $$

4.6. Training and evaluation

As mentioned in Section 2, the GRIPHO observational data are available for a period of $ 16 $ years, with spatial coverage of the entire Italian territory. To evaluate the emulator’s capability of generalisation in spatial domains not used during training, we chose to train only on northern Italy and use the whole peninsula for the inference phase (see Figure 4a). In this setting, the geographical area considered for training is approximately $ \mathrm{120,000} $ km $ {}^2 $ with around $ 400 $ Low nodes and $ \mathrm{14,000} $ High nodes. Instead, the evaluation area is approximately $ \mathrm{300,000} $ km $ {}^2 $ , with around $ 1000 $ Low nodes and $ \mathrm{33,000} $ High nodes. The time range spanned by the GRIPHO dataset is rather limited, which is usually the case when dealing with high-resolution observational data, which are quite difficult to find. We decided to use the first $ 15 $ years for training ( $ 2001 $ $ 2006 $ and $ 2008 $ $ 2015 $ ) and validation ( $ 2007 $ ) and leave the last available year ( $ 2016 $ ) to test the GNN4CD model in the reanalysis to observation downscaling task. Considering the number of time instants and High nodes, we obtain a ratio of approximately $ 75-12.5-12.5 $ for the train-validation-test datasets. In the RCM emulation setting, we considered the area of the Italian peninsula covered by RegCM and three different time slices of the RegCM simulations: historical ( $ 1996 $ $ 2005 $ ), mid-century ( $ 2041 $ $ 2049 $ ) and end-of-century ( $ 2090 $ $ 2099 $ ). All projections were performed under the RCP8.5 scenario (Riahi et al., Reference Riahi, Rao, Krey, Cho, Chirkov, Fischer, Kindermann, Nakicenovic and Rafaj2011; IPCC, 2014), which represents a high-emissions pathway associated with the most pronounced climate change signal. This choice increases the difficulty of the emulation task, particularly for capturing changes in the distribution of extreme precipitation events.

Figure 4. (a) training (northern Italy) and inference (entire Italy) areas, (b) locations of original stations used to create the GRIPHO dataset, and (c) percentage of valid time steps for each station.

The RC Regressor and Classifier components and the R-all Regressor component were all trained separately for $ 50 $ epochs, i.e. approximately $ 24 $ hours each on $ 4\times $ NVIDIA Ampere GPUs on Leonardo, the new pre-exascale Tier-0 EuroHPC supercomputer, currently hosted by CINECA in Bologna, Italy (Turisini et al., Reference Turisini, Amati and Cestari2023). The trained emulator needs only few minutes to compute the hourly precipitation estimates for an entire year over Italy, much less than the dynamical downscaling of a CP-RCM, which needs approximately a couple of days on an equivalent high-resolution grid.

We used the validation year to empirically tune the hyper-parameters that appear in the model architecture and loss functions. Considering the computational cost of training the model, we opted for manual hyper-parameters tuning to limit the resource usage to what was strictly necessary, while still achieving quick convergence to good values. Systematic hyper-parameter tuning is expected to further improve performance. Thus, we view this not as a limitation, but as a promising direction for future enhancement, contingent on the availability of additional computational resources and time. For the Regressor components, we empirically tested the number of bins in QMSE and we concluded that it does not have a strong impact on the training, provided that the coefficient $ \overline{\alpha} $ is properly tuned. We chose to use logarithmically equispaced bins with a bin size of $ \log \left(0.5+1\right) $ . Empirical findings showed a trade-off when training the emulator with the combined MSE-QMSE loss, with lower values of $ \overline{\alpha} $ that favour average results, while higher values of $ \overline{\alpha} $ lead to improved accuracy in the tail of the distribution. We chose $ \overline{\alpha} $ $ =0.025 $ for the RC Regressor component and $ \overline{\alpha} $ $ =0.005 $ for the R-all model, both of which lean toward the latter behaviour. This choice reflects our interest in accurately capturing the full distribution of precipitation values, but with particular emphasis on the extremes. For the FL, we started from the parameter values suggested in Lin et al. (Reference Lin, Goyal, Girshick, He and Dollár2020) ( $ \alpha =0.25 $ and $ \gamma =2 $ ) and we operated a manual grid search to adjust these values to our specific case. We found that the FL loss used to train the RC Classifier component works best with parameters $ \alpha =0.75 $ and $ \gamma =2 $ .

5. Metrics

In evaluating the reanalysis to observation downscaling and RCM emulation capabilities of GNN4CD, we used several key metrics, which are here introduced:

  • probability density function;

  • seasonal diurnal cycles (average, frequency and intensity);

  • spatial average, $ 99 $ th and $ 99.9 $ th percentiles;

  • percentage bias for spatial average, $ 99 $ th and $ 99.9 $ th percentiles;

  • spatial correlation;

  • spatial change in future projections;

The first fundamental metric is the probability density function (PDF), defined as the normalised frequency of occurrence of events within a certain bin. The PDF is obtained by considering all the hourly values for all the grid points over the desired spatial area and temporal period. Next, seasonal diurnal cycles are crucial to investigate temporal patterns on a sub-daily scale. Diurnal cycles are obtained by averaging the values of all time steps corresponding to the same hour, for each hour of the day, over the considered time span. Specifically, we consider the diurnal cycles of precipitation average, frequency, and intensity. The average is obtained by dividing the total sum by the number of instances, considering both zero and non-zero precipitation cases. The frequency is defined as the percentage of non-zero precipitation cases over the total, also referred to as the percentage of rainy hours. Finally, the intensity is computed similarly to the average, but considering only non-zero precipitation cases. The diurnal cycles are presented separately for each season. Seasons in climatological studies are usually defined as DJF (December, January, February), MAM (March, April, May), JJA (June, July, August), and SON (September, October, November), and we adopt the same terminology. As spatial statistics, we consider the average and the $ 99 $ th and $ 99.9 $ th percentiles (p $ 99 $ and p $ 99.9 $ ). These metrics are computed individually for each location in space, for the given period of time, and are visualised through spatial maps. Both zero and non-zero precipitation cases are considered in the computation. The p $ 99 $ and p $ 99.9 $ quantities are defined as the values below which $ 99\% $ and $ 99.9\% $ of the values fall, respectively. The former represents a high-end threshold and illustrates whether the estimates capture the extreme events. The latter focuses on even more extreme events, providing insights into the ability to capture the rarest precipitation occurrences that are usually responsible for flood episodes. For all these quantities, we also consider the percentage bias, which quantifies the relative error in estimates, indicating whether the estimates overestimate or underestimate the reference values. The percentage bias is computed by taking the difference between the estimates and the reference values, then dividing by the reference values and multiplying by $ 100 $ . To quantify the degree of agreement between the spatial patterns of the estimates and those of the reference, we adopt the Pearson correlation coefficient (PCC). We computed the PCC for the maps of precipitation average, p $ 99 $ and p $ 99.9 $ . This coefficient measures linear correlation between two sets of data and can take values between $ -1 $ and $ 1 $ , with $ 0 $ being the discriminant value for negative/positive correlation. The final metric we consider is the spatial change, used to evaluate future projections. This quantity indicates the difference between the future projection quantities (mid-century or end-of-century) and the historical amount, computed consistently within the same model.

6. Results and discussion

In this section, we present the inference results of GNN4CD, assessing its performance in both the reanalysis to observation downscaling and RCM emulation settings.

For the reanalysis to observation downscaling task, GNN4CD estimates are evaluated against GRIPHO observations with a focus on PDFs, spatial average, p $ 99 $ and p $ 99.9 $ maps of hourly precipitation. We also examine the seasonal diurnal cycles, which are particularly relevant, given that one of the key added values of the CP-RCMs lies in their improved representation of sub-daily precipitation patterns, especially the afternoon convective peak typically observed in summer (Ban et al., Reference Ban, Caillaud, Coppola, Pichelli, Sobolowski, Adinolfi, Ahrens, Alias, Anders, Bastin, Belušić, Berthou, Brisson, Cardoso, Chan, Christensen, Fernández, Fita, Frisius, Gašparac, Giorgi, Goergen, Haugen, Hodnebrog, Kartsios, Katragkou, Kendon, Keuler, Lavin-Gullon, Lenderink, Leutwyler, Lorenz, Maraun, Mercogliano, Milovac, H-J, Raffa, Remedio, Schär, Soares, Srnec, Steensen, Stocchi, Tölle, Truhetz, Vergara-Temprado, de Vries, Warrach-Sagi, Wulfmeyer and Zander2021). Additionally, we evaluate the GNN4CD performance on $ 10 $ documented flood episodes (observed between $ 2011 $ and $ 2016 $ ) as an initial step towards assessing its potential for impact-oriented applications.

In the RCM emulation setting, we compare the output of GNN4CD to the RegCM simulations. To this end, we analyse the PDFs across the three time slices and examine the spatial distribution of average and extreme precipitation changes for the future periods. In the PDFs comparison, we include a $ 10 $ -year subset of the GRIPHO observational dataset. For the historical period, our aim is for the emulator to reproduce results that are closer to GRIPHO, thereby mitigating the biases typically present in climate model simulations. For the future time slices, where no ground truth is available, we conduct a comparative analysis between the emulator’s estimates and the RegCM outputs. Here, particular attention is given to assessing the emulator’s ability to capture changes in the precipitation distribution associated with global warming.

Furthermore, given that the use of a time series of predictors to downscale a single precipitation time step represents a relatively novel approach, we conducted an evaluation to assess the added value of this methodology. The corresponding findings are presented and discussed in Appendix A.3.

6.1. Reanalysis to observation downscaling

The aim of this initial evaluation is to provide a comprehensive assessment of the GNN4CD behaviour in a setting that closely resembles the training environment.

We begin by assessing the PDFs and diurnal cycles (Figures 5 and A1, respectively, for GNN4CD RC and R-all). The PDFs computed over the entire Italian domain show a good agreement between the estimates and the observational reference (panels a). GNN4CD tends to slightly overestimate precipitation amounts (in the order of a few millimetres per hour) while higher values, from the p $ 99 $ onward and in the tail of the distribution, are more accurately captured. The p $ 99 $ and p $ 99.9 $ computed on the same aggregated data are also close to the GRIPHO reference, especially for the RC model (Table 2). Panels c-d-e display the diurnal cycles of hourly precipitation average, frequency (percentage of rainy hours), and intensity, respectively. Each panel is further divided by seasons and presents the daily evolution at one-hour intervals. Overall, GNN4CD provides a good match with GRIPHO observations, particularly in terms of average precipitation and frequency in the RC model configuration. Both RC and R-all exhibit a larger bias in average JJA precipitation compared to other seasons. In the RC case, this arises predominantly due to too high precipitation intensities. In the R-all case, too frequent precipitation is the main contributor. Nonetheless, in both cases, GNN4CD is well able to capture the evolution of the precipitation with a very good timing of the precipitation peak in the late afternoon (around 17:00-18:00 hours).

Figure 5. Results in the reanalysis to observation downscaling setting and comparison with GRIPHO observations for the testing year $ 2016 $ for the PDF of hourly precipitation [mm/h] with bin size of $ 0.5 $ mm for (a) Italy (I) (b) northern Italy (N) and central-south Italy (C-S); the insets provide a magnified view of the tail of the distribution; (c) average [mm/h], (d) frequency [%] and (e) intensity [mm/h] seasonal diurnal cycles for Italy (I).

Table 2. Extreme percentiles computed for GRIPHO and the GNN4CD RC and R-all model designs for Italy (I), northern Italy (N), and central-south Italy (C-S)

Next, we examine the spatial maps and the corresponding maps of percentage bias (Figures 6 and A2 for GNN4CD RC and R-all, respectively). In the case of average precipitation (panels a), GNN4CD generally leads to overestimation, with the largest bias happening in areas of complex topography. Systematic biases across the estimates have been identified in the Apulia region, as well as along the Tyrrhenian and Adriatic coasts of the Tuscany and Marche regions. These biases are more pronounced for the RC model. In the case of p $ 99 $ and p $ 99.9 $ (panels b-c), GNN4CD still tends to overestimate extreme precipitation in regions characterised by complex topography, although this overestimation is less pronounced relative to the bias observed in average precipitation. Conversely, a clear underestimation is evident in plain and hilly areas. The RC model exhibits overestimation in Apulia and along the Tyrrhenian coastlines, also for the extreme percentiles. The overestimation in regions of complex topography may be likely linked to the well-known issue of gauge undercatch, as the stations used to create the reference observational dataset are primarily located in valleys. Regions of complex topography are instead rarely covered, leading to poorer interpolation and thus affecting the DL model learning (Figure 4b). The temporal coverage of station data is also very diverse and may have negatively influenced the quality of the GRIPHO dataset in the less covered locations (Figure 4c).

Figure 6. Results in the reanalysis to observation downscaling setting and comparison with GRIPHO observations for the testing year $ 2016 $ for (a) average precipitation [mm/h] and percentage bias [%], (b) p $ 99 $ [mm/h] and percentage bias [%], (c) p $ 99.9 $ [mm/h] and percentage bias [%].

Similarly, we analyse the seasonal spatial maps for GRIPHO and the corresponding percentage bias for GNN4CD RC and R-all. The evaluation focuses on average precipitation and extreme percentiles, presented in Figures A3, A4, and A5. Figure A6 shows instead the seasonal PDFs. Both RC and R-all models tend to be wetter in JJA and drier in MAM. The seasonal PDFs suggest that the principal cause might be an overestimation of the occurrence of low-to-moderate events for the RC model and low events for the R-all model. The tails are instead better represented, in line with the p $ 99.9 $ maps showing the lowest biases.

Moreover, we evaluate the PCC for the entire Italian peninsula, the north and central-south areas (Table 3). In five cases out of nine, the correlation coefficients computed for the R-all model are higher than the RC values. Specifically, R-all performs better in all the metrics for the central-south area. Nevertheless, the RC model also reports positive correlation coefficients for all the cases investigated, with the highest values observed for the north of Italy, as expected. However, the systematic biases that we observe in some parts of the central-south area (e.g., the Apulia region) may have a detrimental influence on the aggregated spatial correlations values, where the gap with northern Italy is much more pronounced than in other metrics we assessed.

Table 3. Spatial correlation between the reference GRIPHO maps and the GNN4CD RC and R-all estimated maps for precipitation average, p $ 99 $ and p $ 99.9 $ ; results are shown for Italy (I), northern Italy (N), and central-south Italy (C-S)

GNN4CD is further evaluated in representing the total precipitation for $ 10 $ flood episodes occurring within the GRIPHO time span. All flood events exceed the $ 99 $ th percentile of the precipitation distribution in the affected area, except one case, which remains above the $ 90 $ th percentile, and seven events that also surpass the $ 99.9 $ th percentile. For this specific application, both the RC and R-all models have been retrained by excluding the time steps of the floods from the training set, in order to allow a fair evaluation. Figure 7 displays the comparison with the GRIPHO observational reference for all the events, for both the RC and R-all model designs. The results are promising, as all flood events are captured in terms of both spatial extent and severity. The overestimation of precipitation amounts is consistent with the patterns observed in previous evaluations, except for the second event, where the overestimation is more pronounced. This case requires a more in-depth study of the physical event to understand whether the event is particularly out-of-distribution compared to the cases encountered during training, and will be the subject of further studies.

Figure 7. Total precipitation [mm] for $ 10 $ flood events in Italy. Events $ 1 $ , $ 4 $ , $ 5 $ , $ 6 $ , $ 8 $ , $ 10 $ took place in northern Italy (N), events $ 2 $ , $ 7 $ in central Italy (C), events $ 3 $ , $ 9 $ in southern Italy (S).

Overall, GNN4CD demonstrates consistent performance in the reanalysis to observation downscaling task over the grid points of the Italian peninsula, indicating a degree of spatial transferability across the precipitation distribution. This includes the DL model’s capacity to represent both average and extreme spatial patterns, as well as the characteristics of individual flood events across the entire domain. A comparison of the PDFs relative to northern Italy and central-south Italy further illustrates the spatial transferability of GNN4CD, although some regional differences remain evident (Figures 5b and A1b). Indeed, notable degradation in performance is observed in certain specific regions where systematic biases persist and worsen the aggregate results. In future work, we aim to address these issues by investigating the underlying causes of the observed discrepancies and developing targeted solutions to improve the spatial transferability and overall performance of the model in the affected areas.

6.2. RCM emulation

In the RCM emulation setting, GNN4CD is used as a proper emulator to downscale RCM simulations.

We begin by comparing the PDFs (Figure 8). Panels a and d show the precipitation PDFs estimated by GNN4CD RC and R-all for the historical period. We compare them to the original climate model output and to the $ 10 $ -year subset of the GRIPHO observational dataset. The PDFs estimated by the RC and R-all models tend to yield a higher frequency of low precipitation values while underestimating the tail of the distribution, relative to the observational data. In contrast, RegCM tends to overestimate precipitation in the distribution tail when compared to observations. As expected, GNN4CD estimates are closer to the observed PDFs than to the RegCM distribution, especially for the RC case. The R-all model shows instead a more evident underestimation. Nevertheless, results are generally promising and in line with the reanalysis to observation downscaling case. Panels b-c and e-f present a comparison between the PDFs shown in panels a and d and those generated by the GNN4CD and RegCM models for the mid-century and end-of-century projections. The results indicate that the emulator generally captures the climate change signal exhibited by the RegCM model, reflected in a shift of the precipitation distribution towards more frequent and intense events, when moving to the future time slices. The magnitude of the shift differs slightly between the RC and R-all model, with the latter being closer to the RegCM shift and the former slightly under-representing the change between mid-century and end-of-century. These findings are quite remarkable, considering that GNN4CD was not trained on any precipitation scenario data, yet it shows the surprising ability to capture general trends even beyond the climate regimes where it was trained.

Figure 8. PDF of hourly precipitation [mm/h] with bin size of $ 0.5 $ mm for Italy (I); comparison of GRIPHO $ 10 $ -years (grey) and RegCM historical (black) with (a) historical GNN4CD RC (blue), (b) mid-century RegCM (pink) and GNN4CD RC (orange), (c) end-of-century RegCM (magenta) and GNN4CD RC (dark orange), (d) historical GNN4CD R-all (blue), (e) mid-century RegCM (pink) and GNN4CD R-all (orange), (f) end-of-century RegCM (magenta) and GNN4CD R-all (dark orange); the insets provide a magnified view of the tail of the distribution.

Next, we compare the spatial change projected by GNN4CD RC and R-all, and RegCM. The results when moving from historical to end-of-century are shown in Figure 9, while Figure A7 shows the case of moving from historical to mid-century. Panels a-b show the reference maps in terms of average precipitation for the historical period and the corresponding changes, computed consistently for each of the models. The emulator’s projections show an end-of-century dry change signal in the central regions towards the Tyrrhenian coast and in the Sardinia island, in line with the projections of RegCM. The intensification of the signal from mid-century to end-of-century, both wet and dry, is also confirmed by the projections of GNN4CD. The emulator also agrees in representing the wet change signal over the Alpine chain for the mid-century time slice. The same agreement is observed for the precipitation increase in the Apulia, Basilicata, Veneto and Friuli-Venezia-Giulia regions, mainly evident in the end-of-century estimates. However, there are multiple cases in which the projections of GNN4CD and RegCM disagree. For instance, GNN4CD projects a wetter climate signal in the Padania region (central-northern Italy) for the mid-century, more evident for the R-all model. The same behaviour is observed over the Alpine chain in the end-of-century change. Panels c-d and e-f show the same results (historical reference and corresponding change) in terms of p $ 99 $ and p $ 99.9 $ . When looking at the p $ 99 $ end-of-century change, the disagreement between the projections of GNN4CD and RegCM is even more evident. In this case, both RC and R-all project a wetter climate in almost all the peninsula, whereas the RegCM projections show a dry signal along the Tyrrhenian coast. Notably, the projections of GNN4CD and RegCM for the p $ 99.9 $ change show an overall alignment. The emulator generally agrees with the RegCM change signal sign, although it generally continues to project a wetter climate.

Figure 9. Maps for GNN4CD RC, GNN4CD R-all, and RegCM showing (a) historical average hourly precipitation [mm/h] and (b) end-of-century average change [%]; (c)-(d) the same for p $ 99 $ and (e)-(f) the same for p $ 99.9 $ .

Additionally, we derive box-plots from the spatial maps of precipitation average, p $ 99.9 $ , and percentage of rainy hours over Italy (Figure 10). These quantities are displayed for each time slice for GNN4CD RC and R-all and for RegCM. Similarly, box plots are also displayed for the relative percentage bias maps. The boxes span from the first to the third quartile of the data, with a horizontal line indicating the median value. The whiskers span from the edges of the box to the most extreme data point that falls within 1.5 times the interquartile range from the lower and upper quartiles. The average precipitation map box-plots highlight the opposite behaviour of the RC and R-all models. The former leads to a much higher median than RegCM, whereas the latter is much closer, with a slightly lower value. Accordingly, the corresponding relative bias map box-plot shows a median value very close to zero for the R-all case. For the p $ 99.9 $ map statistics, both RC and R-all models lead to lower median values, smaller in the case of R-all. The median percentage of rainy hours is instead higher in both cases, again with the larger difference produced by the R-all model. For the average precipitation map, the spread in the estimates of the two model designs is comparable. Instead, in the p $ 99.9 $ case the RC model exhibits a significantly larger spread. Finally, in the rainy hours case, the R-all model exhibits the larger spread. When the same statistics are derived for northern Italy and central-south Italy (Figure A8), similar conclusions can be drawn. However, the case of Northern Italy exhibits much greater spreads, which could have worsened the aggregated statistics relative to Italy.

Figure 10. Box-plots for RegCM (red) and GNN4CD RC (green) and R-all (blue), derived for Italy (I) from the spatial maps of (a) average precipitation [mm/h], (b) p $ 99.9 $ [mm/h] and (c) percentage of rainy hours [%]; the lower panels show the box plots for the relative bias maps of the same quantities.

The observed performance in projecting the precipitation change signal across both spatial and temporal dimensions suggests that the proposed emulator is potentially capable of projecting precipitation changes associated with global warming, as well as it may also possess a degree of spatial transferability in the RCM emulation setting.

7. Conclusions

In this work, we proposed GNN4CD, a combination of a new GNN-based DL model with an innovative training strategy that can be used for both reanalysis to observation downscaling and RCM emulation. Using as input low-resolution atmospheric variables, GNN4CD can efficiently derive precipitation estimates at high resolution. Different from most climate emulators available in literature (Addison et al., Reference Addison, Kendon, Ravuri, Aitchison and Watson2024; Baño-Medina et al., Reference Baño-Medina, Iturbide, Fernández and Gutiérrez2024; Doury et al., Reference Doury, Somot and Gadat2024), GNN4CD was trained on reanalysis and observations. Reanalysis data have the advantage of not suffering from the intrinsic biases of climate models, which can affect the training when outputs from climate models numerical simulations are used as predictors. This training strategy, which we refer to as a hybrid imperfect framework, should facilitate the ability of the emulator to generalise to climates and models unseen during training.

When used for reanalysis to observation downscaling, GNN4CD was able to reproduce the observed precipitation distribution and the extreme percentiles with a relatively good accuracy. In this respect, we observed a trade-off between optimising the initial part versus the tail of the distribution. The chosen loss configuration led to greater accuracy on extreme events, at the expense of low precipitation values, which were often overestimated. We plan to investigate different loss functions to improve this aspect. Nevertheless, GNN4CD estimated the total precipitation quite well during the selected $ 10 $ flood events. The sub-daily variability (diurnal cycles) was also well replicated for all the seasons, with some overestimation in summer. Despite this, the summer afternoon convection precipitation peak was well captured. The link between overestimation in regions of complex topography and the limitations of the current observational data should be further evaluated and will be the subject of further studies.

When used for RCM emulation, with climate data predictors, GNN4CD was evaluated in generating future precipitation projections and emulating the downscaling function between typical RCM and CP-RCM resolutions. Results were quite remarkable, especially considering that the emulator was tested under the RCP8.5 scenario, i.e., a high-end emissions pathway that poses a particularly challenging setting for emulation. Despite this, GNN4CD successfully reproduced the shift of the precipitation distribution toward more frequent and intense precipitation events, demonstrating a promising ability to capture general trends even beyond the climate regimes where it was trained. Further research is needed to explore the potential added value of fine-tuning the trained model using climate simulation data prior to inference.

Moreover, GNN4CD proved capable of estimating precipitation over a spatial domain larger than the training area, without a significant degradation in performance. However, some regions exhibited systematic biases, which should be carefully examined in future evaluations. Further research will be dedicated to incorporating additional regions beyond Italy during both the training and inference phases to help in understanding the limits of the emulator’s spatial transferability. This is important as spatial transferability is a unique feature of GNN4CD and has the potential to extend the emulator’s application to remote and/or data-sparse regions of the world.

Between the two model designs examined (RC and R-all), neither demonstrated a consistently superior performance. They both produced generally comparable results, with instances where each outperformed the other. For future work, we plan to prioritise the development of the R-all model, given its reduced computational cost and the advantage of working with a single model.

While the proposed GNN4CD shows promising skills in emulating regional climate simulations and capturing key aspects of precipitation variability and change, its current implementation has limitations that require further investigation. In particular, the RCM emulation skills are evaluated using coarsened CP-RCM data, which inherently preserves some physical information from the high-resolution simulations. Additionally, potential uncertainties in observational datasets and the historical constraint of reanalysis inputs may limit the emulator’s generalisation to future climate conditions. These considerations highlight the importance of developing strategies such as targeted fine-tuning or hybrid training schemes that can enhance the robustness and generalisation capacity of GNN4CD in more realistic applications. Future work will aim to address these challenges by extending the framework to more robustly support downscaling from GCM-based scenarios and by evaluating the emulator across multiple ensemble members from different RCMs, in order to assess its potential to enhance the CP-RCM ensemble. In addition, we plan to broaden the application of the emulator to other climate variables beyond precipitation.

All these future research directions will contribute to further establishing the effectiveness and reliability of GNN4CD. In turn, this would make high-resolution ensembles of climate projections accessible at a fraction of the cost and time required by dynamical methods.

Open peer review

To view the open peer review materials for this article, please visit http://doi.org/10.1017/eds.2025.10022.

Acknowledgements

The Authors are grateful to the two Reviewers for their constructive feedback and valuable suggestions, which have substantially contributed to improving the quality and clarity of the manuscript.

Author contribution

Conceptualization: E.C., G.S., L.B.; Methodology: V.B., E.C., V.A., S.D.G.; Software: V.B.; Visualisation: V.B; Data curation: V.B., S.D.G.; Writing original draft: V.B; Writing – review and editing: V.B., E.C, G.S., V.A., S.D.G.; Supervision: E.C., G.S., L.B.; all authors approved the final submitted draft.

Competing interests

The authors declare none.

Data availability statement

The code for the GNN4CD emulator is available at https://github.com/valebl/GNN4CD.

Ethics statement

The research meets all ethical guidelines, including adherence to the legal requirements of the study country.

Funding statement

V.B., E.C., acknowledge support from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 101003469 (XAIDA). E.C. acknowledges support from the European Union’s Horizon Europe programme under Grant Agreement No. 101081555 (Impetus4Change). V.B acknowledges co-funding from Assicurazioni Generali through grant D_S_ELIBER_GENERALI_SISSA_CoordCentroDS_0708. GS acknowledges co-funding from Next Generation EU, in the context of the National Recovery and Resilience Plan, Investment PE1 - Project FAIR “Future Artificial Intelligence Research”. This resource was co-financed by the Next Generation EU [DM 1555 del 11.10.22]. L.B. acknowledges that the study was partially carried out within the PNRR research activities of the consortium iNEST funded by the European Union Next-GenerationEU (PNRR, Missione 4 Componente 2, Investimento 1.5 – D.D. 1058 23/06/2022, ECS 00000043).

A. Appendix

A.1. Reanalysis to observation downscaling additional results

Figure A1. Same as Figure 5 but for the R-all model.

Figure A2. Same as Figure 6 but for the R-all model.

Figure A3. Seasonal results in the reanalysis to observation downscaling setting for the testing year $ 2016 $ for the hourly average precipitation; (a) GRIPHO observational reference [mm/h], (b) GNN4CD RC percentage bias [%], (c) GNN4CD R-all percentage bias [%].

Figure A4. Same as Figure A3 but for p $ 99 $ .

Figure A5. Same as Figure A3 but for p $ 99.9 $ .

Figure A6. Seasonal results in the reanalysis to observation downscaling setting for the testing year $ 2016 $ for the PDF of hourly precipitation [mm/h] with bin size of $ 0.5 $ mm for Italy (I); comparison between GRIPHO and (a) GNN4CD RC, (b) GNN4CD R-all; the insets provide a magnified view of the tail of the distribution.

A.2. RCM emulation additional results

Figure A7. Same as Figure 9, but computing the change for the mid-century period.

Figure A8. Same as Figure 10 but for (a) northern Italy (N) and (b) central-south Italy (C-S).

A.3. On the influence of predictors on time series length

To the best of our knowledge, previous studies that served as key references for this work, such as Doury et al. (Reference Doury, Somot, Gadat, Ribes and Corre2022), Doury et al. (Reference Doury, Somot, Gadat and Ribes2023), van der Meer et al. (Reference van der Meer, de Roda Husman and Lhermitte2023), Addison et al. (Reference Addison, Kendon, Ravuri, Aitchison and Watson2024), and Hess et al. (Reference Hess, Aich, Pan and Boers2025) did not incorporate a time series of predictors as input for downscaling. However, it is important to note that these studies were not conducted at sub-daily temporal resolutions, as is the case in our work. We identified one related study (Schmidt et al., Reference Schmidt, Schmidt, Strnad, Ludwig and Hennig2025) that employs a time series of predictors to generate sub-daily precipitation estimates. Nevertheless, their methodological framework differs substantially from ours, as they use a score-based diffusion model trained in a super-resolution setting, with coarse-resolution conditioning applied only at inference time.

Given the importance of capturing sub-daily variability, we believe that incorporating a time series of predictors can be particularly beneficial for hourly-scale modelling. To support this hypothesis, we performed an ablation study comparing the original GNN4CD configuration with three alternative setups:

  • GNN4CD $ \left[t-24,\dots, t\right] $ : the original model

  • GNN4CD $ \left[t-12,\dots, t\right] $ : the same model, but using a reduced sequence from $ t-12 $ to $ t $

  • GNN4CD $ \left[t-6,\dots, t\right] $ : the same model, but using a reduced sequence from $ t-6 $ to $ t $

  • GNN4CD $ \left[t\right] $ : a baseline using only time t predictors, without the recurrent component

The evaluations are performed over the validation year $ 2007 $ for the Italian peninsula, considering the R-all model configuration. The metrics considered in the comparison are the spatial maps of average, p $ 99 $ and p $ 99.9 $ , the PDF and the diurnal cycles. Results show that the full-time series (GNN4CD $ \left[t-24,\dots, t\right] $ ) provides the most accurate results, especially in spatial maps (Figure A9) and PDF metrics (Figure A10a). While the performance of the truncated models does not degrade drastically, the version without temporal context (GNN4CD $ \left[t\right] $ ) performs notably worse, particularly in the central–south Italy region, and fails to reproduce the diurnal cycles (Figure A10b-c-d). Given that computational savings from using a shorter sequence were minimal (on the order of a few minutes per epoch), we retained the original configuration.

Figure A9. Comparison between the four alternative setups for the R-all model configuration, i.e., GNN4CD $ \left[t-24,\dots, t\right] $ , GNN4CD $ \left[t-12,\dots, t\right] $ , GNN4CD $ \left[t-6,\dots, t\right] $ and GNN4CD $ \left[t\right] $ in terms of relative percentage bias [%] with respect to GRIPHO, considering the validation year $ 2007 $ ; (a) average, (b) p $ 99 $ and (c) p $ 99.9 $ spatial maps.

Figure A10. Comparison between the four alternative setups for the R-all model configuration, i.e., GNN4CD $ \left[t-24,\dots, t\right] $ , GNN4CD $ \left[t-12,\dots, t\right] $ , GNN4CD $ \left[t-6,\dots, t\right] $ and GNN4CD $ \left[t\right] $ with respect to GRIPHO, considering the validation year $ 2007 $ and Italy (I); (a) hourly precipitation PDF [mm/h] using a bin size of $ 0.5 $ mm and (b) average [mm/h] (c) frequency [%], and (d) intensity [mm/h] seasonal diurnal cycles.

Footnotes

This research article was awarded Open Materials badge for transparent practices. See the Data Availability Statement for details.

References

Addison, H, Kendon, E, Ravuri, S, Aitchison, L and Watson, P (2024) Machine Learning Emulation of Precipitation from km-scale Regional Climate Simulations Using a Diffusion Model.CrossRefGoogle Scholar
Araújo, JR, Ramos, AM, Soares, PMM, Melo, R, Oliveira, SC and Trigo, RM (2022) Impact of extreme rainfall events on landslide activity in Portugal under climate change scenarios. Landslides 19(10), 22792293. https://doi.org/10.1007/s10346-022-01895-7.CrossRefGoogle Scholar
Ban, N, Caillaud, C, Coppola, E, Pichelli, E, Sobolowski, S, Adinolfi, M, Ahrens, B, Alias, A, Anders, I, Bastin, S, Belušić, D, Berthou, S, Brisson, E, Cardoso, RM, Chan, SC, Christensen, OB, Fernández, J, Fita, L, Frisius, T, Gašparac, G, Giorgi, F, Goergen, K, Haugen, JE, Hodnebrog, Ø, Kartsios, S, Katragkou, E, Kendon, EJ, Keuler, K, Lavin-Gullon, A, Lenderink, G, Leutwyler, D, Lorenz, T, Maraun, D, Mercogliano, P, Milovac, J, H-J, Panitz, Raffa, M, Remedio, AR, Schär, C, Soares, PMM, Srnec, L, Steensen, BM, Stocchi, P, Tölle, MH, Truhetz, H, Vergara-Temprado, J, de Vries, H, Warrach-Sagi, K, Wulfmeyer, V and Zander, MJ (2021) The first multi-model ensemble of regional climate simulations at kilometer-scale resolution, part I: Evaluation of precipitation. Climate Dynamics 57(1), 275302. https://doi.org/10.1007/s00382-021-05708-w.CrossRefGoogle Scholar
Baño-Medina, J, Iturbide, M, Fernández, J and Gutiérrez, JM (2024) Transferability and explainability of deep learning emulators for regional climate model projections: Perspectives for future applications. Artificial Intelligence for the Earth Systems 3(4), e230099. https://doi.org/10.1175/AIES-D-23-0099.1.CrossRefGoogle Scholar
Battaglia, P, Hamrick, JBC, Bapst, V, Sanchez, A, Zambaldi, V, Malinowski, M, Tacchetti, A, Raposo, D, Santoro, A, Faulkner, R, Gulcehre, C, Song, F, Ballard, A, Gilmer, J, Dahl, GE, Vaswani, A, Allen, K, Nash, C, Langston, VJ, Dyer, C, Heess, N, Wierstra, D, Kohli, P, Botvinick, M, Vinyals, O, Li, Y and Pascanu, R (2018) Relational inductive biases, deep learning, and graph networks. https://arxiv.org/pdf/1806.01261.pdf.Google Scholar
Boé, J, Mass, A and Deman, J (2023) A simple hybrid statistical–dynamical downscaling method for emulating regional climate models over western europe. Evaluation, application, and role of added value? Climate Dynamics 61(1), 271294. https://doi.org/10.1007/s00382-022-06552-2.CrossRefGoogle Scholar
Brody, S, Alon, U and Yahav, E (2022) How attentive are graph attention networks? https://arxiv.org/abs/2105.14491.Google Scholar
Collins, M, Beverley, JD, Bracegirdle, TJ, Catto, J, McCrystall, M, Dittus, A, Freychet, N, Grist, J, Hegerl, GC, Holland, PR, Holmes, C, Josey, SA, Joshi, M, Hawkins, E, Lo, E, Lord, N, Mitchell, D, Monerie, P-A, Priestley, MDK, Scaife, A, Screen, J, Senior, N, Sexton, D, Shuckburgh, E, Siegert, S, Simpson, C, Stephenson, DB, Sutton, R, Thompson, V, Wilcox, LJ and Woollings, T (2024) Emerging signals of climate change from the equator to the poles: New insights into a warming world. Frontiers in Science 2, 1340323. https://doi.org/10.3389/fsci.2024.1340323.CrossRefGoogle Scholar
Coppola, E, Sobolowski, S, Pichelli, E, Raffaele, F, Ahrens, B, Anders, I, Ban, N, Bastin, S, Belda, M, Belusic, D, Caldas-Alvarez, A, Cardoso, RM, Davolio, S, Dobler, A, Fernandez, J, Fita, L, Fumiere, Q, Giorgi, F, Goergen, K, Güttler, I, Halenka, T, Heinzeller, D, Hodnebrog, Ø, Jacob, D, Kartsios, S, Katragkou, E, Kendon, E, Khodayar, S, Kunstmann, H, Knist, S, Lavín-Gullón, A, Lind, P, Lorenz, T, Maraun, D, Marelle, L, van Meijgaard, E, Milovac, J, Myhre, G, H-J, Panitz, Piazza, M, Raffa, M, Raub, T, Rockel, B, Schär, C, Sieck, K, Soares, PMM, Somot, S, Srnec, L, Stocchi, P, Tölle, MH, Truhetz, H, Vautard, R, de Vries, H and Warrach-Sagi, K (2020) A first-of-its-kind multi-model convection permitting ensemble for investigating convective phenomena over europe and the mediterranean. Climate Dynamics 55(1), 334. https://doi.org/10.1007/s00382-018-4521-8.CrossRefGoogle Scholar
Coppola, E, Stocchi, P, Pichelli, E, Alavez, JAT, Glazer, R, Giuliani, G, Di Sante, F, Nogherotto, R and Giorgi, F (2021) Non-hydrostatic regcm4 (regcm4-nh): Model description and case studies over multiple domains. Geoscientific Model Development 14(12), 77057723. https://doi.org/10.5194/gmd-14-7705-2021. https://gmd.copernicus.org/articles/14/7705/2021/.CrossRefGoogle Scholar
Cragg, JG (1971) Some statistical models for limited dependent variables with application to the demand for durable goods. Econometrica 39(5), 829844. http://www.jstor.org/stable/1909582.CrossRefGoogle Scholar
Danielson, JJ and Gesch, DB (2011) Global Multi-Resolution Terrain Elevation Data 2010 (GMTED2010). 10.3133/ofr20111073CrossRefGoogle Scholar
Doury, A, Somot, S and Gadat, S (2024) On the suitability of a convolutional neural network based RCM-emulator for fine spatio-temporal precipitation. Climate Dynamics 62, 85878613. https://doi.org/10.1007/s00382-024-07350-8.CrossRefGoogle Scholar
Doury, A, Somot, S, Gadat, S, Ribes, A and Corre, L (2022) Regional climate model emulator based on deep learning: Concept and first evaluation of a novel hybrid downscaling approach. Climate Dynamics. https://doi.org/10.1007/s00382-022-06343-9. https://insu.hal.science/insu-03863754.Google Scholar
Doury, A, Somot, S, Gadat, S, Ribes, A and Corre (2023) Regional climate model emulator based on deep learning: Concept and first evaluation of a novel hybrid downscaling approach. Climate Dynamics 60(10), 17511779. https://doi.org/10.1007/s00382-022-06343-9.CrossRefGoogle Scholar
Fantini, A (2019) Climate Change Impact on Flood Hazard Over Italy. https://hdl.handle.net/11368/2940009.Google Scholar
Fey, M and Lenssen, JE (2019) Fast graph representation learning with Pytorch Geometric. https://arxiv.org/abs/1903.02428.Google Scholar
Giorgi, F and Gutowski, WJ (2015) Regional dynamical downscaling and the cordex initiative. Annual Review of Environment and Resources 40(1), 467490.10.1146/annurev-environ-102014-021217CrossRefGoogle Scholar
Hersbach, H, Bell, B, Berrisford, P, Hirahara, S, Horányi, A, Muñoz-Sabater, J, Nicolas, J, Peubey, C, Radu, R, Schepers, D, Simmons, A, Soci, C, Abdalla, S, Abellan, X, Balsamo, G, Bechtold, P, Biavati, G, Bidlot, J, Bonavita, M, De Chiara, G, Dahlgren, P, Dee, D, Diamantakis, M, Dragani, R, Flemming, J, Forbes, R, Fuentes, M, Geer, A, Haimberger, L, Healy, S, Hogan, RJ, Hólm, E, Janisková, M, Keeley, S, Laloyaux, P, Lopez, P, Lupu, C, Radnoti, G, de Rosnay, P, Rozum, I, Vamborg, F, Villaume, S and Thépaut, J-N (2020) The era5 global reanalysis. Quarterly Journal of the Royal Meteorological Society 146(730), 19992049. https://doi.org/10.1002/qj.3803.CrossRefGoogle Scholar
Hess, P, Aich, M, Pan, B and Boers, N (2025) Fast, scale-adaptive and uncertainty-aware downscaling of earth system model fields with generative machine learning. Nature Machine Intelligence 7, 363373. https://doi.org/10.1038/s42256-025-00980-5.CrossRefGoogle Scholar
IPCC (2014) Climate Change 2014: Synthesis Report. Contribution of Working Groups I, II and III to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change. Geneva: IPCC. https://www.ipcc.ch/report/ar5/syr/.Google Scholar
IPCC (2023) Climate Change 2022 – Impacts, Adaptation and Vulnerability: Working Group II Contribution to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge University Press. https://doi.org/10.1017/9781009325844.Google Scholar
Johnson, JM and Khoshgoftaar, TM (2019) Survey on deep learning with class imbalance. Journal of Big Data 6(1), 154. https://doi.org/10.1186/s40537-019-0192-5.CrossRefGoogle Scholar
Kendon, EJ, Prein, AF, Senior, CA and Stirling, A (2021) Challenges and outlook for convection-permitting climate modelling. Philosophical Transactions of the Royal Society A 379(2195), 20190547. https://doi.org/10.1098/rsta.2019.0547.CrossRefGoogle ScholarPubMed
Laflamme, EM, Linder, E and Pan, Y (2016) Statistical downscaling of regional climate model output to achieve projections of precipitation extremes. Weather and Climate Extremes 12, 1523. ISSN 2212-0947. https://doi.org/10.1016/j.wace.2015.12.001. https://www.sciencedirect.com/science/article/pii/S221209471530058X.CrossRefGoogle Scholar
Lam, R, Sanchez-Gonzalez, A, Willson, M, Wirnsberger, P, Fortunato, M, Alet, F, Ravuri, S, Ewalds, T, Eaton-Rosen, Z, Hu, W, Merose, A, Hoyer, S, Holland, G, Vinyals, O, Stott, J, Pritzel, A, Mohamed, S and Battaglia, P (2023) Learning skillful medium-range global weather forecasting. Science 382(6677), 14161421. https://doi.org/10.1126/science.adi2336.CrossRefGoogle ScholarPubMed
Leung, LR, Mearns, LO, Giorgi, F and Wilby, RL (2003) Regional climate research: Needs and opportunities. Bulletin of the American Meteorological Society 84(1), 8995. http://doi.org/10.1175/BAMS-84-1-89.Google Scholar
Lin, T-Y, Goyal, P, Girshick, R, He, K and Dollár, P (2020) Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(2), 318327. https://doi.org/10.1109/TPAMI.2018.2858826.CrossRefGoogle ScholarPubMed
Luu, LN, Vautard, R, Yiou, P and Soubeyroux, JM (2022) Evaluation of convection-permitting extreme precipitation simulations for the south of France. Earth System Dynamics 13(1):687702, 2022. https://doi.org/10.5194/esd-13-687-2022. https://esd.copernicus.org/articles/13/687/2022/.CrossRefGoogle Scholar
Maraun, D, Widmann, M and Gutiérrez, JM (2019) Statistical downscaling skill under present climate conditions: A synthesis of the value perfect predictor experiment. International Journal of Climatology. 39(9), 36923703, DOI: https://doi.org/10.1002/joc.5877CrossRefGoogle Scholar
Morris, C, Ritzert, M, Fey, M, Hamilton, WL, Lenssen, JE, Rattan, G and Grohe, M (2021) Weisfeiler and Leman go neural: Higher-order graph neural networks. https://arxiv.org/abs/1810.02244.Google Scholar
Oleson, K, Lawrence, M, Bonan, B, Drewniak, B, Huang, M, Koven, D, Levis, S, Li, F, Riley, J, Subin, M, Swenson, S, Thornton, E, Bozbiyik, A, Fisher, R, Heald, L, Kluzek, E, Lamarque, J, Lawrence, J, Leung, R, Lipscomb, W, Muszala, P, Ricciuto, M, Sacks, J, Sun, Y, Tang, J and Yang, Z-L (2013) Technical Description of Version 4.5 of the Community Land Model (CLM). Technical Report 420, National Center for Atmospheric Research. https://doi.org/10.5065/D6RR1W7M.CrossRefGoogle Scholar
Paszke, A, Gross, S, Massa, F, Lerer, A, Bradbury, J, Chanan, G, Killeen, T, Lin, Z, Gimelshein, N, Antiga, L, Desmaison, A, Kopf, A, Yang, E, DeVito, Z, Raison, M, Tejani, A, Chilamkurthy, S, Steiner, B, Fang, L, Bai, J and Chintala, S (2019) Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., pp. 80248035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.Google Scholar
Pei, Y, Qiu, H, Yang, D, Liu, Z, Ma, S, Li, J, Cao, M and Wufuer, W (2023) Increasing landslide activity in the taxkorgan river basin (eastern pamirs plateau, China) driven by climate change. Catena 223, 106911. https://doi.org/10.1016/j.catena.2023.106911.CrossRefGoogle Scholar
Pichelli, E, Coppola, E, Sobolowski, S, Ban, N, Giorgi, F, Stocchi, P, Alias, A, Belušić, D, Berthou, S, Caillaud, C, Cardoso, RM, Chan, S, Christensen, OB, Dobler, A, de Vries, H, Goergen, K, Kendon, EJ, Keuler, K, Lenderink, G, Lorenz, T, Mishra, AN, H-J, Panitz, Schär, C, Soares, PMM, Truhetz, H and Vergara-Temprado, J (2021) The first multi-model ensemble of regional climate simulations at kilometer-scale resolution part 2: Historical and future simulations of precipitation. Climate Dynamics 56, 35813602. https://doi.org/10.1007/s00382-021-05657-4.CrossRefGoogle Scholar
Price, I, Sanchez-Gonzalez, A, Alet, F, Andersson, TR, El-Kadi, A, Masters, D, Ewalds, T, Stott, J, Mohamed, S, Battaglia, P, Lam, R and Willson, M (2025) Probabilistic weather forecasting with machine learning. Nature 637(8044), 8490. https://doi.org/10.1038/s41586-024-08252-9.CrossRefGoogle ScholarPubMed
Rampal, N, Gibson, PB, Sherwood, S, Abramowitz, G and Hobeichi, S (2024b) A reliable generative adversarial network approach for climate downscaling and weather generation. ESS Open Archive. https://doi.org/10.22541/essoar.171352077.78968815/v2.CrossRefGoogle Scholar
Rampal, N, Hobeichi, S, Gibson, PB, Baño-Medina, J, Abramowitz, G, Beucler, T, González-Abad, J, Chapman, W, Harder, P and Gutiérrez, JM (2024a) Enhancing regional climate downscaling through advances in machine learning. Artificial Intelligence for the Earth Systems 3(2), 230066. https://doi.org/10.1175/AIES-D-23-0066.1. https://journals.ametsoc.org/view/journals/aies/3/2/AIES-D-23-0066.1.xml.CrossRefGoogle Scholar
Riahi, K, Rao, S, Krey, V, Cho, C, Chirkov, V, Fischer, G, Kindermann, G, Nakicenovic, N and Rafaj, P (2011) Rcp 8.5—A scenario of comparatively high greenhouse gas emissions. Climatic Change 109(1),3357. ISSN 1573-1480. https://doi.org/10.1007/s10584-011-0149-y.CrossRefGoogle Scholar
Sanchez-Lengeling, B, Reif, E, Pearce, A and Wiltschko, AB (2021) A Gentle Introduction to Graph Neural Networks. https://distill.pub/2021/gnn-intro/.10.23915/distill.00033CrossRefGoogle Scholar
Scheepens, DR, Schicker, I, Hlaváčková-Schindler, K and Plant, C (2023) Adapting a deep convolutional rnn model with imbalanced regression loss for improved spatio-temporal forecasting of extreme wind speed events in the short to medium range. Geoscientific Model Development 16(1), 251270. https://doi.org/10.5194/gmd-16-251-2023. https://gmd.copernicus.org/articles/16/251/2023/.CrossRefGoogle Scholar
Schlichtkrull, M, Kipf, TN, Bloem, P, Van Den Berg, R, Titov, I and Welling, M (2018) Modeling relational data with graph convolutional networks. In The Semantic Web: 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, Proceedings 15. Springer, pp. 593607. https://doi.org/10.1007/978-3-319-93417-4_38.CrossRefGoogle Scholar
Schmidt, J, Schmidt, L, Strnad, F, Ludwig, N and Hennig, P (2025) A generative framework for probabilistic, spatiotemporally coherent downscaling of climate simulation. npj Climate and Atmospheric Science 8(1), 270. Nature Publishing Group UK London.CrossRefGoogle ScholarPubMed
Szwarcman, D, Guevara, J, Macedo, MMG, Zadrozny, B, Watson, C, Rosa, L and Oliveira, DAB (2024) Quantizing reconstruction losses for improving weather data synthesis. Scientific Reports 14(1), 3396. https://doi.org/10.1038/s41598-024-52773-2.CrossRefGoogle ScholarPubMed
Thiery, W, Davin, EL, Lawrence, DM, Hirsch, AL, Hauser, M and Seneviratne, SI (2017) Present-day irrigation mitigates heat extremes. Journal of Geophysical Research: Atmospheres 122, 14031422. https://doi.org/10.1002/2016JD025740.CrossRefGoogle Scholar
Turisini, M, Amati, G and Cestari, M (2023) Leonardo: A Pan-European pre-exascale supercomputer for HPC and AI applications. https://arxiv.org/abs/2307.16885.Google Scholar
van der Meer, M, de Roda Husman, S and Lhermitte, S (2023) Deep learning regional climate model emulators: A comparison of two downscaling training frameworks. Journal of Advances in Modeling Earth Systems 15(6), e2022MS003593. https://doi.org/10.1029/2022MS003593.CrossRefGoogle Scholar
Veličković, P, Cucurull, G, Casanova, A, Romero, A, Lio, P and Bengio, Y (2018) Graph attention networks. https://arxiv.org/abs/1710.10903.Google Scholar
Wang, C, Wang, P, Wang, P, Xue, B and Wang, D (2022) A spatiotemporal attention model for severe precipitation estimation. IEEE Geoscience and Remote Sensing Letters 19, 15. https://doi.org/10.1109/LGRS.2021.3084293.Google Scholar
Wilby, RL (2004) Guidelines for use of climate scenarios developed from statis-tical downscaling methods. In Supporting Material of the Inter-Governmental Panel on Climate Change. Available from the DDC of IPCC TGCIA, 27.Google Scholar
Figure 0

Figure 1. The hybrid imperfect framework, applied to the GNN4CD emulator. Scheme of (a) training: reanalysis to observation downscaling, (b) inference: reanalysis to observation downscaling, (c) inference: RCM emulation.

Figure 1

Table 1. Variables used as predictors (P) and target (T), each reported with its symbol, unit, pressure levels, space and time resolutions

Figure 2

Figure 2. Graph conceptualisation: Low nodes (blue dots) and High nodes (red dots) and close-up of (a) Low-to-High unidirectional edges (orange), connecting Low nodes to High nodes (b) High-within-High bidirectional edges (red), linking High nodes.

Figure 3

Figure 3. Schematic views of (a) RC, designed as a combination of Regressor and Classifier components, (b) R-all consisting of a single Regressor, (c) architecture, composed of four modules: a RNN-based pre-processor, a GNN-based downscaler, a GNN-based processor, and a FCNN-based predictor.

Figure 4

Figure 4. (a) training (northern Italy) and inference (entire Italy) areas, (b) locations of original stations used to create the GRIPHO dataset, and (c) percentage of valid time steps for each station.

Figure 5

Figure 5. Results in the reanalysis to observation downscaling setting and comparison with GRIPHO observations for the testing year $ 2016 $ for the PDF of hourly precipitation [mm/h] with bin size of $ 0.5 $ mm for (a) Italy (I) (b) northern Italy (N) and central-south Italy (C-S); the insets provide a magnified view of the tail of the distribution; (c) average [mm/h], (d) frequency [%] and (e) intensity [mm/h] seasonal diurnal cycles for Italy (I).

Figure 6

Table 2. Extreme percentiles computed for GRIPHO and the GNN4CD RC and R-all model designs for Italy (I), northern Italy (N), and central-south Italy (C-S)

Figure 7

Figure 6. Results in the reanalysis to observation downscaling setting and comparison with GRIPHO observations for the testing year $ 2016 $ for (a) average precipitation [mm/h] and percentage bias [%], (b) p$ 99 $ [mm/h] and percentage bias [%], (c) p$ 99.9 $ [mm/h] and percentage bias [%].

Figure 8

Table 3. Spatial correlation between the reference GRIPHO maps and the GNN4CD RC and R-all estimated maps for precipitation average, p$ 99 $ and p$ 99.9 $; results are shown for Italy (I), northern Italy (N), and central-south Italy (C-S)

Figure 9

Figure 7. Total precipitation [mm] for $ 10 $ flood events in Italy. Events $ 1 $, $ 4 $, $ 5 $, $ 6 $, $ 8 $, $ 10 $ took place in northern Italy (N), events $ 2 $, $ 7 $ in central Italy (C), events $ 3 $, $ 9 $ in southern Italy (S).

Figure 10

Figure 8. PDF of hourly precipitation [mm/h] with bin size of $ 0.5 $ mm for Italy (I); comparison of GRIPHO $ 10 $-years (grey) and RegCM historical (black) with (a) historical GNN4CD RC (blue), (b) mid-century RegCM (pink) and GNN4CD RC (orange), (c) end-of-century RegCM (magenta) and GNN4CD RC (dark orange), (d) historical GNN4CD R-all (blue), (e) mid-century RegCM (pink) and GNN4CD R-all (orange), (f) end-of-century RegCM (magenta) and GNN4CD R-all (dark orange); the insets provide a magnified view of the tail of the distribution.

Figure 11

Figure 9. Maps for GNN4CD RC, GNN4CD R-all, and RegCM showing (a) historical average hourly precipitation [mm/h] and (b) end-of-century average change [%]; (c)-(d) the same for p$ 99 $ and (e)-(f) the same for p$ 99.9 $.

Figure 12

Figure 10. Box-plots for RegCM (red) and GNN4CD RC (green) and R-all (blue), derived for Italy (I) from the spatial maps of (a) average precipitation [mm/h], (b) p$ 99.9 $ [mm/h] and (c) percentage of rainy hours [%]; the lower panels show the box plots for the relative bias maps of the same quantities.

Figure 13

Figure A1. Same as Figure 5 but for the R-all model.

Figure 14

Figure A2. Same as Figure 6 but for the R-all model.

Figure 15

Figure A3. Seasonal results in the reanalysis to observation downscaling setting for the testing year $ 2016 $ for the hourly average precipitation; (a) GRIPHO observational reference [mm/h], (b) GNN4CD RC percentage bias [%], (c) GNN4CD R-all percentage bias [%].

Figure 16

Figure A4. Same as Figure A3 but for p$ 99 $.

Figure 17

Figure A5. Same as Figure A3 but for p$ 99.9 $.

Figure 18

Figure A6. Seasonal results in the reanalysis to observation downscaling setting for the testing year $ 2016 $ for the PDF of hourly precipitation [mm/h] with bin size of $ 0.5 $ mm for Italy (I); comparison between GRIPHO and (a) GNN4CD RC, (b) GNN4CD R-all; the insets provide a magnified view of the tail of the distribution.

Figure 19

Figure A7. Same as Figure 9, but computing the change for the mid-century period.

Figure 20

Figure A8. Same as Figure 10 but for (a) northern Italy (N) and (b) central-south Italy (C-S).

Figure 21

Figure A9. Comparison between the four alternative setups for the R-all model configuration, i.e., GNN4CD $ \left[t-24,\dots, t\right] $, GNN4CD $ \left[t-12,\dots, t\right] $, GNN4CD $ \left[t-6,\dots, t\right] $ and GNN4CD $ \left[t\right] $ in terms of relative percentage bias [%] with respect to GRIPHO, considering the validation year $ 2007 $; (a) average, (b) p$ 99 $ and (c) p$ 99.9 $ spatial maps.

Figure 22

Figure A10. Comparison between the four alternative setups for the R-all model configuration, i.e., GNN4CD $ \left[t-24,\dots, t\right] $, GNN4CD $ \left[t-12,\dots, t\right] $, GNN4CD $ \left[t-6,\dots, t\right] $ and GNN4CD $ \left[t\right] $ with respect to GRIPHO, considering the validation year $ 2007 $ and Italy (I); (a) hourly precipitation PDF [mm/h] using a bin size of $ 0.5 $mm and (b) average [mm/h] (c) frequency [%], and (d) intensity [mm/h] seasonal diurnal cycles.

Author comment: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R0/PR1

Comments

Dear Professor Monteleoni,

I am pleased to submit our manuscript entitled “Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect approach” authored by Valentina Blasone, Erika Coppola, Guido Sanguinetti, Viplove Arora, Serafina Di Gioia and Luca Bortolussi for consideration for publication in the Environmental Data Science journal. This paper is an original contribution and is not published in the present or other forms elsewhere neither is considered for publication in any other journal. This paper presents a new machine learning-based climate emulator, combined with an innovative training strategy to derive high-resolution precipitation projections and we believe it will be of significant interest to your readers.

Please feel free to contact me if you need any additional information.

Sincerely,

Valentina Blasone

Review: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R0/PR2

Conflict of interest statement

Reviewer declares none.

Comments

This study tackles the problem of RCM emulation with deep learning approaches, which is an active area of research. The setup proposed by the authors intends to be model-agnostic, and leverages a graph-based architecture, which is relatively new in this field. The paper is generally well written, easy to follow and the evaluation part is based on relevant diagnostics. However, I struggled a bit to understand the reasons behind the relatively complex setup and the added value of the graph approach. I guess these issues could be removed in a large part by improving the positioning with respect to the

state-of-the-art and including a comparison to existing approaches. For these reasons, I consider the paper in its present deserves substantial reviewing before publication in EDS.

Main comments.

1. My first concern is about the experimental setup and objectives of the present work. I recommend the authors to clarify the positioning of their work with respect to the state-of-the-art in the introduction. In particular, the different types of statistical downscaling can be confusing. Page 3, L38-40, perfect model framework is not equivalent to perfect prognosis and imperfect model framework is not the same as MOS. As two main approaches, I would suggest: Observation downscaling and RCM emulation, and within each method there are 2 possible options corresponding to perfect prognosis/super-resolution and perfect/imperfect frameworks respectively (see Rampal et al. 2024, es-

pecially their Figure 3, which introduces the different setups very clearly).

What I understand is that you train your model in the framework of observation downscaling with the perfect prognosis approach. The model trained under these conditions is then used to perform RCM emulation (in a perfect model framework), which is a kind of transfer learning. Is that right ? This design is quite complicated, and it should be carefully explained. The terms real-world and model-world are also confusing. As an attempt to limit the vocabulary, I think real world could be replaced by downscaling and model world by emulation. I also wonder if there are similar configurations in the

literature. I am aware of some observation downscaling/perfect prognosis works in studies by Bano-Medina et al. and Rampal et al. for instance, but they don’t seem to apply their models for RCM emulation. In addition, I dont’ understand why your framework is named hybrid imperfect approach ? Is that hybrid because you’re using as RCM emulator a model trained for observation downscaling ? And why imperfect ? If your main objective is to perform RCM emulation, then you should really explain why you don’t simply train a RCM emulator ?

2. Another major concern relates to the motivations and added value of using a GNN rather than more standard approaches, such as convolutional networks.

GNN is a complicated architecture, its use should be carefully justified, and if possible explicitly demonstrated. Some reasons are given p7 L16-19. I understand the two different coordinate systems, but is it really a sufficient reason for a GNN approach ? GRIDPHO could also be processed on a regular lat/lon grid ? Regarding domain transferability, I guess it can also be achieved with CNN.

As the use of a GNN is presented as the main innovation of your work, comparing its performances to those of a baseline model seems essential.

3. The list of large-scale predictors is quite long for downscaling a single parameter. I guess it could be reduced without loss of quality. Did you look at predictors importance ? Using time series of predictors is quite unsual (as far as I know) and less documented, and I agree this should to some extent improve the results. However, your temporal length of 24h seems very long. As it is quite a relatively new setup I would recommend testing the sensitivity of the results to the size of this temporal window, or at least to evaluate the gain of using time series (and RNN).

4. I agree precipitation is a highly skewed variable that deserves specific attention. Thresholding and logarithmic transformation are common pre-processing steps that often give good results. In addition, the authors propose a strategy based on 2 models. Could you clarify what you mean by ‘ the regressor is trained only on targets where precipitation values exceed the threshold’ ? Do you mask the nodes where precipitation is below the threshold ? Do you also apply the precipitation thresholding on the GNN outputs ? In the end, between RC and R-All, which strategy would you recommend ?

Regarding the imbalance problem, Ravuri at al. proposed importance sampling to design meaningful datasets with skewed data. It can be an avenue for future works.

Specific comments

1. GRIDPHO grid: what is the reason for choosing a Lambert grid compared to a regular Lat/Lon grid ?

2. More details are needed on the GNN design.

- I understand the processor directly operates on the high-res grid, is that right ? If yes it can be

computationally expensive, did you consider introducing a coarser scale grid for the processor ?

- Can you confirm the processor consists of 5 message passing layers ? How is this number chosen ?

- Is there only one mesh level (compared to some approaches that include various refinement

layers) ?

- P7 L12-13: I guess processor should be replaced by predictor here.

3. Carefully define α, which is used twice, in eq (3.1) and (3.3), but with different meanings and values I suppose. This should be modified (and the definition of α in eq (3.1) should be given).

4. P10 L23: could you clarify what you mean by hourly average precipitation, frequency and intensity ? In my interpretation average precipitation is computed using zero and non-zero precipitation, while intensity is computed with non-zero precipitation only, but I’m not sure. If frequency is defined by percentage of wet hours, I don’t understand why the unit is mm/hr (Figure 5 and similar).

5. Figure 5: do you have an idea why the intensity is overestimated in JJA ? Is it observed for both northern and south-central Italy ?

6. Figures 5 and 6: legend is missing for panel (c).

7. Interpretation of results from Figures 5/6: do you have any ideas to explain the overestimation over complex topography (and underestimation in plain) ? Can it be related to the tuning of the QMSE loss ? I don’t really see the diffusive behaviour of RC, is it also observed when comparing the power spectra ?

8. Figure 7: it does not seem fairplay to show cases from both training and testing datasets. I recommend showing only events from the testing set.

9. P12 L7-12: the domain transferability is demonstrated to some extent. I am a bit puzzled by the correlations in Table 2, which are significantly lower in the Central-South domain, compared to the North one. Given these results, it may be worth being more cautious in the conclusions.

10. P13 L37: could you clarify the difference you make between downscaling and emulation ? I have the feeling both terms are used without distinction throughout the paper, but in the end there seems to be a suble (and maybe important) difference. This clarification is also linked to major point 1.

11. P13 L45-46: I would be more cautious here, as the downscaling is far from perfect. You could add ‘with a relatively good accuracy’ at the end of the sentence.

12. Figure 9: add legend of panel (b).

Review: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R0/PR3

Conflict of interest statement

Reviewer declares none.

Comments

Title: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect approach

Authors: Blasone, V., Coppola, E., Sanguinetti, G., Arora, V., Di Gioia, S. and L. Bortolussi

Journal: Environmental Data Science

Nr: EDS-2024-0049

In this manuscript the authors present a novel approach to downscale climate information to local spatiotemporal scales. Specifically, hourly precipitation at so-called km-scales. These scales are fine enough to resolve sub-daily mesoscale characteristics of precipitation (e.g., diurnal cycles) and short duration, high impact extreme precipitation events. The novelty of the approach hinges on the chosen architecture and the training strategy. Graph Neural Networks (GNN) such as the one employed here have demonstrated considerable flexibility when transferred to domains not seen in training. Spatial transferability is a known challenge in the emerging sub-discipline of AI for climate downscaling and demonstrating this capability represents a significant advance. The training strategy, a so-called Hybrid Imperfect Approach, aims to leverage the advantages conferred by using reanalysis data and observations. The reasons for this are twofold. One, training on observations negates the potential confounding effects of model errors and biases. Two, this training approach has, in principle, better generalization properties as it is not climate model-specific. The authors demonstrate this generalization capability, albeit in a limited way, in their “model-world” analyses. This is also a significant advance as machine learning algorithms trained in the “perfect model” framework often struggle to adapt to new inputs/predictors unseen in training (whether they come from global or regional climate models). A third advance this manuscript presents is the ability of deep learning algorithm to extrapolate to climate states unseen in training. Specifically, the GNN algorithm can capture the distribution shifts in hourly precipitation under future climate conditions. The shifts in the extremes are well represented, which is a prerequisite for employing these tools in developing ensembles of local scale climate projections. Extrapolation is a known challenge for machine learning and a long-running and well-documented criticism of traditional empirical statistical downscaling (ESD) techniques. While the examples provided here is not a comprehensive demonstration, it is a very promising first step.

While the scientific contribution of the paper is significant, I believe the writing could use some improvement. The are several unclear passages and run-on sentences that detract from its overall impact, some missing discussion points and one analysis that seems poorly justified. Further, the figures are of only middling quality. Improving them would greatly enhance the manuscript. I have several major comments/suggestions and many more minor comments/suggestions. However, I do not believe addressing these will require a major revision.

This is an important contribution to the field. As such, I believe it is worth taking the time to strengthen its impact. This can be done through improved figures, more discussion on strengths and limitations, removal of superfluous and ill-posed analyses, and clearer more concise writing. A note on syntax: in the “specific comments” section I use a P<x>L<y> notation where “P<x>” is page number and “L<y>” is line number(s).

<b>General comments</b>

1. The Impact Statement could be strengthened. I understand that these have strict character/word limits. Therefore, I suggest the authors shorten the first sentence. Modify the second and third sentence to emphasize the main advances (spatial transferability, future extrapolation, generalizability) and why they are important.

2. Throughout the manuscript the authors refer to ERA5 (~25km grid spacing) as “low resolution”. This is incorrect. Its effective resolution is sufficient to resolve synoptic scale circulations. State-of-the-art GCMs at ~50-250km grid spacing are appropriately referred to as “low resolution” while RCMs and CPRCMs are operating at resolutions that may be considered “high” to “very high”, respectively. What the authors effectively demonstrate is a downscaling from “moderate to high” resolution and not “low to high”. If they had shown that the algorithm effectively downscales a GCM to 3km then the latter would apply.

3. The introduction contains a helpful review of perfect model and imperfect frameworks. While it is beyond the scope to go into an in-depth discussion of strengths and weaknesses, a few additional sentences would help the reader understand the motivation for seeking a third way. For example, the perfect model framework does not, in the end, address the prohibitive cost issues that plague CPRCM simulations since it requires long future simulations for training. Also, the authors could link the discussion of bias mitigation, a few paragraphs later, more explicitly to these frameworks. Lastly, the authors provide many references for both perfect and imperfect frameworks, but they neglect to mention approaches that are similar to their own such as the approaches outlined by Hess et al., (2024) and Schmidt et al. (2025). These should be included for completeness.

4. A bit more discussion of GRIPHO is needed. What is the method used for interpolation to the common grid? How is observational uncertainty quantified (e.g., undercatch, instrument errors), if it is quantified at all? I would also suggest the authors include station locations in one of their figures (e.g., Figure 4). I suspect that areas of complex topography will also be areas with more sparse measurements, which leads to unreliable interpolation (see, Lussana et al., 2019). These areas are also where, unsurprisingly, the GNN4CD algorithm exhibits a mismatch with the observations.

5. There is a paragraph just before the description of the emulator that discusses the spatial and temporal length scale assumptions. However, it is unclear in the presentation of the emulator architecture, exactly where and how these elements are implemented. It appears the temporal issues are handled by the RNN pre-processor. The GAT layer presumably handles the spatial influence. However, details about how these layers treat the spatial and temporal scales are missing. Please provide some additional details about these elements of the emulator.

6. Is the empirical tuning the of the hyperparameters a problem?

7. I worry that the fourteen years for training, one year for validation and one year for testing is quite unbalanced and may result in overfitting. General guidance in machine learning is an 60-80% for training, 10-20% validation, and 10-20% testing. While I recognize and appreciate that general rules are meant to be broken, the split in this study is closer to 90-5-5. It would be helpful if the authors could defend this choice.

8. I appreciate the attempt by the authors to include an impacts analysis by examining the ability of the emulator to reproduce the precipitation signatures associated with historical flood events. However, I do not think it is appropriate to include results that come from the training period, even if some of the events are outside the training region. Considering the timing of these events (all Oct. or Nov.), even those outside the training region are not likely to be synoptically isolated. Only one of the events is completely out of training period/region. As a solution I propose the authors only include the 2016 event, which they can emphasize shows a promising first result towards events-based, impacts assessments using AI-ML emulators. They authors could then include the other events as supplementary material noting that inclusion in the main body of the results would not be appropriate but that they largely support the use of GNN4CD in this manner and that further research can/will be done to confirm this.

9. The conclusion section needs to a bit more measured with respect to the advances conferred by GNN4CD. This is not to detract from their significance – it is really impressive! – but rather, to more helpfully frame knowledge gaps and research directions. I save the specific suggestions for the following section but can generally say that there should be some discussion of the limitation of the present approach. For example, it downscales coarsened CPRCM for future climates which is not the same as downscaling from an “unseen” GCM due to the fact that the “perfect model” set up using coarsened data already contains much of the information needed to reproduce the precipitation signal. Also, the reader only ever sees seasonal performance in the diurnal cycle plots. All other figures show aggregated results. As such we don’t know if the biases, for example, are uniform in sign and magnitude across different seasons or if the aggregated pattern arises from a single season. Such information can help discern where the emulator struggles and what processes it captures well.

<b>Specific comments</b>

1. P1L24: spell out GNN4CD.

2. P1L29: replace “low-resolution” with “medium-resolution”.

3. P1L38: replace “effect” with “impacts”.

4. P2L30: replace “intense” with “expensive”.

5. P2L31: replace “prohibitive” with “prohibitively expensive”.

6. P2L36-38: The authors imply that ML can “improve” upon traditional models. I am unaware of any results that show emulators exhibit improvements over RCMs. I could be wrong but would encourage the authors to include a reference if this is the case.

7. P3L5: The Addison study is now on arXiv (see references below).

8. P3L19-21: The HIA approach is not well defined. It would be helpful if the authors could be more precise in its definition and describe (briefly) why it is a promising alternative to the purely “perfect” and “imperfect” training frameworks.

9. P5L34: I just wanted to state that, overall, the description of the GNN4CD emulator is excellent. Well-balanced in terms of technical detail and clearly and logically organized.

10. P7L47-48: The authors state that the scaling factor helps representation of rare events. But if the Classifier is solely focused on a binary categorization of rain/no-rain, how can it even begin to discern extreme events?

11. P9L29-32: Since the authors justify the use of a manual tuning of the hyper-parameters by invoking the “cost” of training, it would be helpful to know what this cost is. Also, it might be useful to describe the advantages/disadvantages of an automatic tuning approach?

12. P11L24-27: This is a run-on sentence, and its meaning is not entirely clear. I suggest splitting the sentence in two and taking care with the sentence structure. Here is a suggested rewording: “The RC model exhibits a larger bias in average JJA precipitation compared to other seasons (Figure 5c). This arises predominantly due to too high precipitation intensities. The R-all model also exhibits large biases in JJA average precipitation; in this case too frequent precipitation is the main contributor (Figure A1c).” Hopefully, this kind of restructuring can be helpful in other sections of the manuscript where the sentences are too long and try to include too many details.

13. P11L27-29: The peak in the JJA diurnal precipitation is quite clearly late afternoon/early evening 17-1800.

14. P11L31-34: Another confusing sentence. Precipitation is clearly overestimated over areas of complex topography. But what does “where observations are higher” mean? As an aside this section is where a more robust consideration of the limitations of the observational dataset could fit (see general comment 4).

15. P11L40: Add comma after “Conversely”.

16. P12L45-48: This is a nice result and augurs well for the emulator’s ability to extrapolate to unseen/unknown future climates. However, I think the authors need to temper this statement as the test is performed in a perfect model setup and as such there is already considerable information about the resulting precipitation distribution is contained in the coarsened data (so-called upscaled added value). A more robust test would have been to take a GCM, or an uncoarsened RCM, as input.

17. P13L11-12: The authors need to be a bit more transparent here. The emulator is not just “wetter”, it actually exhibits opposite signed responses over the entirety of Italy for p99, and specific areas for average change. Interestingly, there is good agreement for p99.9, which raises important questions about the stability of the emulator. These inconsistencies should not be viewed as negatives. Rather they serve an important purpose in highlighting knowledge gaps.

18. P13L13: Delete “instead”.

19. P13L29: Replace “to” with “with”.

20. P13L37: Replace “low-resolution” with “medium-resolution”.

21. P13L41: “biases”.

22. P13L42-43: Re-word. I suggest, “This training strategy, which we refer to as HIA, should facilitate the ability of the emulator to generalize to climates and models unseen in training.”

23. P13L47: Replace “leaded” with “leads”.

24. P13L52: Delete “instead”.

25. P14L13-15: Another confusing run-on sentence. This is an important implication, so it is well worth re-wording. Suggestion: “This is important as spatial transferability is a unique feature of this emulator and has the potential to extend the emulator’s application to remote and/or data sparse regions of the world.”

26. P14L20-22: I’ve often wondered why large resolution jumps are a problem in ML. In dynamical downscaling the problem is clear. In traditional ESD we often make jumps from hundreds of kilometers to point scales. The authors needn’t address this in the manuscript, but I am curious. Hess et al., (2024) for example, claim (implicitly) that the resolution jump isn’t a problem; the only limitation is the resolution of the training “target” dataset.

27. P14L23-26: Run-on sentence. Consider breaking in two. Suggestion, “These future research directions will help further establish the effectiveness and reliability of GNN4CD emulator. Doing so will put high-resolution ensembles of climate projections, generated at a fraction of the cost and time compared to dynamical methods, within reach.”

<b>Figures</b>

The figures could use some improvement. Doing so will greatly increase the impact of the manuscript. I only comment on the figures in the main body of the text, but my points also apply to the figures in the appendix.

Figure 4. This figure takes up a lot of space yet communicates very little. I suggest adding e.g., station locations to the map in panel a. Panel b is unnecessary and can be removed. It is also misleading because 2007-2016 is in the “training” period but the caption states it is in the “inference” period.

Figure 5. The PDFs are nearly indecipherable. I suggest taking the approach of Addison et al., (2024). See their figure 3a. This will help more clearly show the separation and/or overlap. The yellow lines are very faint. Choose a color with better contrast. In panel c the frequency is shown in mm/hr. This is incorrect. Frequency should be either a fraction or percentage. It is also unclear whether the diurnal cycles are computed over the training region or all of Italy. Lastly, the caption is missing critical details.

Figures 6,7,9. It is difficult to tell due to the small size of some of the panels, but it appears that the “rainbow” colormaps are not perceptually uniform (quite evident in Figures 6 & 9). Some alternatives can be found here: https://colorcet.com/.

Figure 8. See comments about the PDFs in Figure 1. The issue is even worse here as the authors are trying to show the shift in the future distribution by the CPRCM and how well the emulator reproduces it. It is impossible to discern this as it is currently displayed.

Figure 10. I suggest showing box-plots as they contain more information than just mean + 95% confidence interval. Also, the lines/colors in the bottom panels are unreadable. Lastly, the caption should state whether these are computed over all of Italy or just the training region.

<b>References</b>

Addison, H., Kendon, E., Ravuri, S., Aitchison, L., & Watson, P. A. (2024). Machine learning emulation of precipitation from km-scale regional climate simulations using a diffusion model (No. arXiv:2407.14158). arXiv. https://doi.org/10.48550/arXiv.2407.14158

Hess, P., Aich, M., Pan, B., & Boers, N. (2024). Fast, Scale-Adaptive, and Uncertainty-Aware Downscaling of Earth System Model Fields with Generative Foundation Models (No. arXiv:2403.02774). arXiv. https://doi.org/10.48550/arXiv.2403.02774

Lussana, C., Tveito, O. E., Dobler, A., & Tunheim, K. (2019). seNorge_2018, daily precipitation, and temperature datasets over Norway. Earth System Science Data, 11(4), 1531–1551. https://doi.org/10.5194/essd-11-1531-2019

Schmidt, J., Schmidt, L., Strnad, F., Ludwig, N., & Hennig, P. (2025). A Generative Framework for Probabilistic, Spatiotemporally Coherent Downscaling of Climate Simulation (No. arXiv:2412.15361). arXiv. https://doi.org/10.48550/arXiv.2412.15361

Recommendation: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R0/PR4

Comments

No accompanying comment.

Decision: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R0/PR5

Comments

No accompanying comment.

Author comment: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R1/PR6

Comments

Dear Editor-in-Chief,

we are pleased to submit the review for our manuscript entitled "Graph neural networks for hourly

precipitation projections at the convection permitting scale with a novel hybrid imperfect framework".

We believe that the Reviewers' comments have been very helpful in improving and enriching the work

and we hope that we have responded to all their comments in an appropriate manner.

We confirm that this manuscript is original, has not been published, and is not currently being considered

for publication elsewhere. All authors have approved the manuscript and agree with its submission.

Thank you for considering this manuscript for publication.

Sincerely,

Valentina Blasone

On behalf of all the Authors

Review: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R1/PR7

Conflict of interest statement

No competing interests.

Comments

I would like to thanks the authors for their thoughtful and thorough responses to my comments. I am satisfied that the manuscript is ready for publication.

Review: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R1/PR8

Conflict of interest statement

Reviewer declares none.

Comments

I thank the authors for the revision and answers to all my comments. I’m overall happy with the revision, and I only have some residual minor comments detailed below.

• 1. P6L49 ‘The term hybrid refers to the use of different domains’: in this context ‘domain’ can be misleading as it can also refer to geographical domain (which could apply in your case).

• 2. P8L33-24 ‘time steps with only targets below the threshold are removed, reducing the dataset to approximately 50% of its original size’: you don’t apply the same rule to the R-all case ?

• 3. Figure 7: for each case it would be interesting to know how extreme the event is (for instance, is it a percentile 90, 99 event ?)

• 4. P14L42 ‘with slight overestimation of the precipitation estimate’: I wouldn’t say case (2) is a slight overestimation ! This case would merit more in-depth investigation (however I can understand it’s beyond the scope of this paper).

• 5. Figures 8a and 8d: you should be cautious in comparing with GRIPHO since they do not correspond to the same period as GNN4CD and RegCM estimates, it cannot be considered as the ground truth here.

• 6. P15 L27 than, L37 These findings

• 7. Figure 10: legend indicates Bias % is computed against GRIPHO, but it would make more sense to compute against RegCM (is it just a typo mistake ?).

• 8. I would add somewhere that choosing to emulate the most extreme RCP8.5 scenario makes the task particularly challenging for GNN4CD, and the results even more remarkable.

Recommendation: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R1/PR9

Comments

No accompanying comment.

Decision: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R1/PR10

Comments

No accompanying comment.

Author comment: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R2/PR11

Comments

Dear Editor-in-Chief,

I am pleased to submit the revised version of our manuscript. We believe that this version further improved the quality and clearness of the manuscript.

Thank you for considering this manuscript for publication in the Environmental Data Science journal.

Sincerely,

Valentina Blasone

On behalf of all the co-authors

Review: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R2/PR12

Conflict of interest statement

Reviewer declares none.

Comments

I thank the authors for their answers to my last comments.

Recommendation: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R2/PR13

Comments

No accompanying comment.

Decision: Graph neural networks for hourly precipitation projections at the convection permitting scale with a novel hybrid imperfect framework — R2/PR14

Comments

No accompanying comment.