Impact Statement
Accurate power production forecasts, particularly for solar and wind power which are sensitive to weather conditions, are critical for grid stability, optimizing renewable energy integration, and supporting the transition to cleaner energy. We predict national power output in France by taking advantage of time-varying images of weather and power generation units’ capacity as input data for different machine learning models. The key finding is that image-based models outperform time series-based models. The results of this research provide a practical model benchmark usable for practitioners and policymakers.
1. Introduction
To meet the 2050 net-zero scenario (United Nations Convention on Climate Change, 2015) of the European Union (EU) reinforced by the European Green Deal, which aims at decreasing net greenhouse gas emissions by 55% by 2030 (European Commission, 2019). Sustainable energy sources have become key to clean power production and reduced emissions from the energy sector in Europe. As power demand increases, however, fossil reliance is still high, accounting for 68% of the global primary energy consumed in 2023 and 40% of the electricity produced in the EU (British Petroleum [BP], 2024; Ritchie and Rosado, Reference Ritchie and Rosado2020). Electrification, coupled with more renewable and other low-carbon power supplies, is needed to reduce dependence on fossil fuels. To meet the
$ {\mathrm{CO}}_2 $
emissions goals of the EU, solar and wind power generation need to double their capacity by 2030 to produce 48% of Europe’s energy share (International Renewable Energy Agency (IRENA), 2020b).
France has set a reduction of 33% of its emissions by 2030 compared to 1990, and pledged to reach greenhouse gas neutrality in 2050 (Ministère de la Transition Ecologique, 2020). This involves an increase in renewable power capacity installed throughout the country. The capacity of solar and wind power plants has tripled since 2012, and this growth is expected to accelerate with the capacity being planned to double from 2017 to 2028 (Ministère de la Transition Ecologique, 2019). Increasing renewable capacity comes with grid distribution challenges to prevent gaps between supply and demand, especially during the day when production may exceed consumption (Liu et al., Reference Liu, He, Wu, Liu, Zhang, Chen, Shen and Li2023a). Accurate forecasts of power generation can improve the stability, reliability, quality, and penetration level of renewable energy (International Renewable Energy Agency (IRENA), 2020a). Solar and wind power sources depend on environmental and climate variables such as temperature, solar radiation, and wind speed, making their load highly variable (Engeland et al., Reference Engeland, Borga, Creutin, François, Ramos and Vidal2017; Wang et al., Reference Wang, Zhong, Lai, Xia, Wang and Kang2019b). This variability leads to obstacles for grid operators as they need to constantly balance the demand with the supply. This is one of the reasons why specific models for understanding and predicting day-to-day renewable power generation have motivated interest from researchers and practitioners.
Many studies addressed the problem of short- (10 min–1 h) to medium-term (3 h–3 days) forecasting of renewable power using weather data from stations or numerical weather prediction (NWP). The impact of weather data and variable importance on forecasting energy supply, photovoltaic (PV), and wind power was studied thoroughly (Vladislavleva et al., Reference Vladislavleva, Friedrich, Neumann and Wagner2013; De Giorgi et al., Reference De Giorgi, Congedo and Malvoni2014; Zhong and Wu, Reference Zhong and Wu2020; Liu et al., Reference Liu, He, Wu, Liu, Zhang, Chen, Shen and Li2023b). At the local scale, Malvoni et al. (Reference Malvoni, De Giorgi and Congedo2016) used solar radiation and temperature to predict the generation of a Mediterranean PV plant. The effect of various climates throughout the planet on hourly PV production was also investigated by Alcaniz et al. (Reference Alcañiz, Lindfors, Zeman, Ziar and Isabella2023). Other works such as Ahmad and Hossain (Reference Ahmad and Hossain2020) made use of weather forecasts to maximize hydropower generated from dams while Couto and Estanqueiro (Reference Couto and Estanqueiro2022) who examined model-based predictive features for wind power predictions. Frequently, the availability of accurate weather observation is a bottleneck when working with a dedicated local area, not to mention their inherent sparsity and noise level, leading to NWP being preferred by researchers. Yet, when both types of weather data are available, they can be combined (Sharma et al., Reference Sharma, Sharma, Irwin and Shenoy2011; López Gómez et al., Reference López Gómez, Ogando Martínez, Troncoso Pastoriza, Febrero Garrido, Granada Álvarez and Orosa García2020).
Recent advances in forecasting variable renewable energy generation have seen statistical, machine learning, and deep learning models gain popularity among practitioners (Wang et al., Reference Wang, Lei, Zhang, Zhou and Peng2019a; Iheanetu, Reference Iheanetu2022; Krechowicz et al., Reference Krechowicz, Krechowicz and Poczeta2022; Tsai et al., Reference Tsai, Hong, Tu, Lin and Chen2023). Thanks to the increase in weather and power data availability and quality, models have proven to be useful in revealing driving factors and learning from complex patterns (Sweeney et al., Reference Sweeney, Bessa, Browell and Pinson2020). Depending on the spatial and temporal scale, statistical models can outperform traditional physics-based models, which motivated the development of hybrid models (Bellinguer et al., Reference Bellinguer, Mahler, Camal and Kariniotakis2020; Castillo-Rojas et al., Reference Castillo-Rojas, Medina Quispe and Hernández2023; Gijón et al., Reference Gijón, Pujana-Goitia, Perea, Molina-Solana and Gómez-Romero2023). The link function between weather conditions and PV panels or wind turbines power output has been thoroughly investigated through different types of models (Dolara et al., Reference Dolara, Leva and Manzolini2015; Mayer and Gróf, Reference Mayer and Gróf2021; Zhou et al., Reference Zhou, Qiu, Feng and Liu2022; Bilendo et al., Reference Bilendo, Meyer, Badihi, Lu, Cambron and Jiang2023). Still, challenges remain when developing models for a large region or country.
Statistical data-driven models such as auto-regressive moving average (ARMA) and their variants (ARIMA, ARIMAX, SARIMA, and SARIMAX) have demonstrated reasonable performance, as shown in recent work (Chen and Folly, Reference Chen and Folly2018; Ryu et al., Reference Ryu, Lee, Park, Hwang, Park, Lee and Kwon2022). Support vector machine, k-Nearest Neighbors, Generalized Additive Model (GAM), and tree-based and boosted models also gave good performance in forecasting power output from weather data (Kim et al., Reference Kim, Jung and Sim2019; Condemi et al., Reference Condemi, Casillas-Pérez, Mastroeni, Jiménez-Fernández and Salcedo-Sanz2021). Current trends have seen the use of artificial neural networks, computer vision (CV), and natural language processing models. Their application in renewable power forecasting shows promising performance. Multilayered perceptron (MLP), convolutionnal neural network (CNN), vision transformers (ViT) (Lim et al., Reference Lim, Huh, Hong, Park and Kim2022; Keisler and Naour, Reference Keisler and Naour2025), and sequence architectures such as recurrent neural network or long–short term memory deep learning models were also applied in various renewable energy forecasting frameworks (solar and wind) (Elsaraiti and Merabet, Reference Elsaraiti and Merabet2022; Abdul Baseer et al., Reference Abdul Baseer, Almunif, Alsaduni and Tazeen2023). A key advantage is their flexibility and ability to combine several data sources to make predictions, not to mention the different ways they can exploit complex spatiotemporal data.
Research on statistical models is not limited to model architectures. Data preprocessing techniques are also important to improve forecast performance. Principal component analysis (PCA), wavelet decomposition, time series detrending, and exponential smoothing can be applied to extract relevant features, reduce dimension, remove noise, or reveal pertinent phenomena from the data (Liu and Chen, Reference Liu and Chen2019; Iheanetu, Reference Iheanetu2022). These techniques are mainly used as a first step to improve the robustness and performance of a model. It is important to point out that such techniques can be applied regardless of the type of data at hand, whether it is time series or gridded data over a region, albeit the second option being less explored.
Besides the methodology and models used for forecasting, differences between studies arise from the input and output data. Depending on the purpose and the availability of the data, the time and space resolution as well as temporal and spatial ranges differ between studies (Engeland et al., Reference Engeland, Borga, Creutin, François, Ramos and Vidal2017). Research works encompass scales from short-term single plant forecasts with a time resolution of 5–10 minutes (Malvoni et al., Reference Malvoni, De Giorgi and Congedo2017; Ryu et al., Reference Ryu, Lee, Park, Hwang, Park, Lee and Kwon2022; Gijón et al., Reference Gijón, Pujana-Goitia, Perea, Molina-Solana and Gómez-Romero2023) to medium-term daily forecasts of a region (Kim et al., Reference Kim, Kim, Yoo, Lee and Kim2017). However, due to the lack of available good quality data, regional forecasts are often made out of single plant forecasts aggregated to the desired region. This means an indirect prediction of the regional power supply. Moreover, the temporal scale rarely exceeds a few years’ worth of data (Chen and Folly, Reference Chen and Folly2018; Iheanetu, Reference Iheanetu2022). Thus, gaps exist between short to medium term and regional forecasts, leading to difficulties in comparing results between studies and improving modeling performance.
Most prior studies have used a bottom-up approach based on single-plant models, which neglects the integration of spatial information for prediction. Additionally, many existing models enhanced their performance by incorporating lagged data of the target time series itself, such as power supply from the previous day or hour. To overcome these limitations, in this study, we use supervised machine learning models and test the impact of using spatially resolved data as model inputs. We also decided to exclude the use of lagged inputs from the time series themselves as model inputs. The first goal is to assess the influence of the model calibration procedure, especially the cross-validation protocol, on time series-based model error estimation. The second goal is to compare models ingesting explicit weather “images” against averaged variables as inputs.
We first explain how we build input datasets for wind and PV production integrating spatially resolved weather data and generation units’ capacity and locations. These input images span the period from January 1, 2012, to December 31, 2023, at hourly resolution as presented in Section 2. Second, we present three different modeling approaches to handle the weather-gridded data to forecast daily wind and PV power production in Section 3.1. Finally, we explore cross-validation and hyperparameter optimization procedures in Section 3.3 to give insights and recommendations for model calibration before benchmarking widespread state-of-the-art machine learning models on our different modeling approaches in Section 4.
2. Data
In this section, we describe the target power supply data, the input weather data and power units data, and other input data sources, with the processing workflow to prepare them as input for supervised learning approaches. Figure 1 presents the overall approach, with more details given in the following sections.

Figure 1. Global framework of this study represented schematically.
2.1. Target data
We used as target wind and solar power from the RTE
$ {\mathrm{eCO}}_2\mathrm{mix} $
database. RTE is the public French national Transmission System Operator (TSO) managing the whole electrical grid. RTE provides near-real-time data on electrical consumption, production, flows, and
$ {\mathrm{CO}}_2 $
emissions within the
$ {\mathrm{eCO}}_2\mathrm{mix} $
application.Footnote
1 Electricity production data from RTE covers eight sectors: coal, oil, gas, nuclear, hydro, solar, wind, and bioenergy. We recovered production data for nondispatchable renewable wind and solar power. Solar refers to photovoltaic solar panels and wind to both onshore and offshore turbines.
Time-wise, data are available since January 1, 2012, and were retrieved until December 31, 2023. Resolution is half-hourly from January 1, 2012, to January 31, 2023, and quarter-hourly from February 1, 2023, to December 31, 2023.Footnote 2 We aggregated the data to an hourly resolution to be consistent with the time resolution of our inputs (see Section 2.2). Data being available at the country (NUTS0) or regional (NUTS1) scale, we chose to work directly with country-scale data. This dataset excluded Corsica and other French islands or overseas territories, which are considered self-sufficient in electricity.
France is part of the EU electricity market and the EU grid interconnection. In this work, we aimed to model the electrical power produced using solar and wind from France only, without taking into account any connection with neighboring countries. Therefore, we did not integrate imports and exports into our power supply target and retained only the production data, presented in Figure 2.

Figure 2. Power supply and capacity time series for wind and solar in France for the period of interest. The power capacity curves have been smoothed to a yearly resolution.
2.2. Input data
Our input data are based on gridded weather data weighted by the power capacity available at the given time and location, electricity day-ahead spot price, and other temporal features such as time or day of the year. We combined several different high-quality open-access databases from French governmental or government-affiliated organizations to create coherent inputs.
2.2.1. Weather data
We recovered hourly weather data from the ERA5 reanalysis (Hersbach et al., Reference Hersbach, Bell, Berrisford, Hirahara, Horányi, Muñoz-Sabater, Nicolas, Peubey, Radu, Schepers, Simmons, Soci, Abdalla, Abellan, Balsamo, Bechtold, Biavati, Bidlot, Bonavita, De Chiara, Dahlgren, Dee, Diamantakis, Dragani, Flemming, Forbes, Fuentes, Geer, Haimberger, Healy, Hogan, Hólm, Janisková, Keeley, Laloyaux, Lopez, Lupu, Radnoti, De Rosnay, Rozum, Vamborg, Villaume and Thépaut2020) on single levels for the period of interest from January 1, 2012, to December 31, 2023. We used the domain bounded by 51° North, 42.5° South,
$ - $
4.55° West, and 7.95° East which covers France, re-interpolating the original spatial grid of 0.25°
$ \times $
0.25° or 30 km
$ \times $
30 km. The weather variables we selected are those usually used for renewable power prediction: temperature at 2 m, Northward and Eastward wind speed at 10 and 100 m, instantaneous wind gust speed at 10 m, surface solar radiation downwards, total precipitation, evaporation, and runoff (Table A1). To select the variables relevant to wind and solar power, we used the mutual information between weather variables and power supply targets (Kraskov et al., Reference Kraskov, Stögbauer and Grassberger2004). We normalized the mutual information to one and kept only variables that had a score higher than 20%. This leads to hourly maps with 35 latitude and 51 longitude points for each considered variable in netCDF files.
2.2.2. Power units location, capacity, and activity
To get information on the location of facilities with installed solar panels or wind turbines, we used yearly released data from the Opérateurs Réseaux Energies (ORE)Footnote 3 agency database of all electrical facilities used for producing or storing electricity in France. The inventory published on December 31, 2023, contained around 84,000 electricity-producing units, among which 2,183 are wind facilities and 72,703 are PV farms. Rooftop PV panels dedicated to autoconsumption are not included. Because the ORE dataset did not provide the exact location of each facility, we merged it with the French governmental city databaseFootnote 4 using City ID, to allocate each facility to a 30-km grid cell of our weather maps. A city refers to an NUTS4 entity. City ID is a unique identifier provided to every French city by Institut National de la Statistique et des Etudes Economiques. Facilities’ city IDs that were missing in ORE accounted for less than 2% of the data and were discarded. We assigned facilities to their corresponding wind or solar sector, keeping only PV panels for solar and including both offshore and onshore turbines for wind. The maximum power that can be produced by each facility in MW provided by ORE was used as its capacity. Some power capacity data were missing, representing 0.25% fo the data and thus were discarded. To account for the activity period of each facility, we added its start and stop dates. If the stop date was not given in the ORE inventory, we assumed that the facility was still in activity. For the start date, we used the start-up date or the date the plant was connected to the grid. We verified that those two starting dates were close to each other for facilities where both were reported. After latitude, longitude, sector, power capacity, and start/stop dates for each facility were added, we only dropped 4.4% of the initial ORE dataset. Most of those discarded plants are located overseas or in Corsica.
2.2.3. Power-weighted weather maps
We generated power capacity-weighted weather maps, by assigning each power facility to the nearest grid cell in the gridded hourly weather data. The weather parameters are thus multiplied by the power capacity weights defined as:

with the power capacity
$ {P}_{i,j}^t $
at time
$ t $
and latitude, longitude
$ i,j $
in MW. We use a spatiotemporal normalization of the weights to account for the fact that nondispatchable renewable energy sources have seen their available production capacity increase in the last few years (see Figure 2). Because this behavior is expected to carry on, it is important to account for it in the model’s input. Figure 3 recaps the weighted weather map creation schematically.

Figure 3. Illustration of power-weighted weather maps creation for wind.
2.2.4. Additional input features
To ensure that models could grasp all of the seasonality and trend, we added two temporal features as it is usually done in the electricity forecasting literature (Chatfield, Reference Chatfield1986; Taylor, Reference Taylor2010; Goude et al., Reference Goude, Nedellec and Kong2014). The time step converted to a numerical integer, and the day of the year encoded using a cosine:
$ {doy}_{cos}=\cos \left(\frac{2\pi {doy}_{int}}{365}\right) $
, where
$ {doy}_{int} $
is the day of the year encoded as an integer between 1 and 365. We used those two temporal features for the wind and solar sectors. However, to be more consistent with the physical process of producing electricity with PV panels, we replaced
$ {doy}_{cos} $
for solar by the sunshine duration of the day. This duration was computed from sunrise and sunset times. We did it for every grid cell and timestep.
Even though PV and wind power supply to the grid are related to weather conditions, they are also dependent on the demand that electricity providers need to meet. The last few years have seen negative electricity prices on the market soar as the electrical demand was low, and the available renewable power was in oversupply. This led to a new practice from electricity providers called curtailment, which consists of deliberately restricting the electricity generation from renewable energy sources to prevent negative prices (De Vita et al., Reference De Vita, Capros, Evangelopoulou, Kannavou, Siskos, Zazias, Boeve, Bons, Winkel, Cilhar, De Vos, Leemput and Mandatova2020; Biber et al., Reference Biber, Felder, Wieland and Spliethoff2022; Yasuda et al., Reference Yasuda, Bird, Carlini, Eriksen, Estanqueiro, Flynn, Fraile, Gómez Lázaro, Martín-Martínez, Hayashi, Holttinen, Lew, McCam, Menemenlis, Miranda, Orths, Smith, Taibi and Vrana2022). Thus, we added as input the electricity spot price for France at hourly resolution from ENTSO-E.Footnote
5 There are different ways participants trade electricity on the market and therefore different electricity prices. We chose to use the auction day-ahead spot price as it is the only one that can be freely retrieved through ENTSO-E. Auction day-ahead spot price is the price of an
$ \mathrm{MW}\;{\mathrm{h}}^{-1} $
, which was decided the day before delivery through an auction.
The above-described data processing methodology and workflow allowed us to have input and target datasets for Solar and Wind power, designed for a supervised learning approach, and consisting of a set of
$ \left(X,Y\right) $
observations.
$ X $
refers to hourly weather maps gridded over France for each selected weather variable, weighted by the power capacity of plants located in the corresponding cells. It also includes day-ahead spot price and temporal features such as the time and day of the year or sunshine duration.
$ Y $
refers to the corresponding electrical power produced during this hour.
3. Models and calibration
This section describes the models we tested to predict electricity power production from weather variables. It also includes a discussion on model calibration techniques.
3.1. Modeling choices and approaches
As we aimed to develop models able to predict the power production of PV and wind for a day, given the weather conditions, day-ahead price, and temporal features of that same day, we aggregated all input data from hourly to daily resolution. Aggregation also helped to increase the signal-to-noise ratio and prevent overfitting when predicting daily power from hourly data. This leads to a day-to-day prediction approach without utilizing values of the previous days. In operation, real forecasts could then be easily obtained with our model by plugging daily weather forecasts from numerical weather prediction models.
3.1.1. Model architectures
We chose to test three modeling architectures of increasing complexity, as summarized in Figure 4: first using power-weighted weather images averaged over the whole French territory, second applying to power-weighted weather a dimension reduction method, and third applying a vision or image-based technique.

Figure 4. Representation of the three modeling approaches used in this work to make use of weather maps.
Models using spatially averaged images as input
The first approach is to train models on spatially averaged input data, to have a time series-to-time series regression framework. After averaging, weather time series are combined with price and temporal features series to leverage one-to-one models (models using one input point to predict the corresponding target point). In this family of models, we tested linear regressions, generalized additive models, tree-based models, boosting or artificial neural network, all proven to be capable of reaching state-of-the-art performance (Wood et al., Reference Wood, Goude and Shaw2014; Gaillard et al., Reference Gaillard, Goude and Nedellec2016; Krechowicz et al., Reference Krechowicz, Krechowicz and Poczeta2022; Chen et al., Reference Chen, Hu, Wang, Wang and Zhu2023; Liu et al., Reference Liu, He, Wu, Liu, Zhang, Chen, Shen and Li2023b).
Models using dimensionally reduced input images
The second approach is to use dimension reduction techniques to extract key features from our high-dimensional input power-weighted weather maps before combining them with price and other time features for training a model (Teste et al., Reference Teste, Makowski, Bazzi and Ciais2024). Several dimension reduction methods exist, ranging from empirical orthogonal functions, widely used in the earth sciences community, to autoencoder based on deep network architectures. These methods enable us to reduce the dimension of the input space while providing rich features. In this work, we focused on PCA and optimized the number of principal components as any other model hyperparameter. After obtaining the principal components that behave as time series, we applied the same models as for the spatial average: tree-based models, GAM, and NN.
Models using images as input
The third approach consists of building models capable of directly ingesting the power-weighted weather maps alongside price and temporal features. Here, we used a CNN architecture, previously shown to be capable in image classification, segmentation, or regression tasks, even though they are now slowly being replaced by better performing ViT (Keisler and Naour, Reference Keisler and Naour2025).
3.2. Train, validation, and test subsets
We split our dataset into a training and a test subset for the evaluation of model performance. As our data is time-dependent, power production changed throughout the years, mainly due to openings of new facilities. We chose the period from January 1, 2012, to December 31, 2022, to be the train set and January 1, 2023, to December 31, 2023, to be the test set. Nonetheless, hyperparameter tuning is a key step of model development as it often makes the difference between poor and high-performing models. To perform hyperparameter optimization (HPO) we can use different CV methods as well as different optimization frameworks. To ensure the robustness of our model selection procedure, we chose to keep a validation set dedicated to the investigation of cross-validation and optimization methods. This validation set spans the period from January 1, 2022, to December 31, 2022. After choosing a proper model selection and HPO procedure, it is included in the train set for final HPO and model calibration before evaluation on the test set, as described later.
3.3. Cross-validation and HPO
Cross-validation is used to approximate the generalization error, that is, the error of the trained model exposed to new unseen data (Hyndman and Athanasopoulos, Reference Hyndman and Athanasopoulos2018). Different techniques are used for splitting the training set into a new training set to train the model and a new left-out test set to evaluate its performance for computing the approximated generalization error. This step is usually combined with HPO to select the best set of hyperparameters for a given model architecture. Selecting the best-suited calibration procedure is a complicated process (Arlot and Celisse, Reference Arlot and Celisse2009; Bergstra et al., Reference Bergstra, Bardenet, Bengio and Kegl2011), and we explain later the proposed optimization scheme.
3.3.1. Procedures inspected
Our data are time-dependent because our target is a power supply time series. Different studies investigated which cross-validation procedure was best suited in this case (Tashman, Reference Tashman2000; Bergmeir and Benítez, Reference Bergmeir and Benítez2012; Cerqueira et al., Reference Cerqueira, Torgo and Mozetic2019). However, the scope of those studies was mainly synthetic and stationary, not to mention small, that is, a few hundred points, time series. Another major limitation is that even though real datasets were used, those modeling approaches involved lagged values of the target time series as predictors, which were excluded in our case. Therefore, we chose to study different cross-validation procedures and HPO algorithms to guide the choices for the calibration of our models. We did these experiments using only the models based on spatial averages of input weather images. The following cross-validation procedures were used:
-
• Hold-out: Split the training set into a train set and a test set.
-
• K-fold: Split the training set into
$ K $ folds. At each iteration, a fold is chosen to be the test set while the
$ K-1 $ others form the train set. Iterate until all folds were used as test once. After all the iterations, the approximated generalization error is taken to be the average of the error made on each test fold.
-
• Expanding: Split the training set into
$ K $ folds following the order of the samples. During the
$ {i}^{th} $ iteration, the first
$ i $ folds are used as the train set and the
$ i+1 $ fold is used as the test. Repeat until the entire training set has been used. After all the iterations, the approximated generalization error is taken to be the average of the errors made on each test fold.
-
• Sliding: Split the training set into
$ K $ folds following the order of the samples. During the ith iteration, the
$ i $ fold is used as the train set, and the
$ i+1 $ fold is used as the test. Repeat until the entire training set has been used. After all the iterations, the approximated generalization error is taken to be the average of the errors made on each test fold.
-
• Blocking: Choose a block length
$ l $ based on the temporal structure to conserve most of the correlation between neighboring samples. Split the training set into blocks of length
$ l $ . Attribute blocks to the train or test set at random (inspired by Wood, Reference Wood2024).
Figure 5 shows the scheme of these five cross-validation methods. We split the data into a 1-year test set for the Hold-Out method, 10 splits to get yearly folds for every method using folds and blocks of 7 days for the blocking method. The block size was chosen to keep most of the temporal structure using autocorrelation and partial autocorrelation analysis. We also considered the shuffling variants of the K-fold and hold-out methods, which involve randomly shuffling the samples before the folds or subset attributions.

Figure 5. Different cross-validation procedures considered in this work represented schematically. For Hold-Out and K-Fold, only the method without prior random shuffling is represented.
Regarding hyperparameter optimization, we compared two optimization algorithms: Random search and Bayesian search using Gaussian Processes (Bergstra et al., Reference Bergstra, Bardenet, Bengio and Kegl2011; Bischl et al., Reference Bischl, Binder, Lang, Pielok, Richter, Coors, Thomas, Ullmann, Becker, Boulesteix, Deng and Lindauer2023).
To assess the impacts of cross-validation and HPO for different model architectures, we repeated the experiments using three models: a random forest, a tree-based boosting scheme (XGBoost), and a feed-forward neural network or MLP. In total, this led to 7 cross-validations × 2 HPO × 3 models estimators of the generalization error. At first glance, one might think that cross-validation procedures that respect the temporal order of the data are best suited to our approach. Still, we wanted to make an informed decision by doing the experiments. Our final goal is to choose the pairs of cross-validation techniques and HPO algorithms that give the “best” estimator of the generalization error. Here, best refers to different criteria ranging from the precision of the generalization error estimate to the computational resource usage.
3.3.2. Cross-validation experiments
As cross-validation’s main goal is to obtain an approximation of the generalization error
$ \hat{\varepsilon} $
, we monitored how far the estimate was from the real error. To do so, we recorded for each of the 100 optimization iterations the test error made during cross-validation on the training part of the data for a given set of hyperparameters. Then, we compared it to the real generalization error
$ \varepsilon $
made on the validation set. Here, the training and validation part refers to the one visible in Figure 1. Since we are dealing with a regression task, the error
$ \varepsilon $
was taken to be the root-mean-squared error (RMSE) of the modeled and observed daily power production. See Appendix B for metrics definition. Our target being a power production daily time series, the unit of RMSE is MW. Given the real generalization error
$ \varepsilon $
and its estimate
$ \hat{\varepsilon} $
from cross-validation, for each procedure, we computed the difference between the two quantities as
$ \Delta \varepsilon =\varepsilon -\hat{\varepsilon} $
and analyzed the average
$ \overline{\Delta \varepsilon } $
and its standard deviation
$ \sigma \left(\Delta \varepsilon \right) $
across the HPs. We also determined the optimum value of
$ \hat{\varepsilon} $
reach after optimization and compared it with the real error in
$ \Delta {\varepsilon}_{min} $
.
During the experiments, we monitored the time taken to perform one iteration and the permutation feature importance of each feature obtained during cross-validation compared to the one obtained on the validation set. These times of computation tell us how costly each error estimation method was. The feature importance tells us if the cross-validation technique impacted the interpretability of the model. Last, we experimented with different dataset sizes to inspect the influence of data size on cross-validation methods since the literature only deals with small sample sizes. As the dataset size increases, older and older data are utilized for training. Computation times can be found in Table 1 and results for random forest on solar are presented in Figures 6 and 7. Results for other models on solar are in Appendix C and on wind in Appendix D, Figures D1–D6. Results about permutation feature importance showed that despite the different cross-validation methods, the ranking of the features stayed the same for the different hyperparameter combinations explored, meaning that the method does not impact the model interpretability.
Table 1. Average and standard deviation of computing times for 1 iteration for each cross-validation method in seconds

Note. The (S) indicates the shuffling variant of the method. Medals indicate the top three fastest methods for each model and dataset.

Figure 6. Results of different cross-validation techniques for random forest on solar. Each axis represents a monitored quantity for a given HPO optimization procedure. The values for each method are plotted as points, and only the worst and best values for each axis are printed. The (S) indicates the shuffling variant of the method.

Figure 7. Robustness of cross-validation procedure regarding the dataset size for random forest on solar. The marker indicates the average
$ \mid \Delta \varepsilon \mid $
, while the error bars display the standard deviation. The (S) indicates the shuffling variant of the method.
On the radar chart of Figure 6, we can see that
$ \Delta \varepsilon $
is positive on average and for the optimum. This means that our generalization error estimates
$ \hat{\varepsilon} $
is lower than the real error
$ \varepsilon $
. In other words, the cross-validation tends to overestimate the model performance leading to overconfidence in the model. We can also see that methods that do not preserve the chronological order or shuffling perform worse than those that do. Specifically, hold-out, expanding, and sliding lead to the closest estimate on average and the optimum for both searches. However, sliding is the most sensitive to the set of hyperparameters as its variability
$ \sigma \left(\Delta \varepsilon \right) $
is the highest. This might stem from its small training set size which never exceeds 1 year of data. This is also confirmed by the error bars of Figure 7. This same figure shows that increasing the dataset size by appending older and older data leads to a slight increase in
$ \mid \Delta \varepsilon \mid $
meaning that our generalization error estimate is moving away from the real one. This is because older data such as 2012 carry less meaningful information than more recent data such as 2020 for predicting the validation set which is the year 2022. This behavior also explains why some methods display an inflection point for a certain dataset size meaning that there is an optimum past period of time to consider to make better predictions on the validation set.
The same conclusions hold for boosting and feed-forward neural networks on the solar dataset (see Figures C1–C4). It is worth mentioning that the neural network shows a high variability and a high
$ \Delta \varepsilon $
for the Bayesian search HPO, suggesting that this algorithm might not be the best for optimizing neural network hyperparameters. For the Wind dataset (see Appendix D, Figures D1–D6), hold-out, sliding, and expanding methods are the best methods to estimate the generalization error for all three model architectures. Yet, we can see for the random forest and boosting models that increasing the dataset size with older data does help better approximate the generalization error with the expanding and sliding methods. This means that in the wind dataset, older data still carry meaningful information for predicting the most recent validation set, even if there is a pronounced annual trend in the wind power production time series (see Figure 2).
Finally, Table 1 shows that cross-validation procedures involving folds are more computationally intensive per iteration, as one can expect. Combined with the previous graphs we can conclude that the longer computing times arising from the use of K-fold methods are not worth it since hold-out and sliding are better performers and between 5 and 10 times faster to compute per iteration.
From the result of those experiments testing different cross-validations, with different HPO and different model architectures we were able to make recommendations on how to choose a model selection procedure when dealing with time series to time series forecasting from covariates. We found that dedicated procedures that keep the chronological order during cross-validation perform better than standard K-fold or shuffled hold-out. Depending on the model architecture and the underlying data, some techniques tend to overestimate or underestimate model performance leading to underconfidence or overconfidence in our model. This systematic work could be extended to deep learning models that directly ingest images as inputs, to also get recommendations to push their performance even further.
4. Benchmark results and discussion
In this section, we present the results of our calibrated models on the training + validation set and evaluated on the test set. The best hyperparameters for each model were selected from the best generalization error, based on experiments from the previous section, that is, using Bayesian search with either an expanding or hold-out cross-validation method, depending on the model complexity, to save computing time. Expanding was preferred over sliding cross-validation due to the high sensitivity of sliding to hyperparameter sets. We assessed the performance of the model using the RMSE, mean absolute error (MAE), mean absolute percentage error (MAPE), normalized root-mean-squared error (nRMSE), and R2 score (R2). The definitions of these metrics are given in Appendix B. Table 2 contains all our results on the solar dataset, while results for wind can be found in Appendix E, Table E1.
Table 2. Benchmark results for different models using three different modeling approaches on the solar dataset

Note. Medals indicate the top three best-performing models on the test set for each metric.
As nondispatchable renewables capacity increased throughout our study period, solar and wind power production time series have an increasing trend from 2012 to 2023 as highlighted by Figure 2. This trend requires the models to be able to extrapolate on the test set. Despite reaching state-of-the-art performance in many tasks, tree-based models such as random forest and boosting are known to face difficulties when it comes to extrapolation outside of the training domain (Hengl et al., Reference Hengl, Nussbaum, Wright, Heuvelink and Graeler2018; Malistov and Trushin, Reference Malistov and Trushin2019). Our case makes no exception, despite low errors on the train set, random forest, and boosting models errors soared on the test set (see Tables 2 and E1). To address this issue, many research works propose alternatives such as stochastic or linear trees (Gama and Brazdil, Reference Gama and Brazdil1999; Zhang et al., Reference Zhang, Nettleton and Zhu2019; Numata and Tanaka, Reference Numata and Tanaka2020; Ilic et al., Reference Ilic, Görgülü, Cevik and Baydoğan2021; Raymaekers et al., Reference Raymaekers, Rousseeuw, Verdonck and Yao2024). We chose to apply two different methods to try to solve this extrapolation problem: linear trees and detrending of the time series.
Our detrending scheme consisted of applying a trend estimation method, such as seasonal trend decomposition using loess, on the entire dataset. Once the trend is estimated, we remove it from the data. The transformed data were thus passed to the model for calibration. The predictions were obtained by reconstruction from the model’s output and trend estimate. The detrending was done on both weather input and power output data, as the weighting scheme introduced trends in the covariates.
Linear trees did not seem to be a silver bullet on the solar dataset as their performance was only marginally better for the forest and worse in the case of boosting. In contrast, for the wind dataset, they prove to be useful in enhancing the extrapolation performance. However, their performance was still far from the tree-based models predicting detrended power supply from detrended weather averages before reconstructing the proper production time series. Despite the error induced by the trend estimation and reconstruction step, this method displays some of the best results on both solar and wind within the spatial average method and even outside. Such behavior could be expected because the trend is estimated on the whole dataset. The extrapolation problem is weaker for GAM and MLP as they manage to better grasp the trend, achieving better performance on the test set.
Compared with the spatial input averaging approach, using tree-based models with PCA did not achieve better performance due to the extracted principal components exhibiting the same trend as the spatial averages. This time, we only applied linear trees, as detrending principal components was more challenging. They exhibited a small improvement on the solar dataset but a bigger decrease in performance when used to predict wind power supply. Combining PCA with GAM does not seem to improve performance on both datasets. For MLP, it depends on the sector, but one thing that we noticed after our calibration is that networks combined with PCA are deeper than networks without it, meaning that it requires more layers to extract meaningful information from the principal components.
Although the increase in complexity between dimension reduction and spatial average approach did not lead to clear improvements in model performance for every model architecture, leveraging the entire weather maps with a more complex computer vision architecture, such as a CNN clearly did. This phenomenom stems from the unsupervised nature of the PCA compared to the supervised CNN. In fact, the CNN is the best-performing model on the wind dataset and the second-best on the solar dataset. By utilizing our spatiotemporal weighting scheme, the CNN has a better grasp of the trends in renewables implementation, as highlighted in Figure 8, and avoids extrapolation difficulties. Combined with the MLP results, it highlights the versatility and suitability of neural network-based models for predicting power production from renewable sources.

Figure 8. Power capacity, occlusion attribution, and regional realized power supply for early and late 2023 for Wind. Occlusion is an interpretation method that hides part of the input and sees how it impacts the CNN prediction. The higher the impact is, the higher the hidden part’s importance (Zeiler and Fergus, Reference Zeiler and Fergus2014). Power supply data are obtained from RTE for all of France’s regions (NUTS1).
Tables 2 and E1 illustrate the challenges that tree-based models face with extrapolation. Without the detrending scheme, these models would not rank among the top three performers. Instead, neural networks would dominate the podium, with the rankings reflecting the increasing complexity of the modeling approaches. Specifically, as models incorporate more spatially explicit data, their performance improves, with vision models outperforming MLPs combined with PCA, which in turn surpass MLPs on time series. Therefore, we recommend that practitioners incorporate spatial information when designing forecasting models.
The work conducted on cross-validation procedures and HPO schemes allowed us to push state-of-the-art machine learning architecture to their best performance. However, such a study could be extended to include deep learning models such as CNN to improve their performance. As deep convolutional neural networks are already amongst the best models for both solar and wind, we did not pursue this path. However, it is worth mentioning that a systematic study would benefit deep learning models and strengthen their edge.
5. Conclusion
This study presented datasets and tested a modeling framework based on machine learning and climate as well as facility locations as an input for predicting daily solar and wind supply at the country level in France. Several different machine learning models with different complexities were applied to create a benchmark. Attention was paid to the methods used for calibrating the model to avoid displaying overconfident metrics. The method proposed was applied over France and could be extended to any other country or region.
Our model calibration experiments showed that there is no “silver bullet” model, as it is dependent on the data and the model at hand. Under- or overconfidence can arise depending on the calibration, leading to desillusions if the model is chosen to be run in operations based on the calibration results. Thus, a thorough validation procedure and analysis are required to avoid such phenomena and improve the production launch. Still, some general recommendations can be made towards preferring cross-validation methods, keeping the temporal structure of the data intact, as they are both more computationally efficient and less biased, leading to more robust models.
Trying to model renewable power supply from weather inputs without including the power capacity at facility locations in the inputs is pointless, as some state-of-the-art already failed to correctly model the trend with this added information. Models that are able to ingest the entire high-dimensional weather input can learn from spatial patterns to achieve better predictions, improving the forecasts. This means that being spatially explicit in both the data curation and preparation, as well as in the modelling process, is key to achieving good predictions. Therefore, we encourage other practitioners to include geospatial data in their framework. However, one must bear in mind that power capacity inventories are not available everywhere and can be of different quality depending on the data source.
In summary, geospatial weather information is key for renewable energy forecasting. By providing an open dataset and benchmark, we hope to foster research and improve comparison between studies.
Open peer review
To view the open peer review materials for this article, please visit http://doi.org/10.1017/eds.2025.10021.
Author contribution
Conceptualization: E.L., Y.G., P.C.; Methodology: E.L., Y.G.; Data curation: E.L. Data visualization: E.L. Supervision: Y.G, P.C. Writing original draft: E.L.; Writing review & editing: E.L., Y.G., P.C. All authors approved the final submitted draft.
Competing interests
The authors declare none.
Data availability statement
The datasets built for this work can be accessed https://doi.org/10.5281/zenodo.14287949.
Ethics statement
The research meets all ethical guidelines, including adherence to the legal requirements of the study country.
Funding statement
This research was supported by a grant from the Association Nationale de la Recherche et de la Technologie (ANRT) No. 2024/0010.
A. Appendix A: Weather variables
Table A1. Description of climate variables

Source: ERA5 Documentation (https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation).
Note. There are 110,808 hourly weather observations spanning 4383 days with a 35
$ \times $
51 grid for each time step.
B. Appendix B: Metrics Definition




where
$ {y}_{max} $
,
$ {y}_{min} $
, and
$ \overline{y} $
are the maximum, minimum, and the average of the true target
$ y $
, respectively.
C. Appendix C: Cross-validation experiment results for solar
C.1. Boosting

Figure C1. Results of different cross-validation techniques for boosted trees on solar. Only the worst and best values for each axis are printed. The (S) indicates the shuffling variant of the method.

Figure C2. Robustness of cross-validation procedure regarding dataset size for boosted tress on Solar. The marker indicates the average
$ \mid \Delta \varepsilon \mid $
, while the error bars display the standard deviation. The (S) indicates the shuffling variant of the method.
C.2. Feed-forward neural network (MLP)

Figure C3. Results of different cross-validation techniques for feed-forward neural network on solar. Only the worst and best values for each axis are printed. The (S) indicates the shuffling variant of the method.

Figure C4. Robustness of cross-validation procedure regarding dataset size for feed-forward neural network on solar. The marker indicates the average
$ \mid \Delta \varepsilon \mid $
, while the error bars display the standard deviation. The (S) indicates the shuffling variant of the method.
D. Appendix D: Cross-validation experiment results for wind
D.1. Random forest

Figure D1. Results of different cross-validation techniques for random forest on wind. Only the worst and best values for each axis are printed. The (S) indicates the shuffling variant of the method.

Figure D2. Robustness of cross-validation procedure regarding dataset size for random forest on wind. The marker indicates the average
$ \mid \Delta \varepsilon \mid $
, while the error bars display the standard deviation. The (S) indicates the shuffling variant of the method.
D.2. Boosting

Figure D3. Results of different cross-validation techniques for boosted trees on wind. Only the worst and best values for each axis are printed. The (S) indicates the shuffling variant of the method.

Figure D4. Robustness of cross-validation procedure regarding dataset size for boosted trees on wind. The marker indicates the average
$ \mid \Delta \varepsilon \mid $
, while the error bars display the standard deviation. The (S) indicates the shuffling variant of the method.
D.3. Feed-forward neural network (MLP)

Figure D5. Results of different cross-validation techniques for feed-forward neural network on wind. Only the worst and best values for each axis are printed. The (S) indicates the shuffling variant of the method.

Figure D6. Robustness of cross-validation procedure regarding dataset size for feed-forward neural network on wind. The marker indicates the average
$ \mid \Delta \varepsilon \mid $
, while the error bars display the standard deviation. The (S) indicates the shuffling variant of the method.
E. Appendix E: Benchmark results for wind
Table E1. Benchmark results for different models using three different modeling approaches on the wind dataset

Note. Medals indicate the top three best-performing models on the test set for each metric.
F. Appendix F: Comparison with ENTSO-E day-ahead forecasts
In the literature of renewable energy forecasting, most of the studies use numerical weather predictions, that is, forecasted weather, as inputs to the models, and mainly focus on a local scale, such as a single solar or wind farm. In this work, we used re-analysis ERA5 data as the weather inputs, which do not account for the weather forecasting error, and we directly predict the supply at the regional scale without any lags. These aspects make the comparison with other work difficult. However, we provide in Table F1 a comparison of the spatially explicit CNN results with the ENTSO-E day-ahead forecasts for wind and solar generation in France.Footnote 6 The day-ahead forecasts available on ENTSO-E are sourced from each TSO, and since they are run operationally, they must use numerical weather forecasts. Since the available forecast data granularity is hourly, we aggregated it to daily forecasts for the sake of comparison. We can see that our approach, combined with the use of re-analysis data, improved the forecasts by 18% on solar and around 20% on wind.
Table F1. Comparison of ENTSO-E day ahead renewable Forecast performance for France with our model forecast performance in 2023 (test set)

Note. The hourly ENTSO-E forecasts were aggregated to daily to match our work’s granularity.
G. Appendix G: Sensitivity of CNN model to Gaussian noise applied to the weather inputs
Since this study’s weather aspect is based on re-analysis data and not forecasted data, we study the degradation of the CNN model performance when mimicking weather forecasts as inputs. To do so, a Gaussian white noise without any correlation between the different weather variables is added to each weather map. The noise level is controlled, and the results of the performance degradation are reported in Table G1. It is worth mentioning that adding the same noise level to all the weather predictors does not translate into the same error for every weather variable. Even though this analysis is simple, we can notice that the solar model is less sensitive to the noise added than the Wind model. When looking at the score/metrics given by European Center for Medium-range Weather Forecast for their forecasts in their reference document,Footnote
7 we can see in Figure 26 that the RMSE for wind at 10 m is less than
$ 0.5\;\mathrm{m}\;{\mathrm{s}}^{-1} $
(around
$ 0.25\;\mathrm{m}\;{\mathrm{s}}^{-1} $
) for 60- and 72-hour ahead forecasts. Our range of wind speed values is between −14 and
$ 14\;\mathrm{m}\;{\mathrm{s}}^{-1} $
, with an average of 3–
$ 4\mathrm{m}\;{\mathrm{s}}^{-1} $
, where the wind turbines are located. This would mean an error on forecast variables of around 5-10%, which would lead to a decrease of 10–40% of our predictions for a “fake” 3-day-ahead forecast.
Table G1. Comparison of our model performance when adding Gaussian noise to the weather inputs to mimic weather forecast data

Note. The RMSE is computed on 2023 (test set). The relative change compares the metric with the noise to the metric without. Negative values mean improvement. The experiment was repeated 100 times before computing the mean and a 95% empirical confidence interval.
Comments
Cover Letter
Eloi LINDAS
Commissariat à l’Energie Atomique (CEA)/ Laboratoire des Sciences du Climat et de l’Environnement (LSCE)
Orme des Merisiers, Bat 714, 91190 Saint-Aubin
06/12/2024
Dear Editors,
I am writing to submit our manuscript titled “Towards Accurate Forecasting of Renewable Energy: Building Datasets and Benchmarking Machine Learning Models for Solar and Wind Power in France” for consideration in Environmental Data Science as an Application paper.
Our work addresses a challenge in the transition to renewable energy: accurate forecasting of regional solar and wind power supply. Using over a decade of spatially resolved weather and production data, we developed a dataset and benchmarked state-of-the-art machine learning models across 3 modeling approaches. Our findings give insights and recommendations on how to select a cross-validation procedure to estimate model generalization error as precisely as possible. The work also demonstrates the effectiveness of vision-based models in capturing complex spatial relationships, significantly enhancing forecasting accuracy at a national scale for France.
This study contributes to the journal’s focus on data-driven approaches for sustainable decision-making with:
1. A dataset that integrates spatially explicit weather, generation, and market data spanning 2012–2023.
2. Insights for practitioners, such as recommendations for cross-validation methods tailored to time series forecasting.
3. A benchmark of state-of-the-art machine learning models, exploring techniques from dimension reduction to computer vision.
We believe this research will interest a wide audience, including data scientists, energy forecasters, and policymakers, as it advances methodologies to support the integration of renewable energy into the grid.
We confirm that this work is original and has not been published elsewhere, nor is it currently under consideration for publication elsewhere. We have no conflicts of interest to disclose and all authors approved this submission.
Thank you for considering our manuscript. We look forward to the possibility of contributing to Environmental Data Science. To address any questions or provide additional materials, please contact me at: eloi.lindas@lsce.ipsl.fr
Sincerely,
Eloi LINDAS (On behalf of the co-authors)