BQN Results - HackMD

--- title: BQN Results description: Small summary of the forecast performance of the Bernstein quantile network robots: noindex, nofollow --- # BQN Results ## Todos - [x] For all hyperparameters run several models, plot mean and variance of performance - [x] Performance wrt to horizons might be of interest - [ ] PITs for the different horizons - [ ] PITs for different degrees - [x] "mean + std" - [x] "mean" - [ ] "mean + quantiles" - [ ] Wind Power Plot für Test und Training data getrennt - [ ] Randomisiere die Validation Daten ---------- ## Overview __Goal__: Implementation of a Bernstein Quantile Network for postprocessing of ensemble weather forecasts and predicting wind power generation in a probabilistic and non-parametric fashion. __Related work__: 1. Bremnes, J.B. (2020): [_Ensemble Postprocessing Using Quantile Function Regression Based on Neural Networks and Bernstein Polynomials_](https://journals.ametsoc.org/view/journals/mwre/148/1/mwr-d-19-0227.1.xml) 2. Schulz, B., Lerch, S. (2021): [_Machine learning methods for postprocessing ensemble forecasts of wind gusts: A systematic comparison_](https://arxiv.org/abs/2106.09512) 3. Phipps, K. et al. (2021): [_Evaluating Ensemble Post-Processing for Wind Power Forecasts_](https://onlinelibrary.wiley.com/doi/10.1002/we.2736) ## Data Ensemble weather forecast for southern sweden (Sweden_Zone3_Ensemble) [u100, v100, t2m, sp, speed] Aggregated wind power observations from southern sweden (Sweden_Zone3_Power) [wind_power] ![](https://i.imgur.com/CW1awA6.png) ## Approach For each forecast horizon and each aggregation method, a neural network is trained. The _horizons_ are {3,6,9,12,15,18,21,24} and are the hours ahead. The _aggregation method_ refers to the level of aggregation of the weather ensemble. The possible methods are {"all", "mean" "mean+std", "single", "single+std"} and refer to: 1. "all": All 50 ensemble members are given to the model unaggregated. 2. "mean": For each weather variable, calculate the mean over the ensemble. 3. "mean+std": For each weather variable, calculate the mean and the standard deviation over the ensemble and feed it to the model. 4. "single": Feed each ensemble member to the model as if it was a single weather forecast. 5. "single+std": As above, but also give the standard deviation over the ensemble members as a measure of uncertainty. The neural network consists of an input layer, 2 to 5 hidden layers and and output layer. The size of the input layer depends of the level of aggregation and can range from 5 ("single") over 10 ("mean+std" and "single+std") to 250 ("all"), +1 for the lagged wind power data. The sizes of the hidden layers are guessed or optimized for in the hyperparameter tuning and are sorted in decreasing order. The output layer has degree+1 many output neurons, degree being the degree of the Bernstein polynomials. The "softplus" activation is used to enforce non-negative outputs. The outputs of the network are then used as increments of the coeffitients for the Bernstein polynomials. This linear combination of polynomials is then interpreted as the quantile function of the distribution of the random variable "wind power". The non-negative outputs of the network assert an increacing function, a fundamental property of a quantile function. The neural network is trained with a quantile loss function (see Bremnes or Schulz) and its predictive performance is assessed with the CRPS. ## Evaluation To evaluate the BQN approach, first a search for optimal hyperparameters was conducted. The hyperparameters taken into accout were: 1. "degree": degree of the Bernstein polynomials 2. "depth": number of dense layers in the neural network 3. "layerX_size": width/number of neurons of the layer X in the network. (The layer sizes get sorted in decreasing order as to create a funnel shaped network) 4. "learning rate": The learning rate of the Adam optimizer, values in {1.0, 0.1, 0.01, 0.001, 0.0001} 5. "activation": activation function used between hidden layers of the network, namely "relu" or "selu" After the optimal hyperparameters have been determined, ~~5~~ 10 runs for each model configuration were conducted and their outputs were aggregated as described in [Schulz]. The performance of the aggregated model was then assessed with respect to the CRPS. ## Results __Hyperparameters found to be optimal__ | horizon | aggregation | degree | depth | learning_rate | activation | layer0_size |... | |---------|-------------|--------|-------|---------------|------------|-------------|----------| | 3 | single | 14 | 3 | 0.01 | selu | 14 |... | | 3 | single+std | 9 | 5 | 0.01 | selu | 20 |... | | 3 | mean | 20 | 3 | 0.001 | selu | 19 |... | | 3 | mean+std | 4 | 4 | 0.01 | selu | 25 |... | | 3 | all | 21 | 3 | 0.001 | selu | 62 |... | | 6 | single | 4 | 2 | 0.1 | relu | 18 |... | |... | ... |... |... |... |... |... |... | __CRPS of the averaged models__ Compare to table 5 in [Phipps] | horizon | all | mean | mean+std | single | single+std | |---------|-------|-------|----------|--------|------------| | 3 | 55.91 | 47.03 | 42.88 | 45.2 | 41.41 | | 6 | 65.64 | 59.13 | 57.54 | 59.63 | 56.85 | | 9 | 73.75 | 70.37 | 66.81 | 71.56 | 66.72 | | 12 | 69.78 | 65.36 | 62.95 | 69.98 | 65.73 | | 15 | 69.98 | 69.66 | 69.06 | 73.57 | 68.56 | | 18 | 73.47 | 78.52 | 64.26 | 72.22 | 68.99 | | 21 | 86.78 | 79.2 | 74.64 | 73.57 | 76.21 | | 24 | 72.64 | 71.22 | 75.1 | 75.29 | 73.91 | __CRPS Plots__ ![](https://i.imgur.com/qPuNZwL.png) ![](https://i.imgur.com/ckPfV1A.png) __Scatter Plots of the different runs__ "Individual" refers to the performance of the individual model runs, "Average" to that of the averaged model. ![](https://i.imgur.com/eNsTxsQ.png) ![](https://i.imgur.com/ge1GDvI.png) ![](https://i.imgur.com/PE5rCId.png) ![](https://i.imgur.com/hTe4Ebr.png) ![](https://i.imgur.com/xxrf60O.png) __Rank Histograms__ [~~Link to the oldest imgur album~~](https://imgur.com/a/BzhANxK) [~~Link to the old imgur album~~](https://imgur.com/a/eoMqAd8) [__Link to the newest imgur album__](https://imgur.com/a/ZFvMMry) Examples ![](https://i.imgur.com/0YrZ6pH.png) __Forecast Plots__ Example plots of the first forecast and observation in the test set [~~Link to the imgur album~~](https://imgur.com/a/E93ODaN) [__Link to the new imgur album__](https://imgur.com/a/kzpN4y0) __Old__ "Wavy" forecast ![](https://i.imgur.com/XVlpHuV.png) __Old__ Forecast starting (almost) at 0 [horizon 12, aggregation mean] ![](https://i.imgur.com/LWCMkPf.png) Well-behaved forecast ![](https://i.imgur.com/FlnsDRa.png) __First Coefficients__ ![](https://i.imgur.com/3iF6KWY.png) ## Comparison The table "CRPS of averaged models" shows the mean CRPS for each aggregation method and forecast horizon and can thus be compared to Table 5 in \[Phipps\]. Sadly, the BQN network cannot beat the best strategy at most horizons, except for the 3h ahead forecast. The performance is similar to that of "Linear Raw" or "Random Forrest Raw". In \[Schulz], only the aggregation method "all" is used in accordance with \[Bremnes]. Our data suggests that aggregation of the ensemble weather forecast increases the predictive performance. ~~In contrast to the PIT plots in \[Schulz], our rank histograms were far from uniform. The reason for this is unknown and unexpected, since training with the quantile loss should produce calibrated forecasts.~~ ### What are improvements After shuffling of the training and test data sets, reducing the size of the neural network and the maximal degree of the Bernstein polynomials, the rank histograms now appear to be uniform. ## Other Observations ### Rank histograms Firstly, the peak at the higher ranks can be explained by the fact that the wind power data is an aggregation of multiple power plants. When predicting the power generation of a single plant, the maximum capacity of the plant can be taken into account when scaling the data. In that case, better calibration is expected. ~~Secondly, the "waviness" in some of the histograms is in correspondence with the "waviness" of the forecast distribution.~~ A stricter upper bound on the degree of the Bernstein polynomials reduced the "waviness" of the forecasts significantly. ~~Some models produce very unsharp forecasts. This can be observed in forecast plots, where the distribution starts at 0 no matter the observation. If the coefficient for the first Bernstein polynomial is very small, such distributions arise. Thus this effect occurs frequently, if the first output neuron is on average very small.~~ This problem could be alleviated by the above improvements and averaging over 10 instead of 5 runs. ## Problems ### Hyperparameter tuning 1. Batch size, patience were not taken into account during the tuning 2. The layer sizes of the hidden layers should have been conditionally selected, as to be in decreasing order automatically or not even considered, if the hyperparameter "depth" implied it. This was solved in a very hacky way, which did not reduce the HP search space and might have had an impact on convergence. ## Still To Do 1. Take Time Series Structure into account. -> CNNs or other network architectures?