# Attempting to reproduce RainBench
#### Group 5 - Alexander Freeman (5421144), Chakir El Moussaoui (4609395), and Michèle Knol (4962567)
#### CS4240 Deep Learning 2021/2022
-----
In this blog post, we try to reproduce the paper in which they introduce a new dataset RainBench with an accompanying processing pipeline PyRain. RainBench and PyRain were developed to mitigate the unequal reach of resources in the context of climate problems, such as extreme precipitation events. It is therefore an important paper to check its reproducibility. This proved, however, quite difficult, as it turned out that running the network presented by the paper was already the biggest problem. The little documentation and the specific configurations required made it very hard to even get started, only to end up stuck at a roadblock no one managed to figure out.
-----
## Introduction
Developing countries, mostly in South America and Africa, do not have the resources to compute weather forecasts, due to the simple reason that they cannot afford it. However, these weather forecasts can for example predict extreme precipitation events that could be very harmful to harvests and crops, and it is, therefore, crucial that these countries also have the ability to predict these kinds of events. They can then take appropriate measurements, in time, to minimize the damage to the crops and harvests. This is extremely important, especially for the third-world countries, as their dependency on their own harvests is tremendous, and even one destroyed crop field could mean a lot of hunger amongst the local population.
In an attempt to make a forecasting program that is available also for the less fortunate countries, Schroeder de Witt, C., Tong, C. and their group [^1] devised a forecast that is based on machine learning and uses global satellite imagery, to get rid of the numerical dependency. To this end, they developed the multi-modal benchmark dataset RainBench, which is dedicated to advancing global precipitation forecasting. It uses weather data from different sources (SimSat, IMERG, and ERA5), collected from as much as 1979 (ERA5) to as recent as 2016 (SimSat) They also developed PyRain, to enable efficient processing of the huge amounts of data necessary for forecasting precipitation events. In the end the goal is to forecast the output of either the IMERG or the ERA5 dataset, based on data from selected input datasets (SimSat, ERA, SimSat + ERA or ERA*, which is an minimized form of ERA5).
In this blog post, we want to answer the question if this paper is reproducible. Reproducibility is important [^2], because, without the ability to reproduce, there is a possibility that there are problems with the results of that particular paper. It is therefore important for a paper to be as clear as possible in its description of the code, to avoid the inability to reproduce by others, and therefore to avoid raising suspicion about the results obtained with that code.
## Deciding the goal
Before we even started the project, we already had a plan: the first week would be spent deciding on whether we would do either a full reproduction or an extension on the already existing code and we would continue from there. We specifically didn’t read the code during this time, so as to not prime our minds towards any single implementation. The first goal would be to reproduce the IMERG/persistence combination, as the teaching assistant, who helped guide us through the project, wagered, it would take the least amount of computation based on the data sizes (which we will get to later).
### Reproduction or extension
After trying out for several days we decided that a full reproduction was not in the cards. RainBench itself is just a dataset and the main contribution of the paper. As for the models that were used, the paper itself did not contain any real description. PyRain is the data loading pipeline that they presented next to RainBench and the only description is about how they load the data, not about any of the models.
The only hints towards the inner workings are the name of the architecture (a ConvLSTM or Convolutional Long Short Term Memory network), figure 4 (which does not explain anything by passing off the model as a black box) and a passing mention of the original configuration in [^4]. [^4] does describe the network, but this model is very specific to their dataset and not reproducible for the RainBench dataset. This does not give much to work with and so we decided to start using PyRain and the included models to extend it after running the original benchmarks.

<p style="text-align: center;font-style: italic">Figure 1: General model of RainBench</p>
Looking at the code, we noticed that the [model](https://github.com/FrontierDevelopmentLab/PyRain/blob/master/src/benchmark/models.py) is copied from [^5]. Now, they actually *do* describe how a ConvLSTM works, but this is only useful in the sense that it explains the model architecture. The integration with RainBench remains undescribed.
We would have liked to do 1) a reproduction of all the figures in table 1, 2) do a hyperparameter check and finally 3) do an ablation study based on the datasets included.

<p style="text-align: center;font-style: italic">Table 1: Precepitation forecasts evaluated by the paper</p>
These goals went from attainable to unattainable over the course of the project, as we had to try to get the project running in the first place. Our progress in that endeavour is what we describe down below.
## Getting it running
Before we can explain our process, we must first explain the datasets of which RainBench consists: SimSat, IMERG and ERA5. All three are available via [Google Cloud](https://console.cloud.google.com/storage/browser/aaai_release) in numpy memmap format in both 5.626 degrees and 14.0625 degrees resolution. The paper only seems to use the 5.625 degree resolutions, so we used those as well. SimSat and IMERG both have reasonable file sizes with 273.9 MB and 1.3GB respectively. ERA5, however, is 49.4GB big. This is important information, especially when looking at cloud-based solutions as it can make the setup quite cumbersome.
Another important aspect is the dependencies used in the project. [PyTorch](https://pytorch.org/get-started/locally/) especially has a lot of versions that will and won’t work together with CUDA (also based on operating system) and the project also uses [PyTorch-Lightning](https://www.pytorchlightning.ai/) with an old version. Normally these dependencies are described using a requirements.txt file or a conda environment, but these (or any alternative) were not provided and as such the versions were a bit of trial and error until they worked.
We have tried running it locally, on Google Colab and in the end also on the TUDelft cluster. These three approaches were done mainly parallel, except for solving common problems.
### Local
Running it locally seemed to be the most straightforward way to actually get the project working: clone the repository, download the dataset, configure some settings, run the command and train! The downloading took a while for ERA5, but the project was quickly set up. However, since ERA5 was so big, the next problem popped up very quickly. We had three computers at our disposal with 8GB, 16GB and 32GB of RAM respectively. During the initial loading of the dataset, it went over the limits of 16GB of memory and resulted in the error shown in *Code log 1: Memory error thrown*. As such we were left with one device. Now, we were working with SimSat to predict IMERG, but that still uses ERA5 variables all over the place, so we could not simply leave it out.
```
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
--------------------------------------------
0 | net | ConvLSTMForecaster | 8.1 M
--------------------------------------------
8.1 M Trainable params
0 Non-trainable params
8.1 M Total params
32.496 Total estimated model params size (MB)
Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s]T
raceback (most recent call last):
File "run_benchmark.py", line 312, in <module>
main(hparams)
File "run_benchmark.py", line 181, in main
trainer.fit(model)
***stacktrace***
File "C:\Users\chaki\anaconda3\envs\PyRain\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
reduction.dump(process_obj, to_child)
File "C:\Users\chaki\anaconda3\envs\PyRain\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
MemoryError
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\chaki\anaconda3\envs\PyRain\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Users\chaki\anaconda3\envs\PyRain\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
```
<p style="text-align: center;font-style: italic">Code log 1: Memory error thrown</p>
In itself this problem was not that big of a deal: we at the very least had a way of continuing. It was, however, discouraging as the model training time was listed as a day and removing two computers meant a lot less chances to train.
We had some issues with CUDA and WSL, but after installing [Windows 10 (non-insider) 2021H2](https://github.com/microsoft/WSL/issues/6014), we had GPU support working.
After these fixes, everything was set to go and with good hope we ran the terminal command: `python3 run_benchmark.py --inc_time --config_file config.yml --gpus 1 --no_relu --batch_size 8 --imerg --sources simsat`
*(As from the provided README, except for `--gpus 1` for finding the GPU and `--batch_size 8` as to limit it to the amount of available VRAM).*
Running this resulted in the following error:
```
UNAVAILABLE CATS: ['simsat5625/clbt']
Epoch 0: 4%|█▊ | 191/4967 [00:58<24:11, 3.29it/s, loss=0.0748, v_num=85]Traceback (most recent call last):
File "run_benchmark.py", line 313, in <module>
main(hparams)
File "run_benchmark.py", line 182, in main
trainer.fit(model)
File "/home/alexander/PyRain/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
***stacktrace***
File "/home/alexander/PyRain/venv/lib/python3.7/site-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 7.
Original Traceback (most recent call last):
File "/home/alexander/PyRain/venv/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/alexander/PyRain/venv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/alexander/PyRain/venv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/alexander/PyRain/src/dataloader/memmap_dataloader.py", line 576, in __getitem__
sample_results = self.get_sample_at(sample_mode_id, sample_ts, index)
File "/home/alexander/PyRain/src/dataloader/memmap_dataloader.py", line 416, in get_sample_at
{"interpolate": vbl_section.get("interpolate", "NaN")})]
File "/home/alexander/PyRain/src/dataloader/memmap_dataloader.py", line 250, in __getitem__
a = results_dict[list(results_dict.keys())[0]]
IndexError: list index out of range
```
<p style="text-align: center;font-style: italic">Code log 2: Index error thrown</p>
This error is at the core of our problems, popping up everywhere. It seems to not be able to find a value in the simsat dataset (‘simsat5625/clbt’). Further debugging told us that it was not the entire variable (‘simsat5625/clbt’ is actually in the dataset), but a single range of time values referenced somewhere else in the dataset. Finding the cause was simply impossible, because of the fact that it is memmapped and very hard to dissect. No further information could be found to solve it. We created a [GitHub issue](https://github.com/FrontierDevelopmentLab/PyRain/issues/5), but as of now (14/04/2022), there has been no reply.
We have gone through every permutation of datasets (simsat, simsat+era, era) and prediction target (IMERG and ERA5) and gotten into problems with either this error or another one relating to a missing metric called val_loss, used in the ReduceLROnPlateau (a scheduler that does what it says on the tin: reducing the learning rate when the validation loss plateaus). We tried removing it, but the val_loss stays missing, leading to more errors.
```
Epoch 0: 100%|███████████████████████████████████████████████| 4967/4967 [22:18<00:00, 3.71it/s, loss=0.0727, v_num=84]
Traceback (most recent call last):
File "run_benchmark.py", line 313, in <module>
main(hparams)
File "run_benchmark.py", line 182, in main
trainer.fit(model)
File "/home/alexander/PyRain/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/home/alexander/PyRain/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/alexander/PyRain/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/alexander/PyRain/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in _run
results = self._run_stage()
File "/home/alexander/PyRain/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1324, in _run_stage
return self._run_train()
File "/home/alexander/PyRain/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1354, in _run_train
self.fit_loop.run()
File "/home/alexander/PyRain/venv/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 205, in run
self.on_advance_end()
File "/home/alexander/PyRain/venv/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 306, in on_advance_end
self.epoch_loop.update_lr_schedulers("epoch", update_plateau_schedulers=True)
File "/home/alexander/PyRain/venv/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 433, in update_lr_schedulers
opt_indices=[opt_idx for opt_idx, _ in active_optimizers],
File "/home/alexander/PyRain/venv/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 476, in _update_learning_rates
f"ReduceLROnPlateau conditioned on metric {monitor_key}"
pytorch_lightning.utilities.exceptions.MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: []. Condition can be set using `monitor` key in lr scheduler dict
Epoch 0: 100%|██████████| 4967/4967 [22:19<00:00, 3.71it/s, loss=0.0727, v_num=84]
```
<p style="text-align: center;font-style: italic">Code log 3: ReduceLROnPlateau error thrown</p>
This error also remains unfixed, leading to the point where we could not continue with the approach of locally running it.
### Google Colab
The second approach was via [Google Colab](https://colab.research.google.com/drive/1Md2mWCHWyX6kDxawge7toJEUulmms9tu?usp=sharing ). Looking back this may have been a bit naive as per the system requirements, but we tried this nonetheless. Collaborating with the other groups gave us a setup using TPUs (Tensor Processing Units) instead of GPUs as enabling GPUs reduces the disk size to below what we need to load in ERA5. This in the end resulted in the val_loss not being defined, ending this approach pretty quickly.
### Cluster
One of the reasons why we thought the network was not running correctly was that it was due to neither our local computers nor Google Colab having the computing power we actually needed. We, therefore, got access to the TU Delft cluster to run the network on. At first, we ran into a different error that had to do with the transferral of files to the cluster:
```
(pytorch) dlgroup5@tud13232:~/PyRain$ python3 run_benchmark.py --sources simsat --no_relu --imerg --inc_time --config_file config.yml
datapath: era5625_aaai.dill
Traceback (most recent call last):
File "run_benchmark.py", line 313, in <module>
main(hparams)
File "run_benchmark.py", line 153, in main
hparams, loaderDict, normalizer, collate = get_data(hparams)
File "/home/dlgroup5/PyRain/src/benchmark/collect_data.py", line 25, in get_data
sample_conf=sample_conf) for p in tvt.split('_')}
File "/home/dlgroup5/PyRain/src/benchmark/collect_data.py", line 25, in <dictcomp>
sample_conf=sample_conf) for p in tvt.split('_')}
File "/home/dlgroup5/PyRain/src/dataloader/memmap_dataloader.py", line 275, in __init__
self.dataset = DatafileJoin(self.datapath)
File "/home/dlgroup5/PyRain/src/dataloader/memmap_dataloader.py", line 111, in __init__
dc = dill.load(open(datapath, "rb"))
File "/home/dlgroup5/miniconda3/envs/pytorch/lib/python3.7/site-packages/dill/_dill.py", line 313, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/home/dlgroup5/miniconda3/envs/pytorch/lib/python3.7/site-packages/dill/_dill.py", line 525, in load
obj = StockUnpickler.load(self)
_pickle.UnpicklingError: invalid load key, '<'.
```
<p style="text-align: center;font-style: italic">Code log 4: Unpickling error thrown</p>
After using another source as a dataset and some other changes in, for example, versions, it finally started running properly. Until again the process halted on the data loader error mentioned in *Code log 2: Index error thrown*. That being said, this is the furthest we have come in terms of having the provided network running properly on the entire provided dataset that we could all run.
### Collaboration
When we had a surprisingly hard time running the code we turned to other groups that had the same topic/paper and asked how they managed to get it all running. Unfortunately, they also had difficulties running the code. We had separate meetings with them to discuss our progress and help each other in order to have at least something working so we could continue, but it turned out that, in the end, we still didn't manage to get the network running properly.
## The issue with debugging
All in all, we have spent a lot of hours debugging the codebase. The main problems that we would like to emphasize here is the lack of documentation and the lack of focus on the codebase. While the project has a README that shows exactly how to run it, getting any understanding of the internal workings is not that easy. The code references many variables in the different datasets, without any explanation. Not only that, but the dataloader is so complex that even debugging with a simpler set of datapoints is not achievable in the limited time that we had.
The lack of focus is mainly because the code tries to do a lot of things at the same time. While there is a `run_benchmark.py` script, there are also many other scripts that seem to output datafiles without any description of their function, input or outputs.
## Reproducibility
After all our struggles we can not help but conclude that the reproducibility of this project is hard if not impossible. We have fixed countless problems along the way, but there are two big roadblocks that seemingly have no solutions available for now. The documentation is of no help in this regard and we have exhausted our available options. We want to emphasize that three groups of students (in total 12 students) worked on this and none of us have gotten it to work, even getting the same errors in the process.
That being said, the problems with the code are not the only avenue of declaring the lack of reproducibility. Even if we are not running the provided code, we should be able to get at least an architecture from the paper. There are no mentions of hyperparameters, like the number of epochs, learning rate, the optimizer that was used, regularization or anything else.
What is interesting is that during the course we came across a paper that mentions the reproducibility of machine learning research [^3]. In this paper, the author states several correlations between a reproducible paper and its actual reproducibility. Some of these mostly objective correlations which correlate positively with the reproducibility of a paper are:
- Page limit
- Technical algorithmic details
- Empirical based
- Number of tables
- Hyperparameters mentioned
- Less number of equations
- Author reply
In the case of this paper, it effectively only has 5 pages which is not a lot. There are no technical algorithmic details discussed. It does seem to be empirically based, based on the two tables provided, which also, is not a lot. No actual hyperparameters are discussed. It has no equations, but to argue for it to be more reproducible is against what the author actually meant with this correlation; having a large number of equations makes a paper harder to read and is generally tied to more complex and difficult algorithms. As this paper has no equations at all, we argue it still counts towards being less reproducible. The authors also did not reply to any messages sent to them by either our group or other groups with the same project.
### Concluding remarks
Unfortunately, we did not get past the first steps of the project. Every option has been exhausted, yet running, extending and reproducing did not work out.
Using the correlations mentioned by the author [^3] and via our own findings, it seems that this paper is a good example of a paper that is not reproducible.
To reiterate, during our struggling spell of finding issues and solving them, here is a quick overview of the things we had to deal with:
- Specific and outdated library versions were never mentioned, so we had to debug on our own.
- Little to no documentation in the codebase.
- A single vague model in the paper to refer to.
- No hyperparameters are described at all.
- A big dataset that was somehow still missing values, causing the error we never managed to fix.
- And no communication from the authors.
[^1]: Schroeder de Witt, C., Tong, C. et al., (2020). RainBench: Enabling Data-Driven Precipitation Forecasting on a Global Scale.
[^2]: https://ai.facebook.com/blog/how-the-ai-community-can-get-serious-about-reproducibility/
[^3]: Raff, E., (2019). A Step Toward Quantifying Independently Reproducible Machine Learning Research.
[^4]: Casper Kaae Sønderby, Lasse Espeholt, Jonathan Heek, Mostafa Dehghani, Avital Oliver, TimSalimans, Shreya Agrawal, Jason Hickey, and Nal Kalchbrenner. Metnet: A neural weather model for precipitation forecasting. arXiv preprint arXiv:2003.12140, 2020
[^5]: Xingjian, S. H. I., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., & Woo, W.-C. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Advances in neural information processing systems, 802–810.