Reproduction: "NeRV: Neural Representations for Videos"

# Reproduction: "NeRV: Neural Representations for Videos" Contributors: Anton Lang 5110467 Dielof van Loon 5346894 David Schep 5643384 Remi Makinwa 5270677 Reproduction Code: https://github.com/diebolo/nerv ## 0. Original Paper Arxiv: https://arxiv.org/abs/2110.13903 Github: https://github.com/haochen-rye/NeRV/tree/master Project Page: https://haochen-rye.github.io/NeRV/ ## **1. Introduction** In this blog post we will reproduce the results from the paper **NeRV: Neural Representations for Videos** by the authors **Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser-Nam Lim, Abhinav Shrivastava** The paper presents a novel neural representation for videos called NeRV, which encodes videos into neural networks. The key insight is that by directly training a neural network with video frame index and output corresponding RGB image, the weights of the model can be used to represent the videos. This approach eliminates the need for a long and complex pipeline used in traditional video compression methods. The paper shows that NeRV can achieve comparable performances to traditional video compression approaches and outperform standard denoising methods. In this reproduction we evaluate the ability to reproduce the encoding of a video, denoise a video, run the algorithm on the cholec80 dataset and do an ablation study. A hyperparameter sensitivity analysis is also discussed and a hyperparameter grid search is also performed, where we introduce (from our limited testing) areas where different hyperparameters would perform better than those suggested in the original paper. ## **2. Reproduction** For the reproduction, the original github repository of the paper was used. It ran our of the box, without any issues. Training the model with default parameters on the Big Buck Bunny dataset for 1200 epochs led to similar results as the ones achieved in the original study. Below a frame of both the groundtruth (on the left) and the output of the trained model (right) can be seen. Our output frames achieve an average psnr of 34.47, which is a bit higher than the original studies 34.21. ![image](https://i.imgur.com/GJvvfdJ.png) In addition to reproducing the results on the Big Buck Bunny video, the denoising findings were also tested. The paper stated that the model is good at denoising videos, this despite not explicitely being designed for this. To further this point they compare NerV's denoising capabilities with more conventional methods, using a version of the UVG dataset (all videos of the dataset concatenated) with noise applied to it as the test. For this test they used the PSNR as a measure of how well the methods are at denoising the video: ![image](https://hackmd.io/_uploads/S1xIHlde0.png) The exact method used to add noise to the video wasn't shared in the paper, so a conventional script to apply salt & pepper noise to the frames of a video was written. Furthermore, out of interest of time, noise wasn't applied to the entire UVG dataset, instead only the Honeybee video from the dataset was evaluated. To make sure the results lined up as much as possible, despite the difference, effort was made to ensure that the PSNR of the frames generated by the noise script matched the reported baseline PSNR in the paper as much as possible. In the end a baseline PSNR for each frame of about 28.0 compared to the paper's 27.95 was achieved. Below an example of such a frame can be found: ![image](https://i.imgur.com/5x0KNNL.png) Training the model for 1200 epochs on this new dataset and evaluating it against the non-noisy groundtruth dataset results in a PSNR of 37.77. Below a comparison is made between the non-noisy groundtruth footage of the Honeybee video (on the left) against the output of the model trained on the noisy data (on the right): ![image](https://i.imgur.com/BAQ3cSa.png) The footage is almost identical, except for the absence of the honeybee. We suspect that the reason for this is that the Honeybee video is mostly still, consisting of flowers swaying very slowly in the background and the titular honeybee flying around. The consequence of this is that the frames of the video change very little throughout and so the easiest way to represent the overall video is to present an almost unchanging frame. ## **3. Evaluation on New Data** It was recommended to use the cholec80 dataset for the new data evaluation. Cholec80 is an endoscopic video dataset containing 80 videos of cholecystectomy surgeries performed by 13 surgeons. The dataset was converted into 1733 frames and was used to train an instance of NeRV. Due to the large VRAM of the GPU that was available (24GB, RTX 4090) we used a large batch size (10). This sped up training and was thought to increase overfitting. Overfitting would be a good thing in this case, as encoding the framedata would need to be done as accurately as possible. The following settings where used for training the cholec80 set. | Beta | Epochs | Warmup | Loss | Batch | PSNR | MSSIM | | --- | ------ | ------ | ------- | ----- | ----- | ------ | | 0.5 | 1200 | 0.2 | Fusion6 | 10 | 25.55 | 0.8273 | It can be seen that the PSNR and MSSIM values are quite low compared to the rest of the examples given. This is probably due to the large batch size, thus teaching us that a higher batch size was not favorable in this case. ![cholec](https://hackmd.io/_uploads/Bkt9xVEgA.png) Left groundtruth, middle batch size 1 with 1200 epochs, right batch size 10 with 1200 epochs. We retrained the model with batch size 1 and the better performing parameters from the hyperparameter research found later in this post. We can see that the result is significantly better with smaller batch size. This is also told by the performance metrics. | Beta | Epochs | Warmup | Loss | Batch | PSNR | MSSIM | | --- | ------ | ------ | ---- | ----- | ----- | ------ | | 0.9 | 1200 | 0.1 | L2 | 1 | 34.05 | 0.9490 | ### Strides We also needed to adjust the output resolution, since cholec80 is 480p. The default strides are 5, 3, 2, 2, 2, which outputs a 1080p video. It can be seen in the figure from the NeRV paper that these are upscale factors which can be adjusted to adjust the resolution of the result. Since 480 is not dividable by 9, we opted for 540p, which was achieved by using 5, 3, 2, 2, 1. ![image](https://hackmd.io/_uploads/SJp9N4NxC.png) ## **4. Ablation Study** The NerV paper provides an ablation study on 5 different parts of the architecture. 1. Input embedding 2. Upscale layer 3. Normalization layer 4. Activation layer 5. Loss objective For the reproduction of this ablation study a single part was chosen. NerV uses a complicated loss function which combines L1 loss with SSIM loss. This loss objective was chosen for the reproduction. The result of the original ablation study is as follows: |Loss function| PSNR| MSSIM| |----|-----|------| | L1|35.77|0.959| | SSIM|35.69|**0.971**| | L1+SSIM|**37.29**|0.970| It is clear that the loss function was chosen based on the significant improvent in the PSNR over the other two loss functions. The MSSIM does not show a big change between loss functions, with a slight but insignificant win for SSIM. The original ablation study uses the entire UVG dataset for the ablation study. Due to resource constraints a single video, ReadySteadyGo, was chosen for this ablation reproduction. Secondly, the model was only trained for 300 epochs instead of the 1200 epochs used in the NerV paper. This explains the large difference between the data from the recreation and the original paper. Because of this we will only look at relative differences and not absolute values. Below is the result of our ablation reproduction. |Loss function| PSNR| MSSIM| |----|-----|------| | L1|**19.50**|0.6443| | SSIM|18.60|0.6509| | L1+SSIM|19.34|**0.6514**| Our data shows that L1 has the highest PSNR while L1+SSIM has the highest MSSIM. This is inconsistent with the original paper. The discrepancy can likely be explained by the shorter training time and small dataset. Even though L1 has the best PSNR, it can still be argued that L1+SSIM is the better loss function as the PSNR is very close to L1 while the MSSIM much higher than the other loss functions. ## **5. Hyperparameter sensitivity analysis** A brief sensitivity analysis of some of the model hyperparameters was undertaken. It should be noted that no large grid search could be performed due to computational constraints. The dataset used in all hyperparameter search was the big bug bunny dataset and the models were trained for 300 epochs. Pruning hyper parameters where not investigated since the paper provides sufficient detail on these. ### 5.1 Learning Rate First the learnign rate was investigated. For this, the default learning rate of 5e-4 was modified and the model trained for 300 epochs. The following figure is obtained from this: ![lr](https://hackmd.io/_uploads/Bk546mYlC.png) This figure shows, that the learning rate scheduling seems to not be ideal in the current implementation. Especially in later epochs it seeems that the model benefits from higher learning rates thanthe current cosine decay provides. This results in the 5e-3 learning rate outperforming the default learning rate in this case. ### 5.2 Hyper Parameter Grid Search Now, a simple hyperparameter grid search has been performed. In this section the default learning rate of 5e-4 and the models were trained for 300 epochs. The chosen hyper parameters are: * beta = [0.5, 0.9] * batch_size = [1, 3] * warmup = [0.2, 0.1] * loss_type = ["Fusion6", "Fusion5", "L2"] Where the first item is always the default setting. It should also be noted that the batch sized was limited to 3 while the paper used 6 for the 1500 epoch state of the art comparison runs. This was due to hardware limitations of the used 1070ti. The model was trained 24 times and a slightly better set of hyper-parameters for the chosen metric and training duration was found: ![batch](https://hackmd.io/_uploads/ryVp2mYgR.png) This should not be taken as a devinitive improvement but rather are starting point for further evaluation. The potentially improved hyper parameters are: * default - beta: 0.5, warmup: 0.2, loss: Fusion6, Batchsize: 1 * best - beta: 0.9, warmup: 0.1, loss: L2, Batchsize: 1 A comparison with the default parameters on the left and the "improved" parameters on the right can be seen below. On the bunnys ears for example visual improvements can be observed. ![comparison](https://i.imgur.com/fw945iP.png) Another interesting thing to note is that the batch size of 3 produced significantly worse results. This is unexpected since the state of the art comparison in the paper uses a batch size of 6. It is hypothesized that this is due to a modified learning rate, this is however not mentioned in the paper. Finally, below is a table of the obtained PSNR and MS_SSIM for most of the different training datasets. The results for batch size 3 have been omitted for the sake of brevity since they were so much worse. |Beta|Warmup| Loss |Batch| PSNR| MSSIM| |----|------|-------|-----|-----|------| | 0.5| 0.1 |Fusion6| 1 |32.15|0.9593| | 0.5| 0.1 |Fusion6| 3 |29.07|0.9235| | 0.5| 0.1 |Fusion5| 1 |31.31|0.9587| | 0.5| 0.1 | L2 | 1 |32.93|0.9513| | 0.5| 0.2 |Fusion6| 1 |32.11|0.9591| | 0.5| 0.2 |Fusion5| 1 |31.29|0.9582| | 0.5| 0.2 | L2 | 1 |32.87|0.9506| | 0.9| 0.1 |Fusion6| 1 |32.27|0.9604| | 0.9| 0.1 |Fusion5| 1 |31.38|0.9598| | 0.9| 0.1 | L2 | 1 |33.17|0.9538| | 0.9| 0.2 |Fusion6| 1 |32.16|0.9593| | 0.9| 0.2 |Fusion5| 1 | 31.3| 0.959| | 0.9| 0.2 | L2 | 1 |33.14|0.9534| ## **6. Discussion & Conclusion** In this blog post, we reproduced the results from the paper "NeRV: Neural Representations for Videos" by Hao Chen et al. We successfully trained the NeRV model on the Big Buck Bunny dataset and achieved similar performance to the original paper, with an average PSNR of 34.47 compared to their reported 34.21. Furthermore, we evaluated the denoising capabilities of the NeRV model on the Honeybee video from the UVG dataset. We applied salt and pepper noise to the video frames and trained the NeRV model on this noisy data. The trained model showed good denoising performance, achieving a PSNR of 37.77 when evaluated against the non-noisy ground truth footage. We also evaluated the NeRV model on a new dataset, the Cholec80 endoscopic video dataset. By adjusting the strides to match the resolution of the dataset, we were able to train the model successfully. We found that using a smaller batch size (1 vs. 10) performed better, achieving a PSNR of 34.05 and an MSSIM of 0.9490. In addition to the main experiments, we conducted an ablation study on the loss function used in NeRV. While our results were inconsistent with the original paper, likely due to the shorter training time and smaller dataset, we found that the combined L1+SSIM loss function achieved the highest MSSIM score, suggesting its potential as a good choice for the loss function. Finally, we performed a hyperparameter sensitivity analysis, investigating the learning rate and a small grid search over various hyperparameters. We found that a higher learning rate of 5e-3 outperformed the default rate, and identified a potentially improved set of hyperparameters (beta=0.9, warmup=0.1, loss=L2, batch size=1) that yielded better performance on the Big Buck Bunny dataset. Overall, our reproduction validates the effectiveness of the NeRV approach for video representation and demonstrate its potential applications in areas such as video compression and denoising. ### Contributions of team members: - **Anton**: Hyperparameter grid search and learning rate analysis - **Dielof**: Evaluation on New Data - **David**: Ablation study - **Remi**: Bunny reproduction, denoising reproduction (and the required scripts) ### AI Disclosure While the content and experiments in this blog post are our own work, some text has been edited and improved using the AI language model Claude from Anthropic.