Reproduction: RGB Stream Is Enough for Temporal Action Detection

# Reproduction: RGB Stream Is Enough for Temporal Action Detection ###### Authors: Reuben Gardos Reid (r.j.gardosreid@student.tudelft.nl, 5457734) Leo Hendriks (l.j.f.hendriks@student.tudelft.nl, 4605594), Abel Van Steenweghen (a.vansteenweghen@student.tudelft.nl, 4876431), Gautham Venkataraman (g.venkataraman@student.tudelft.nl, 5334152) ###### tags: `reproduction` `new data` `ablation` `temporal action detection` --- This blog post gives an overview of the single-stage temporal action detection algorithm, DaoTAD, and covers our reproduction process and results. The goal of our project is to reproduce the results in table 3 of the original paper, "RGB Stream is Enough for Temporal Action Detection"[^RGBEnough]. Additionally, we perform a brief ablation study with the data augmentation and try to train the model with a new dataset, MultiTHUMOS [^MultiTHUMOS]. This project was a part of the class Deep Learning (CS4240) at Delft University of Technology. #### What is temporal action detection Before we dive into the methodology and the results of our reproduction of this work, lets have a quick primer on what **temporal action detection (TAD)** is. Consider the case of object detection where the goal is to find the location of objects in an image, and also classify the objects that have been found. In contrast, the goal of temporal action detection is to detect an action and find the temporal boundaries of the action. i.e, beginning and end time of a particular action in a video. Current TAD methods can be divided in two categories: - Two-stage methods: First generates several segmentation proposals, then classifies each proposal to an action class. - One-stage methods: Performs localization and classification at the same time. Most one-stage methods use two data streams: an RGB frames stream and optical flow.  RGB stream is simply the change in pixels' RGB values per frame. Optical flow is the apparent motion of a part of an image, based on its relative motion to the scene it's in. Computing optical flow is time-consuming however, which is why current one-stage methods do it offline in advance. #### What is DaoTAD DaoTAD is a one-stage TAD method that only relies on an RGB stream and doesn't rely on optical flow. By omitting the dependency on optical flow, usage of image level data augmentation becomes possible. This allows for a less computationally intensive TAD method with equal performance. ### Data Augmentation Data augmentation is useful to aritifically increase the size of datasets, without the extra labelling cost, by making slightly modified copies of the original data. DaoTAD uses both temporal (TLDA) and image level data augmentations (ILDA). DaoTAD uses four ILDA techniques: random crop, photo distortion, random rotation, and random flip. ### Structure of DaoTAD In this section we will give a short overview of the structure of the DaoTAD classifier. In the image below a schematic overview is given of the architecture of DaoTAD ![Overview of DaoTAD](https://i.imgur.com/RdWPmHN.png) DaoTAD consists of three main parts: #### Backbone The backbone consists of a 3D feature extractor followed by a spatial reduction module (SRM). To extract the features ResNet50 I3D is used, because of its efficiency and performance. To improve the computation cost and to allow the classifier to use the whole frame, the spatial reduction module is used. The authors chose to use average pooling in this module based on the ablation study . #### Neck The neck comes next which uses a temporal downsample module (TDM) followed by a temporal feature pyramid network (TFPN). The temporal downsample module is used to deal with the large variance in temporal scale. The TDM consists of 4 1D-Convolution layers with kernal size 3 and stride 2, thus reducing the dimension by half for each layer. The output of each of the layers and the original input of the neck are sequentially summed in the TFPN. The idea behind this module is to place the frames of a time span in the context of the frames before and after it. #### Head The head is the final part of the DaoTAD module and consists of the Temporal Prediction Module (TPM). The TPM has two branches: - the classification branch that predicts the class probabilities for each segment and - the regression branch that predict the temporal boundaries for each segment. Both branches consist of 5 1D-Convolution layers with kernel size 3 and stride 1. The difference between them is in the last layer. The classification branch has classification output and the regression branch has regression output. ## Reproduction ### Hardware The authors of the paper mention that they used 4 NVIDIA 1080Ti to run their experiments with a batch size of 16. This combination amounts to 44GB of GPU memory. Unfortunately, we were limited to 1 GPU on Google Cloud Platform (GCP). In order to match the batch size of 16 on a single GPU, we resorted to using an instance with the NVIDIA A100 GPU. Our A100 instance ([`a2-highgpu-1g` on GCP](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a2_vms)) had the following software and hardware specifications: CPU - Intel Xeon (12) @ 2.200GHz RAM - 85484 MiB OS - Ubuntu 18.04.6 LTS x86_64 (Kernel - 5.4.0) GPU - NVIDIA A100 (driver - 455.32) GPU Memory - 40 GB HBM2 CUDA version - 11.1 CUDNN version - 8.0.5 (for CUDA 11.1) Python version - 3.8.5 PyTorch version - 1.8.2 We tried our best to match the specifications as close as we could. However, due to our limited hardware options and massive advancements in the Cuda, Cudnn and PyTorch API, this is the toolchain that we settled with. ### THUMOS14 THUMOS14[^THUMOS14] is the original dataset used for evaluation in our paper. It is comprised of 1,010 videos, 432 of which have temporal action annotations relevant for temporal action detection. The set of 432 is split into 220 validation videos (used for training) and 212 test videos. Among the videos are annotations for 20 action classes. For example, this clip is from a video in the dataset which contains both the `CricketShot` and `CricketBowl` classes. ![Example GIF from a video from the THUMOS14 dataset containing CricketShot and CricketBowl classes.](https://i.imgur.com/2NXO59j.gif) The code provided by the authors of the original paper includes scripts for downloading and processing the videos and annotations. Instructions for completing the process can be found [in the DaoTAD repository](https://github.com/andohuman/vedatad/tree/main/tools/data/thumos14). ### MultiTHUMOS Next to reproducing the papers results on the THUMOS14 dataset we also wanted to investigate the models performance on a similar dataset. For this dataset we choose MultiTHUMOS[^MultiTHUMOS]. MultiTHUMOS has 65 action different classes compared to the 20 of THUMOS14. In general MultiTHUMOS is more dense then THUMOS14, meaning that there are on average more labels per frame and more action classes per video. We will investigate how well DaoTAD performs on the MultiTHUMOS dataset and how it compares to other classifiers specifically build for MultiTHUMOS. The MuliTHUMOS dataset can be considered an extension of the labels in the original THUMOS14 dataset. It doesn't come with any new data, but instead of class annotations for just 20 actions, the authors identify and provide class annotation labels for 65 actions (which includes the original 20 actions) for the same data present in THUMOS14. So naturally, the MultiTHUMOS dataset only contains the new annotations labels, which has to be downloaded and extracted while preserving the directory structure. We came up with a simple [script](https://github.com/andohuman/vedatad/blob/main/tools/data/multithumos/annotations/create_split.py) that splits the MultiTHUMOS dataset and generates the annotations files that is consistent in file structure and naming conventions with the original code that works with THUMOS14. The DaoTAD classifier can almost immediately be applied to the MultiTHUMOS dataset. The only difference is in the number of classes, thus a small change needs to be made to the last layer of the classification branch in the TPM. This final layer is changed by changing the number of classes in the training configuration file. To speed up training, we will initialize the classifier with the trained weights on THUMOS14 (excluding the last layer of the classification branch). We do this because we expect the optimal weights for all other layers to be similar to the weights on the THUMOS14 dataset. To accomplish this we have made an extension to the weight loading for DaoTAD. In the config one can now add a list of weight names to exclude from the weight loading. Any weights that are not included in the weight loading will instead be initialised randomly. Thus we can now easily accomplish our goal by adding the names of the weights of the last layer to the exclusion list in the config. ```python weights = dict( filepath='daotad_i3d_r50_e700_thumos14_rgb.pth', exclude=("head.retina_cls.", )) # Add any prefixes to "exclude" to not load them from the checkpoint before starting training. ``` *New config file functionality to exclude certain layers from loading weights from a checkpoint.* ### Ablation Lastly we also performed an ablation study on the ILDA. For this we disabled the four data augmentation techniques in the config file. The random cropping was changed to a centered crop, to remain the right frame format. We compare the performance of this ablated model with both the results from our normal DaoTAD implementation and with the results from the paper. ## Results #### What is mAP? What is IoU? mAP (mean average precision) is a common measure to use when evaluating action and object detection. To understand what mAP is measuring, it is important to understand what the IoU (intersection over union) threshold is as well. IoU is a comparison between overlapping objects or durations. It is calculated by dividing the area covered by the intersection of two objects (or time covered by both durations) by the area covered by the union of the two. ![](https://i.imgur.com/rIBYQBR.png) *Source: [pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/](https://pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/)* This measure is then compared against a threshold (0.5 by default) to determine if the 2 objects are "the same". In the context of TAD, using a threshold of 0.5, two durations would be interpreted as "the same" if more than half of the durations overlap. If this is the case *and* the labels match, this is registered as a true positive. Using the IoU and threshold, the precision and recall can then be calculated. Finally, taking the area under the precision-recall curve gives the AP. The mAP measure is calculated by taking the mean of the AP across all action classes. ### Thumos14 Results  Here we compare the mAP for the model using both the paper's weights, and the weights we trained ourselves. We first verified that the weights generated by the authors' matched the performance that was claimed in the paper. Secondly, we then attempt to train DaoTAD from scratch using the same seeds as the authors to see if we can match their performance. | Type | Method | Optical Flow | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | | :---:| :-----:|:----:|:----:|:---:|:---:|:---:|:---:| | One-stage | Paper | :x: | 62.8 | 59.5 | 53.8 | 43.6 | 30.1 | | One-stage | Ours | :x: | 59.6 | 56.0 | 50.7 | 40.9 | 29.2 | The paper's weights produce slightly better mAP metrics than ours. While these results are fairly consistent with the paper, it is a curious result given that to the best of our knowledge we matched all of the training parameters and hardware configurations aside from the GPU. We also observed that the mAP metrics produced by a model trained on an NVIDIA T4 GPU in an earlier run was worse than our current model. Perhaps the batch size and the GPU specifications play an important role than expected. ### Multithumos Results  Here we compare the mAP performance of DaoTAD to the three best performing models on MultiTHUMOS. | Type | Method | Optical Flow | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | | :---:| :-----:|:----:|:----:|:---:|:---:|:---:|:---:| | Two-stage | MLAD | :heavy_check_mark: | - | - | 51.5 | - | - | | Two-stage | PDAN | :heavy_check_mark: | - | - | 47.6 | - | - | | Two-stage | TGM | :heavy_check_mark: | - | - | 46.4 | - | - | | One-stage | DaoTAD | :x: | 37.6 | 34.3 | 29.8 | 23.7 | 16.2 | *The performance of the top three models was obtained from [paperswithcode](https://paperswithcode.com/sota/action-detection-on-multi-thumos). The mAP is given for different tIoU thresholds. For the top three models only the performance on tIoU 0.5 is known.* We can observe that DaoTAD scores significantly lower on the MultiTHUMOS dataset. We believe this might be because DaoTAD struggles with the dense labels and concentrated actions introduced in MultiTHUMOS. The top three best models make using of optical flow so it might also be the case that RGB stream is in fact *not* enough for Temporal Action Detection. We also suspect that training DaoTAD directly on the MultiTHUMOS dataset without using the weights that were trained on THUMOS14 might have an effect on the results. However, we could not verify our hypothesis as we did not have enough GCP credits left to spare. ### Ablation Results Here we compare the results from the ablated version of our DaoTAD implementation, with the reported scores from the ablation study of the paper. Method | Ablation | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | | :-----:|:----:|:----:|:---:|:---:|:---:|:---:| | Paper | :x: | 62.8 | 59.5 | 53.8 | 43.6 | 30.1 | | Paper | :heavy_check_mark: | - | - | 48.4 | 38.7 | 24.3 | | Ours | :x: | 59.6 | 56.0 | 50.7 | 40.9 | 29.2 | | Ours | :heavy_check_mark: | 59.3 | 55.6 | 48.4 | 39.4 | 26.1 |  In line with the paper there is a significant performance drop brought by the ILDA ablation. Noteworthy is that the mAP scores of our implementation decrease considerably less compared to the scores reported by the paper, with our implementation even scoring better than the paper's implemenation *with ablation enabled*. These results do confirm that ILDA is crucial for temporal action detection when leaving out optical flow. ## Lessons learned The reproduction process was not without its hiccups, and we learned a couple of lessons the hard way. First, we overlooked that running on a single GPU would effectively utilize a smaller batch size. In the initial paper, the authors used 4 GPUs, each with a batch size of 4, for an effective batch size of 16. The first time we trained the model on THUMOS14 from scratch, we trained with a batch size of 4 on a single GPU. The mAP was 15-20% lower across all IoU tresholds. Upon switching to a batch size of 16 on a single GPU, the results were much closer to those reported in the DaoTAD paper. We also learned a lesson in automating back-ups. For most of the project, we were making manual snapshots of the VM disks whenever we reached an important milestone. However, relying on manual back-ups does not work well when you forget to perform the back-up. If something happens to the VM or the disk before you remember to initialize the back-up, then you might end up in a situation where you are retraining all of your models days before the deadline. ## Conclusion  We were able to reproduce the experiments described in the paper. Despite matching the experiment setup to the best of our abilities, our DaoTAD implementation wasn't able to reach the reported scores. Our experiments on the MultiTHUMOS dataset showed a significantly lower score compared to the top performing models, but the scores could be significantly improved by training the model on this new dataset and performing a hyperparameter tuning. The ablation study showed a drop in performance when disabling the ILDA. The ablation decreased performance less for our implementation than reported in the paper. We also identify two possible directions for improvement with regards to this project. One of the major hurdles we had to deal with in reproducing this paper was setting up the right toolchain as not all combinations of GPU hardware, CUDA, CuDNN and PyTorch work nicely with each other. In the interest of reproducibility, it would be better to abstract the software requirements inside a device agnostic container with container management tools such as Docker or Podman. This will allow independent researchers to quickly get started and focus on the project without having to waste time and effort on setting up the right toolchain. The second direction for improvement is to extend the code to make use of the trained network to perform inference on any input video stream. At the moment of writing this blog post, the project only focuses on training and evaluating the network. A simple python script that makes use of the trained network and display inference results on any input video will objectively provide a better summary of the whole project than just reading about it.  ## Contributions - Reuben A lot of my efforts were spent setting up our Google Cloud environment, and making sure training went smoothly. I also spent some time troubleshooting dependency and deprecation issues (as some of the libraries used have made breaking changes since the paper was released). I wrote a script to visualize the output of the logfiles, and made some changes to the test script to allow for reevaluation of test results without running the entire test process again. - Leo My contributions to this project are mostly related to the MultiTHUMOS dataset. I have implemented the loading of the weights from the THUMOS14 training for use in the training on the MultiTHUMOS dataset. The overview of the DaoTAD structure in this blog post was also one of my contributions. My experience with working on a VM in Google Cloud Platform was extremely limited an I was more often a burden than a help in that area. I greatly appreciate Gautham's and Reuben's help when using the VM. - Abel I mainly contributed with the ablation study and writing parts of this blog post. I tried to replicate the whole ablation study reported in the paper, but because of operating issues with Google Cloud console we were only able to run an ablation of the whole ILDA pipeline. I had no prior experience with PyTorch nor Google Cloud, therefore I am really grateful for the help and efforts of Reuben and Gautham on setting up and running the project. Despite this lack of experience I learned a lot on implementing deep learning models in the cloud. - Gautham As I had previous experience with PyTorch, Cuda and the Unix environment, my contributions were mainly setting up the instances with the required libraries and drivers. It was a cumbersome task in itself as we tried very hard to replicate the exact specifications the authors of the paper were using but unfortunatley we had to settle for as close as possible. Beside that I wrote a simple script to extract and parse the MultiTHUMOS annotations so that we could train the model with it. Overall my contributions could be surmised as the IT admin of the group. I'm grateful to Reuben and his considerable efforts in monitoring our training. I'm thankful to Abel and Leo who were able to do their tasks in the environment I set up without any hitches. ## References [^RGBEnough]: Wang, C., Cai, H., Zou, Y., & Xiong, Y. (2021). RGB stream is enough for temporal action detection. arXiv preprint arXiv:2107.04362. [^MultiTHUMOS]: Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., & Fei-Fei, L. (2018). Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision, 126(2), 375-389. [^THUMOS14]: Jiang, Y. G., Liu, J., Zamir, A. R., Toderici, G., Laptev, I., Shah, M., & Sukthankar, R. (2014). THUMOS challenge: Action recognition with a large number of classes.