# Learning context in images
**Understanding and reproduction of 'Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection'**
Made by Pierre Bongrand (5396077) & Yu Miyawaki (4250182)
---
In this blog, we will summarize, discuss, partly reproduce, and provide simpler explanations of the paper entitled: "Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection" authored by Sara Beery, Guanhang Wu, Vivek Rathod, Ronny Votel, Jonathan Huang[^article]. One of the major purposes of the blog is to provide a more in-depth and intuitive explanation of both their theory and findings. A Jupyter Notebook containing details about the implementation is available [here](https://colab.research.google.com/drive/1TZFYt3wjXG-4sv3pWEnymQlxvazCF7TR?usp=sharing).
## Introduction
This article draws its inspiration from two previous works. The first and main basis for this work is a paper published in 2015 named Fast R-CNN. This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. We will see later on, how does the Context R-CNN leverage this previous model, which was at the time the first of its kind.
In 2017, a new attention-based machine learning technique was published[^attention]. The basic idea of this new technique is that in each step, the model applies a self-attention mechanism that directly models relationships between all words in a sentence, regardless of their respective position. When having to identify the next words of a sentence, the transformer compares a candidate word to every other word in the sentence. The result of these comparisons is an attention score for every other word in the sentence. These attention scores determine how much each of the other words should contribute to the next representation of the candidate word[^google]. As these previous results were the best scene for the task of translation, it has brought a lot of attention to transformers and to this self-attention model which is now widely used by GPT-3[^GPT3] and many other models.
The authors of the paper on Context R-CNN took inspiration from this attention-based model and applied it to computer vision by reusing the Fast R-CNN model. Their intuition was that the strength of the attention model for text analysis lay in the overall understanding of the sentence. This means that the model takes information from all words in the sentence, not just the previous $x$ words. However, for long time frames video analysis, an object of interest may only be visible 1% of the time. This means that the model must relate two images of the same objects taken a few days apart.
## Purpose
The purpose of this paper is to use an attention mechanism, on static monitoring camera data to perform objection detection. Object detection consists of analysing an image and returning what is on the image and where is it located. If an object of interest is visible, the model should return a bounding box containing the object of interest.
These static monitoring cameras are used to monitor animal behaviour, occurrences and population for a biologist in natural reserves or car traffic patterns in cities. Both of these applications could have a significant impact on the safety of the biologist and the city. Besides, as we have seen in recent years, the use of video surveillance is becoming more and more standard, such a model could have an extremely important impact on our society. A domain of application could be citizens tracking and identification.

## Challenges
However, with such a task and with the dataset that comes with it, there are many challenges to accomplishing these goals. Here is the list of the most essential challenges the researchers had to solve:
- There is a very high number of pictures where the object of attention is not visible. The dataset is very unbalanced, which may be challenging for training.
- If the object of interest is visible for a human on the monitor. There is no guarantee that the objects of interest will be centred, focused, well-lit, or have an appropriate scale. This implies significant flexibility of the classifier with these parameters.
- Again, on a similar note, the objects of interest may be partially observed. As it could be out of range of the camera.
- The image quality of the available dataset is quite poor.
- Lastly, in both context of the city car traffic and the natural reserve, there is quite some background distractors. Background distractors can create false positive.
## Data
Three datasets were used in total by the authors to evaluate their results: Snapshot Serengeti (SS)[^SS], Caltech Camera Traps (CCT)[^CCT] and finally City Cam (CC)[^CC]. Although our assignment was about the CityCam data, we couldn't get the data because of the invalidation of the link to the dataset. That's why we use the SS dataset instead of CC.
### [Snapshot Serengeti](https://github.com/tensorflow/models/tree/master/research/object_detection/test_images)
The Snapshot Serengeti dataset consists of hundreds of camera traps in Serengeti National Park, Tanzania. These camera traps are providing a powerful new window into the dynamics of Africa’s most elusive wildlife species. This grid of camera traps has been operating continuously since 2010 and has produced millions of images.
### [CityCam](https://www.citycam-cmu.com/dataset)
The CityCam dataset contains 10 types of vehicle classes, around 60K frames and 900K annotated objects. It covers 17 cameras monitoring downtown intersections and parkways in a high-traffic city, and “clips” of data are sampled multiple times per day, across months and years. The data is diverse, covering day and nighttime, rain and snow, and high and low traffic density.
In this paper, the authors used 13 camera locations for training and 4 cameras for testing, with both parkway and downtown locations in both sets.
## Method for object detection
We will explain the global architecture of the model by first explaining each component one by one and the challenge that they solve. Later, we will comment on how everything interacts with each other and explain the flow of information through the architecture.
### Faster R-CNN[^faster]
The Faster R-CNN is one of the most important components of the model as it will be used repetitively through the network. Moreover, it will be the component doing both the extraction of the region of interest in an image and the labelling.
Faster R-CNN is mainly composed of two subcomponents:
- **Region Proposal Network (RPN)**: It takes as input an image of the dataset and then, returns a collection of class agnostic bounding box proposals. In short, the objective of this subcomponent is to learn whether the content of a rectangle is an object or a background.
- Output feature map from input data via Resnet101
- The feature map is the output layer of an input that has been passed through the last convolutional layer using a learned model.
- Setting Anchor boxes for feature maps.
- After setting the anchors, each point on the feature map, we create nine anchor boxes for each anchor. The information obtained by comparing each anchor box with the ground truth becomes the teacher data for training the RPN.
- Create RPN training data (output layer of RPN model) by comparing information of anchor boxes and answers(ground truth)
- The output is "whether the content of an anchor box is a background or an object" and "If it is an object, how much does it shift from the ground truth"
- **Object Detection**: This stage extracts instance-level features via the ROIAlign operation which then undergo classification and box refinement. In short, the objective of this subcomponent is to learn what is in the detected rectangle.
- Pass the input image through Resnet101 (same structure and weights as used in RPN) to obtain the feature map.
- Convert input to fixed-length by ROIAlign
- The obtained vectors are passed through all 4096 coupling layers twice, and finally, two types of outputs are obtained, one for class classification and the other for rectangular misalignment regression.

### Resnet 101
Resnet 101 is used for transfer learning against the above model. This model achieves making a deep layer, which expresses different features without Vanishing Gradient. The main idea is that it is inserting a shortcut path.
### Memory Banks:
As the goal of this paper is to make use of context, the model needs to store some information about the context of each frame. The authors created two types of memory bank containing information about the surrounding frames: Long Term Memory (M_Long) & Short Term Memory (S_Long). The Long Term Memory bank has for goal to provide contextual information on a very high time frame, it can go from days to weeks. While the short term memory bank will focus on the surrounding frames that were taken in the last hours and the next hours.
The RPN from the Fast R-CNN is routed through two attention-based modules that index into memory banks (M_short & M_long).
- **Long Term Memory Bank (M_long)**
- Given a predefined time horizon i_t-k: i_t+k, running a frozen, pre-trained detector on each frame
- Long term memory bank is built from feature vectors corresponding to resulting detections
- 3 Strategies to limit the size of M_Long:
- Only take instance-level feature tensors after cropping proposals from the RPN and save only a spatially pooled representation of each such tensor concatenated with a spatiotemporal encoding of the DateTime and box position (yielding per-box embedding vectors).
- Limiting the number of proposals for which we store features
- Rely on a pre-trained single frame Faster R-CNN with Resnet-101 backbone as a frozen feature extractor
-> Memory Bank Capacity: 8500 contextual features (1 month of data)
- **Short Term Memory Bank (M short)**
- For small window sizes it is feasible to hold features for all box proposals in memory! (different than in long time frames)
- We take the stacked tensor of cropped instance-level features across all frames within a small window around the current frame (≤ 5 frames) and globally pool across the spatial dimensions (width and height).
- Output: Matrix of shape (# proposals per frame∗# frames)×(feature depth) containing a single embedding vector per box proposal.
-> This matrix (M Short) is then passed into the short term attention block.
**Implementation:**
As we can see, in this first image below a context feature bank is built up.

In the case below, all picked up context embeddings are put into a contextual memory bank.

### Attention Modules Architecture:

As can be seen in the image above the attention block uses two main features. These features are the input features (A) and the context feature (B). We will go through each of them. By taking the weighted sum of the relevance of M and adding it to the original object features, a context feature is constructed for each object.
- Input Features (A)
- The tensor of input features from the current frame (which in our setting has shape $[n × 7 × 7 × 2048]$, with $n$ the number of proposals emitted by the first-stage of Faster R-CNN).
- We first spatially pool A across the feature width and height dimensions, yielding A pool with shape $[n × 2048]$.
- Normalize
- Context Features (B)
- B can be used for both Mshort and Mlong.
- Basic Functioning
- Define $k(·; θ)$ as the key function, $q(·; θ)$ as the query function, $v(·; θ)$ as the value function, and $f (·; θ)$ as the final projection that returns us to the correct output feature-length to add back into the input features.
- Distinct $θ$ ($θ$ long or $θ$ short) for long term or short term attention respectively. In the experiments, $k$, $q$, $v$ and $f$ are all fully-connected layers, with output dimension $2048$.
- Compute weights $w$ using standard dot-product attention:
- Then, construct a context feature Fcontext for each box by taking a projected, weighted sum of context features
- Finally, add $F$ context as a per-feature-channel bias back into the original input features $A$.
### Global architecture
As we have explained every component of the architecture we will now go through the architecture by explaining how does the information circulate in this model on a higher level.
Can be seen in the image below the process of classifying and training.

First of all, as object detection models work for images the first step is to divide the video into images. As we are analyzing one image, we will name this image the keyframe. This keyframe can be seen on the left side of the image above. As we can see, the keyframe is surrounded by two other images. These images will be very helpful later on for the context mechanism.
Then, the first stage of Fast R-CNN is applied to the keyframe and its surrounding images. This stage consists of extracting the main region of interest. It describes these regions by using features. Each region of interest gets a vector or matrix of features assigned to it. Note that both the keyframe and its surrounding frames will have their main features extracted.
Only the extracted features of the keyframe will be transmitted to the two attention modules as well as to the second stage of Fast R-CNN. While the extracted features of the surrounding frames of the keyframes will be transmitted to the Short Term Memory (M_Short).
Next, the attention block is being leveraged as it will make use of both the keyframe features and the context features. This is then transmitted to the next attention block which will exploit these features with the Long term memory.
Finally, the second stage of Fast R-CNN will be called. This second stage of Fast R-CNN consists of taking each one of the features vectors that it is given and will label the boxes to a class, such as Car, Bus, Dear, Elefant.
Note: There are also other hidden stages of the Fast R-CNN throughout the architecture as the image above is very simplified. There are the maximum suppression stage, the alignment stage, to refine the bounding box and others…
## Results
### Quantitative results
The quantitative results are very promising. In this context R-CNN is outperforming the single-frame model by a very significant margin on the three datasets as we can see in the image below.

Another interesting finding is the increase of the mAP(mean Average Precision) when increasing the time horizon of the Long term memory. By increasing the time horizon of the long term memory, they are increasing the quantity of information that they want to put into the memory. We observe that after running simulations for the data set SS, the mAP scores increase at every increase of the time horizon. This is a strong indication that these features from the time horizon helps. Because we are not adding parameters but just increasing the quantity of the memory.

However, even if they changed the time horizon from 1 minute to 1 month, the mAP only improved by about 10%. The author attributes this to a sampling strategy in which a burst of highly relevant images is captured for each motion trigger. In other words, the SS dataset does not need a long-term context and is already performing well.
Lastly, the comparison between their model is indicating that the long term memory is the component of the model which helps the most. As we can see when comparing the results of the highlighted boxes, the Long term attention mechanism (in red) is responsible for more than 95% of the best performance of the model.

### Qualitative results
The qualitative results of this paper are also very interesting. We can visualize the attention weights of the model (in green) as can be seen in the image below. The biggest image is where the attention is the highest. Moreover, it is fascinating to see that the model can identify that the same animal appeared at a different moment, with a big leap in time between the appearances. This information can be seen on the timeline of the image below.

Lastly, it is very interesting to see the cases in which the contextual model performs than the previous one. On the left is the Fast R-CNN and on the right is the Context R-CNN
1. In this first example, we see that the car which is moving out of the frame was detected by the contextual model. This may be because the short term memory used the previous images, where the car was fully visible.

2. In this example, the model was able to identify an animal that was very far from the camera and occluded. We can suppose that on another frame close to this one, the animal was maybe fully visible.

4. Finally, in this last example, the previous model identified the red square as being a car, while it was the background. The contextual model does not do this mistake. Probably because it observes that this element does not move through time. Therefore, it cannot be a car.

[^CC]:https://www.citycam-cmu.com/dataset
[^SS]:https://github.com/tensorflow/models/tree/master/research/object_detection/test_images
[^CCT]: https://beerys.github.io/CaltechCameraTraps/
[^notebook]: https://colab.research.google.com/drive/1TZFYt3wjXG-4sv3pWEnymQlxvazCF7TR?usp=sharing
[^article]: https://arxiv.org/abs/1912.03538
[^attention]: https://arxiv.org/abs/1706.03762
[^GPT3]: https://arxiv.org/abs/2005.14165
[^google]: https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
[^faster]: https://arxiv.org/pdf/1506.01497.pdf