# Meta Faster-RCNN ## Abstract Summary - meta-learning based few-shot object detection to **transfer knowledge from base to novel classes** - coarse-to-fine **proposal based** object detection framework - incorporates **prototype-based classifiers** into both proposal generation and classification stages - proposal generation: incorporates a **light weight matching network** (between query image feature map and spatially-pooled class features) - **attentive feature alignment method** to reduce spatial misalignment between proposals and few-shot class examples - experiments conducted on **FSOD benchmarks** ## Introduction The author considers the context of FSOD in a two-stage setup, which involves both the **proposal generation** and **bbox regression and classification.** His main goal is to introduce **prototype-based classifier** for both the RPN part and detector head. ### The two main issues - proposal generation is bad when used on novel classes (misses high IoU boxes, which means that what are meant to be positive examples for novel classes are considered negative for RPN trained over base classes) - RoI alignment (alignment between region proposals and true objects) is bad (very noisy), especially over noisy proposal generation for novel classes. ### Contributions - propose (1) coarse-grained prototype matching network to generate proposals for few-shot novel classes in a fast and effective manner, and (2) fine-grained prototype matching network with attentive feature alignment between proposals and novel class examples. ### Faster-RCNN vs Meta Faster-RCNN - RPN: original uses dense sliding windows to perform binary classification and bbox regression, their work uses a lightweight convolutional matching network to measure the similarity between each sliding window and spatially pooled prototypes for each class (linear object/non-object classifier vs metric-learning) --> **coarse-grained prototype matching network** - Classifier/detector: attentive feature alignment between generated proposals and class prototypes using high resolution feature maps (how? first estimate the soft correspondences and then learn to align, in order to discover foreground regions --> spatial feature alignment + foreground attention module)--> **fine-grained prototype matching network** - Overview: Meta-RPN and Meta-classifier, trained incrementally without the need to train at meta-test time. - Learn a Faster-RCNN detector over base classes using the shared backbone from their few-shot detector ![](https://i.imgur.com/P6OSYLN.png) - the two different classifiers have their own strengths and weaknesses. The softmax-based classifier is better at predicting base classes, whereas the metric-learning based classifier is better at adapting to novel classes, **thus the author proposes to jointly learn two classifier.** ![](https://i.imgur.com/2UGWQwm.png) - object detection model consists of both the original and the proposed Meta-Faster RCNN. - two seperate networks (original Faster-RCNN and the proposed few-shot object detector with a two-stage coarse-to-fine prototype matching network) ## Approach ### Task definition, Model Architecture - the key idea of FSOD is to learn how to match the query image with few-show class examples from base classes training set, **so it can generalize to few-shot novel classes.** - the proposed detection model can be divided into four modules: 1. feature extraction - Siamese network with ResNet as backbone is used to extract features from both **query images** and **k-shot class examples**. - given a query image, CNN features are extracted from the shared backbone (typically the output after the ResNet-4 block) - k-shot class examples are sampled for each novel class by cropping the images including the context of the surrounding regions. These cropped images are then fed into the CNN feature extractor backbone to obtain the features. These **per class K-shot features are then averaged to obtain the class prototype.** 2. Object detection for base classe - After feature extraction, RPN is used to generate **category-agnostic proposals of all base classes in the image.** - After that, R-CNN classifier is employed to generate class probabilities and bounding box coordinates **over all base classes.** 3. proposal generation for novel classes - Below is the Meta-RPN used for generating proposals for novel classes ![](https://i.imgur.com/3FaAiE2.png) - first, it performs spatial average pooling to get the averaged prototype **for each novel class.** This way the global representation of a novel class is obtained. - a small subnet is then used, just like the original RPN, to extract features using a 3x3 convolution and Relu. - Instead of a object/non-object linear classifier, the author proposes to use a non-linear cnn-based feature fusion network for binary classification and bbox regression. 4. proposal classification and refinement for novel classes - first use again a Siamese network to generate features for both class prototypes and generated proposal features (with high resolution feature maps) - perform RoI alignment between the class prototypes and the generated proposals - since spatial alignment is poor by directly performing the alignement, an attention module is used to improve this alignment. ![](https://i.imgur.com/X2UQRK1.png) ### Training Framework Training can be split into three parts: - Meta-training with base classes --> for each episode, meta-training dataset is sampled for k-shot instances from base classes (in order to simulate the FSOD for novel classes). - Learning the detection head for base classes --> fix backbone parameters and train RPN R-CNN module over base classes. - Fine-tuning with joint base and novel classes --> sample a small balanced dataset of joint base and novel classes. - The key difference between meta-learning and fine-tuning is that, during meta-testing, only support set novel classes are used to compute the prototypes without any training, while during fine-tuning, original novel class images are used as query images to fine-tune both the Meta-RPN and Meta-classifier. ## Related Work - this work is most similar to "Few-Shot Object detection with attention-rpn and multi-relation detector" by QiFan, et al. ## Experimental Setup