# Hallucination improves object detection ## Introduction ### Motivation - in extremely low-shot setting, the lack of data variation is a problem, especially when considering novel classes. - RPN is a good start since it finds the most porimising regions by the highest IoU boxes, but in a low-shot settinig, there is simply not enough data that allows the variation. - **in this work, it claims that it trains a network to transfer shared within-class variation from base classes** - one reason if because shared variation is hard to be encoded in the RPN. - this paper proposes to use a hallucinator at the ROI head (**after RPN**) to generate examples in the **ROI feature space** (the ROI feture space here means the ROI regions/boxes/proposals generated by the RPN) - this can be seen as a way of data augmentation for building a better classifer. ### Contributions - contributions are three-fold (1) the author explores the problem that arises with the lack of inter-class variation in training data, in the few-shot learning problem. (2) the author proposes a hullicinator for novel data to transfer shared modes of within-class variation from base classes to novel classes. (3) the author claims that their proposed model outperforms TFA (from the frustratinly simply few-shot object detection) in low-shot setting. - author also claims that their work is the first one to show the effectiveness of hallucination on few-shot object detection. ## Related Work ### Object detection - author defines two main groups: (1) serial (2) parallel object detection networks - serial detectors first generate promising RoIs and then feed each proposal box to classifier that predicts if the region contains an object ### Few-shot object detection There are mainly four lines of work under this paradigm: - learning better feature representatiions through **Metric learning** - modified fine-tunning techniques - **Meta-learning techniques** - techniques to imporve region proposal generation process by **attention-mechanism and class-aware features** - additional information such as **semantic relations** and **multi-scale representations** ### Data Hallucination - Most work focuses on classification tasks and the learned feature space. - learning from base classes the shared feature transformations to generate novel class features. - pairwise deformations between examples of the same class. - combined meta-learner and a hallucinator. ## Approach - The author builds their model based on the two existing state-of-the-art baselines: (1) TFA and (2) CoRPN. ### TFA - TFA is a two-stage fine-tuning few-shot detector: train on base classes and then fine tune on novel classes (blue area means training and grey are means fixed) ![](https://i.imgur.com/oWkX1Pl.png) - TFA is a two-stage fine-tuning approach to few-shot learning. - It is built on top on the Faster-RCNN baseline. However, it uses a cosine-similarity classifier to reduce intra-class variance in few-shot learning. - It has a ResNet-101 backbone pretrained on ImageNet with a feature pyramid network. - The training procedure is that, in the first stage, it is trained on base class instances; in the second stage, it is fine-tuned on novel class instance, where only the box regressor and classifier are trained, while keeping the rest of the part frozen. ### CoRPN ![](https://i.imgur.com/rjulK8E.png) - CoRPN has exactly the same architectuer and training procedure as TFA, except the **proposal generation procedure**. - CoRPN has multiple RPN's, where each RPN predicts the RoI (high IoU boxes), so that if one RPN fails the others can still capture them. - The loss of CoRPN is as such: ![](https://i.imgur.com/9zvpjWu.png) where Ldiv is the divergence loss and Lcoop is the cooperative loss. Ldiv encourages the RPN's to be different while Lcoop encourages RPN's to cooperate by **setting a lower bound to the RPN response(prediction) to boxes(anchors)** ![](https://i.imgur.com/j9CrU3v.png) ![](https://i.imgur.com/HIeMmnn.png) ### The hallucinator model - as can be seen, the hallucinator model is placed after the "box head" (ROI feature extractor, e.g. NMS), but does not contribute any outputs for the box regressor. - so it can be seen as a way of augmenting the variation of ROI features. - the hallucinated examples are then appended with the original ROI training examples to train the box classifier. - **only the original examples are used to train the box regressor, not the hallucinated examples** - the hallucintor is a two-layer MLP with ReLU. The input size is three times the feature size and the output size of eacch linear layer is the same size as the input feature. ![](https://i.imgur.com/vkIGPCf.png) - the hallucinator model takes as inputs the class prototypes, a seed example, a noise vector parameterized by phi. - the prototypes are used to capture global category info. This can also be seen as **a way of regularization** by not simply copying the seed examples. - during training, the prototypes computed at the base-class training stage and novel class fine-tuning stage is different. All base class examples are used to compute the base class prototypes **before training the hallucinator**, and these base class prototypes are not updated during training. - however, the novel class prototypes are updated dynamically using both training and hallucinated exmaples when training the classifier (prototypes are updated whenever a new hallucinated example is generated). ![](https://i.imgur.com/jnZppBr.png) - the loss is computed by using the weights pretrained (wyi and wk) and the seed example xi of category ck. ![](https://i.imgur.com/jq2A6a9.png) #### Training style - Iterative training as opposed to end-to-end joint training: **EM style training**. - randomly sample the proposals as seed exmaples as input to the hallucinator. #### Actual training procedure ![](https://i.imgur.com/49iCJks.png) Training on base classes - First train a plain detector (without the hallucinator) on base classes, then train the hallucinator on this pretrained ROI feature space (guided by the pre-trained classifier). Fine-tuning on novel classes - initially, a batch of samples consisting of an imbalanced set of positive and negative exmaples, with negative examples being the majority. - First generate the hallucinated examples of novel classes using the trained hallucinator and then randomly replace the background examples with examples generated by the hallucinator to obtain a **refined training batch**. - Then, we train the classifier again on the refined dataset with hallucinated examples. - Next, we fine-tune the hallucinator using the classifier, and then use the fine-tuned classifier to fine-tune hallucinator...etc. ## Evaluation - both the baselines and the proposed models are Cb+Cn way few shot detectors. - the **standard evaluation procedure** used is based on the Frustraintly-Simple Few-Shot Object Detection paper. - **including grountruth boxes in the training examples in the RoI head.** - fine-tuning stages on PASCAL and COCO are different. For PASCAL VOC, the classifier is trained using a balanced dataset with base and novel classes, whereas for COCO, a Cn-way classifier is first trained on novel classes, it is then trained on the Cb+Cn way using a balanced few-shot dataset.