A Comparative Review of Few-Shot Object Detection

# A Comparative Review of Few-Shot Object Detection - this is by far the only survey that systematically compares few-shot object detection methods. ## Taxonomy (based on dataset settings) In terms of novel classes, the problems can be defined as: - LS-FSOD: a small novel set data and an optional dataset without target supervision to learn generic notions. - SS-FSOD: has an extra target-domain data without annotations (an additional unlabelled examples) - WS-FSOD: a small novel set of data with image-level labels (weakly labelled, sometimes some unlaballed novel set and a base set may be included to compensate for the inaccurate supervisory signals) Usually, base is usually for learning **task-agnostic notions**, where as learning from whether weakly labelled novel set or unlabelled novel set is for learning **task-specific guidance** ## Main Challenges - still adopts classic deep learning framework variants - imbalance problems - **large intra-class variations** (which is bad because ultimately we want to group examples of the same class togther and large intra-class variation can lead to misclassification) - **low inter-class distance** (fine-grained problems, our model needs to define clear boundaries between different classes) - limited supervisory signal can lead to **low-density sampling**, where low-density sampling of data leads to nasty data distribution having problems such as high intra-class variations, low inter-calss fistance and data shift. - degradation (large training error due to irrelevant features). - domain shifts in RPN due to domain shift between base and novel classes. Generic notions tend to lack transfer the in knowledge between sorce and target domain. The similarity between the base and novel class information may also affect the quality of rois. - **data bias** in small scale dataset (such as in scale, context and intra-class diversity) leads to overfitting (caused by learning not-sot-robust features). - due to low-density sampling, various data splits may lead to inaccurate/unstable results (results showing the performance of the model may be misled by these results). - insufficient instance samples may lead to amplifying noise and bias in the data. Data augmentation may be naive since they will eventually form a loose cluster for each class even though the intra-class diversity is increased. - incomplete annotations (SS problem): annotations of base class data from which novel classes may appear may be incomplete and thus these novel classes are suppressed during learning as backgrounds. - inaccurate supervisory signals (WS problem) caused by image-level tags only being associated with the most discriminative parts of the object. ## Into the Two-Stage Detector - the two-stage detector is a coarse-to-fine approach since it first screens **class agnostic candidates** (e.g the contains object or not notion) and then performs post-processing NMS after ROI alignment and pooling. - however, the one-stage detector uses a **class-specific approach** to generate a set of bounding boxes and associated class probability distributions for each spatial location. ## LS-FSOD - two types: balanced LS-FSOD (more common), imbalanced LS-FSOD, depending on whether there exists a problem of foreground-foreground imbalance. - approaches: meta-learning (e.g **metric-learning based**), optimization-based, model-based learning (e.g Transfer-learning). ### Metric-based learning 1. ten aspects of an overview of existing methods: data preprocessing, embedding network, RPN vs meta-RPN, support-only vs support-query guidance, the aggregator, the scoring function, the loss function, and the fusion node. Characteristics: - aggregrator: (1) MUL: channel-wise multiplication is known for feature re-weighting to learn feature co-occurence. (2) SUB: known for being a distance metric (3) CAT: concatenation that stacks features along the channel axis for subsequent networks to explore a good way of feature fusion. - scoring function (an essential element of the metric-learning based method). - Examples include the cosine similarity, Pearson Similarity and the Relation Network. It is mainly to measure the similarity between the an ROI (foreground feature) against each prototype of a given category. - Most binary classifers are built upon a one vs all binary classifier. - two ways of learning negative proposals: (1) use fixed-length vector as a negative prototype (2) label negative (background) boxes with pseudo labels randomly sampled from a list of unseen categories. - Loss Functions It is essential that in the metric-learning based method, an **auxilirary loss** is incorporated for the class prototypes to not only reduce th intra-class variations of the prototypes of each class, but also to increase the distance between the prototypes. Examples include: - in Meta-RCNN, it uses a **meta loss** for the diversification of class prototypes (class attentive vectors by the PRN branch) to encourage these class attentive vectors to encode class-specific information. - in the work of attention RPN and multi-relation network, it employs a **two-way contrastive loss**. It first samples a triplet consisting of a query and two supports of different categories. It then generates positive proposals using both the query and the support of the given category. Next, it constructs a balanced training set consisting of three **proposal-support pairs**: (pc, sc), (pb, sc), (p*, sn) --> positive support to positive proposal and negative proposal, and then a negative support. It uses Binary Cross-Entropy to learn a good representative. - Margin-based Ranking Loss: a multi-task loss consisting of two parts: (1) a hinge-loss varaint for foreground-background classification (2) max-margin contrastive loss for enforcing all ROIs to satisfy the max-margin category separation (inter-class distance) and semantic clustering (margin between prototypes). Some works include: - Bansal et al. proposes a margin loss to encourage matching between porposal and its corresponding ture category should be high. - Li et al. proposes that the novel class prototypes can be embedded into the margin between base class prototypes. **However, large class margin between prototypes of base class makes it hard for novel classes to find good class prototypes**. Therefore Li, et al. proposes a max-margin loss to adaptively adjust class-margin. - Zhang et al. improves the traditional contrastic loss from two aspects: (1) to adaptively adjust the inter-class distance using **learnable margins**, (2) to use **focal loss** to adaptively adjust the contributions of various kinds of samples to the gradient. - Li et al. also proposes a tranformation invariant principle to learn an embedding netowrk that consistenly produces class prototypes between an image and a transformed one. - Fusion node fusion nodes are nodes placed at different positions of a one/two-stage detector to aggregate query features (proposals) and support prototypes differently. - Training/Testing process The model usually starts with a network pretrained on ImageNet to learn basic notions (e.g low level visual features). The pretrained network is mainly for feature extraction. The model is then trained on episodes to learn how to match instances in the query set. At meta-test time, Dnovel can be pre-processed to get class-sensitive vectors and then these vectors can serve as task specific parameters for the model to be fine-tuned on. - two main evaluation settings 1. randomly sample 500 episodes where each episode consists of a support set and a query set. The support set consists of n-way k-shots and a query set consists of at least 10 images per category. 2. randomly sample classes given a split ratio of base and novel classes. - solutions to acquire support patches to pair up with a query image: 1. use a pre-trained mask-rcnn to evaluate the difficulty of a ground-truth box. 2. remove images with a box whose size is less than 32x32 in the training stage 3. randomly sample m support patches and only chooose a support patch which has the most similarity with the query images. 4. sample more images to build a larger support set for each task (e.g k=200). - other ways of acquiring support prototypes (other than the general visual prototype (GVP)) 1. LVP (learnable visual prototype): explots a list of learnable kernels to automatically acquire class prototypes for each class. 2. SP (semantic prototype): semantic prototypes are learned from a large-scale corpus to provide task-specific parameters for the detectors ### Optimization-based - Optimization-based method: learning from base datasets to provide a suitable gradient guidnace or a uniformly optimal initial weight. However, weight initialization is difficult for fsod since the two subtasks are hard to be balanced. Meta-Retina Net is an example. ### Model-based - Model-based method: design a model or learning strategy to quickly adapt to a new episode. Transfer Learning Methods: pre-trainig and fine-tuning. - base class training - fine-tuning - regularization - view classifer as an encoder: - background depression: estimate a background mask to mask the background features. This allows the network to focus more on the foreground objects. Another method is to use a pretrained saliency map to reweight the features for background depression. - FPN: ## Semi-Supervised Learning The SS-FSOD can be used to further reduce the annotation burden encountered in traditional SSOD (semi-supervised object detection). There are two main ways of tackling the problem: **self-training** and **self-supervised based.** - Self-Supervised Learning method - A qualified detector should meet several requirements: robust feature encoding - pretext tasks (VAE/GAN based reconstruction) - consistency regularization - Self-Training method ## Weakly Supervised Learning