# Mask RCNN ### Introduction Mask R-CNN is a masterpiece of He Kaiming God in 2017. It performs instance segmentation while performing target detection, and has achieved excellent results. It has won the COCO 2016 championship without any tricks. The design of its network is also relatively simple. On the basis of Faster R-CNN, a branch is added to the original two branches (classification + coordinate regression) for semantic segmentation , as shown in the following figure ![](https://i.imgur.com/lKcRBsb.png) Mask R-CNN detailed Introduction So why does this network have such good results, and what are the network details? The following are introduced one by one in detail. Before introducing Mask R-CNN, first understand what is segmentation, because Mask R-CNN does this, so this must be figured out first, see the following figure, which mainly introduces several different segmentation, of which Mask RCNN does Among theminstance segmentation. - **Semantic segmentation**: classify pixel by pixel in an image. - **Instance segmentation**: Detects objects in an image and segmentes the detected objects. - **Panoptic segmentation**: describes all objects in the image. The following picture shows the difference between these two segments. As can be seen in the following figure, panoramic segmentation is the most difficult: ![](https://i.imgur.com/1H3re6I.png) ##### - Instance segmentation must not only find the objects in the image correctly, but also accurately segment them. So Instance Segmentation can be seen as a combination of object dection and semantic segmentation. ##### - Mask RCNN is an extension of Faster RCNN. For each Proposal Box of Faster RCNN, FCN is used for semantic segmentation. The segmentation task and positioning and classification tasks are performed simultaneously. ##### - Introduced RoI Align instead of RoI Pooling in Faster RCNN. Because RoI Pooling is not pixel-to-pixel alignment, this may not have a great impact on the bbox, but it has a great impact on the accuracy of the mask. After using RoI Align, the accuracy of the mask is significantly improved from 10% to 50%, as explained in Section 3. ##### - The semantic segmentation branch is introduced to realize the decoupling of the relationship between mask and class prediction. The mask branch only performs semantic segmentation, and the task of type prediction is assigned to another branch. This is different from the original FCN network. When the original FCN predicts the mask, it also predicts the type to which the mask belongs. ##### - Without using fancy methods, Mask RCNN surpassed all state-of-the-art models of the time. ##### - Trained on an 8-GPU server for two days. #### Mask R-CNN algorithm steps - First, enter an image you want to process, and then perform the corresponding pre-processing operation, or the pre-processed image. - Then, input it into a pre-trained neural network (ResNeXt, etc.) to obtain the corresponding feature map. - Next, a predetermined number of ROIs are set for each point in this feature map to obtain multiple candidate ROIs; - Then, these candidate ROIs are sent to the RPN network for binary classification (foreground or background) and BB regression to filter out some candidate ROIs. - Next, perform a ROIAlign operation on the remaining ROIs (that is, firstly map the original image with the pixels of the feature map, and then map the feature map with the fixed feature). - Finally, these ROIs are classified (N-class classification), BB regression, and MASK generation (FCN operations are performed in each ROI). #### Mask R-CNN architecture decomposition Here, I decompose Mask R-CNN into the following three modules:- 1. Faster-Rcnn 2. ROIAlign 3. FCN. These three modules are core of the algorithm ### FCN The FCN algorithm is a classic semantic segmentation algorithm that can accurately segment objects in a picture. The overall architecture is shown in the figure above. It is an end-to-end network. The main modes include convolution and deconvolution, that is, the image is first convolved and pooled to reduce the size of the feature map. Perform a deconvolution operation, that is, perform an interpolation operation, continuously increase its feature map, and finally classify each pixel value. Thus, accurate segmentation of the input image is achieved. ![](https://i.imgur.com/jUlzglk.png) ##### Analysis and comparison of ROIPooling and ROIAlign ![](https://i.imgur.com/Pr2FdXC.jpg) **The biggest difference between ROI Pooling and ROIAlign is that the former uses two quantization operations, while the latter does not use quantization operations and uses a linear interpolation algorithm.** ![](https://i.imgur.com/0WNo9kL.png) #### How Mask R-CNN achieves good results?? First of all, the difficulty of instance segmentation is that you need to detect the position of the target and segment the target at the same time , so you need to integrate target detection (frame the target's position) and semantic segmentation (classify the pixels and segment the target) )method. Prior to Mask R-CNN, Faster R-CNN performed better in the field of object detection, while FCN performed better in the field of semantic segmentation. So the natural way is to combine Faster R-CNN and FCN. #### So how does Mask R-CNN do it? Mask R-CNN is based on Faster R-CNN. Then we first review Faster R-CNN. Faster R-CNN is a typical two-stage target detection method. First, RPN candidate regions are generated, and then the candidate regions pass through Roi. Pooling performs target detection (including target classification and coordinate regression), and classification and regression share the previous network . #### What improvements have Mask R-CNN made? Mask R-CNN is also two stage, and the RPN part is the same as Faster R-CNN. Then, Mask R-CNN adds a third branch based on Faster R-CNN, and outputs the Mask of each ROI ( Here is the biggest difference from the traditional method. The traditional method generally uses an algorithm to generate a mask and then classify it, and it is performed in parallel here ) Naturally, this becomes a multitasking problem. #### Mask R-CNN Network Mask R-CNN basic structure: It uses the same two-state steps as Faster RCNN: first, it finds the RPN, then classifies, locates, and finds the binary mask for each RoI found by the RPN. This is different from other networks that first found the mask and then classified it. Mask R-CNN's loss function: ![](https://i.imgur.com/q9Olqj3.jpg) Mask Representation: Because there is no fully connected layer and RoIAlign is used, one-to-one correspondence between output and input pixels can be achieved. #### RoIAlign The purpose of RoIPool is to derive a small feature map (eg 7x7) from the ROI determined by the RPN network. The size of the ROI varies, but after RoIPool, it has become 7x7. The RPN network will propose a number of RoI coordinates as [x, y, w, h], and then input RoI Pooling, and output a 7x7 feature map for classification and positioning. The problem is that the output size of RoI Pooling is 7x7. If the RoI size of the RON network output is 8 * 8, then there is no guarantee that the input pixels and output pixels are in one-to-one correspondence. First, they contain different amounts of information (some are 1 1, some are 1 to 2), and secondly, their coordinates cannot correspond to the input (which input pixel coordinates of the RoI output pixel of 1 to 2?). This has little effect on classification, but has a great effect on segmentation. The output coordinates of RoIAlign are obtained using an interpolation algorithm and are no longer quantized; the values in each grid are no longer max, and the difference algorithm is also used. ![](https://i.imgur.com/3yaTulq.png) **Comparison of ROI Pool and ROIAlign performance** ![](https://i.imgur.com/fWtlLvl.jpg) From the previous analysis, we can draw a qualitative conclusion that ROIAlign will greatly improve the performance of target detection. According to the above table, we conducted a quantitative analysis. The results showed that ROIAlign increased the AP value of the mask by 10.5 percentage points, and increased the AP value of the box by 9.5 percentage points. **Comparison of Multinomial and Binary loss** ![](https://i.imgur.com/DXN7D7z.jpg) According to the analysis in the above table, we know that Mask R-CNN uses two branches to decouple classification and mask generation, and then uses Binary Loss instead of Multinomial Loss, which eliminates competition between different types of masks. Depending on the class labels predicted by the classification branch, the corresponding mask is selected for output. The mask branch does not need to be re-classified, and the performance is improved. **Performance comparison between MLP and FCN mask** ![](https://i.imgur.com/9gIhB4Y.jpg) In the table above, MLP uses FC to generate the corresponding mask, while FCN uses Conv to generate the corresponding mask. In terms of parameters, the latter is much less than the former, which will not only save a lot of memory space, Will speed up the entire training process (so fewer parameters need to be inferred and updated). In addition, because the features obtained by MLP are relatively abstract, some useful information is lost in the final mask. We can intuitively see the difference from the right. From a qualitative perspective, FCN increased the mask AP value by 2.1 percentage points. ### Network Architecture: For clarity, there are two classification methods Different backbones are used: resnet-50, resnet-101, resnext-50, resnext-101; Use a different head architecture: When Faster RCNN uses resnet50, the features are derived from CONV4 for RPN use. This is called ResNet-50-C4 In addition to using these structures, the author uses a more efficient backbone--FPN ![](https://i.imgur.com/UniTrFP.jpg) ![](https://i.imgur.com/Crp0wBY.jpg) In the figure above, the red BB in the image indicates the detected target. We can observe with the naked eye that the detection result is not very good, that is, the entire BB is slightly to the right, and some pixels on the left are not included in the BB. The end result shown on the right is perfect. #### Equivariance in Mask R-CNN Equivariance means that the output will change as the input changes. ![](https://i.imgur.com/TrGLGzn.jpg) Equivariance 1 That is, the full convolution feature (Faster R-CNN network) and the transformation of the image have the same deformation, that is, as the image is transformed, the full convolution feature also changes correspondingly; ![](https://i.imgur.com/8QeOV6l.jpg) Equivariance 2 The full convolution operation on the ROI (FCN network) and the transformation in the ROI are homogeneous; ![](https://i.imgur.com/WjrL72G.jpg) Equivariance 3 ROIAlign operation maintains the homogeneity before and after ROI transformation. ![](https://i.imgur.com/Ao0NDRc.jpg) Full Convolution in ROI ![](https://i.imgur.com/x4ZWgVp.jpg) Dimension alignment of ROIAlign ![](https://i.imgur.com/wWR7WiU.jpg) ### Network Training This is basically the same as Faster R-CNN. IOU> 0.5 is a positive sample, and Lmask. It is calculated only for positive samples. The image is transformed to 800 on the short side. The ratio of positive and negative samples is 1: 3. RPN uses 5 scales and 3 aspect ratios. #### Inference Details Mask R-CNN using ResNet as the backbone generates 300 candidate regions for classification and regression, and uses FPN method to generate 1000 candidate regions for classification and regression, and then performs non-maximum suppression operation, **Finally detects the regions before the score of 100. mask detection**. There is no parallel operation like training here, the author explains that it can improve accuracy and efficiencyThen, the mask branch can predict the masks of k categories, but here according to the classification result, select the corresponding k-th category, get the corresponding mask, and then resize to the size of the ROI, and then use the threshold 0.5 to binarize. ( **Here, resize requires interpolation, so it needs to be binarized again. The size of m can refer to the figure above. The mask is not the size of the ROI, but a relatively small picture, so the resize operation is required.** ) ### Experimental results: First is the instance segmentation result of the Mask R-CNN algorithm on the COCO dataset: ![](https://i.imgur.com/eMJNcvB.png) Comparison of the results of the Mask RCNN algorithm and other example segmentation algorithms (MNC and FCIS are the champions of the segmentation competition of COCO 2015 and 2016, respectively). ![](https://i.imgur.com/jvI83jA.png) Table 2 is a comparison of some details ##### (a) is a comparison of the Mask R-CNN effect under different feature extraction networks. ResNet-50-C4 indicates that the extracted features are the output of ResNet's stage4. In other words, the input of ROI Pool or ROIAlign is The output of stage4. It can be seen that deeper networks or better feature extraction networks can bring more improvements. ##### (B) Comparison between sigmoid and softmax. ##### (C ) The comparison of ROI Pool, ROIWarp, and ROIAlign performed on ResNet-50-C4 shows the effectiveness of ROIAlign and the type of pooling that has little effect on the effect of ROIAlign. ##### (D) The comparison between ROI Pool and ROIAlign performed on ResNet-50-C5. It can be seen that the effect of ROI Pool at this time is worse than extracting features from C4. After all, the higher the level of feature quantization, the greater the error. Big. In addition, the effect of ROIAlign based on C5 feature extraction is better than that based on C4 feature extraction, which indicates that the error caused by ROIAlign is very small. This experiment is still more important because it largely solves the long-term large perception field. The problem of poor detection and segmentation comes. ##### ##### (E) shows the comparison of the experimental results of Mask branch using FCN and MLP. ![](https://i.imgur.com/3QIzLhf.png) In addition to the results of instance segmentation, the author also gives the results related to target detection in the article, as shown in Table 3. It can be seen that simply replacing ROIPool in the Faster RCNN algorithm with ROIAlign can also significantly improve. In addition, Mask RCNN has a certain effect on the target detection effect because it has more mask-related supervision information during training. ![](https://i.imgur.com/t3cJrsO.png) ### Summary 1. Mask R-CNN is a very flexible framework. It can add different branches to complete different tasks, and can complete various tasks such as target classification, target detection, semantic segmentation, instance segmentation, and human pose recognition. 2. It is indeed a good thing. algorithm! 3. Goal of Mask R-CNN 4. High speed 5. High accuracy (high classification accuracy, high detection accuracy, high instance segmentation accuracy, etc.) 6. Simple and intuitive 7. Easy to use