YOLOv4 - HackMD

###### tags: `Paper Notes` # YOLOv4 * 原文：YOLOv4: Optimal Speed and Accuracy of Object Detection * 機構：Institute of Information Science Academia Sinica, Taiwan * 時間：2020 年 ### Background * YOLOv4 目的：建立一個 real time 且 high performance 的 object detector，並且只需要一張 gpu 即可做訓練。 * 在 YOLOv4 論文中，作者實驗了很多 object detection 的技巧，也介紹了很多 object detection 的概念，可以視作半個教科書。 * 需要注意的是，論文中提到的許多技巧（包含作者在論文中新提出的）例如，CmBN、modified SAT，都是沒有被用在 YOLOv4 上的。 ### Object Detection Models * 如圖 2 所示，現今 object detector 的架構大致可以分成 4 個部份： * **Input**：沒什麼好講的，就是圖片。 * **Backbone**：用於初步提取圖片的特徵。常用的模型包括：VGG16、ResNet、EfficientNet、CSPDarknet53 等。backbone 通常會先 pretrain 在 ImageNet 上。 * **Neck**：用於整合 backbone 的各層 feature map。常用的模型包括：SPP、FPN、PAN 等。 * **Head**：將 neck 整合好的特徵送入 head，用於預測 bounding box (bbox)。 * 常見的 head 可以分成 2 種：one-stage (dense) 與 two-stage (sparse)。由於 one-stage detector 在每個 grid 上都要預測是否有 bbox，因此稱為 dense。而 two-stage detector 因為有 ROI pooling 的幫助，因此只需要對 ROI 做預測即可，因次稱為 sparse。 * Dense Prediction (one-stage)： * RPN、SSD、YOLO、RetinaNet (anchor based) * CornerNet、CenterNet、MatrixNet、FCOS (anchor free) * Sparse Prediction (two-stage)： * Faster R-CNN、R-FCN、Mask R-CNN (anchor based) * RepPoints (anchor free) <center><img src ="https://i.imgur.com/JVUMw5g.png"></center> <center>圖 2：Object Detector。</center> ### Model Architecture * YOLOv4 的主要架構如下： * Backbone：CSPDarknet53 [81] * Neck：SPP [25] + PAN [49] * Head：YOLOv3 [63] * Darknet53： * 如圖 A 所示，Darknet53 總共有 53 層 conv. layer，除去最後一層 Connected (FC，實際上是通過 1x1 的 conv. layer 實現，因此算進 53 的一員)，總共 52 層 conv. layer 用於當做主體網絡。 * 每層 conv. layer 而都包含： * Conv2D * BatchNormalization * LeakyReLU (但在 YOLOv4 裡選用 Mish) * 圖 A 的輸入尺寸為 256x256，但實際上沒有限定一定要多少。論文中是使用 512x512。 <center><img src ="https://i.imgur.com/BLy2lrt.png"></center> <center>圖 A：Darknet53 架構圖。</center> * CSPNet (Cross Stage Partial Network)： * CSPNet 可以在不降低甚至增加準確度的情況下，減少 CNN 網路 10% 到 20% 的計算量。 * 如圖 B 所示，CSPNet 就是先將 base layer 依比例 $\gamma$ 拆分成兩份，其中一份原封不動，另一份則會經過 transition，最後兩者在 concatenate 起來，再經過一次 transition。 * 在 CSPDarketnet53 中，base layer 為每個 ResBlock Body 前的 conv. layer 的 feature map，而 transition 則為一層 conv. layer。 <center><img src ="https://i.imgur.com/5g7eJLm.png"></center> <center>圖 B：CSPNet 示意圖。</center> * SPP (Spatial pyramid pooling) + PAN (Path Aggregation Network)： * 這裡引用[知乎@周威](https://zhuanlan.zhihu.com/p/150127712)的解說，如圖 C 所示。 * SPP 的使用主要使在 process1 裡。 * process1： ```python # input shape = 19x19 y19 = DarknetConv2D_BN_Leaky(512, (1,1))(darknet.output) y19 = DarknetConv2D_BN_Leaky(1024, (3,3))(y19) y19 = DarknetConv2D_BN_Leaky(512, (1,1))(y19) # SPP maxpool1 = MaxPooling2D(pool_size=(13,13), strides=(1,1), padding='same')(y19) maxpool2 = MaxPooling2D(pool_size=(9,9), strides=(1,1), padding='same')(y19) maxpool3 = MaxPooling2D(pool_size=(5,5), strides=(1,1), padding='same')(y19) y19 = Concatenate()([maxpool1, maxpool2, maxpool3, y19]) y19 = DarknetConv2D_BN_Leaky(512, (1,1))(y19) y19 = DarknetConv2D_BN_Leaky(1024, (3,3))(y19) y19 = DarknetConv2D_BN_Leaky(512, (1,1))(y19) ``` * process2： ```python # upsampling y19_upsample = compose(DarknetConv2D_BN_Leaky(256, (1,1)), UpSampling2D(2))(y19) # input shape = 38x38 concatenate y38 = DarknetConv2D_BN_Leaky(256, (1,1))(darknet.layers[204].output) y38 = Concatenate()([y38, y19_upsample]) y38 = DarknetConv2D_BN_Leaky(256, (1,1))(y38) y38 = DarknetConv2D_BN_Leaky(512, (3,3))(y38) y38 = DarknetConv2D_BN_Leaky(256, (1,1))(y38) y38 = DarknetConv2D_BN_Leaky(512, (3,3))(y38) y38 = DarknetConv2D_BN_Leaky(256, (1,1))(y38) ``` * process3： ```python # upsampling y38_upsample = compose(DarknetConv2D_BN_Leaky(128, (1,1)), UpSampling2D(2))(y38) # input shape = 76x76 y76 = DarknetConv2D_BN_Leaky(128, (1,1))(darknet.layers[131].output) y76 = Concatenate()([y76, y38_upsample]) ``` * process4：對 process3 的輸出做 downsampling (Conv2D、filters=256、size=(3,3)、strides=(2,2)) 後，與 process2 的輸出做 concatentate。 * process5：對 process4 的輸出做 downsampling (Conv2D、filters=256、size=(3,3)、strides=(2,2)) 後，與 process1 的輸出做 concatentate。 <center><img src ="https://i.imgur.com/tyn5wck.jpg"></center> <center>圖 C：YOLOv4 整體架構圖。(圖片來源：知乎@周威)</center> * YOLO HEAD 1 由多層 conv. layer 所組成。最後一層為 76x76x(num_anchor\*(num_classes+5))。HEAD 2、3 依此類推。 ### Experiments & Results * IOU Loss：CIOU (Complete IOU Loss) * $w^{gt}$、$h^{gt}$：width and height of ground truth box * $w$、$h$：width and height of predicted box $$ L_{CIOU} = 1 - IOU(A, B) + \frac{\rho^{2}(A_{ctr}, B_{ctr})}{c^{2}} + \alpha \cdot v \\ \alpha = \frac{v}{(1 - IOU) + v} \\ v = \frac{4}{\pi^{2}}(arctan \frac{w^{gt}}{h^{gt}} - arctan \frac{w}{h})^{2} $$ * Data Augmentation：Mosaic <center><img src ="https://i.imgur.com/PLigRKe.png"></center> <center>圖 2：Mosaic Augmentation。</center> ### References * [YOLO V4 — 网络结构解析（特详细！） - 知乎 (zhihu.com)](https://zhuanlan.zhihu.com/p/150127712) * [YOLO V4 — 损失函数解析（特详细！） - 知乎 (zhihu.com)](https://zhuanlan.zhihu.com/p/159209199) * [25] Spatial pyramid pooling in deep convolutional networks for visual recognition. * [49] Path aggregation network for instance segmentation. * [63] YOLOv3: An incremental improvement. * [81] CSPNet: A new backbone that can enhance learning capability of cnn.