TinaFace: Strong but Simple Baseline for Face Detection

###### tags: `Paper Notes` # TinaFace: Strong but Simple Baseline for Face Detection * 機構：Media Intelligence Technology Co.,Ltd * 時間：2020 年 ### Background * 近年來 face detection 領域有著巨大的進步，然而現在的研究卻發展出了許多只專用於 face detection 的方法。在這篇論文中，作者指出 face detection 與一般的 object detection 沒什了兩樣，頂多是 object detection 的 special case 而已（one class object detection）。因此，那些適用於 object detection 的方法同樣適用於 face detection。 * 作者以 ResNet-50 [11] 為 backbone，並使用了許多在 object detection 中，簡單又高效的方法，最後建立了 TinaFace。 * TinaFace 在 WIDER FACE 資料集中的 hard test set 拿下了 92.1% 的 average precision (AP)。如果再加上 test time augmentation (TTA) 方法的話，更是能達到 92.4% 的 AP。 ### Model Architecture <center><img src="https://i.imgur.com/PSC3era.png"></center> <center>圖 1：TinaFace 架構圖。A 表示每個 grid 中的 anchor 數。</center> * 圖 1 為 TinaFace 的架構圖，TinaFace 以 one-stage object detector — RetinaNet [19] 為基礎，並對其進行修改。紅色框框的部份表示其與 RetinaNet 的不同之處。TinaFace 的整體架構如下： * Backbone：ResNet-50 [11] * Neck：FPN [18] + Inception [36] * Head：FCN (Fully Convolution Networks) * Loss：Focal Loss [19] + DIoU Loss [61] + Cross-Entropy Loss * Deformable Convolution Networks (DCN)： * 如圖 A 所示。標準 convolutional layer 的輸入為上一層 feature map 中的固定位置，這樣會造成 receptive field 的形狀太固定，沒辦法專注在我們想看的東西。而藉由加入 offset 的機制，DCN [4] 的 receptive field 可以曲扭成我們想要的位置。 * 在 backbone 中的第 4、5 個 stage，作者使用 DCN 而非一般的 convolutional layer。 <center><img src="https://i.imgur.com/z4aiSoN.png" style="zoom:50%;" /></center> <center>圖 A：DCN 效果示意圖。點點表示上一層 feature map 的輸入。</center> * Inception Module： * 如圖 B 所示。 <center><img src="https://i.imgur.com/wqFdaPG.png" style="zoom:50%;" /></center> <center>圖 B：Inception 架構圖。</center> * IoU-aware Branch： * 圖 1.(e) 中紫色框框的計算結果為 detection confidence。具體公式如下： $$ score = p_{i}^{\alpha} IoU_{i}^{1-\alpha} $$ * $p_{i}$、$IoU_{i}$：第 $i$ 個 detected box 的 classification score、predicted IoU。 * $\alpha \in [0, 1]$：控制 classification 與 IoU 影響力的 hyperparameter。 * Distance-IoU Loss (DIoU)： * 對於 bbox regression loss，這裡採用 DIoU Loss。DIoU Loss 公式如下： $$ L_{DIoU} = 1 - IoU + \frac{p^{2}(b, b^{gt})}{c^{2}} $$ * $b$、$b^{gt}$：predicted bbox、gorund truth bbox。 * $p(,)$：計算中心點的距離。 * $c$：能同時框住$b$ 與 $b^{gt}$ 的最小的 box 的面積。 * classification 與 IoU prediction 則分別採用 focal loss 與 cross-entropy loss。 ### Experiments & Results * WIDER FACE 資料集： * 包含 32,203 張圖片、393,703 個人臉。 * 依據 50%/10%/40% 的比例分成 train/val/test 三個子集合。 * 每個子集合又能再細分成 Easy/Medium/Hard 三個等級。 * Normalization Method：雖然 Batch Normalization (BN) 很常被使用，但 BN 在 batch 小於 4 的時候效果很差，而考慮到不是所有人都能用高記憶體容量的 GPU，作者轉而使用 Group Normalization [44]。 * Anchor and Assigner：對於 FPN 中的 6 個 layer，anchor 大小分別是 $2^{\frac{4}{3}} \times \{ 4, 8, 16, 32, 64, 128\}$、長寬比則為 ground truth 的長寬比的平均值。 * TinaFace 與其他模型的比較如表 2 所示。TTA 表示 test time augmenttion。 <center><img src="https://i.imgur.com/tUV2fPC.png" style="zoom:50%;" /></center> <center>表 2：TinaFace 與其他模型的比較。</center> * 訓練參數、data augmentation、TTA 請見原文第 5 頁。 ### References [4] Jifeng Dai et al. "Deformable convolutional networks" [11] Kaiming He et al. "Deep residual learning for image recognition" [18] Tsung-Yi Lin et al. "Feature pyramid networks for object detection" [19] Tsung-Yi Lin et al. "Focal loss for dense object detection" [36] Christian Szegedy et al. "Going deeper with convolutions" (Inception) [61] Zhaohui Zheng et al. "Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression."