--- tags: Human Face --- # MTCNN 人臉檢測 Paper:Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks ## Contribution 1. 提出了一個結合face detection和face alignment的multi-task network。 2. 提出Online Hard Sample Mining以提升模型準確率。 ## Inference pipeline ![](https://i.imgur.com/9rEi2wi.png) 1. 建造一個Image Pyramid 2. P-Net (proposal net) - 目的 : Image Pyramid輸入Pnet,得到大量的人臉候選框並利用bouding box regreesion修正人臉框 - Preprocess : 以Kernel size為12x12且stride=4的方式將Image Pyramid切成數個12x12的window,並將這些12x12的window分別輸入P-Net - Predict output : - face classification (2 dim vector) - bounding box regression (4 dim vector) - face landmark localization (10 dim vector) - 左眼, 右眼, 左嘴角, 右嘴角, 鼻子的座標 - Postprocess : - 使用NMS篩選所有人臉框 - 利用bbox regression修正人臉框 - Bounding box Regrssion: ```python= def bbox_regression(bbox, reg): """ param name: (x1,y1,x2,y2), (x1,y1)和(x2,y2)為bbox左上角和右下角的座標 param age: model輸出的bbox regression vector return: 修正過後bbox """ #bbox的長寬 bbw = bbox[2]-bbox[0]+1 bbh = bbox[3]-bbox[1]+1 bbox_c = [bbox[0]+reg[0]*bbw, bbox[1]+reg[1]*bbh, bbox[2]+reg[2]*bbw, bbox[3]+reg[3]*bbh] return bbox_c ``` - Architecture : ![](https://i.imgur.com/xoHHxqR.png) 3. R-net (refine net) - 目的 : 將P-net產生的大量的人臉候選框更準確地篩選並利用bouding box regreesion修正人臉框 - Preprocess : 將P-net預測出來的人臉框resize成24x24的window,並將這些24x24的window分別輸入R-Net - Predict output & Postprocess : 與P-net相同 - Architecture : ![](https://i.imgur.com/OawL6kO.png) 4. O-net (output net) - 目的 : 與R-net的目的大致相同, 不同的是, 更注重的是準確地輸出Facial Landmark - Preprocess : 將P-net預測出來的人臉框resize成48x48的window,並將這些48x48的window分別輸入R-Net - Predict output & Postprocess : 與P-net和R-net相同 - Architecture : ![](https://i.imgur.com/mzstUYE.png) ## Training Method - Training Data Resource - WIDERFACE和CelebA - Data Preparation - 數據類型有下面四種 - 從WIDERFACE圖像中隨機crop數個box - 若與label的IOU小於0.3, 則視為此Box為negative sample - 若與label的IOU大於0.65,則視為此Box為poisitve sample - 若與label的IOU大於0.4小於0.65、則視為此Box為Part sample - 從CelebA得到landmark face sample - 在一個minbatch裡, 保持四種sample比例是3:1:1:2 - Loss: - face classification - cross-entropy loss : $L^{det}_{i}=-(y^{det}_{i}log(p_{i})+(1-y^{det}_{i})log(1-p_{i}))$ - 僅有Positive和negative sample計算此loss - bounding box regression - l2 loss : $L^{box}_{i}=||\hat{y}^{box}_{i}-y^{box}_{i}||^2_2$ - 僅有Positive和part sample計算此loss - face landmark localization - l2 loss : $L^{landmark}_{i}=||\hat{y}^{landmark}_{i}-y^{landmark}_{i}||^2_2$ - 僅有landmark face sample計算此loss - total loss - P-net和R-net $L^{total}=\sum_{i}(L^{det}_{i}+L^{box}_{i}+0.5L^{landmark}_{i})$ - o-net $L^{total}=\sum_{i}(L^{det}_{i}+0.5L^{box}_{i}+L^{landmark}_{i})$ - Hard Sample mining - 只對分類損失進行hard sample mining,指的是在一個batch裡,只取分類損失(det loss)的前70%的訓練數據做back propagation - Training pipeline - 依序訓練P-net, R-net, O-net - P-Net - 從WIDERFACE圖像中隨機crop數個box, 並依照上述方法分類成negative, positive, part sample, 並從CelebA得到landmark face sample - R-Net - 將WIDERFACE圖像中過一次P-Net得到face box, 並依照上述方法分類成negative, positive, part sample, 並從CelebA得到landmark face sample - O-Net - 將WIDERFACE圖像中過一次P-Net和O-Net得到face box, 並依照上述方法分類成negative, positive, part sample, 並從CelebA得到landmark face sample