---
tags: Human Face
---
# MTCNN 人臉檢測
Paper:Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks
## Contribution
1. 提出了一個結合face detection和face alignment的multi-task network。
2. 提出Online Hard Sample Mining以提升模型準確率。
## Inference pipeline

1. 建造一個Image Pyramid
2. P-Net (proposal net)
- 目的 : Image Pyramid輸入Pnet,得到大量的人臉候選框並利用bouding box regreesion修正人臉框
- Preprocess : 以Kernel size為12x12且stride=4的方式將Image Pyramid切成數個12x12的window,並將這些12x12的window分別輸入P-Net
- Predict output :
- face classification (2 dim vector)
- bounding box regression (4 dim vector)
- face landmark localization (10 dim vector)
- 左眼, 右眼, 左嘴角, 右嘴角, 鼻子的座標
- Postprocess :
- 使用NMS篩選所有人臉框
- 利用bbox regression修正人臉框
- Bounding box Regrssion:
```python=
def bbox_regression(bbox, reg):
"""
param name: (x1,y1,x2,y2), (x1,y1)和(x2,y2)為bbox左上角和右下角的座標
param age: model輸出的bbox regression vector
return: 修正過後bbox
"""
#bbox的長寬
bbw = bbox[2]-bbox[0]+1
bbh = bbox[3]-bbox[1]+1
bbox_c = [bbox[0]+reg[0]*bbw,
bbox[1]+reg[1]*bbh,
bbox[2]+reg[2]*bbw,
bbox[3]+reg[3]*bbh]
return bbox_c
```
- Architecture :

3. R-net (refine net)
- 目的 : 將P-net產生的大量的人臉候選框更準確地篩選並利用bouding box regreesion修正人臉框
- Preprocess : 將P-net預測出來的人臉框resize成24x24的window,並將這些24x24的window分別輸入R-Net
- Predict output & Postprocess : 與P-net相同
- Architecture :

4. O-net (output net)
- 目的 : 與R-net的目的大致相同, 不同的是, 更注重的是準確地輸出Facial Landmark
- Preprocess : 將P-net預測出來的人臉框resize成48x48的window,並將這些48x48的window分別輸入R-Net
- Predict output & Postprocess : 與P-net和R-net相同
- Architecture :

## Training Method
- Training Data Resource
- WIDERFACE和CelebA
- Data Preparation
- 數據類型有下面四種
- 從WIDERFACE圖像中隨機crop數個box
- 若與label的IOU小於0.3, 則視為此Box為negative sample
- 若與label的IOU大於0.65,則視為此Box為poisitve sample
- 若與label的IOU大於0.4小於0.65、則視為此Box為Part sample
- 從CelebA得到landmark face sample
- 在一個minbatch裡, 保持四種sample比例是3:1:1:2
- Loss:
- face classification
- cross-entropy loss : $L^{det}_{i}=-(y^{det}_{i}log(p_{i})+(1-y^{det}_{i})log(1-p_{i}))$
- 僅有Positive和negative sample計算此loss
- bounding box regression
- l2 loss : $L^{box}_{i}=||\hat{y}^{box}_{i}-y^{box}_{i}||^2_2$
- 僅有Positive和part sample計算此loss
- face landmark localization
- l2 loss : $L^{landmark}_{i}=||\hat{y}^{landmark}_{i}-y^{landmark}_{i}||^2_2$
- 僅有landmark face sample計算此loss
- total loss
- P-net和R-net
$L^{total}=\sum_{i}(L^{det}_{i}+L^{box}_{i}+0.5L^{landmark}_{i})$
- o-net
$L^{total}=\sum_{i}(L^{det}_{i}+0.5L^{box}_{i}+L^{landmark}_{i})$
- Hard Sample mining
- 只對分類損失進行hard sample mining,指的是在一個batch裡,只取分類損失(det loss)的前70%的訓練數據做back propagation
- Training pipeline
- 依序訓練P-net, R-net, O-net
- P-Net
- 從WIDERFACE圖像中隨機crop數個box, 並依照上述方法分類成negative, positive, part sample, 並從CelebA得到landmark face sample
- R-Net
- 將WIDERFACE圖像中過一次P-Net得到face box, 並依照上述方法分類成negative, positive, part sample, 並從CelebA得到landmark face sample
- O-Net
- 將WIDERFACE圖像中過一次P-Net和O-Net得到face box, 並依照上述方法分類成negative, positive, part sample, 並從CelebA得到landmark face sample