--- tags: Human Face --- # FaceNet 人臉識別 (CVPR2015) Paper : FaceNet: A Unified Embedding for Face Recognition and Clustering > In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. ## Contribution 1. 使用CNN提取Face Emedding, 並直接對Face Emedding做優化 2. 提出triplet loss ## Architecture ![](https://i.imgur.com/2CwETff.png) ## Triplet loss ![](https://i.imgur.com/i6PRv4F.png) - The embedding is represented by $f(x)\in R^{d}$. It embeds an image $x$ into a $d$-dimensional Euclidean space, and $||f(x)||^2_{2}=1$. - $x^{a}_i:$image of anchor, $x^{p}_i:$image of positive, $x^{n}_i:$image of negative - objective: $$ ||f(x^{a}_i)-f(x^{p}_i)||^2_{2}+m \leq ||f(x^{a}_i)-f(x^{n}_i)||^2_{2} $$ - loss function:$$ \sum_{i}{||f(x^{a}_i)-f(x^{p}_i)||^2_{2}- ||f(x^{a}_i)-f(x^{n}_i)||^2_{2}+m}$$ - triplet selection: - 目的:若窮舉所有的triplet,triplet的數量是非常大的。舉例來說,在一個1000個人的資料集,每人有100張圖片的情況下,triplet數量會有 $1000*100*99*100*999$ 個。 - Solution: - 在每個mini-batch選擇anchor positive pair和anchor negative pair,並在每個mini-batch從資料集抽取40個人,並從每個人抽取40張圖片,再抽取非40人中的圖片200張,總共1800張圖片當作一個mini-batch。 - 在mini-batch中挑選所有的anchor positive pair,並選擇相對於每個anchor最為困難的anchor negative pair(離anchor embedding距離最遠的negative embedding),triplet數量為 $40*40*39$ 個。 - 但因為一直使用最困難的anchor negative pair,會導致model收斂到local minimum, 所以在選擇anchor negative pair會多一個條件, 並稱為為semi hard。 $$ ||f(x^{a}_i)-f(x^{p}_i)||^2_{2} \leq ||f(x^{a}_i)-f(x^{n}_i)||^2_{2} $$ ## Experiment - Evaluation method (face verification task) - All faces pairs $(i, j)$ of the same identity are denoted with $P_{same}$, whereas all pairs of different identities are denoted with $P_{diff}$. - true accepts : $TA(d)=\{(i,j)\in P_{same}|D(x_{i}, x_{j})\lt d\}$ - false accepts : $FA(d)=\{(i,j)\in P_{diff}|D(x_{i}, x_{j})\lt d\}$ - validation rate : $VAL(d)=|TA(d)|/|P_{same}|$ - false accept rate : $FAR(d)=|FP(d)|/|P_{diff}|$ - $VAL@10^{-3}FAR$ : $VAL(d_{FAR})$ with $d_{FAR}=FAR^{-1}(10^{-3})$ - Network Architectures choose | architecture | $VAL@10^{-3}FAR$ | FLOPS | | ----------------------------- |:----------------:|:-----:| | NN1 (Zeiler&Fergus 220×220) | 87.9% | 1.6B | | NN2 (Inception 224×224) | 89.4% | 1.6B | | NN3 (Inception 160×160) | 88.3% | 500M | | NN4 (Inception 96×96) | 82.0% | 285M | | NNS1 (mini Inception 165×165) | 82.4% | 220M | | NNS2 (tiny Inception 140×116) | 51.9% | 20M | - Image quality ![](https://i.imgur.com/wjzKzBE.png) - Embedding Dimensionality ![](https://i.imgur.com/jZKey7J.png =390x250) - Trainin data amount ![](https://i.imgur.com/yddvih8.png =433x250) ## Reference code : [link](https://github.com/davidsandberg/facenet) paper : [link](https://arxiv.org/pdf/1503.03832.pdf)