ArcFace: Additive Angular Margin Loss for Deep Face Recognition(翻譯)

tags:`論文翻譯` `deeplearning`

ArcFace: Additive Angular Margin Loss for Deep Face Recognition(翻譯)

說明

版面的部份會以段落方式，先原文，再譯文，圖片與表格會插入第一次提到該照片的段落譯文下面

個人註解，任何的翻譯不通暢部份都請留言指導

paper hyperlink

Abstract — Recently, a popular line of research in face recognition is adopting margins in the well-established softmax loss function to maximize class separability. In this paper, we first introduce an Additive Angular Margin Loss (ArcFace), which not only has a clear geometric interpretation but also significantly enhances the discriminative power. Since ArcFace is susceptible to the massive label noise, we further propose sub-center ArcFace, in which each class contains

K

sub-centers and training samples only need to be close to any of the

K

positive sub-centers. Sub-center ArcFace encourages one dominant sub-class that contains the majority of clean faces and non-dominant sub-classes that include hard or noisy faces. Based on this self-propelled isolation, we boost the performance through automatically purifying raw web faces under massive real-world noise. Besides discriminative feature embedding, we also explore the inverse problem, mapping feature vectors to face images. Without training any additional generator or discriminator, the pre-trained ArcFace model can generate identity-preserved face images for both subjects inside and outside the training data only by using the network gradient and Batch Normalization (BN) priors. Extensive experiments demonstrate that ArcFace can enhance the discriminative feature embedding as well as strengthen the generative face synthesis.

Abstract — 近來，人臉辨識中的一個熱門的研究方向，就是在softmax loss function中採用margins的概念來最大化類別分離。在這篇論文中，我們首先引入Additive Angular Margin Loss (ArcFace)，它不僅具有清晰的幾何解釋，還顯著的增加了判別的能力。由於ArcFace容易受到大量標籤噪點的影響，我們進一步提出子sub-center ArcFace，其中每個類別包含

K

個sub-centers，訓練樣本只需要靠近

K

個positive sub-centers中的任何一個。sub-centers ArcFace形成一個主導的子類別，該子類別包含大多數乾淨的臉部樣本，然後非主導的子類別則包含難以辨識或有噪點的臉部樣本。基於這種自驅動隔離(self-propelled isolation)機制，我們透過自動淨化大量含有真實世界噪點的網路臉部資料來提升效能。除了判別性特徵嵌入之外，我們還探索了逆向的問題，也就是將特徵向量映射到人臉圖像。無需訓練任何額外的生成器(generator)或判別器(discriminator)，預訓練的ArcFace模型只需使用網路梯度(network gradient)和Batch Normalization (BN) priors，即可為訓練資料內部和外部的研究對象生成身份保持(identity-preserved)的人臉圖像。大量的實驗表明，ArcFace可以增強判別性特徵嵌入，同時還能強化生成人臉合成的能力。

1 INTRODUCTION

FACE representation using DCNN embedding is the method of choice for face recognition [1], [2], [3], [4], [5], [6]. DCNNs map the face image, typically after a pose normalization step [7], [8], into a feature that should have small intra-class and large inter-class distance. There are two main lines of research to train DCNNs for face recognition. Some train a multi-class classifier which can separate different identities in the training set, such by using a softmax classifier [2], [4], [9], [10], [11], and the others learn directly an embedding, such as the triplet loss [3]. Based on the large-scale training data and the elaborate DCNN architectures, both the softmax-loss-based methods [9] and the triplet-loss-based methods [3] can obtain excellent performance on face recognition. However, both the softmax loss and the triplet loss have some drawbacks. For the softmax loss:(1) the learned features are separable for the closed-set classification problem but not discriminative enough for the open-set face recognition problem; (2) the size of the linear transformation matrix

W \in R^{d \times N}

increases linearly with the identities number

N

. For the triplet loss: (1) there is a combinatorial explosion in the number of face triplets especially for large-scale datasets, leading to a significant increase in the number of iteration steps; (2) semi-hard sample mining is a quite difficult problem for effective model training.

使用DCNN embedding的人臉表示(FACE representation)是人臉辨識的首選方法 [1]、[2]、[3]、[4]、[5]、[6]。 DCNNs通常會在姿勢正規化步驟(pose normalization step) [7]、[8] 之後將人臉影像映射為應具有較小的類別內(intra-class)距離和較大的類別間(inter-class)距離的特徵。訓練DCNNs來做人臉辨識有兩個主要研究方向。有些會訓練多類別的分類器，可以分離訓練集中不同的身份，例如使用softmax classifier [2]、[4]、[9]、[10]、[11]，而另一些則直接學習嵌入(embedding)，例如triplet loss[3]。基於大規模訓練資料和精心設計的DCNN架構，無論是基於基於softmax-loss的方法 [9] 還是基於Triplet-loss的方法 [3] 都可以在人臉辨識方面獲得優異的效能。然而，softmax loss和triplet loss都有一些缺點。對於softmax loss：(1)學習到的特徵對於閉集分類問題是可分離的，但對於開集人臉辨識問題的辨別力是不足的；(2)線性變換矩陣

W \in R^{d \times N}

的大小隨著identities number(身份的數量？)

N

線性增長。對於triplet loss：(1)人臉三元組的組合是爆炸性的數量，特別是對於大規模資料集，導致迭代步數的明顯增加； (2)semi-hard sample的探勘對於有效的模型訓練來說是一個相當困難的問題。

semi-hard sample：負樣本到錨點樣本的距離大於正樣本到錨點樣本的距離

To adopt margin benefit but avoid the sampling problem in the Triplet loss [3], recent methods [13], [14], [15] focus on incorporating margin penalty into a more feasible framework, the softmax loss, which has global sample-to-class comparisons within the multiplication step between the embedding feature and the linear transformation matrix. Naturally, each column of the linear transformation matrix is viewed as a class center representing a certain class. Sphereface [13] introduces the important idea of angular margin, however their loss function requires a series of approximations, which results in an unstable training of the network. In order to stabilize training, they propose a hybrid loss function which includes the standard softmax loss. Empirically, the softmax loss dominates the training process, because the integer-based multiplicative angular margin makes the target logit curve very precipitous and thus hinders convergence.

為了能夠利用邊界(margin)的優勢，同時避免Triplet loss [3]中的採樣問題，近來的方法 [13]、[14]、[15] 都專注於將邊界懲罰項(margin penalty)整合到更可行的框架，也就是softmax loss，其於嵌入特徵(embedding feature)與線性變換步驟之間的乘法步驟中做了全域的樣本到類別的比較。很自然地，線性變換矩陣的每一個column就會被視為代表某個類別的類別中心。 Sphereface[13]引入了angular margin的重要概念，不過其損失函數需要一系列近似計算，這導致了網路訓練的不穩定。為了穩定訓練，他們提出了一種混合的損失函數，其中包括標準的softmax loss。根據經驗，softmax loss在訓練過程中占主導地位，因為基於整數的乘性angular margin使目標的logit curve(對數曲線？)非常陡峭，從而阻礙了模型的收斂。

angular margin，其中angular翻譯為『角』，翻為角邊界覺得很怪，就保留原文

In this paper, we propose an Additive Angular Margin loss [16] to stabilize the training process and further improve the discriminative power of the face recognition model. More specifically, the dot product between the DCNN feature and the last fully connected layer is equal to the cosine distance after feature and center normalization. We utilize the arc-cosine function to calculate the angle between the current feature and the target center. Afterwards, we introduce an additive angular margin to the target angle, and we get the target logit back again by the cosine function. Then, we re-scale all logits by a fixed feature norm, and the subsequent steps are exactly the same as in the softmax loss. Due to the exact correspondence between the angle and arc in the normalized hypersphere, our method can directly optimize the geodesic distance margin, thus we call it ArcFace.

在這篇論文中，我們提出了一種Additive(加性) Angular Margin loss [16]來穩定訓練過程並進一步提高人臉辨識模型的判別能力。更具體地說，DCNN feature與最後一個全連接層之間的點積就等於特徵和中心正規化後的餘弦距離。我們利用反餘弦函數來計算目前特徵與目標中心之間的角度。然後，我們對目標角度引入additive angular margin(加性的角度邊界)，並透過餘弦函數再次的得到target logit。然後，我們透過固定的特徵範數重新縮放所有logits，後續步驟就跟softmax loss中的步驟完全相同。由於正規化的超球面中，角度與弧度的精確對應，我們的方法可以直接地最佳化測地線距離邊界，因此我們稱之為ArcFace。

logits，理論上，應該指的是softmax之前的tensor(或向量？)，經過softmax之後就會變成機率分佈

測地線(英語：Geodesic)又稱大地線或短程線，數學上可視作直線在彎曲空間中的推廣；在有度規定義存在之時，測地線可以定義為空間中兩點的局域最短路徑。測地線(英語：geodesic)的名字來自對於地球尺寸與形狀的大地測量學(英語：geodesy)。 –取自維基百科說明

Even though impressive performance has been achieved by the margin-based softmax methods [17], [13], [14], [15], they all need to be trained on well-annotated clean datasets [18], which require intensive human efforts. Wang et al. [18] found that faces with label noise significantly degenerate the recognition accuracy and manually built a high-quality dataset including 1.7M images of 59K celebrities. However, it took 50 annotators to work continuously for one month to clean the dataset, which further demonstrates the difficulty of obtaining a large-scale clean dataset for face recognition. Since accurate manual annotations can be expensive [18], learning with massive noisy data has recently attracted much attention [19], [20], [21]. However, computing time-varying weights for samples [19] or designing piece-wise loss functions [20] according to the current model’s predictions can only alleviate the influence from noisy data to some extent as the robustness and improvement depend on the initial performance of the model. Besides, the co-mining method [21] requires to train twin networks together thus it is less practical for training large models on large-scale datasets.

儘管基於邊界的softmax的方法 [17]、[13]、[14]、[15] 已經取得了令人印象深刻的效能，不過它們都需要在標記良好的乾淨資料集上進行訓練[18]，這需要大量的人力投入。Wang等人[18]發現帶有標記噪點(label noise)的人臉照片會明顯降低辨識的準確度，並手動建立了包含59,000位名人的170萬張照片的高品質資料集。不過吼，光清理資料集就要50個人不眠不休連續工作一個月，這進一步說明了獲取用於人臉辨識的大規模乾淨資料集的難度。由於準確的手工註解成本昂貴[18]，因此利用大量帶有噪點的資料來做學習的方式在最近逐漸受到關注[19]、[20]、[21]。然而，為樣本計算時變權重[19]或根據當前模型的預測設計piece-wise loss functions[20]也就只能在一定程度上減輕噪點資料的影響，因為穩健性和改善取決於模型的初始效能。此外，co-mining method[21]需要一起訓練孿生網路(twin networks)，因此在大規模資料集上訓練大型模型就沒那麼實用。

To improve the robustness under massive real-world noise, we relax the intra-class constraint of forcing all samples close to the corresponding positive centers by introducing sub-classes into ArcFace [22]. As illustrated in Figure 1, we design

K

sub-centers for each class and the training sample only needs to be close to any of the

K

positive sub-centers instead of the only one positive center. If a training face is a noisy sample, it does not belong to the corresponding positive class. In ArcFace, this noisy sample generates a large wrong loss value, which impairs the model training. In sub-center ArcFace, the intra-class constraint enforces the training sample to be close to one of the multiple positive sub-centers but not all of them. The noise is likely to form a nondominant sub-class and will not be enforced into the dominant sub-class. Therefore, sub-center ArcFace is more robust to noise. In our experiments, we find the proposed sub-center ArcFace can encourage one dominant sub-class that contains the majority clean faces and multiple non-dominant sub-classes that include hard or noisy faces. This automatic isolation can be directly employed to clean the training data through dropping non-dominant subcenters and high-confident noisy samples. Based on the proposed sub-center ArcFace, we can automatically obtain large-scale clean training data from raw web face images to further improve the discriminative power of the face recognition model.

為了提高模型在大量真實世界噪點下的穩健性，我們透過在ArcFace中引入子類別的方式，來放寬強制所有樣本接近對應positive center的類別內(intra-class)的約束[22]。如Figure 1所示，我們為每個類別設計

K

個sub-centers，然後，訓練樣本就只需要接近K個positive sub-centers中的任何一個，而不限於單一個positive center。如果訓練的人臉是噪點樣本(noisy sample)，那它就不屬於相對應的正類別(positive class)。在ArcFace中，這個噪點樣本會產生很大的錯誤損失值，進而影響模型的訓練。在sub-centers ArcFace中，類別內(intra-class)的約束會強制訓練樣本接近多個positive sub-centers的其中一個，但不是全部。噪點可能會形成非主導的子類別，而且不會強制歸入主導的子類別。因此，sub-center ArcFace對噪點的部份有著更強的穩健性。在我們的實驗中，我們發現所提出的sub-centers ArcFace會促成一個包含大多數乾淨臉孔的主導的子類別和多個包含困難樣本或噪點臉部的非主導的子類別。這種自動隔離的方式可以透過剔除非主導的sub-centers和高置信度噪點樣本的方式直接用來做訓練資料的清理。基於所提出的sub-centers ArcFace，我們可以從原始的網路臉部影像中自動獲得大量乾淨的訓練資料，進一步提升人臉辨識模型的判別力。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Fig. 1. Comparisons of Triplet [3], Tuplet [12], ArcFace and sub-center ArcFace. Triplet and Tuplet conduct local sample-to-sample comparisons with Euclidean margins within the mini-batch. By contrast, ArcFace and sub-center ArcFace conduct global sample-to-class and sample-to-subclass comparisons with angular margins.

In Figure 1, we compare the differences between Triplet [3], Tuplet [12], ArcFace and sub-center ArcFace. Triplet loss [3] only considers local sample-to-sample comparisons with Euclidean margins within the mini-batch. Tuplet loss [12] further enhances the comparisons by using all of the negative pairs within the mini-batch. By contrast, the proposed ArcFace and sub-center ArcFace conduct global sample-to-class and sample-to-subclass comparisons with angular margins.

在Figure 1中，我們比較了Triplet[3]、Tuplet[12]、ArcFace和sub-center ArcFace之間的差異。Triplet loss [3] 只考慮小批量內局部樣本與樣本之間的Euclidean margins的比較。Tuplet loss [12] 則是透過使用小批量中的所有負樣本對來進一步增強比較。相較之下，我們所提出的ArcFace和sub-center ArcFace使用angular margins來進行全域的樣本到類別和樣本到子類別的比較。

As the proposed ArcFace is effective for the mapping from the face image to the discriminative feature embedding, we are also interested in the inverse problem: the mapping from a low-dimensional latent space to a highly nonlinear face space. Synthesizing face images [23], [24], [25], [26], [27], [28], [29] has recently brought much attention from the community. DeepDream [30] is proposed to transform a random input to yield a high output activation for a chosen class by employing the gradient from the pre-trained classification model and some regularizers (e.g. total variance [31] for maintaining piece-wise constant patches).Even though DeepDream can keep the selected output response high to preserve identity, the resulting faces are not realistic, lacking natural face statistics. Inspired by the pioneer generative face recognition model (Eigenface [32]) and recent data-free methods [33], [34], [35] for restoring ImageNet images, we employ the statistic prior (e.g. mean and variance stored in the BN layers) to constrain the face generation. In this paper, we show that the proposed ArcFace can also enhance the generative power. Without training any additional generator or discriminator like in Generative Adversarial Networks (GANs) [36], the pre-trained ArcFace model can generate identity-preserved and visually reasonable face images only by using the gradient and BN priors.

由於所提出的ArcFace對於從臉部影像映射到判別性特徵嵌入(discriminative feature embedding)是有效的，因此我們對於反過來的問題也同樣感興趣：從一個低維度的潛在空間映射到到高度非線性的臉部空間。合成臉部影像[23]、[24]、[25]、[26]、[27]、[28]、[29]最近引起了社群的一些注意。DeepDream[30]的提出，可以用來轉換隨機輸入，以生成對所選定類別的高輸出激活(high output activation)，這是透過預訓練分類模型的梯度以及一些正則化器(例如用於piece-wise constant patches的總變異[31])所實現的。儘管DeepDream可以保持所選定的輸出響應的高度以維持其身份，不過所產生的臉部並不真實，也就缺乏自然的臉部統計數據。受到早期的生成式人臉辨識模型(Eigenface [32])和最近用於恢復ImageNet影像的data-free methods[33]、[34]、[35]的啟發，我們採用了統計先驗(statistical prior)的方式(像是儲存在BN網路層中的平均數和變異數)來限制臉部的生成。在這篇論文中，我們說明所提出的ArcFace同時可以增強生成能力。不需要像生成式對抗網路(GAN)[36]那樣訓練任何額外的生成器或判別器，預訓練的ArcFace模型只需使用梯度和BN pirors就可以生成維持身份特徵且視覺上合理的人臉影像。

The advantages of the proposed methods can be summarized as follows:

我們所提出方法的優點可以總結如下：

Intuitive. ArcFace directly optimizes the geodesic distance margin by virtue of the exact correspondence between the angle and arc in the normalized hypersphere. The proposed additive angular margin loss can intuitively enhance the intra-class compactness and inter-class discrepancy during discriminative learning of face feature embedding.

直觀. ArcFace直接地最佳化在正規化超球面上角度(angle)與弧(arc)之間確切的對應關係所帶來的測地距離邊界(geodesic distance margin)。我們所提出的加性角度邊界損失(additive angular margin loss)能夠在臉部特徵嵌入的判別學習過程中，直觀地增強類別內的緊密度(intra-class compactness)與類別間的差異性(inter-class discrepancy)

測地距離(geodesic distance)指在曲面上兩點之間的最短路徑距離，在此處的超球面上特別對應角度與弧長的關係。ArcFace通過角度增量，將嵌入空間的類別內的特徵聚攏、類別間的特徵分散，從而提升區分度。

Economical. We introduce sub-class into ArcFace to improve its robustness under massive real-world noises. The proposed subcenter ArcFace can automatically clean the large-scale raw web faces (e.g. MS1MV0 [37] and Celeb500K [38]) without expensive and intensive human efforts. The automatically cleaned training data, named IBUG-500K, has been released to facilitate future research.

經濟. 我們在ArcFace中引入sub-class，以提高其在大量真實世界噪點下的穩健性。我們所提出的sub-centers ArcFace可以自動清理大規模的網路上爬下來的人臉資料(例如MS1MV0 [37]和Celeb500K[38])，而不需要貴森森且密集的勞力。自動清理的訓練資料，名為IBUG-500K，已經發布，以促進未來的研究發展。

Easy. ArcFace only needs several lines of code and is extremely easy to implement in the computational-graph-based deep learning frameworks, e.g. MxNet [39], Pytorch [40] and Tensorflow [41]. Furthermore, contrary to the works in [13], [42], ArcFace does not need to be combined with other loss functions in order to have stable convergence.

簡單. ArcFace只需要幾行程式碼，並且在基於計算圖的深度學習框架中非常容易實現，像是MxNet [39]、Pytorch [40] 與 Tensorflow [41]。此外，與[13]、[42]中的研究相反，ArcFace不需要以結合其它的損失函數的方式來獲得穩定的收斂。

Efficient. ArcFace only adds negligible computational complexity during training. The proposed center parallel strategy can easily support millions of identities for training on a single server (8 GPUs).

高效率. ArcFace在訓練過程中只增加了一咪咪咪咪的計算複雜度。所提出的中心平行策略(center parallel strategy)可以輕鬆的支援單一伺服器(8個GPU)上訓練數百萬個身份。

Effective. Using IBUG-500K as the training data, ArcFace achieves state-of-the-art performance on ten face recognition benchmarks including large-scale image and video datasets collected by us. Impressively, our model reaches 97.27% TPR@FPR=1e-4 on IJB-C. Code and pre-trained models have been made available.

有效性. 使用IBUG-500K作為訓練資料，ArcFace在十個人臉辨識基準測試(包括我們收集的大規模影像和視訊資料集)上實現了最日本一級棒的效能。令人印象深刻的是，我們的模型在IJB-C上達到了97.27% TPR@FPR=1e-4。程式碼和預訓練模型已經公開。

Engaging. ArcFace can not only enhance the discriminative power but also strengthen the generative power. By accessing the network gradient and employing the statistic priors stored in the BN layers, the pre-trained ArcFace model can restore identity-preserved and visually plausible face images for both subjects inside and outside the training data.

Engaging. ArcFace不僅可以增強判別的能力，還可以增強生成的能力。透過存取網路的梯度並利用保存在BN層中的統計先驗，預訓練的ArcFace模型可以為訓練資料內部和外部的受試者恢復身份保留且視覺上可信任的臉部影像。

Face Recognition with Margin Penalty. As shown in Figure 1, the pioneering work [3] uses the Triplet loss to exploit triplet data such that faces from the same class are closer than faces from different classes by a clear Euclidean distance margin. Even though the Triplet loss makes perfect sense for face recognition, the sample-to-sample comparisons are local within mini-batch and the training procedure for the Triplet loss is very challenging as there is a combinatorial explosion in the number of triplets especially for large-scale datasets, requiring effective sampling strategies to select informative mini-batch [43], [3] and choose representative triplets within the mini-batch [44], [12]. As the Triplet loss trained with semi-hard negative mining converges slower due to the ignorance of too many examples, a double-margin contrastive loss is proposed in [45] to explore more informative and stable examples by distance weighted sampling, thus it converges faster and more accurately. Some other works tried to reduce the total number of triplets with proxies [46], [47], i.e., sample-to-sample comparison is changed into sample-to-proxy comparison. However, sampling and proxy methods only optimize the embedding of partial classes instead of all classes in one iteration step.

Face Recognition with Margin Penalty. 如Figure 1所示，這個開創性的研究[3]使用Triplet loss來利用triplet data，這使得來自同一類別的臉部特徵比來自不同類別的臉部特徵更接近，具有明顯的歐幾里德距離邊界(Euclidean distance margin)。儘管Triplet loss對於人臉辨識來說是合理的，不過這個樣本與樣本的比較是小批量內的局部比較，而Triplet loss的訓練過程又非常具有挑戰性，因為三元組的數量會出現組合爆炸，特別是對於大型資料集，因此需要有效的取樣策略來選取具有代表性的小批量[43], [3]，並在小批量中選擇具有代表性的三元組[44], [12]。由於使用semi-hard negative mining來訓練的Triplet loss由於忽略太多樣本而造成收斂速度較慢，[45]中提出了double-margin contrastive loss，透過距離加權採樣(distance weighted sampling)的方式來探索更多具有信息且穩定的樣本，讓收斂的速度更快，且準確性更高。其它有一些研究則是嘗試使用代理(proxies)來減少三元組的總數[46]，[47]，也就是把樣本與樣本的比較改為樣本與代理的比較。然而，採樣和代理方法僅在一個迭代步驟中最佳化部分類別的嵌入(embedding)，而不是所有類別的嵌入。

Margin-based softmax methods [13], [17], [14], [15] focused on incorporating margin penalty into a more feasible framework, softmax loss, which has extensive sample-to-class comparisons. Compared to deep metric learning methods (e.g., Triplet [3], Tuplet [44], [12]), margin-based softmax methods conduct global comparisons at the cost of memory consumption on holding the center of each class as illustrated in Figure 1. Sample-to-class comparison is more efficient and stable than sample-to-sample comparison as (1) the class number is much smaller than sample number, and (2) each class can be represented by a smoothed center vector which can be updated online during training. To further improve the margin-based softmax loss, recent works focus on the exploration of adaptive parameters [48], [49], [50], inter-class regularization [51], [52], mining [53], [54], grouping [55], etc.

基於邊界的softmax方法[13]、[17]、[14]、[15]著重於將邊界的懲罰項(margin penalty)納入一個更可行的框架中，softmax loss，這方法有著大量的樣本到類別的比較。與深度度量學習方法(如Triplet [3]、Tuplet [44]、[12])相比，基於邊界的softmax方法透過保持每個類別中心點來做全域比較，相對需要較大的的記憶體消耗，如Figure 1所示。樣本到類別的比較比樣本到樣本的比較更有效和穩定，因為(1)類別的數量遠小於樣本的數量，(2)每個類別都可以用平滑的中心向量來表示，而且這可以在訓練過程中線上更新。為了進一步改善基於邊界的softmax loss，最近的研究重點是探索自適應參數[48]、[49]、[50]、類別間(inter-class)的正規化[51]、[52]、探勘[53]、[54] 、分組[55]等

Face Recognition under Noise. Most of the face recognition datasets [56], [37], [9], [38] are downloaded from the Internet by searching a pre-defined celebrity list, and the original labels are likely to be ambiguous and inaccurate [18]. Learning with massive noisy data has recently drawn much attention in face recognition [57], [19], [20], [21] as accurate manual annotations can be expensive [18] or even unavailable.

Face Recognition under Noise. 大多數人臉辨識資料集[56]、[37]、[9]、[38]都是透過搜尋預先定義的名人清單從網路上下載的，原始標籤很可能會有含糊不清或是不準確[18]的問題。近來，用著這些大量噪點資料進行學習的這個問題，已經在人臉辨識的領域中引起了廣泛的注意[57]、[19]、[20]、[21]，因為準確的手動註解可能很貴[18]甚至不可用。

Wu et al. [57] proposed a semantic bootstrap strategy, which re-labels the noisy samples according to the probabilities of the softmax function. However, automatic cleaning by the bootstrapping rule requires time-consuming iterations (e.g. twice refinement steps are used in [57]) and the labelling quality is affected by the capacity of the original model. Hu et al. [19] found that the cleanness possibility of a sample can be dynamically reflected by its position in the target logit distribution and presented a noise-tolerant end-to-end paradigm by employing the idea of weighting training samples. Zhong et al. [20] devised a noise-resistant loss by introducing a hypothetical training label, which is a convex combination of the original label with probability

ρ

and the predicted label by the current model with probability

1 - ρ

. However, computing time-varying fusion weight [19] and designing piece-wise loss [20] contain many hand-designed hyperparameters. Besides, re-weighting methods are susceptible to the performance of the initial model. Wang et al. [21] proposed a co-mining strategy which uses the loss values as the cue to simultaneously detect noisy labels, exchange the high-confidence clean faces to alleviate the error accumulation caused by the sampling bias, and re-weight the predicted clean faces to make them dominate the discriminative model training. However, the co-mining method requires training twin networks simultaneously and it is challenging to train large networks (e.g. ResNet100 [58]) on a large-scale dataset (e.g. MS1MV0 [37] and Celeb500K [38]).

Wu等人[57]提出了一種語意引導(semantic bootstrap )的策略，根據softmax function的機率重新標記噪點樣本。然而，透過引導規則進行自動清理需要耗時的迭代過程(例如[57]中使用了兩次的精煉步驟)，且標記的品質也會受到原始模型能力的影響。Hu等人[19]發現一個樣本的乾淨程度可以動態地透過其於target logit distribution中的位置來反映，並透過採用加權訓練樣本的想法提出了一種抗噪性(noise-tolerant)的端到端範例。Zhong等人[20]透過引入hypothetical training label來設計一種noise-resistant loss，這個訓練標記(training label)是機率為

ρ

的原始標記與機率為

1 - ρ

的當前模型預測標記的凸組合。然而，計算時變融合權重[19]和設計piece-wise(分段損失)[20]包含許多手動設計的超參數。此外，重新加權方法容易受到初始模型效能的影響。王等人[21]提出了一種co-mining strategy(聯合探勘策略)，該策略使用損失值作為提示同時檢測噪點標記，交換高置信度的乾淨面臉樣本以減緩採樣偏差所造成的誤差積累，並對預測的乾淨面臉樣本重新加權使他們主導判別模型的訓練。然而，co-mining method需要同時訓練孿生網路，這對於在大規模資料集(例如 MS1MV0 [37] 和 Celeb500K [38])上訓練大型網路(例如 ResNet100 [58])具有挑戰性。

Face Recognition with Sub-classes. Practices and theories that lead to “sub-class” have been studied for a long time [59], [60]. The concept of “sub-class” applied in face recognition was first introduced in [59], [60], where a mixture of Gaussians was used to approximate the underlying distribution of each class. For instance, a person’s face images may be frontal view or side view, resulting in different modalities when all images are represented in the same data space. In [59], [60], experimental results showed that subclass divisions can be used to effectively adapt to different face modalities thus improve the performance of face recognition. Wan et al. [61] further proposed a separability criterion to divide every class into sub-classes, which have much less overlaps. The new within-class scatter can represent multi-modality information, therefore optimizing this within-class scatter will separate different modalities more clearly and further increase the accuracy of face recognition. However, these work [59], [60], [61] only employed hand-designed feature descriptor on tiny under-controlled datasets.

Face Recognition with Sub-classes. 關於「子類別(sub-class)」的實踐和理論已經被研究了很長的一段時間[59]，[60]。人臉辨識中應用的「子類別(sub-class)」概念首次在[59]、[60]中被引入，其中使用高斯混合來近似每個類別的底層分佈(underlying distribution)。例如，一個人的臉部影像可能是正面視角或側面視角，當所有的影像都在同一個資料空間中表示時，就會導致不同的模態。在[59]、[60]中，實驗結果說明著，子類別(sub-class)的劃分可以有效適應不同的臉部模態，從而提高人臉辨識的表現。Wan等人[61]進一步提出了一種可分離性的準則，將每個類別劃分為重疊性更少的子類別。這個新的within-class scatter(類別內散佈？)可以表示多模態信息，因此，最佳化這個within-class scatter將可以更漂亮地分離不同模態，並進一步提高人臉辨識的準確性。然而，這些研究[59]、[60]、[61]就只有在小型且受控的資料集上採用手工設計的特徵描述子(feature descriptor)。

Concurrent with our work, Softtriple [62] presents a multicenter softmax loss with class-wise regularizer. These multicenters can depict the hidden distribution of the data [63] due to the fact that they can capture the complex geometry of the original data and help reduce the intra-class variance. On the fine-grained visual retrieval problem, the Softtriple [62] loss achieves better performance than the softmax loss as capturing local clusters is essential for this task. Even though the concept of “sub-class” has been employed in face recognition [59], [60], [61] and fine-grained visual retrieval [62], none of these work has considered the large-scale (e.g. 0.5 million classes) face recognition problem under massive noise (e.g. around 50% noisy samples within the training data).

與此同時，Softtriple [62] 提出了帶有class-wise regularizer的multicenter softmax loss。這些multicenters可以描述資料的隱藏分佈[63]，因為它們可以捕捉原始資料的複雜幾何結構並有助於減少類別內的方差(intra-class variance)。在細粒度視覺檢索問題上，Soft-triple loss[62] 比起softmax loss有著更好的效能，因為捕獲local clusters對於該任務至關重要。儘管「子類別(sub-class)」的概念已被應用於人臉辨識[59]、[60]、[61]和細粒度視覺檢索[62]，但這些研究都沒有考慮大規模(例如50萬個類別)大量噪點下的人臉辨識問題(例如訓練資料中約50%的噪點樣本)。

Face Synthesis by Model Inversion. Identity-preserving face generation [64], [65], [66], [29] has been extensively explored under the framework of GAN [36]. Even though GAN models can yield high-fidelity images [67], [68], training a GAN’s generator requires access to the original data. Due to the emerging concern of data privacy, an alternative line of work in security focuses on model inversion, that is, image synthesis from a single CNN. Model inversion can not only help researchers to visualize neural networks to understand their properties [69] but also can be used for data-free distillation, quantization and pruning [33], [34], [35]. Fredrikson et al. [70] propose the model inversion attack to obtain class images from a network through a gradient descent on the input. As the pixel space is so large compared to the feature space, optimizing the image pixels by gradient descent [31] requires heavy regularization terms, such as total variation [31] or Gaussian blur [71]. Even though previous model inversion methods [70], [30] can transform an input image (random noise or a natural image) to yield a high output activation for a chosen class, it leaves intermediate representations constraint-free. Therefore, the resulting images are not realistic, lacking natural image statistics.

Face Synthesis by Model Inversion. Identity-preserving face generation(身份保持的人臉生成)[64]、[65]、[66]、[29]在GAN [36]的框架下得到了廣泛的研究。儘管GAN模型能夠生成高度擬真的影像 [67]、[68]，但訓練GAN的生成器需要存取原始資料。由於資料隱私逐漸受到重視，安全領域的另一個研究方向就專注在模型的逆向(model inversion)，也就是從單一的CNN模型進行影像合成。Model inversion不僅可以幫助研究人員視覺化神經網路以了解其特性[69]，還可以用於無資料蒸餾(data-free distillation)、量化和剪枝[33]、[34]、[35]。Fredrikson等人[70]提出模型逆向攻擊，透過輸入的梯度下降從網路中取得類別影像。由於像素空間相較於特徵空間來說非常大，因此透過梯度下降[31]最佳化影像像素需要大量的正規化項，像是total variation(總變差)[31]或Gaussian blur(高斯模糊)[71]。儘管先前的模型逆向方法[70]、[30]可以將輸入影像(隨機噪點或自然影像)轉換為產生所選定類別的高輸出激活，不過其間的中間表示並未受到限制。因此，生成的圖像自然就缺乏真實感，也不會有自然影像的統計特徵。

The pioneer generative face recognition model is Eigen-face [32], which can project a training face image or a new face image (mean-subtracted) on the eigenfaces and thereby record how that face differs from the mean face. The eigenvalue associated with each eigenface represents how much the image vary from the mean image in that direction. The recognition process with the eigenface method is to project query images into the facespace spanned by eigenfaces calculated, and to find the closest match to a face class in that face-space. Even though raw pixel features used in Eigenface are substituted by the deep convolutional features, the procedure of employing the statistic prior (e.g. mean and variance) to reconstruct face images can be an inspiration. Recently, [33], [34], [35] have proposed a data-free method employing the statistics (e.g. mean and variance) stored in the BN layers to restore ImageNet images. Inspired by these works, we synthesize face images by inverting the pre-trained ArcFace model and considering the face prior (e.g. mean and variance) stored in the BN layers.

早期的的生成式人臉辨識模型是Eigenface(特徵臉) [32]，它可以將訓練的人臉影像或新的人臉影像(減去平均值)投影到特徵臉上，從而記錄該人臉不同於平均人臉的地方。每個特徵臉所對應的特徵值(eigenvalue)表示影像在該方向上相對於平均影像的變化量。使用特徵臉方法的辨識過程是將查詢影像投影到由所計算出的特徵臉所生成的臉部空間中，然後在該臉部空間中尋找與某個臉部類別最接近的匹配。儘管特徵臉中使用的原始像素特徵已經被深度卷積特徵所取代，不過，使用統計先驗(例如平均值和方差)來重建臉部影像的過程依然具有啟發性。最近，[33]、[34]、[35]提出了一種無資料方法(data-free method)，利用儲存在BN層中的統計資料(例如平均值和方差)來還原ImageNet的影像。受這些研究的啟發，我們透過反向計算預訓練的ArcFace模型並考慮儲存在BN層中的人臉先驗知識(face prior)(例如平均值和變異數)來合成人臉影像。

3 PROPOSED APPROACH

3.1 ArcFace

The most widely used classification loss function, softmax loss, is presented as follows:

\begin{matrix} (1) & L_{1} = - \log \frac{e^{W_{y_{i}}^{T} x_{i} + b_{y_{i}}}}{\sum_{j = 1}^{N} e^{W_{j}^{T} x_{i} + b_{j}}} \end{matrix}

最常使用的類別損失函數，也就是softmax loss，如下：

\begin{matrix} (1) & L_{1} = - \log \frac{e^{W_{y_{i}}^{T} x_{i} + b_{y_{i}}}}{\sum_{j = 1}^{N} e^{W_{j}^{T} x_{i} + b_{j}}} \end{matrix}

where

x_{i} \in R^{d}

denotes the deep feature of the

i

-th sample, belonging to the

y_{i}

-th class. The embedding feature dimension

d

is set to

512

in this paper following [72], [73], [13], [14].

W_{j} \in R^{d}

denotes the

j

-th column of the weight

W \in R^{d \times N}

b_{j} \in R^{N}

is the bias term, and the class number is

N

. Traditional softmax loss is widely used in deep face recognition [4], [9]. However, the softmax loss function does not explicitly optimize the feature embedding to enforce higher similarity for intra-class samples and diversity for inter-class samples, which results in a performance degeneration for deep face recognition under large intra-class appearance variations (e.g. pose variations [74], [75] and age gaps [76], [77]) and large-scale test scenarios [78], [79], [80].

其中

x_{i} \in R^{d}

表示第

i

個樣本的深層特徵，屬於第

y_{i}

個類別。嵌入特徵的維度

d

在本文中設置為

512

，這個設定是參考如下文獻[72], [73], [13], [14]。

W_{j} \in R^{d}

表示權重

W \in R^{d \times N}

的第

j

列，

b_{j} \in R^{N}

則是偏差項，類別數量為

N

。傳統的softmax loss廣泛應用於深度人臉辨識[4]、[9]中然而，softmax loss function並未明確地最佳化特徵嵌入，以增強同類別樣本的相似度並擴大不同類別樣本間的差異性，這導致在面臨大範圍的類別內的外觀變化(如姿勢變化 [74], [75]和年齡差異 [76], [77]）以及大規模測試場景 [78], [79], [80]時，深度人臉辨識的效能的退化問題。

For simplicity, we fix the bias

b_{j} = 0

as in [13]. Then, we transform the logit as

W_{j}^{T} x_{i} = ‖ W_{j} ‖ ‖ x_{i} ‖ \cos θ_{j}

, where

θ_{j}

is the angle between the weight

W_{j}

and the feature

x_{i}

. Following [13], [14], [82], we fix the individual weight

‖ W_{j} ‖ = 1

ℓ_{2}

normalization. Following [83], [14], [82], [15], we also fix the embedding feature

‖ x_{i} ‖

ℓ_{2}

normalization and re-scale it to

s

. The normalization step on features and weights makes the predictions only depend on the angle between the feature and the weight. The learned embedding features are thus distributed on a hypersphere with a radius of

s

為了簡單起見，我們將偏差固定為

b_{j} = 0

，如[13]所示。然後，我們將logit轉換為

W_{j}^{T} x_{i} = ‖ W_{j} ‖ ‖ x_{i} ‖ \cos θ_{j}

，其中

θ_{j}

是權重

W_{j}

和特徵

x_{i}

之間的角度。依據[13]、[14]、[82]，我們透過

ℓ_{2}

正規化將每個單獨的權重

‖ W_{j} ‖

固定為 1。同時依據[83]、[14]、[82]、[15]，我們對嵌入特徵

‖ x_{i} ‖

做

ℓ_{2}

正規化，並將其重新縮放為

s

。特徵和權重的正規化步驟使得預測單純取決於特徵和權重之間的角度。因此，學習到的嵌入特徵會分佈在半徑為

s

的超球面上。

\begin{matrix} (2) & L_{2} = - \log \frac{e^{s \cos θ_{y_{i}}}}{e^{s \cos θ_{y_{i}}} + \sum_{j = 1, j \neq y_{i}}^{N} e^{s \cos θ_{j}}} \end{matrix}

$\cos θ_{y_{i}} = \frac{W_{y_{i}}^{T} x_{i}}{‖ W_{y_{i}} ‖ ‖ x_{i} ‖}$ ，計算的是嵌入特徵與類別中心之間的相似度
$s$ 是個縮放因子

Since the embedding features are distributed around each feature center on the hypersphere, we employ an additive angular margin penalty

m

between

x_{i}

and

W_{y_{i}}

to simultaneously enhance the intra-class compactness and inter-class discrepancy as illustrated in Figure 2. Since the proposed additive angular margin penalty is equal to the geodesic distance margin penalty in the normalized hypersphere, we name our method as ArcFace.

由於嵌入特徵分佈在超球面上的每個特徵中心周圍，所以我們在

x_{i}

和

W_{y_{i}}

之間採用additive angular margin penalty，

m

，同時增強類別內的緊湊性與類別間的差異，如圖所示如Figure 2所示。

Fig. 2. Training the deep face recognition model by the proposed ArcFace loss (

K

=1) and sub-center ArcFace loss (e.g.

K

=3). Based on a

ℓ_{2}

normalization step on both embedding feature

x_{i} \in R^{512}

and all sub-centers

W \in R^{512 \times N \times K}

, we get the subclass-wise similarity score

S \in R^{N \times K}

by a matrix multiplication

W^{T} x_{i}

. After a max pooling step, we can easily get the class-wise similarity score

S^{'} \in R^{N \times 1}

. Afterwards, we calculate the

a r c c o s θ_{y_{i}}

and get the angle between the feature

x_{i}

and the ground truth center

W_{y_{i}}

. Then, we add an angular margin penalty

m

on the target (ground truth) angle

θ_{y_{i}}

. After that, we calculate

\cos (θ_{y_{i}} + m)

and multiply all logits by the feature scale

s

. Finally, the logits go through the softmax function and contribute to the cross entropy loss.

利用我們所提出的ArcFace loss (

K

=1)與sub-center ArcFace loss (e.g.

K

=3)所訓練的深度人臉辨識模型。基於對嵌入特徵

x_{i} \in R^{512}

和所有sub-centers

W \in R^{512 \times N \times K}

進行

ℓ_{2}

正規化的步驟，我們通過矩陣乘法

W^{T} x_{i}

得到subclass-wise的相似度分數

S \in R^{N \times K}

。經過max pooling之後，我們可以輕鬆得的class-wise的相似度分數

S^{'} \in R^{N \times 1}

。隨後，我們計算

a r c c o s θ_{y_{i}}

，從而得到特徵

x_{i}

與真實的中心

W_{y_{i}}

之間的角度(angle)

θ_{y_{i}}

。接著，我們在目標(真實標記)角度

θ_{y_{i}}

上加入angular margin penalty

m

。然後計算

\cos (θ_{y_{i}} + m)

，並將所有logits乘以特徵縮放因子

s

。最後，logits通過softmax function，並投入於交叉熵損失(cross entropy loss)。

$\cos θ_{y_{i}} = \frac{W_{y_{i}}^{T} x_{i}}{‖ W_{y_{i}} ‖ ‖ x_{i} ‖}$ ，因為分母都正規化為1了，所以
$\cos θ_{y_{i}} = W_{y_{i}}^{T} x_{i}$ ，得到相似度之後再利用反餘弦函數來計算
$a r c c o s$ ，python的話就math.acos(cos_theta)，就可以得到角度
利用max pooling是因為，arcface就是只關心最近的那個sub-center

\begin{matrix} (3) & L_{3} = - \log \frac{e^{s \cos (θ_{y_{i}} + m)}}{e^{s \cos (θ_{y_{i}} + m)} + \sum_{j = 1, j \neq y_{i}}^{N} e^{s \cos θ_{j}}} \end{matrix}

We select face images from 8 different identities containing enough samples (around 1,500 images/class) to train 2-D feature embedding networks with the Norm-Softmax and ArcFace loss, respectively. As illustrated in Figure 3, all face features are pushed to the arc space with a fixed radius based on the feature normalization. The Norm-Softmax loss provides roughly separable feature embedding but produces noticeable ambiguity in decision boundaries, while the proposed ArcFace loss can obviously enforce a more evident margin between the nearest classes.

我們從8個不同的身份中選擇包含足夠樣本(每個類別約1,500張影像)的人臉影像，以分別使用 Norm-Softmax和ArcFace loss來訓練2-D的特徵嵌入網路。如Figure 3所示，所有的臉部特徵都在特徵正規化的基礎上被壓縮到一個具有固定半徑的弧空間。Norm-Softmax loss提供了大致可分離的特徵嵌入，但在決策邊界中產生了明顯的模糊性，而我們所提出的ArcFace loss顯然可以在最接近的類別之間強化更明確的邊界。

Fig. 3. Toy examples under the Norm-Softmax and ArcFace loss on 8 identities with 2D features. Dots indicate samples and lines refer to the center direction of each identity. Based on the feature normalization, all face features are pushed to the arc space with a fixed radius. The geodesic distance margin between closest classes becomes evident as the additive angular margin penalty is incorporated.

Numerical Similarity. In SphereFace [13], [42], ArcFace, and CosFace [14], [15], three different kinds of margin penalty are proposed, e.g. multiplicative angular margin

m_{1}

, additive angular margin

m_{2}

, and additive cosine margin

m_{3}

, respectively. From the view of numerical analysis, different margin penalties, no matter add on the angle [13] or cosine space [14], all enforce the intra-class compactness and inter-class diversity by penalizing the target logit [81]. In Figure 4(b), we plot the target logit curves of SphereFace, ArcFace and CosFace under their best margin settings. We only show these target logit curves within

[20^{\circ}, 100^{\circ}]

because the angles between

W_{y_{i}}

and

x_{i}

start from around

90^{\circ}

(random initialization) and end at around

30^{\circ}

during ArcFace training as shown in Figure 4(a). Intuitively, there are three numerical factors in the target logit curves that affect the performance, i.e. the starting point, the end point and the slope.

Numerical Similarity. 在SphereFace [13]、[42]、ArcFace和CosFace [14]、[15]中，提出了三種不同類型的邊界懲罰，像是multiplicative angular margin

m_{1}

、additive angular margin

m_{2}

及additive cosine margin

m_{3}

。從數值分析的角度來看，不同的邊界懲罰，無論是在角度[13]或是餘弦空間[14]中加入懲罰項，都是透過懲罰target logit的方式來強化類別內(intra-class)的緊湊性和類別間(inter-class)的多樣性[81]。在Figure 4(b)中，我們繪製了SphereFace、ArcFace和CosFace在最佳邊界設定下的target logit curves。我們單純的呈現

[20^{\circ}, 100^{\circ}]

範圍內的target logit curves，因為

W_{y_{i}}

和

x_{i}

之間的角度從

90^{\circ}

左右開始(隨機初始化)，並且ArcFace的訓練會以大約

30^{\circ}

結束，如Figure 4(a)所示。直觀來說，目標對數曲線(target logit curves)中有三個數值因素影響效能，也就是起點、終點和斜率。

Fig. 4. Target logit analysis. (a)

θ_{j}

distributions from start to end during ArcFace training. (2) Target logit curves for softmax, SphereFace, ArcFace, CosFace and combined margin penalty (

\cos (m_{1} θ + m_{2}) - m_{3}

By combining all of the margin penalties, we implement SphereFace, ArcFace and CosFace in a united framework with

m_{1}

m_{2}

and

m_{3}

as the hyper-parameters.

透過結合所有的邊界懲罰，我們在一個統一的框架中以

m_{1}

、

m_{2}

和

m_{3}

作為超參數實現SphereFace、ArcFace和CosFace。

\begin{matrix} (4) & L_{4} = - \log \frac{e^{s (\cos (m_{1} θ_{y_{i}} + m_{2}) - m_{3})}}{e^{s (\cos (m_{1} θ_{y_{i}} + m_{2}) - m_{3})} + \sum_{j = 1, j \neq y_{i}}^{N} e^{s \cos θ_{j}}} . \end{matrix}

As shown in Figure 4(b), by combining all of the above-motioned margins (

\cos (m_{1} θ + m_{2}) - m_{3}

), we can easily get some other target logit curves which also achieve high performance.

如Figure 4(b)所示，透過組合所有上述邊界(

\cos (m_{1} θ + m_{2}) - m_{3}

)，我們可以輕鬆的獲得其它也能實現高效能的target logit curves。

Geometric Difference. Despite the numerical similarity between ArcFace and previous works, the proposed additive angular margin has a better geometric attribute as the angular margin has the exact correspondence to the geodesic distance. As illustrated in Figure 5, we compare the decision boundaries under the binary classification case. The proposed ArcFace has a constant linear angular margin throughout the whole interval. By contrast, SphereFace and CosFace only have a nonlinear angular margin.

Geometric Difference. 儘管ArcFace和先前的研究在數值上相似，但所提出的additive angular margin具有更好的幾何屬性，因為angular margin與測地距離具有精確的對應關係。如Figure 5所示，我們比較了在二元分類情況下的決策邊界。所提出的ArcFace在整個區間內有著恆定的線性的角度邊界(constant linear angular margin)。相較之下，SphereFace和CosFace僅具有非線性角度邊界。

Fig. 5. Decision margins of different loss functions under binary classification case. The dashed line represents the decision boundary, and the grey areas are the decision margins.

The minor difference in margin designs can have a significant influence on model training. For example, the original SphereFace [13] employs an annealing optimization strategy. To avoid divergence at the beginning of training, joint supervision from softmax is used in SphereFace to weaken the multiplicative integer margin penalty. We implement a new version of SphereFace without the integer requirement on the margin by employing the arc-cosine function instead of using the complex double angle formula. In our implementation, we find that

m = 1.35

can obtain similar performance compared to the original SphereFace without any convergence difficulty.

在邊界設計的微小差異可能會對模型訓練有著重大影響。舉例來說，原始的SphereFace[13] 採用了退火最佳化策略。為了避免在訓練初期發散，SphereFace中使用了softmax的聯合監督(joint supervision)來弱化乘性整數邊界懲罰(multiplicative integer margin penalty)。我們透過使用反餘弦函數而不是使用複雜的倍角公式來實現新版本的SphereFace，這樣在邊界上就沒有了整數的要求。在我們的實作中，我們發現

m = 1.35

就可以在沒有收斂性困難的情況下得到跟原始SphereFace類似的效能。

Other Intra and Inter Losses. Other loss functions can be designed based on the angular representation of features and centers. For examples, we can design a loss to enforce intra-class compactness and inter-class discrepancy on the hypersphere.

其它的損失函數可以以基於特徵和中心的角度表示來設計。舉例來說，我們可以設計一個損失來強制超球面上的類別內的緊湊性及類別間的差異。

Intra-Loss is designed to improve the intra-class compactness by decreasing the angle/arc between the sample and the ground truth center.

Intra-Loss主要是透過減少樣本與真實中心之間之間的角度/弧度來提高類別內的緊湊性。

\begin{matrix} (5) & L_{5} = L_{2} + \frac{1}{π} θ_{y_{i}} . \end{matrix}

\begin{matrix} (2) & L_{2} = - \log \frac{e^{s \cos θ_{y_{i}}}}{e^{s \cos θ_{y_{i}}} + \sum_{j = 1, j \neq y_{i}}^{N} e^{s \cos θ_{j}}} \end{matrix}

Inter-Loss targets at enhancing inter-class discrepancy by increasing the angle/arc between different centers.

Inter-Loss的目標則是透過增加不同中心之間的角度/弧度來增強類別間的差異。

\begin{matrix} (6) & L_{6} = L_{2} - \frac{1}{π (N - 1)} \sum_{j = 1, j \neq y_{i}}^{N} \arccos (W_{y_{i}}^{T} W_{j}) . \end{matrix}

To enhance inter-class separability, RegularFace [51] explicitly distances identities by penalizing the angle between an identity and its nearest neighbor, while Minimum Hyper-spherical Energy (MHE) [84] encourages the angular diversity of neuron weights inspired by the Thomson problem. Recently, fixed classifier methods [85], [86], [87] exhibit little or no reduction in classification performance while allowing a noticeable reduction in computational complexity, trainable parameters and communication cost. In these methods, inter-class separability is not learned but inherited from a pre-defined high-dimensional geometry [87].

為了增強類別之間的可分離性，RegularFace [51] 透過懲罰本身與其最近鄰居之間的角度顯式地拉開不同身份的距離，而Minimum Hyper-spherical Energy (MHE)[84]則是受到Thomson problem的啟發，鼓勵神經元權重的角度多樣性。近來，固定分類器(fixed classifier)方法[85]、[86]、[87]在分類效能上幾乎無損，同時可以降低計算複雜性、可訓練參數和通訊成本。在這些方法中，類別間的可分離性不是利用學習學到的，而是從預先定義的高維度幾何中所繼承的[87]。

Triplet-loss aims at enlarging the angle/arc margin between triplet samples. In FaceNet [3], Euclidean margin is applied on the normalized features. Here, we employ the triplet-loss by the angular representation of our features as

\arccos (x_{i}^{p o s} x_{i}) + m \leq \arccos (x_{i}^{n e g} x_{i})

Triplet-loss旨在擴大triplet samples(三個樣本？)之間的角度/弧度邊界。在FaceNet[3] 中，歐幾里德邊距是用來正規化特徵的。在這裡，我們透過特徵的角度表示(angular representation)來採用triplet-loss為

\arccos (x_{i}^{p o s} x_{i}) + m \leq \arccos (x_{i}^{n e g} x_{i})

。

3.2 Sub-center ArcFace

Even though ArcFace has shown its power in efficient and effective face feature embedding, this method assumes that training data are clean. However, this is not true especially when the dataset is in large scale. How to enable the margin-based softmax loss to be robust to noise is one of the main challenges impeding the development of face recognition[18]. In this paper, we address this problem by proposing the idea of using sub-classes for each identity, which can be directly adopted by ArcFace and will significantly increase its robustness.

儘管ArcFace已經說明了其於高效且有效的臉部特徵嵌入方面強大的能力，不過這個方法假設訓練資料是乾淨的。然而，事實並非如此，特別是當資料集的規模較大的時候。如何使margin-based的softmax loss面對噪點依然穩健是阻礙人臉辨識發展的主要挑戰之一[18]。在這篇論文中，我們透過為每個身份(identity)使用sub-classes的想法來解決這個問題，這個想法可以直接被ArcFace採用，而且還可以明顯提升其穩健性。

As illustrated in Figure 2, we set

K

sub-centers for each identity. Based on a

ℓ_{2}

normalization step on both embedding feature

x_{i} \in R^{512}

and all sub-centers

W \in R^{512 \times N \times K}

, we get the subclass-wise similarity scores

S \in R^{N \times K}

by a matrix multiplication

W^{T} x_{i}

. Then, we employ a max pooling step on the subclass-wise similarity score

S \in R^{N \times K}

to get the class-wise similarity score

S^{'} \in R^{N \times 1}

如Figure 2所示，我們為每個身份(identity)設定

K

個sub-centers。基於對嵌入特徵

x_{i} \in R^{512}

和所有的sub-centers

W \in R^{512 \times N \times K}

的

ℓ_{2}

正規化步驟，我們透過矩陣乘法

W^{T} x_{i}

得到subclass-wise的相似性分數

S \in R^{N \times K}

。然後，我們對subclass-wise的相似度分數

S \in R^{N \times K}

做max pooling來得到class-wise的相似性分數

S^{'} \in R^{N \times 1}

。

The proposed sub-center ArcFace loss can be formulated as:

我們所提出的sub-center ArcFace loss可以表示為：

\begin{matrix} (7) & L_{7} = - \log \frac{e^{s \cos (θ_{y_{i}} + m)}}{e^{s \cos (θ_{y_{i}} + m)} + \sum_{j = 1, j \neq y_{i}}^{N} e^{s \cos θ_{j}}} \end{matrix}

where

θ_{j} = a r c c o s (max_{k} (W_{j_{k}}^{T} x_{i}))

k \in {1, \dots, K}

其中

θ_{j} = a r c c o s (max_{k} (W_{j_{k}}^{T} x_{i}))

k \in {1, \dots, K}

。

ArcFace loss:

\begin{matrix} (3) & L_{3} = - \log \frac{e^{s \cos (θ_{y_{i}} + m)}}{e^{s \cos (θ_{y_{i}} + m)} + \sum_{j = 1, j \neq y_{i}}^{N} e^{s \cos θ_{j}}} \end{matrix}

In Figure 6(a), we have visualized the clustering results of one identity from the CASIA dataset [56] after employing the sub-center ArcFace loss (

K = 10

) for training. It is obvious that the proposed sub-center ArcFace loss can automatically cluster faces such that hard samples and noisy samples are separated away from the dominant clean samples. Note that some sub-classes are empty as

K = 10

is too large for a particular identity. In Figure 6(b), we show the angle distribution on the CASIA dataset [56]. We use the pre-trained ArcFace model to predict the feature center of each identity and then calculate the angle between the sample and its corresponding feature center. As we can see from Figure 6(b), most of the samples are close to their centers, however, there are some noisy samples which are far away from their centers. This observation on the CASIA dataset matches the noise percentage estimation (

9.3 % \sim 13.0 %

) in [18]. To automatically obtain a clean training dataset, the noisy tail is usually removed by a hard threshold (e.g. angle

\geq 77^{\circ}

or cosine

\leq 0.225

). Since sub-center ArcFace can automatically divide the training samples into dominant sub-classes and non-dominant sub-classes, clean samples (in red) can be separated from hard and noisy samples (in blue). More specifically, the majority of clean faces (

85.6 %

) go to the dominant sub-class, while the rest hard and noisy faces go to the non-dominant sub-classes.

在Figure 6(a)中，我們視覺化了CASIA資料集[56]中的一個身份使用sub-center ArcFace loss (

K = 10

)進行訓練後的聚類結果。很明顯的，我們所提出的sub-center ArcFace loss可以自動地對人臉進行聚類，從而將不好判定的樣本(hard samples)和噪點樣本與主要的乾淨樣本分開。請注意，某些sub-classes是空的，因為

K = 10

的設定對於特定身份來說太大了。在Figure 6(b) 中，我們顯示了CASIA資料集[56]上的角度分佈(angle distribution)。我們使用預訓練的ArcFace model來預測每個身份的特徵中心，然後計算樣本與其對應特徵中心之間的角度。從Figure 6(b)可以看的出來，多數的樣本都靠近其中心點，當然，也是有一些噪點樣本遠離其中心點。在CASIA資料集的觀察結果與[18]中的噪點百分比估計值(

9.3 % \sim 13.0 %

)相符。為了自動化獲得乾淨的訓練資料集，通常會透過硬閾值(hard threshold)(例如角度

\geq 77^{\circ}

或餘弦

\leq 0.225

)來去除雜點尾部(noisy tail)的資料。由於sub-center ArcFace可以自動地將訓練樣本分為主要與非主要的sub-classes，因此可以將乾淨的樣本(紅色)與不好判定的樣本和噪點樣本(藍色)分開。更具體地說，大多數乾淨的人臉(

85.6 %

)都是屬於主要的sub-class，而其餘不好判定的樣本和噪點樣本人臉則是屬於非主要的sub-class。

Fig. 6. (a) The sub-classes of one identity from the CASIA dataset [56] after using the sub-center ArcFace loss (

K = 10

). Noisy samples and hard samples (e.g. profile and occluded faces) are automatically separated from the majority of clean samples. (b) Angle distribution of samples from the dominant and non-dominant sub-classes. Clean data are automatically isolated by the sub-center ArcFace.

Even though using sub-classes can improve the robustness under noise, it undermines the intra-class compactness as hard samples are also kept away as shown in Figure 6(b). In [37], MS1MV0 (around 10M images of 100K identities) is released with the estimated noise percentage around

47.1 % \sim 54.4 %

[18].In [88], MS1MV0 is refined by a semi-automatic approach into a clean dataset named MS1MV3 (around 5.1M images of 93K identities). Based on these two datasets, we can get the clean and noisy labels on MS1MV0. In Figure 7(b) and Figure 7(a), we show the angle distributions of samples to their closest sub-centers (training settings: [MS1MV0, ResNet50, Sub-center ArcFace

K

=3]).In general, there are four categories of samples: (1) easy clean samples belonging to dominant sub-classes (

57.24 %

), (2) hard noisy samples belonging to dominant sub-classes (

12.40 %

), (3) hard clean samples belonging to non-dominant sub-classes (

4.28 %

), and (4) easy noisy samples belonging to non-dominant sub-classes (

26.08 %

).In Figure 7(a), we show the angle distribution of samples to their corresponding centers from the ArcFace model (training settings: [MS1MV0, ResNet50, ArcFace

K

=1]).By comparing the percentages of noisy samples in Figure 7(b) and Figure 7(a) we find that sub-center ArcFace can significantly decrease the noise rate to around one third (from

38.47 %

12.40 %

) and this is the reason why sub-center ArcFace is more robust under noise. During the training of sub-center ArcFace, samples belonging to non-dominant sub-classes are pushed to be close to these non-dominant sub-centers as shown in Figure 7©.Since we have not set any constraint on sub-centers, the sub-centers of each identity can be quite different and even orthogonal. In Figure 7(d), we show the angle distributions of non-dominant samples to their dominant sub-centers. By combining Figure 7(b) and Figure 7(d), we find that the clean and noisy data have some overlaps but a constant angle threshold (between

70^{\circ}

and

80^{\circ}

) can be easily searched to drop most of the high-confident noisy samples.

儘管使用sub-classes可以提高在噪點下的穩健性，但它同時也會破壞類別內的緊湊性，因為難以判定的樣本也會被拉遠遠的，如Figure 6(b)所示。在[37]中所放出的MS1MV0(大約100K個身份的10M影像)資料集，估計噪點百分比約

47.1 % \sim 54.4 %

[18]。在[88]中，MS1MV0透過半自動化的方法提煉出名為MS1MV3的乾淨資料集(約93K個身份的510萬張影像)。基於這兩個資料集，我們可以得到MS1MV0上的乾淨標記和噪點標記。在Figure 7(b)和Figure 7(a)中，我們顯示了樣本到最近sub-centers的角度分佈(訓練設定：[MS1MV0，ResNet50，Sub-center ArcFace

K

=3])。一般來說，樣本分為四類：(1)易於分類的乾淨樣本，屬於主要的sub-classes(

57.24 %

)，(2)難分類的噪點樣本，屬於主要的sub-classes(

12.40 %

)，(3)難分類的乾淨樣本，屬於非主要的sub-classes(

4.28 %

)，以及(4)易於分類的噪聲樣本，屬於非主要的sub-classes(

26.08 %

)。在Figure 7(a)中，我們顯示了ArcFace model中樣本與其對應中心的角度分佈(訓練設定：[MS1MV0，ResNet50，ArcFace

K

=1])。透過比較Figure 7(b)和Figure 7(a)中的噪點樣本的分比，我們發現sub-center ArcFace可以將噪點率明顯降低到三分之一左右(從

38.47 %

到

12.40 %

)，這也就是為什麼sub-center ArcFace在噪點下更穩健的原因。在sub-center ArcFace的訓練過程中，屬於非主要的sub-classes樣本被推向靠近這些非主要的sub-centers，如Figure 7©所示。由於我們沒有對sub-centers設定任何的約束，因此每個身份的sub-centers可以有很大的差異，甚至是正交(無關)的。在Figure 7(d)中，我們說明了非主要樣本與其主要的sub-centers的角度分佈。結合Figure 7(b)和Figure 7(d)兩圖，我們發現乾淨資料和噪點資料有些許的重疊，但這可以透過固定的角度閾值(在

70^{\circ}

和

80^{\circ}

之間)輕鬆篩選掉大部分高置信度的噪點樣本。

Fig. 7. Data distribution of ArcFace (

K

=1) and the proposed sub-center ArcFace (

K

=3) before and after dropping non-dominant sub-centers. MS1MV0[37] is used here.

K = 3 ↓ 1

denotes sub-center ArcFace with non-dominant sub-centers dropping.

Based on the above observations, we propose a straightforward approach to recapture intra-class compactness. We directly drop non-dominant sub-centers after the network has enough discriminative power. Meanwhile, we introduce a constant angle threshold to drop high-confident noisy data. After that, we retrain the ArcFace model from scratch on the automatically cleaned dataset.

基於上述觀察，我們提出了一個簡單的方法來重新獲得類別內的緊湊性。當網路有足夠的判別力後，我們可以直接丟棄掉非主要的sub-centers。與此同時，我們引入了固定的角度閾值來篩選掉高置信度的噪點資料。接續的，我們在自動清理的資料集上從頭開始重新訓練ArcFace model。

3.3 Inversion of ArcFace

In the above sections, we have explored how the proposed ArcFace can enhance the discriminative power of a face recognition model. In this section, we take a pre-trained ArcFace model as a white-box and reconstruct identity preserved as well as visually plausible face images only using the gradient of the ArcFace loss and the face statistic priors (e.g. mean and variance) stored in the BN layers. As shown in Figure 8 and illustrated in Algorithm 1, the pre-trained ArcFace model has encoded substantial information of the training distribution. The distribution, stored in BN layers via running mean and running variance, can be effectively employed to generate visually plausible face images, avoiding convergence outside natural faces with high confidence.

在上面的章節中，我們探討了我們所提出的ArcFace可以如何的增強人臉辨識模型的判別能力。在這一章節中，我們將預訓練的ArcFace作為白箱，然後單純使用ArcFace loss的梯度和儲存在BN layers中的人臉統計先驗資訊(如平均值和變異數)重建具有身份保持性且視覺上合理的人臉影像。如Figure 8所示以及Algorithm 1所示，預訓練的 ArcFace model已經對訓練分佈的大量信息做了編碼。透過移動平均值和移動變異數存儲存BN layers中的分佈可以有效地用於生成視覺上合理的人臉影像，避免以高置信度收斂到自然人臉之外的情況。

Fig. 8. ArcFace is not only a discriminative model but also a generative model. Given a pre-trained ArcFace model, a random input tensor can be gradually updated into a pre-defined identity by using the gradient of the ArcFace loss as well as the face statistic priors stored in the Batch Normalization layers.

Besides the ArcFace loss (Eq.3) to preserve identity, we also consider the following statistic priors during face generation:

除了ArcFace loss (Eq.3)以保持身份之外，我們還在人臉生成過程中考慮下面的統計先驗：

\begin{matrix} (8) & L_{8} = \sum_{i = 0}^{L} ‖ {\tilde{μ}}_{i}^{r} - μ_{i} ‖_{2}^{2} + ‖ {\tilde{σ}}_{i}^{r} - σ_{i} ‖_{2}^{2}, \end{matrix}

一個比較直觀的想法就是，讓資料跟BN layer之間的均值、標準差愈接近愈好，那自然就可以逆向還原？

where

μ_{i}^{r}

σ_{i}^{r}

are the mean/standard deviation of the distribution at layer

i

, and

μ_{i}

σ_{i}

are the corresponding mean/standard deviation parameters stored in the

i

-th BN layer of a pre-trained ArcFace model. After jointly optimizing Eq.3 and Eq.8 (

L_{3} + λ L_{8}, λ = 0.05

) for

T

steps as in Algorithm 1, we can generate faces, when fed into the network, not only have same identity as the pre-defined identity but also have a statistical distribution that closely matches the original data set.

其中

μ_{i}^{r}

σ_{i}^{r}

是第

i

層分佈的平均值/標準差，

μ_{i}

σ_{i}

則是儲存在預訓練ArcFace model的第

i

層BN層的對應平均值/標準差參數。在像Algorithm 1那樣針對Eq.3 和Eq.8(

L_{3} + λ L_{8}, λ = 0.05

)一起最佳化

T

次後，當我們把這些餵到網路時，我們就可以生成具有預先定義身份的人臉影像，這些影像不僅保有相同的身份特徵，還在統計分布上與原始資料集緊密匹配。

The above approach exploits the relationship between an input image and its class label for the reconstruction process. As the output similarity score is fixed according to predefined

N

classes, the reconstruction is limited on images of training subjects. To solve open-set face generation from the embedding feature, the constraints on predefined classes need to be removed. Therefore, we substitute the classification loss to the

ℓ_{2}

loss between feature pairs. Open-set face generation can restore the face image from any embedding feature, while close-set face generation only reconstructs face images from the class centers stored in the linear weight.

上述方法利用輸入影像與其類別標記之間的關係來進行重建過程。由於輸出相似度分數是根據預先定義的

N

類別所固定的，因此重建僅限於訓練物件的影像。為了解決從嵌入特徵生成開集人臉的問題，我們需要移除掉對預定義類別的約束。因此，我們將分類損失替換為特徵對(feature pairs)之間的

ℓ_{2}

損失。開集人臉生成可以從任何嵌入特徵重建人臉影像，而閉集人臉生成則是只能從線性權重中所儲存的類別中心來重建人臉影像。

Concurrent with our work, [33], [34], [35] have proposed a data-free method employing the BN priors to restore ImageNet images for distillation, quantization and pruning. Their model inversion results contain obvious artifact in the background due to the translation augmentation during training. By contrast, our ArcFace model is trained on normalized face crops without background, thus the restored faces exhibit less artifact. Besides, these data-free methods only considered close-set image generation but ArcFace can freely restore both close-set and open-set subjects. In this paper, we show that the proposed additive angular margin loss can also improve face generation.

在我們研究的同時，[33]、[34]、[35]提出了一種data-free方法，這是利用BN prior來恢復 ImageNet影像用於知識蒸餾(distillation)、量化(quantization)和剪枝(pruning)。然而，由於訓練過程中使用平移增強(translation augmentation)，他們的模型逆向的結果在背景中包含明顯的瑕疵。相比之下，我們的ArcFace model是在沒有背景的標準化臉部裁切上進行訓練的，因此恢復的臉部明顯較少的瑕疵。此外，這些data-free方法僅考慮閉集的影像生成，不過ArcFace可以自由地還原閉集和開集的物件。在這篇論文中，我們說明了我們所提出的additive angular margin loss也可以改善臉部生成。

4 EXPERIMENTS

4.1 Implementation Details

Training Datasets. As given in Table 1, we separately employ CASIA [56], VGG2 [9], MS1MV0 [37] and Celeb500K [38] as our training data in order to conduct fair comparison with other methods. MS1MV0 (loose cropped version) [37] is a raw data with the estimated noise percentage around

47.1

[18]. MS1MV3 [88] is cleaned from MS1MV0 [37] by a semiautomatic approach. We employ ethnicity-specific annotators (e.g. African, Caucasian, Indian and Asian) for large-scale face image annotations, as the boundary cases (e.g. hard samples and noisy samples) are very hard to distinguish if the annotator is not familiar with the identity. Celeb500K [38] is collected in the same way as MS1MV0 [37], using the celebrity name list [37] to search identities from Google and download the top-ranked face images. We download 25M images of 500K identities, and employ RetinaFace [8] to detect faces larger than 50×50 from the original images. By employing the proposed sub-center ArcFace, we can automatically clean MS1MV0 [37] and Celeb500K [38]. After removing the overlap identities (about 50K) through the ID string, we combine the automatically cleaned MS1MV0 and Celeb500K and obtain a large-scale face image dataset, named IBUG-500K, including 11.96 million images of 493K identities. Figure 9 illustrates the gender, race, pose, age and image number distributions of the proposed IBUG-500K dataset.

Training Datasets. 如Table 1所示，我們分別採用CASIA [56]、VGG2 [9]、MS1MV0 [37]和Celeb500K [38]作為我們的訓練資料，以便與其它方法進行公平的比較。MS1MV0(寬鬆裁剪版本)[37] 是一個原始資料，估計噪點百分比約為

47.1

[18]。MS1MV3 [88] 是通過半自動的方法從MS1MV0[37]清理後得到的資料。我們僱用特定種族的標註人員(像是非洲人、高加索人、印度人和亞洲人)來處理大規模的人臉圖像標記，因為如果標記人員不熟悉身份的話，那會很難區分邊界案例(例如難以判定的樣本和噪點樣本)。Celeb500K [38]的收集方式與MS1MV0 [37]相同，使用名人的清單[37]從Google搜尋身份並下載排名在前面的人臉圖像。我們下載了500K個身份的25M張的影像，並使用RetinaFace [8]從原始影像中偵測大於50×50的人臉。透過採用所提出的sub-center ArcFace方法，我們可以自動地的清理MS1MV0 [37] 和 Celeb500K [38]。在透過ID字串去除重複的身份(約50K)之後，我們將利用自動清理的MS1MV0和Celeb500K兩個資料集結合起來，最終得到一個大規模的人臉影像資料集，命名為IBUG-500K，包括493K個身份的1196萬張影像。Figure 9說明了所提出的IBUG-500K資料集的性別、種族、姿勢、年齡和圖像數量的分佈。

Fig. 9. IBUG-500K statistics. We show the (a) gender, (b) race, © yaw pose, (d) age and (e) image number distributions of the proposed large-scale training dataset.

TABLE 1: Face datasets for training and testing. “(D)” refers to the distractors. IBUG-500K is the training data automatically refined by the proposed sub-center ArcFace. LFR2019-Image and LFR2019-Video are the proposed large-scale image and video test sets.

Test Datasets. During training, we explore efficient face verification datasets (e.g. LFW [89], CFP-FP [74], AgeDB [76]) to check the convergence status of the model. Besides the most widely used LFW [89] and YTF [90] datasets, we also report the performance of ArcFace on the recent datasets (e.g. CPLFW [75] and CALFW [77]) with large pose and age variations. We also extensively test the proposed ArcFace on large-scale image datasets (e.g. MegaFace [78], IJB-B [79], IJB-C [80] and LFR2019-Image [88]) and large-scale video datasets (LFR2019-Video [88]). Detailed dataset statistics are presented in Table 1. For the LFR2019-Image dataset, there are 274K images from the 5.7K LFW identities [89] and 1.58M distractors downloaded from Flickr. For the LFR2019-Video dataset, there are 200K videos of 10K identities collected from various shows, films and television dramas. The length of each video ranges from 1 to 30 seconds. Both the LFR2019-Image dataset and the LFR2019-Video dataset are manually cleaned to ensure the unbiased evaluation of different face recognition models.

Test Datasets. 在訓練過程中，我們探索高效率的人臉驗證資料集（例如 LFW [89]、CFP-FP [74]、AgeDB [76]）來確認模型的收斂狀態。除了最廣泛使用的LFW [89] 和YTF [90] 資料集之外，我們還報告了ArcFace在最近的資料集(有著大量姿勢、年齡變化，像是CPLFW [75]和CALFW[77])上的效能。我們也在大規模影像資料集(像是 MegaFace [78]、IJB-B [79]、IJB-C [80] 和 LFR2019-Image [88])和大型視訊資料集(LFR2019-Video[88])。詳細的資料集統計資料如Table 1所示。LFR2019-Image資料集的話，274K的影像是來自5.7K的LFW identities [89]以及從Flickr下載的1.58M的干擾影像。LFR2019-Video資料集的話則是從各種節目、電影和電視劇中收集的10K個身份的200K個影片。每個影片的長度從1秒到30秒不等。 LFR2019-Image資料集和LFR2019-Video資料集均經過手動清洗，以確保不同人臉辨識模型的公正評估。

Experimental Settings. For data prepossessing, we follow the recent papers [13], [14] to generate the normalized face crops (112×112) by utilizing five facial points predicted by RetinaFace [8]. For the embedding network, we employ the widely used CNN architectures, ResNet50 and ResNet100 [58], [91] without the bottleneck structure. After the last convolutional layer, we explore the BN [92]-Dropout [93]-FC-BN structure to get the final 512-D embedding feature. In this paper, we use ([training dataset, network structure, loss]) to facilitate understanding of different experimental settings.

Experimental Settings. 對於資料預處理的部份，我們依著近來的論文[13]、[14]，利用RetinaFace [8]預測的五個臉部點來生成標準化的臉部裁剪(112×112)。對於嵌入網路的部份，我們採用被廣泛使用的CNN架構，ResNet50和ResNet100 [58]、[91]，沒有瓶頸結構。在最後一個卷積層之後，我們使用BN[92]-Dropout[93]-FC-BN的結構來獲得最終的512維的嵌入特徵。在這篇論文中，我們使用([training dataset, network structure, loss])以便於理解不同的實驗設置。

We follow [14] to set the feature scale

s

to 64 and choose the angular margin

m

of ArcFace at

0.5

. All recognition experiments in this paper are implemented by MXNet [39]. We set the batch size to 512 and train models on eight NVIDIA Tesla P40 (24GB) GPUs. We set the momentum to 0.9 and weight decay to

5 e - 4

. For the ArcFace training, we employ the SGD optimizer and follow [14], [9] to design the learning rate schedules for different datasets. On CASIA, the learning rate starts from 0.1 and is divided by 10 at 20, 28 epochs. The training process is finished at 32 epochs. On VGG2, the learning rate is decreased at 6, 9 epochs and we finish training at 12 epochs. On MS1MV3 and IBUG-500K, we refer to the verification accuracy on CFP-FP and AgeDB to reduce the learning rate at 8, 14 epochs and terminate at 18 epochs.

我們依著[14]將特徵尺度

s

設定為64，然後ArcFace的angular margin

m

則是設置為

0.5

。這篇論文中的所有人臉辨識實驗都是使用MXNet[39]實現。我們將batch size設定為 512，並在8塊NVIDIA Tesla P40 (24GB)的GPU上訓練模型。我們將momentum設定為0.9，weight decay設定為

5 e - 4

。對於ArcFace的訓練，我們採用SGD來最佳化，並依著[14]、[9]針對不同資料集的learning rate設計不同的調度策略。在CASIA上，learning rate從 0.1開始，在第20、28個epoch時除以0。訓練過程在第32個epoch時結束。在VGG2上，學習率在第6、9個epoch時降低，我們在第12個epoch時完成訓練。在MS1MV3和IBUG-500K上，我們參考CFP-FP和AgeDB上的驗證精度，在第8、14個epoch時降低learning rate，並在第18個epoch終止。

For the training of the proposed sub-center ArcFace on MS1MV0 [37], we also employ the same learning rate schedule as on MS1MV3 to train the first round of model

(K = 3)

. Then, we drop non-dominant sub-centers

(K = 3 ↓ 1)

and high-confident noisy data

(> 75^{\circ})

by using the first round model through an off-line way. Finally, we retrain the model from scratch using the automatically cleaned data. For the experiments of the sub-center ArcFace on Celeb500K [38], the only difference is the learning rate schedule, which is same as on IBUG-500K.

對於在MS1MV0 [37]上訓練由我們所提出的sub-center ArcFace，我們也採用與MS1MV3相同的學習率策略來訓練第一輪模型

(K = 3)

。然後，我們透過離線方式使用第一輪模型來去除non-dominant sub-centers

(K = 3 ↓ 1)

和高置信度的噪點資料

(> 75^{\circ})

。最後，我們使用自動化清理的資料從頭開始重新訓練模型。對於sub-center ArcFace在Celeb500K上的實驗[38]，唯一的差異是學習率的策略，這部份是跟IBUG-500K上的設定相同。

During testing of the face recognition models, we only keep the feature embedding network without the fully connected layer (160MB for ResNet50 and 250MB for ResNet100) and extract the

512 - D

features (8.9 ms/face for ResNet50 and 15.4 ms/face for ResNet100) for each normalized face. To get the embedding features for templates (e.g. IJB-B and IJB-C) or videos (e.g. YTF and LFR2019-Video), we simply calculate the feature center of all images from the template or all frames from the video.

在人臉辨識模型的測試過程中，我們就只保留沒有全連接層的特徵嵌入網路(ResNet50為160MB，ResNet100為250MB)，並提取

512 - D

的特徵(ResNet50為8.9 ms/face，ResNet50為15.4 ms/face)。為了獲得模板(像是IJB-B和IJB-C)或是影片(例如YTF和LFR2019-Video)的嵌入特徵，我們只需計算模板中所有影像或影片中所有幀的特徵中心。

4.2 Ablation Study on ArcFace

In Table 2, we first explore the angular margin setting for ArcFace on the CASIA dataset with ResNet50. The best margin observed in our experiments is

0.5

. Using the proposed combined margin framework in Eq. 4, it is easier to set the margin of SphereFace and CosFace which we find to have optimal performance when setting at

1.35

and

0.35

, respectively. Our implementations for both SphereFace and CosFace can lead to excellent performance without observing any difficulty in convergence. The proposed ArcFace achieves the highest verification accuracy on all three test sets. In addition, we perform extensive experiments with the combined margin framework (some of the best performance is observed for CM1 (1, 0.3, 0.2) and CM2 (0.9, 0.4, 0.15)) guided by the target logit curves in Figure 4(b). The combined margin framework leads to better performance than individual SphereFace and CosFace but upper-bounded by the performance of ArcFace.

在Table 2中，我們首先使用ResNet50來探索在CASIA資料集上的ArcFace angular margin的設定。在我們的實驗中觀察到的最佳邊界是

0.5

。基於方程式(4)中提出的結合邊界框架(combined margin framework)，我們更容易設定SphereFace和CosFace的邊界，並發現它們邊界分別設定為

1.35

和

0.35

時可以有最佳效能。我們的方法在SphereFace和CosFace都可以在收斂性毫無困難的情況下有著完美的效能。我們所提出的ArcFace在三個測試集上實現了最高的驗證精度。此外，我們在combined margin framework上做了大量實驗(部分最佳性能來自於CM1 (1, 0.3, 0.2)和CM2(0.9, 0.4, 0.15))，這些實驗依據Figure 4(b)所示的目標對數曲線(target logit curves)進行設計。combined margin framework的效能超越了單一的SphereFace和CosFace，不過仍然受限於ArcFace的效能上限。

TABLE 2: Verification results (%) of different loss functions ([CASIA, ResNet50, Loss*]).

Besides the comparison with margin-based methods, we conduct a further comparison between ArcFace and other losses which aim at enforcing intra-class compactness (Eq. 5) and inter-class discrepancy (Eq. 6). As the baseline, we choose the softmax loss. After weight and feature normalization, we have observed obvious performance drops on CFP-FP and AgeDB with the feature re-scale parameter

s

set as

64

. To obtain comparable performance as the softmax loss, we have searched the best scale parameter

s = 20

for Norm-Softmax.By combining the Norm-Softmax with the intra-class loss, the performance improves on CFP-FP and AgeDB. However, combining the Norm-Softmax with the inter-class loss only slightly improves the accuracy. Employing margin penalty within triplet samples is less effective than inserting margin between samples and centers as in ArcFace, indicating local comparisons in the Triplet loss are not as effective as global comparisons in ArcFace. Finally, we incorporate the Intra-loss, Inter-loss and Triplet-loss into ArcFace, but no obvious improvement is observed, which leads us to believe that ArcFace is already enforcing intra-class compactness, inter-class discrepancy and classification margin.

除了與基於邊界的方法進行比較之外，我們還對ArcFace和其它類型的loss做了進一步的比較，這些loss主要是為了增強類別內的緊湊性(Eq. 5)和類別間的差異(Eq. 6)。我們選擇softmax loss作為比較基線。在權重與特徵正規化之後，我們觀察到，在特徵重新縮放參數

s

設定為

64

的時候，CFP-FP和AgeDB的效能明顯下降。為了獲得與softmax loss相當的效能，我們搜尋到Norm-Softmax的最佳縮放參數為

s = 20

。透過結合Norm-Softmax與類別內的損失(inter-class loss)，CFP-FP和AgeDB的效能得到提升。然而，結合Norm-Softmax與類別內的損失(inter-class loss)也就只是提高一咪咪的準確性。在triplet samples中使用邊界懲罰(margin penalty)的效果不如在ArcFace中在樣本和中心之間插入邊界(margin)，這說明了Triplet loss中的局部比較不如ArcFace中的全域比較來的有效。最後，我們將Intra-loss、Inter-loss和Triplet-loss整合到ArcFace中，但沒有觀察到明顯的提升，這讓我們確信，ArcFace已經混然天成的加強類別內的緊湊性、類別間的差異性和分類邊界。

4.3 Ablation Study on Sub-center ArcFace

In Table 3, we conduct extensive experiments to investigate the proposed sub-center ArcFace on noisy training data (e.g.MS1MV0 [37] and Celeb500K [38]). Models trained on the manually cleaned MS1MV3 [88] are taken as the reference. We train ResNet50 networks under different settings and evaluate the performance by adopting TPR@FPR=1e-4 on IJB-C, which is more objective and less affected by the noise within the test data[94].

在Table 3中，我們做了大量的實驗，在有噪點訓練資料(像是MS1MV0 [37]和Celeb500K[38])上研究我們所提出的sub-center ArcFace。在手動清理的MS1MV3 [88]上訓練的模型作為參照。我們在不同設置下訓練ResNet50網路，並在IJB-C上採用TPR@FPR=1e-4來評估效能，這更客觀，受測試資料中噪點的影響較小[94]。

From Table 3, we have the following observations:

ArcFace has an obvious performance drop (from (14)
$96.50 %$ to (1)
$90.27 %$ ) when the training data is changed from the clean MS1MV3 to the noisy MS1MV0. By contrast, sub-center ArcFace is more robust ((2)
$93.72 %$ ) under massive noise.
當訓練資料從乾淨的MS1MV3變成有噪點的的MS1MV0時，ArcFace的效能明顯下降(從(14)
$96.50 %$ 到(1)
$90.27 %$ )。相比之下，sub-center ArcFace在大量噪點下更具穩健性((2)
$93.72 %$ )。
Too many sub-centers (too large
$K$ ) can obviously undermine the intra-class compactness and decrease the accuracy (from (2)
$93.72 %$ to (5)
$67.94 %$ ). This observation indicates that noise tolerance and intra-class compactness should be balanced during training. Considering the GPU memory consumption, we select
$K$ =3 in this paper.
太多的sub-centers(也就是太大的
$K$ )會明顯破壞類別內的緊湊性並降低準確度(從(2)
$93.72 %$ 到(5)
$67.94 %$ )。這項觀察結果表明，在訓練過程中應該要平衡噪點的容忍度和類別內的緊湊性。考量到GPU記憶體的消耗，這篇論文中我們選擇
$K$ =3。
The nearest sub-center assignment by the max pooling is slightly better than the softmax pooling[62]((2)
$93.72 %$ vs. (3)
$93.55 %$ ). Thus, we choose the more efficient max pooling operator in the following experiments.
透過max pooling所分配的最接近的sub-center略優於softmax pooling[62]((2)
$93.72 %$ vs. (3)
$93.55 %$ )。因此，我們在後續實驗中選擇更高效的最大池化操作(max pooling operator)。
Dropping non-dominant sub-centers and high-confident noisy samples can achieve better performance than adding regularization[62] to enforce compactness between sub-centers ((7)
$95.92 %$ vs. (10)
$93.64 %$ ). Besides, the performance of our method is not very sensitive to the constant threshold ((6)
$95.91 %$ , (7)
$95.92 %$ and (8)
$95.74 %$ ), and we select
$75^{\circ}$ as the threshold for dropping high-confident noisy samples in the following experiments.
排除掉non-dominant sub-centers和高置信度的噪點樣本可以比添加正規化來強制sub-centers之間的緊湊性的方式有著更好的效能[62]((7)
$95.92 %$ vs. (10) $93.64% $)。此外，我們的方法的表現對常數閾值((6)
$95.91 %$ ，(7)
$95.92 %$ 和(8)
$95.74 %$ )不是那麼敏感，我們選擇
$75^{\circ}$ 作為後續實驗中排除掉高置信度噪點樣本的閾值。
Co-mining[21] and re-weighting methods[19][20] can also improve the robustness under massive noise, but sub-center ArcFace can do better through automatic clean and noisy data isolation during training ((7)
$95.92 %$ vs. (11)
$93.82 %$ , (12)
$93.65 %$ and (13)
$93.60 %$ ).
Co-mining[21]和re-weighting的方法[19][20]也可以提高大量噪點下的穩健性，不過sub-center ArcFace可以透過訓練過程中自動化清理和噪點資料隔離做得更好((7)
$95.92 %$ 與 (11)
$93.82 %$ 、(12)
$93.65 %$ 和 (13)
$93.60 %$ )。
On the clean dataset (MS1MV3), sub-center ArcFace achieves similar performance as ArcFace ((16)
$96.43 %$ vs. (14)
$96.50 %$ ). By employing the threshold of
$75^{\circ}$ on MS1MV3,
$4.18 %$ hard samples are removed, but the performance only slightly decreases, thus we estimate MS1MV3 still contains some noises.
在乾淨的資料集(MS1MV3)上，sub-center ArcFace實現了與ArcFace相似的效能((16)
$96.43 %$ vs. (14)
$96.50 %$ )。透過在MS1MV3上採用
$75^{\circ}$ 的閾值，
$4.18 %$ 難以判定的樣本被移除，但性能僅略有下降，因此我們估計MS1MV3仍包含一些噪點樣本。
The proposed sub-center ArcFace trained on noisy MS1MV0 can achieve comparable performance compared to ArcFace trained on manually cleaned MS1MV3 ((7)
$95.92 %$ vs. (14)
$96.50 %$ ).
與在手動清潔的MS1MV3上訓練的ArcFace相比，在帶噪點樣本的MS1MV0上訓練的sub-center ArcFace可以實現與之相當的效能((7)
$95.92 %$ vs. (14)
$96.50 %$ )。
By enlarging the training data, sub-center ArcFace can easily achieve better performance even though it is trained from noisy web faces ((19)
$96.91 %$ vs. (13)
$96.50 %$ ).
透過擴大訓練資料，sub-center ArcFace可以輕鬆獲得更好的效能，即使它是從充滿噪點的網路人臉資料訓練((19)
$96.91 %$ vs. (13)
$96.50 %$ )。

TABLE 3: Ablation experiments of different settings of the proposed sub-center ArcFace on MS1MV0, MS1MV3 and Celeb500K. The 1:1 verification accuracy (TPR@FPR=1e−4) is reported on the IJB-B and IJB-C datasets. ([MS1MV0 / MS1MV3 / Celeb500K, ResNet50, Sub-center ArcFace])

4.4 Benchmark Results

Results on LFW, YTF, CFP-FP, CPLFW, AgeDB, CALFW．

LFW [89] and YTF [90] datasets are the most widely used benchmark for unconstrained face verification on images and videos. In this paper, we follow the \textit{unrestricted with labelled outside data} protocol to report the performance. As reported in Table 4, ArcFace models trained on MS1MV3 and IBUG-500K with ResNet100 beat the baselines (e.g. SphereFace [13] and CosFace [14]) on both LFW and YTF, which shows that the additive angular margin penalty can notably enhance the discriminative power of deeply learned features, demonstrating the effectiveness of ArcFace. As the margin-based softmax loss has been widely used in recent methods, the performance begins to be saturated around

99.8 %

and

98.0 %

on LFW and YTF, respectively. However, the proposed ArcFace is still among the most competitive face recognition methods.

LFW [89] 和 YTF [90] 資料集是影像和視訊無約束人臉驗證中使用最廣泛的基準。在本文中，我們遵循 \textit{unrestricted with labelled external data} 協定來報告效能。如表4 所示，使用ResNet100 在MS1MV3 和IBUG-500K 上訓練的ArcFace 模型在LFW 和YTF 上都擊敗了基線（例如SphereFace [13] 和CosFace [14]），這表明附加角度邊距懲罰可以顯著增強深度學習特徵的判別力，展現了 ArcFace 的有效性。由於基於保證金的 softmax 損失已在最近的方法中廣泛使用，因此 LFW 和 YTF 上的效能分別在

99.8 %

和

98.0 %

左右開始飽和。然而，所提出的 ArcFace 仍然是最具競爭力的人臉辨識方法之一。

TABLE 4: Verification performance (%) of different methods on LFW and YTF. ([Dataset*, ResNet100, ArcFace])

Besides on LFW and YTF datasets, we also report the performance of ArcFace on the recently introduced datasets (e.g. CFP-FP [74], CPLFW [75], AgeDB [76] and CALFW [77]) which show large pose and age variations. Among all of the recent face recognition models, our ArcFace models trained on MS1MV3 and IBUG-500K are evaluated as the top-ranked face recognition models as shown in Table 5, outperforming counterparts by an obvious margin on the pose-invariant and age-invariant face recognition. In Figure 10, we show the results of ArcFace model trained on IBUG-500K by illustrating the angle distributions of both positive and negative pairs on LFW, YTF, CFP-FP, CPLFW, AgeDB and CALFW. We can clearly find that the intra-variance due to pose and age gaps significantly increases the angles between positive pairs thus making the best threshold for face verification increasing and generating more confusion regions on the histogram.

除了LFW 和YTF 資料集之外，我們還報告了ArcFace 在最近引入的資料集（例如CFP-FP [74]、CPLFW [75]、AgeDB [76] 和CALFW [77]）上的效能，這些資料集顯示出較大的姿勢和年齡變化。在所有最新的人臉辨識模型中，我們在MS1MV3 和IBUG-500K 上訓練的ArcFace 模型被評估為排名最高的人臉辨識模型，如Table 5 所示，在姿勢不變和年齡不變方面明顯優於同行人臉辨識。在Figure 10 中，我們透過說明 LFW、YTF、CFP-FP、CPLFW、AgeDB 和 CALFW 上正負對的角度分佈，展示了在 IBUG-500K 上訓練的 ArcFace 模型的結果。我們可以清楚地發現，由於姿勢和年齡差距而導致的內部方差顯著增加了正對之間的角度，從而使人臉驗證的最佳閾值增加並在直方圖上產生更多的混淆區域。

Fig. 10. Angle distributions of both positive and negative pairs on LFW, YTF, CFP-FP, CPLFW, AgeDB and CALFW. The red histogram indicates positive pairs while the blue histogram indicates negative pairs. All angles are represented in degree. ([IBUG-500K, ResNet100, ArcFace])

TABLE 5: Verification performance (%) of different methods on CFP-FP, CPLFW, AgeDB and CALFW. ([Dataset*, ResNet100, ArcFace])

Results on MegaFace. The MegaFace dataset [78] includes 1M images of 690K different individuals as the gallery set and 100K photos of 530 unique individuals from FaceScrub [112] as the probe set. As we observed an obvious performance gap between identification and verification in the previous work (e.g. CosFace [14]), we performed a thorough manual check in the whole MegaFace dataset and found many face images with wrong labels, which significantly affects the performance. Therefore, we manually refined the whole MegaFace dataset and report the correct performance of ArcFace on MegaFace. In Table 6, we use “R” to denote the refined version of MegaFace and the performance comparisons also focus on the refined version.

Results on MegaFace. MMegaFace資料集[78]包含690K名不同身份的100K張圖片作為gallery set，以及來自FaceScrub [112]的530名獨立身份的100K張照片作為probe set。由於我們在先前的研究中(像是CosFace [14])觀察到辨識和驗證之間存在明顯的效能差距，因此我們對整個MegaFace資料集做了徹頭徹尾的手動檢查，然後就發現到很多標記錯誤的人臉影像，這明顯會影響效能。因此，我們手動調控了整個MegaFace資料集，並報告了ArcFace在MegaFace上的正確效能。在Table 6中，我們使用「R」表示MegaFace的調控版本，效能的比較也集中在調控後的版本。

TABLE 6: Face identification and verification evaluation of different methods on MegaFace Challenge1 using FaceScrub as the probe set. “Id” refers to the rank-1 face identification accuracy with 1M distractors, and “Ver” refers to the face verification TPR at 10⁻⁶ FPR. “R” refers to data refinement on both probe set and 1M distractors of MegaFace. ArcFace obtains state-of-the-art performance under both small and large protocols.

On MegaFace, there are two testing scenarios (identification and verification) under two protocols (large or small training set). The training set is defined as large if it contains more than 0.5M images. For the fair comparison, we train ArcFace on CASIA and IBUG-500K under the small protocol and large protocol, respectively. In Table 6, ArcFace trained on CASIA achieves the best single-model identification and verification performance, not only surpassing the strong baselines (e.g. SphereFace [13] and CosFace [14]) but also outperforming other published methods [72], [84].

在MegaFace上，有兩種協議(大或小訓練集)下的兩種測試場景(辨識和驗證)。如果訓練集包含超過 0.5M的影像，那就定義為大規模的訓練集。為了公平比較，我們分別在small protocol和large protocol下在CASIA和IBUG-500K上訓練ArcFace。在Table 6中，在CASIA上訓練的ArcFace實現了最佳的單模型辨識和驗證效能，不僅超越了強大的基線(如SphereFace [13]和CosFace [14])，而且還優於其它已發布的方法[72]，[84] 。

Under the large protocol, ArcFace trained on IBUG-500K surpasses ArcFace trained on MS1MV3 by a clear margin (0.47% improvement on identification), which indicates that large-scale training data is very beneficial and the proposed sub-center ArcFace is effective for automatic data cleaning under different data scales. As shown in Figure 11, ArcFace trained on IBUG-500K forms an upper envelope of other models under both identification and verification scenarios. Compared to MC-FaceGraph [109], ArcFace trained on IBUG-500K obtains comparable results on identification and better results on verification. Considering 18.8M images of 636K identities are used in MC-FaceGraph [109], the performance of our method is very impressive, as we only use images automatically cleaned from noisy web data. Similar to LFW, the identification results on MegaFace are also saturated (around 99%). Therefore, the performance gap of 0.04% on identification is negligible and our model is among the most competitive face recognition methods.

在large protocol下，在IBUG-500K上訓練的ArcFace明顯優於在MS1MV3上訓練的ArcFace(辨識率提高了0.47%)，這表明大規模的訓練資料是非常有幫助的，並且我們所提出的sub-center ArcFace在不同資料規模下的自動資料清理方面表現十分有效。如Figure 11所示，在IBUG-500K上訓練的ArcFace在辨識和驗證場景下形成了其它模型的upper envelope(上部包絡？)。對比MC-FaceGraph [109]，在IBUG-500K上訓練的ArcFace在辨識方面獲得了相當的成果，在驗證方面獲得了更好的結果。考慮到MC-FaceGraph [109]使用了636K個身份的18.8M張影像，我們的方法的效能非常令人印象深刻，因為我們只使用從充滿噪點的網路資料中自動清理的影像。類似於LFW，MegaFace上的辨識結果也已經飽和(99%左右)。因此，辨識上0.04%的效能差距可以忽略不計，我們的模型是最具競爭力的人臉辨識方法之一。

Fig. 11. CMC and ROC curves of different models on MegaFace. Results are evaluated on both original and refined MegaFace dataset.

Results on IJB-B and IJB-C. The IJB-B dataset [79] contains 1,845 subjects with 21.8K still images and 55K frames from 7, 011 videos. The IJB-C dataset [79] is a further extension of IJB-B, having 3, 531 subjects with 31.3K still images and 117.5K frames from 11, 779 videos. On IJB-B and IJB-C datasets, there are two evaluation protocols, 1:1 verification and 1:N identification.

Results on IJB-B and IJB-C. IJB-B 資料集 [79] 包含1,845個主體，涵蓋 21.8K 靜態影像以及來自7,011部影片的55K幀畫面。IJB-C 資料集 [79] 是 IJB-B 的進一步擴展， 3,531 名主體，涵蓋 31.3K 靜態影像以及來自 11,779 部影片的 117.5K 幀畫面。在IJB-B和IJB-C資料集上，設有兩種評估協議，1:1驗證和1:N辨識。

For the widely used 1:1 verification protocol, there are 12,115 templates with 10,270 genuine matches and 8M impostor matches on IJB-B, and there are 23,124 templates with 19,557 genuine matches and 15,639K impostor matches on IJB-C. In Table 7, we compare the TPR (@FPR=1e-4) of ArcFace with the previous state-of-the-art models. We first employ the VGG2 [9] dataset as the training data and the ResNet50 as the embedding network to train ArcFace for the fair comparison with the most recent softmax-based methods [9], [113], [94]. As we can see from the results, the proposed additive angular margin can obviously boost the performance on both IJB-B and IJB-C compared to the softmax loss (about 3 ∼ 5%, which is a significant reduction in the error).

在被廣泛使用的1:1驗證協議中，IJB-B上有12,115個樣板，其中10,270個真實匹配(genuine matches)和8M冒名匹配(impostor matches)， IJB-C的話則是有23,124個模板，其中19,557個真實匹配和15,639K個冒名匹配(impostor matches)在。在Table 7中，我們把ArcFace的TPR (@FPR=1e-4)與先前最先進的模型進行比較。我們首先使用 VGG2 [9] 資料集作為訓練資料，使用 ResNet50 作為嵌入網路來訓練 ArcFace，以便與最新基於softmax的方法 [9]、[113]、[94] 進行公平比較。從結果中可以看到，相較於softmax loss，由我們所提出的additive angular margin可以明顯提升 IJB-B 和 IJB-C 上的表現(約 3 ∼ 5%，這明顯降低了誤差)。

TABLE 7: 1:1 verification (TPR@FPR=1e-4) on IJB-B and IJB-C.

Drawing support from more training data (IBUG-500K) and deeper neural network (ResNet100), ArcFace can further improve the TPR (@FPR=1e-4) to 96.02% and 97.27% on IJBB and IJB-C, respectively. Compared to the joint margin-based and mining-based method (e.g. CurricularFace [54]), our method further decreases the error rate by 22.57% and 29.09% on IJB-B and IJB-C, which indicates that the automatically cleaned data by the proposed sub-center ArcFace are effective to boost the performance. In Table 8, we compare the proposed sub-center ArcFace with FaceGraph [109] on large-scale cleansing. In FaceGraph [109], one million celebrities (87.0M face images) [37] are cleaned into a noise-free dataset named MC-FaceGraph (including 18.8M face images of 636.2K identities) by employing a global-local graph convolutional network. Even though the proposed sub-center ArcFace is only applied to half million identities, the cleaned dataset, IBUG-500K (including 11.96M face images of 493K identities), still outperforms MC-FaceGraph [109]. Under the evaluation metric of TPR@FPR=1e-5, the ArcFace model trained on IBUG-500K surpasses the counterpart trained on MC-FaceGraph by 0.66% and 0.45% on IJB-B and IJB-C, respectively. In Figure 12, we show the full ROC curves of the proposed ArcFace on IJB-B and IJB-C, and ArcFace achieves impressive performance even at FPR=1e-6 setting a new baseline.

受助於更多訓練資料(IBUG-500K)和更深層的神經網路(ResNet100)，ArcFace可以進一步將IJBB和IJB-C上的TPR(@FPR=1e-4)分別提高到96.02％和97.27％。與joint margin-based和mining-based的方法(如 CurrularFace [54])相比，我們的方法在 IJB-B 和 IJB-C 上進一步將錯誤率降低了 22.57% 和 29.09%，這說明了透過由我們所提出的sub-center ArcFace所自動清理的資料可以有效提升效能。在Table 8中，我們將sub-center ArcFacee與FaceGraph[109]在大規模清洗上做了比較。在FaceGraph [109] 中，透過採用global-local graph convolutional network，將100萬名人(8,700 萬張人臉影像)[37] 清理為名為MC-FaceGraph 的無噪點資料集(包括636.2K 身份的1880萬張人臉圖像)。儘管sub-center ArcFace僅應用於50萬個身份，但清理後的資料集 IBUG-500K(包括 493K個個體的 1,196 萬張人臉影像)仍然優於 MC-FaceGraph [109]。在TPR@FPR=1e-5的評估指標下，在IBUG-500K上訓練的ArcFace模型在IJB-B和IJB-C上分別比在MC-FaceGraph上訓練的模型高出0.66%和0.45%。在Figure 12中，我們說明了 ArcFace 在 IJB-B 和 IJB-C 上的完整 ROC 曲線，即便在 FPR=1e-6 的條件下，ArcFace 的表現仍令人印象深刻，設立了一個新的基準。

Fig. 12. ROC curves of 1:1 verification protocol on IJB-B and IJB-C. ([Dataset*, ResNet100, ArcFace])

For the 1:N end-to-end mixed protocol, there are 10,270 probe templates containing 60,758 still images and video frames on IJB-B, and there are 19,593 probe templates containing 127,152 still images and video frames on IJB-C. In Table 8, we report the Rank-1 identification accuracy of our method compared to baseline models. ArcFace trained on IBUG-500K achieves impressive performance on both IJB-B (95.94%) and IJB-C (97.21%), setting a new record on this benchmark.

對於1:N end-to-end的混合協議，IJB-B 包含 10,270 個探測模板(probe templates)，涵蓋 60,758 張靜態影像與影片幀畫面；IJB-C 則包含 19,593 個探測模板，涵蓋 127,152 張靜態影像與影片幀畫面。在Table 8中，我們報告了我們的方法與基準模型相比的 Rank-1 辨識精度。在 IBUG-500K 上訓練的 ArcFace 在 IJB-B (95.94%) 和 IJB-C (97.21%) 上均取得了令人印象深刻的效能，創下了該基準的新記錄。

TABLE 8: 1:1 verification (TPR@FPR=1e-5) and 1:N identification (Rank-1) on IJB-B and IJB-C. ([Dataset*, ResNet100, ArcFace])

Results on LFR2019-Image and LFR2019-Video. Lightweight Face Recognition (LFR) Challenge [88] targets on bench-marking face recognition methods under strict computation constraints (i.e. computational complexity < 1.0 GFlops). For a fair comparison, all participants in the challenge must use MS1MV3 [88] as the training data. On LFR2019-Image, trillion-level pairs between gallery and probe set are used for evaluation and TPR@FPR=1e8 is selected as the main evaluation metric. On LFR2019-Video, billion-level pairs between all videos are used for evaluation and TPR@FPR=1e-4 is employed as the main evaluation metric.

Results on LFR2019-Image and LFR2019-Video. Lightweight Face Recognition (LFR)的挑戰[88]目標是在嚴格的計算限制(即計算複雜度< 1.0 GFlops)下對人臉辨識方法進行基準測試。為了公平比較，所有挑戰的參與者都必須使用 MS1MV3 [88] 作為訓練資料。在LFR2019-Image上，使用gallery和probe set之間的兆級對(trillion-level pairs)進行評估，並選擇TPR@FPR=1e8作為主要評估指標。在LFR2019-Video上，使用所有影片之間的億級對(billion-level pairs)用於評估，並採用TPR@FPR=1e-4作為主要評估指標。

In Table 9, we compare the performance of ArcFace with the top-ranked competition solutions [88]. For the design of our lightweight model, we explore EfficientNet-B0 [118] as the backbone. When training from scratch with the proposed ArcFace loss, EfficientNet-B0 can obtain 86.44% on LFR2019-Image and 61.47% on LFR2019-Video, respectively. Following the top-ranked solutions, we also employ knowledge distillation [119] to boost the performance of our lightweight model. ArcFace trained on MS1MV3 with ResNet100 provides a highperformance teacher network, achieving 92.75% on LFR2019-Image and 64.89% on LFR2019-Video. With the assistance of the teacher network, our lightweight model is trained by minimizing (1) the ArcFace loss (2) the

ℓ_{2}

regression loss between 512-D features of the teacher and student networks, and (3) the KL loss [119] between class-wise similarities predicted by the teacher and student networks. The weights of the

ℓ_{2}

regression loss and the KL loss is set to 1.0 and 0.1, respectively. With knowledge distillation, our method finally achieves 88.65% on LFR2019-Image and 63.60% on LFR2019-Video. As shown in Figure 13, our method obtains comparable performance with the champion of the LFR2019-Image track and envelops the ROC curves of all top-ranked challenge solutions in the LFR2019-Video track, surpassing the champion by 0.37%.

在Table 9中，我們比較了ArcFace與排名靠前的競賽解決方案[88]的效能。輕量級模型設計的部份，我們探索 EfficientNet-B0 [118] 作為骨幹。當使用我們所提出的ArcFace loss從頭開始訓練時，EfficientNet-B0 分別在 LFR2019-Image 上獲得 86.44%，在 LFR2019-Video 上獲得 61.47%。根據排名靠前的解決方案，我們也採用知識蒸餾[119]來提高輕量級模型的效能。使用 ResNet100 在 MS1MV3 上訓練的 ArcFace 提供了高效能的教師網路(teacher network)，分別在 LFR2019-Image 上的效能為 92.75%，在 LFR2019-Video 上的效能為 64.89%。在教師網路的幫助下，我們的輕量級模型透過最小化(1)ArcFace loss(2)教師和學生網路的512維特徵之間的

ℓ_{2}

回歸損失，以及（3）教師與學生網路預測的類別相似度之間的 KL loss [119]。

ℓ_{2}

regression loss和KL loss的權重分別設定為1.0和0.1。在知識蒸餾的幫助下，我們的方法最終在 LFR2019-Image 上有著 88.65%，在 LFR2019-Video 上則是 63.60%。如Figure 13所示，我們的方法獲得了與 LFR2019-Image track冠軍可比擬的效能，並且在 LFR2019-Video track中碾壓所有排名靠前的挑戰解決方案的 ROC 曲線，超過冠軍 0.37%。

Table 9：Verification results (%) on the LFR2019-Image (TPR@FPR=1e-8) and LFR2019-Video (TPR@FPR=1e-4) datasets. ([Dataset*, Network*, ArcFace])

Fig. 13. ROC curves of 1:1 verification protocol on the LFR2019-Image and LFR2019-Video datasets. ([MS1MV3, EfficientNet-B0, ArcFace])

4.5 Inversion of ArcFace

This section demonstrates the capability of the proposed ArcFace model in terms of effectively synthesizing identity-preserved face images from subject’s centers (the close-set setting) or features (the open-set setting).

本節展示了所提出的 ArcFace 模型在合成具有身份保留(identity-preserved)臉部影像方面的能力，這些影像可以從主體的中心(close-set setting)或特徵(open-set setting)生成。

We adopt the ArcFace (ResNet50) trained on MS1MV3 to conduct the inversion experiments, which include two settings, i.e. close-set and open-set. In the close-set mode, centers stored in the linear layer are selected as the targets to generate face images. Identity preservation is constrained by a classification loss (e.g. Softmax, SphereFace, CosFace and ArcFace). In the open-set mode, embedding features predicted by the pre-trained models are used as the targets to generate face images. Identity preservation is constrained by a

ℓ_{2}

loss. For each time, we synthesize 256 face images of different identities at the resolution of 112 × 112 in one mini-batch using one NVIDIA V100 GPU. We employ Adam optimizer [120] at a learning rate of 0.25 and the iteration lasts 20K steps. Regularization parameters [30] for total variance and

ℓ_{2}

norm of the generated faces are set as

1 e - 3

and

1 e - 4

, respectively.

我們採用在MS1MV3上訓練的ArcFace(ResNet50)進行反轉的實驗，包括閉集和開集兩種設定。在close-set模式下，儲存在線性層中的中心被選為生成臉部影像的目標。身份保留受到分類損失的限制(例如 Softmax、SphereFace、CosFace 和 ArcFace)。在open-set模式中，則是使用預訓練模型所預測的嵌入特徵作為生成人臉影像的目標。身份保留受到

ℓ_{2}

loss的限制。每一次的實驗中，我們使用一塊 NVIDIA V100 GPU 在一個小批量中合成 256 張不同身份的人臉影像，解析度為 112 × 112。我們採用 Adam optimizer [120]，learning rate為 0.25，迭代 20K次。所生成的人臉的總變異數和

ℓ_{2}

範數的正規化參數 [30] 分別設定為

1 e - 3

和

1 e - 4

。

In order to quantitatively validate how well the proposed method can preserve the identity of the subject and how visually plausible the reconstructed face image is, three metrics are adopted: (1) Frechet Inception Distance (FID) [121]; (2) cosine similarity from a third-party model ([IBUG-500K, ResNet100, ArcFace]); and (3) face verification accuracy on LFW for open-set experiments.

為了定量驗證所提出的方法可以在多大程度上保留主體的身份以及重建的人臉影像在視覺上的可信度，採用了三個指標：(1)Frechet Inception Distance（FID）[121]；（2）來自第三方模型的餘弦相似度（[IBUG-500K、ResNet100、ArcFace]）； (3) 開集實驗中於LFW的人臉驗證精度。

Close-set Face Generation. In Table 10, we quantify the realism and identity preservation of the reconstructed faces from different face recognition models. For each model, we synthesize training identities by using the 5K randomly selected class indexes. For each identity, different random inputs are gradually updated by the network gradient into identity-preserved face images. The proposed ArcFace model obviously outperforms the baseline methods (e.g. softmax, SphereFace and CosFace) in the image quality, achieving the FID score of 70.39. By employing the powerful ArcFace model trained on IBUG-500K, we calculate all cosine similarities between real training faces and corresponding generated faces. The average cosine similarity of ArcFace is 0.6248, surpassing all the baseline models by a clear margin.

Close-set Face Generation. 在Table 10中，我們量化了來自不同人臉辨識模型所重建人臉的真實性和身份保留度。對於每個模型，我們透過隨機選擇5K個類別索引來合成訓練身份。對於每個身份，不同的隨機輸入透過網路梯度逐漸更新為身份保留的人臉影像。我們所提出的ArcFace模型在影像品質上明顯優於基線方法(例如 softmax、SphereFace 和 CosFace)，FID 分數為 70.39。透過採用在 IBUG-500K 上訓練的強大 ArcFace 模型，我們計算真實訓練人臉和對應生成人臉之間的所有餘弦相似度。 ArcFace的平均餘弦相似度為0.6248，明顯超過所有基線模型。

In Figure 14, we show the synthesized faces from the proposed ArcFace in comparison with the baseline CosFace model. As can be seen, ArcFace is able to reconstruct identity-preserved faces only by using the model parameters without training any additional discriminator and generator like in GAN [36]. Considering the image quality is only constrained by the classification loss and the BN priors, it is quite understandable that there exist some identity-unrelated artifacts in the generation results. Besides, there are many grey images in MS1MV3 and this statistic information is also stored in the BN parameters, thus some generated faces are not colorful. Compared to the baseline CosFace model, our ArcFace can depict better facial features of the real faces in terms of identity preservation and image quality.

在Figure 14中，我們說明了所提出的 ArcFace 與基線 CosFace 模型在合成臉部的比較。可以看的出來，ArcFace 可以在單純的使用模型參數來重建保留身份的人臉，而不用像 GAN [36] 那樣訓練任何額外的判別器和生成器。考慮到影像的品質僅受分類損失和 BN prior的約束影響，生成結果中存在一些與身份無關的瑕庛也是可以理解的。此外，MS1MV3中有很多灰階影像，並且這些統計資訊也都儲存在BN的參數中，因此一些生成的人臉不是彩色的。與基線 CosFace 模型相比，我們的 ArcFace 在身份保留和影像品質方面可以更好地描繪真實人臉的臉部特徵。

Fig. 14. Close-set face generation. ArcFace can generate identitypreserved face images only by using the model parameters without training any additional discriminator and generator like in GAN. The first column is the identity from the training data. Column 2 to 4 are the outputs from our ArcFace model. Column 5 to 7 are the outputs from the baseline CosFace model.

Table 10：FID and cosine similarity of different model inversion results. ArcFace model (ResNet50) for inversion is trained on MS1MV3, but the generated face images also exhibit high similarity from the view of the more powerful ArcFace model (ResNet100) trained on IBUG-500K. The margin parameter for each method is given in the bracket.

Open-set Face Generation. In Table 11, we compare inversion results of different models on LFW. For each pre-trained model, we first calculate the embedding features of 13,233 face images from LFW, and then we generate faces constrained to these target features through a

ℓ_{2}

loss. As we can see, ArcFace maintains best reconstruction quality and identity preservation, consistently outperforming the baseline models in both FID and average cosine similarity metrics. On the real faces of LFW, the ArcFace model (ResNet50) achieves 99.81% verification accuracy. On the generated faces, the verification accuracy slightly drops to 97.75% by using the same model ([MS1MV3, ResNet50, ArcFace]) for testing. For unbiased evaluation, we report the matching accuracy on LFW by employing the powerful ArcFace model (ResNet100) trained on IBUG-500K and this model is more susceptible to artifacts in the generated results. Even though there is a further drop in the verification accuracy (93.30%), the results compared to the baseline models further demonstrate the advantages of ArcFace in the inversion problem.

Open-set Face Generation. 在Table 11中，我們比較了不同模型在LFW上的逆向生成結果。對於每個預訓練模型，我們首先計算來自 LFW 的 13,233 張人臉影像的嵌入特徵，然後透過 $\ell_2 loss生成受這些目標特徵約束的人臉。我們可以看的出來，ArcFace 保持了最佳的重建品質和身份保留，在 FID 和平均餘弦相似度指標方面始終優於基線模型。在 LFW 的真實人臉上，ArcFace 模型(ResNet50)實現了 99.81% 的驗證準確率。在所生成的人臉上，使用相同的模型([MS1MV3，ResNet50，ArcFace])進行測試，驗證精度略有下降至97.75％。為了來個公正的評估，我們透過採用在 IBUG-500K 上所訓練的較為強大的ArcFace模型(ResNet100) 來報告 LFW 上的匹配準確率，該模型對於生成結果中的瑕疵較為敏感。儘管驗證精度進一步下降(93.30%)，但與基線模型相比的結果進一步證明了ArcFace在反演問題上的優勢。

Table 11：FID, cosine similarity and verification accuracy on LFW of different model inversion results. The cosine similarity and the verification accuracy are tested by the ArcFace model (ResNet100) trained on IBUG-500K. The margin parameter for each method is given in the bracket.
Open-set Face Generation.

Figure 15 illustrates our synthesis from features of LFW faces that contain appearance variations (e.g. age, gender, race, pose and occlusion). Similar to the previous experiment, our ArcFace model robustly depicts identity-preserved faces. The success of robustly handling with those challenging factors comes from two properties: (1) the ArcFace network was trained to ignore those facial variations in its embedding features, and (2) real face distributions stored in the BN layers can be effectively exploited for face image synthesis. Even though ArcFace can inverse most of the faces with realism and identity preservation, there exist some confusions during generation. In Figure 15(f), we show some inversion results from ArcFace containing gender confusions. Even though these confusions can be easily distinguished by human eyes, they exhibit high similarity from the view of the machine. In Figure 16, we further conduct an ablation study about ArcFace inversion without BN constraints. As we can see from these results, constraints from the BN layers can enforce the generated face more visually plausible. Without the BN constraints, the resulting face images lack natural image statistics and can be quite easily identified as unnatural.

Figure 15說明了我們在LFW臉部的合成結果，這裡面包含外觀變化(如年齡、性別、種族、姿勢與遮蔽)。與先前的實驗類似，我們的 ArcFace 模型能夠穩健地生成保持身份一致性的臉部影像。能夠穩健地處理這些具有挑戰性的因素的成功來自於兩個屬性：(1) ArcFace network被訓練成可以在其嵌入特徵中忽略這些臉部變化，(2) 儲存在BN 層中的真實臉部分佈可以有效地被用於臉部影像合成。儘管ArcFace能夠生成具有真實感並保持身份一致性的臉部影像，不過在生成過程中仍然存在一些混淆的情況。在Figure 15(f)中，我們給出了 ArcFace 的一些包含性別混淆的反演結果。儘管這些怪異可以透過人眼輕鬆區分，不過從機器的角度來看，它們表現出很高的相似性。在Figure 16中，我們進一步對沒有BN約束的ArcFace反演進行了消融研究。從這些結果可以看的出來，BN 層的約束可以使生成的人臉在視覺上更加合理。在沒有BN約束的情況下，所產生的人臉影像缺乏自然影像的統計訊息，並且很容易被辨識為非自然。

Fig. 15. Open-set face generation from the pre-trained ArcFace model. We show the ArcFace inversion results (right) under age, gender, race, pose and occlusion variations by only using the embedding features from LFW [89] test samples (left). In the bottom, we show some bad cases (e.g. gender confusion) generated from the ArcFace inversion.

Fig. 16. Open-set face generation without and with BN constraints. The first row is the original LFW [89] samples. The second row is the ArcFace inversion results without BN constraints, and the third row is the ArcFace inversion results with BN constraints.

5 CONCLUSIONS

In this paper, we first propose an Additive Angular Margin Loss function, named ArcFace, which can effectively enhance the discriminative power of deep feature embedding for face recognition. We further introduce sub-class into ArcFace to relax the intra-class constraint under massive real-world noises. The proposed sub-center ArcFace encourages one dominant sub-class that contains the majority of clean faces and non-dominant subclasses that include hard or noisy faces. This automatic isolation can be employed to clean large-scale web faces and we demonstrate that our method consistently outperforms the state of the art through the most comprehensive experiments. Apart from enhancing discriminative power, ArcFace can also strengthen the model’s generative power, mapping feature vectors to face images. The pre-trained ArcFace model can generate identity-preserved face images for both subjects inside and outside the training data only by using the network gradient and BN priors. As the proposed ArcFace inversion only focuses on approximating the target identity feature, the facial poses and expressions are not controllable. In the future, we will explore controlling intermediate neuron activations to target specific facial poses and expressions during inversion. In addition, we will also explore how to make the face recognition model not invertible so that face images cannot be easily reconstructed from model weights to protect privacy.

在本篇論文中，我們首先提出了一種加性角度邊界損失函數(Additive Angular Margin Loss function)，稱為ArcFace，它可以有效增強人臉辨識中深度特徵嵌入的判別能力。我們進一步在 ArcFace 中引入sub-class，以放鬆在大量現實世界噪點資料下的類別內的約束。我們所提出的sub-center ArcFace鼓勵形成一個包含大多數乾淨人臉的主要子類別(dominant sub-class)和一些包含難以判定或噪點人臉的非主要子類別(non-dominant sub-classes)。這種自動隔離的方法可用於清理大規模的網路資料，我們透過最全面的實驗證明我們的方法始終優於現有技術。除了增強判別能力外，ArcFace 還可以增強模型的生成能力，將特徵向量映射到人臉影像。預訓練的 ArcFace 模型只需使用網路梯度和 BN priors，就可以為訓練資料內部和外部的受試者生成身份保留的人臉影像(無論是否來自訓練資料集)。由於ArcFace的反演僅著重於近似目標身份特徵，因此臉部姿勢和表情是無法控制的。在未來的研究中，我們將探索控制中間的神經元激活(activation)於反演過程中針對特定的臉部姿勢和表情。此外，我們還將探索如何使人臉辨識模型不可逆，使得人臉影像無法輕易地從模型權重重建，以保護隱私。

ArcFace: Additive Angular Margin Loss for Deep Face Recognition(翻譯)

tags:論文翻譯 deeplearning

說明

1 INTRODUCTION

2 RELATED WORK

3 PROPOSED APPROACH

3.1 ArcFace

3.2 Sub-center ArcFace

3.3 Inversion of ArcFace

4 EXPERIMENTS

4.1 Implementation Details

4.2 Ablation Study on ArcFace

4.3 Ablation Study on Sub-center ArcFace

4.4 Benchmark Results

4.5 Inversion of ArcFace

5 CONCLUSIONS

Read more

Book_論文翻譯

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Dify + Whisper Asr Webservice

OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER

tags:`論文翻譯` `deeplearning`