CurricularFace: Adaptive Curriculum Learning Loss for Deep Face Recognition

# CurricularFace: Adaptive Curriculum Learning Loss for Deep Face Recognition Apr 01, 2020 [CurricularFace: Adaptive Curriculum Learning Loss for Deep Face Recognition](https://arxiv.org/abs/2004.00288) Introduces adaptive decision boundary as an extension of ArcFace loss. Modulation coefficient depends on proportion of positive cosine similarities. In early stages the proportion is smaller in later stages. Penalty for hard samples* increases along with proportion. *Hard samples are those that out of decision boundary. ## Method Similarly to most discriminative softmax based losses there are positive and negative similarities and normalize outputs to $s$ Similarly to most discriminative softmax based losses there are positive and negative similarities and normalize outputs to $s$. $$\mathcal{L} = - \log \frac{\exp(s\color{green}{T}(\cos\theta_{y_i}))}{\exp(s\color{green}{T}(\cos\theta_{y_i})) + \sum_{ j \ne y_i} \exp(s\color{red}{N}(\cos\theta_{y_i}))}$$ Positive similarity $$T(\cos \theta_{y_i}) = \cos(\theta_{y_i} + m)$$ as in ArcFace Negative similarity $$N(t^{(i)}, \cos\theta_j)=\begin{cases} \cos\theta_j, & T(\cos\theta_{y_i})\ge\cos\theta_j\\ \cos\theta_j(t^{(i)} + \cos\theta_j), & T(\cos\theta_{y_i})<\cos\theta_j \end{cases}$$ Unlike in other papers (I’ve seen so far), there is quadratic term for hard samples. $t^{(i)}$ is estimated as a moving average of $r^{(i)}$, batch true positive similarities $$r^{(i)}=\frac{1}{B}\sum_i \cos\theta_i$$ (in formula (9) averaging seemed to be missing) Note that $t^{(i)}$ may be negative and hard examples ate smoothed out.