Learning Meta Face Recognition in Unseen Domains

# Learning Meta Face Recognition in Unseen Domains reader: ferres complexity: 4-5 rating: 4 [paper](https://arxiv.org/abs/2003.07733) Proposes an optimiation algorithm in which generalization is a constraint. The problem is a real word scenatio where you train a model on one domain and later apply it in a different domain. **Example** ![](https://i.imgur.com/aocsuMo.png) ## Method In a nutshell > optimize the model parameters, such that after updating on the **meta-train** domains, the model also performs well on the **meta-test** domain. This claim intuitively guides to the update rule: $$ \nabla \mathcal{L}_M = \gamma \color{purple}{\nabla \mathcal{L}_S(\theta)} + (1-\gamma) \nabla \left(\mathcal{L}_T(\theta - \alpha \color{purple}{\nabla \mathcal{L}_S(\theta)})\right), $$ Where * $\nabla \mathcal{L}_M$ -- Final Meta Loss * $\nabla \mathcal{L}_S$ -- Loss for Meta-Train, **S**ource data (we pretend it is all the data we see) * $\nabla \mathcal{L}_T$ -- Loss for Meta-**T**est (we pretend it the data we evaluate on) In the equation we clearly see gradient through gradient $$ \nabla \left(\mathcal{L}_T(\theta - \alpha \color{purple}{\nabla \mathcal{L}_S(\theta)})\right) $$ That is a difficulty for implementation. *Losses on train and test data are not the part of the algorithm and may ba arbitrary!* How do we get $\mathcal{L}_S$ or $\mathcal{L}_T$? We sample first get **all** datasets/domains splits and then sample examples from these datasets. so for each iteration ### Losses However, authors use their own losses motivation them a bit. #### Hard-pair Attention Loss So the idea is to do exact mining within a batch. They might have problems with hard examples so they use it. * Sample $B$ pairs of **G**allery and **P**robe **F**acens ($F_{g_{b}}$, $F_{p_{b}}$) * Mine hard samples based on threshold (may be any sort of mining). They do it with similarity matrix $M_{i,j}=(F_{g_{i}})^\top(F_{p_{j}})$ and threshold $\tau$ so hard are $\mathcal{N}=\left\{(i,j) | M_{i,j}>\tau, i\ne j \right\}$. * Then use kind of triplet loss: $$ \mathcal{L}_{h p}=\frac{1}{2|\mathcal{P}|} \sum_{i \in \mathcal{P}}\left\|F_{g_{i}}-F_{p_{i}}\right\|_{2}^{2}-\frac{1}{2|\mathcal{N}|} \sum_{(i, j) \in \mathcal{N}}\left\|F_{g_{i}}-F_{p_{j}}\right\|_{2}^{2} $$ what says *true pairs shouls be closer, false pairs far away*. #### Soft classification loss Exactly the crossentropy loss with logits, but calculated twice (we have pairs) $$ \mathcal{L}_{c l s}=\frac{1}{2 B} \sum_{i=1}^{B}\left(\mathrm{CE}\left(y_{i}, s \cdot F_{g_{i}} W^{T}\right)+\mathrm{CE}\left(y_{i}, s \cdot F_{p_{i}} W^{T}\right)\right) $$ This loss was proven usefull, why not? #### Domain Alignment It is expected that different domain causes embedding shift. To avoid this we might want embeddigns to lie close to each other. This loss tries to fix this. $$ \begin{aligned} c_{j} &=\frac{1}{B} \sum_{i=1}^{B}\left(\left(F_{g_{i}}^{\mathcal{D}_{j}}+F_{p_{i}}^{\mathcal{D}_{j}}\right) / 2\right) \\ c_{m t r} &=\frac{1}{n} \sum_{j=1}^{n} c_{j} \\ \mathcal{L}_{d a} &=\frac{1}{n} \sum_{j=1}^{n}\left\|s \cdot\left(c_{j}-c_{m t r}\right)\right\|_{2}^{2} \end{aligned} $$ They point the attention to $F_{g_{i}}^{\mathcal{D}_{j}}$, $F_{p_{i}}^{\mathcal{D}_{j}}$ saying embeddings are already normalized (did it converge withot normalization or there were some problems?). ### The loss Loss for "training" samples in the batch is all three losses at once. $$ \mathcal{L}_{S}=\mathcal{L}_{h p}\left(\mathcal{X}_{S} ; \theta\right)+\mathcal{L}_{c l s}\left(\mathcal{X}_{S} ; \theta\right)+\mathcal{L}_{d a}\left(\mathcal{X}_{S} ; \theta\right) $$ Loss for "testing" samples only two, **Domain alignment** is ommited. Note, parameters are not the same. $$ \mathcal{L}_{T}=\mathcal{L}_{h p}\left(\mathcal{X}_{T} ; \theta^{\prime}\right)+\mathcal{L}_{c l s}\left(\mathcal{X}_{T} ; \theta^{\prime}\right) $$ ![](https://i.imgur.com/MEBc4ST.png) ## Experiments Some more results for hyperparameter tuning are also provided in the paper. Below are the most important. ### Mthod Ablation conclisions * full order approximation is better. * Optimal $\gamma$ is about 0.5 * Dampling meta train and meta test is in the right way important. In experiments the best way was. Sample 2 as train and 1 as test (I think it is about train batch is larger than test batch). ### New domain setup Interesting results. Comparison is fair enough. However, individual loss contribution is not shown in here: ![](https://i.imgur.com/wfyj5GP.png) For reference, protocols ![](https://i.imgur.com/kSuHVsQ.png) ### In domain setup Out of domain setup is new, old setup is important as well. They evaluate on CACD-VS, CASIA NIR-VIS 2.0, Multi-PIE, MeGlass, Public-IvS. They look not bad there and usually in top 2 results: ![](https://i.imgur.com/KkzYurT.png) ![](https://i.imgur.com/VKT24YK.png) ![](https://i.imgur.com/ZY8iuff.png) ![](https://i.imgur.com/P1w8DpI.png) ![](https://i.imgur.com/f9oZI3H.png) ![](https://i.imgur.com/qwCtDHL.png) ## Ideas to check * Cross domain setup with toy VK data * Req: implement the paper * Cross dataset setup * Exp: Check id merge problem * Req: implement the paper