# Learning Meta Face Recognition in Unseen Domains
reader: ferres
complexity: 4-5
rating: 4
[paper](https://arxiv.org/abs/2003.07733)
Proposes an optimiation algorithm in which generalization is a constraint. The problem is a real word scenatio where you train a model on one domain and later apply it in a different domain.
**Example**

## Method
In a nutshell
> optimize the model parameters, such that after updating on the **meta-train** domains, the model also performs well on the **meta-test** domain.
This claim intuitively guides to the update rule:
$$
\nabla \mathcal{L}_M =
\gamma \color{purple}{\nabla \mathcal{L}_S(\theta)}
+ (1-\gamma) \nabla \left(\mathcal{L}_T(\theta - \alpha \color{purple}{\nabla \mathcal{L}_S(\theta)})\right),
$$
Where
* $\nabla \mathcal{L}_M$ -- Final Meta Loss
* $\nabla \mathcal{L}_S$ -- Loss for Meta-Train, **S**ource data (we pretend it is all the data we see)
* $\nabla \mathcal{L}_T$ -- Loss for Meta-**T**est (we pretend it the data we evaluate on)
In the equation we clearly see gradient through gradient
$$
\nabla \left(\mathcal{L}_T(\theta - \alpha \color{purple}{\nabla \mathcal{L}_S(\theta)})\right)
$$
That is a difficulty for implementation. *Losses on train and test data are not the part of the algorithm and may ba arbitrary!*
How do we get $\mathcal{L}_S$ or $\mathcal{L}_T$? We sample first get **all** datasets/domains splits and then sample examples from these datasets.
so for each iteration
### Losses
However, authors use their own losses motivation them a bit.
#### Hard-pair Attention Loss
So the idea is to do exact mining within a batch. They might have problems with hard examples so they use it.
* Sample $B$ pairs of **G**allery and **P**robe **F**acens ($F_{g_{b}}$, $F_{p_{b}}$)
* Mine hard samples based on threshold (may be any sort of mining). They do it with similarity matrix $M_{i,j}=(F_{g_{i}})^\top(F_{p_{j}})$ and threshold $\tau$ so hard are $\mathcal{N}=\left\{(i,j) | M_{i,j}>\tau, i\ne j \right\}$.
* Then use kind of triplet loss:
$$
\mathcal{L}_{h p}=\frac{1}{2|\mathcal{P}|} \sum_{i \in \mathcal{P}}\left\|F_{g_{i}}-F_{p_{i}}\right\|_{2}^{2}-\frac{1}{2|\mathcal{N}|} \sum_{(i, j) \in \mathcal{N}}\left\|F_{g_{i}}-F_{p_{j}}\right\|_{2}^{2}
$$
what says *true pairs shouls be closer, false pairs far away*.
#### Soft classification loss
Exactly the crossentropy loss with logits, but calculated twice (we have pairs)
$$
\mathcal{L}_{c l s}=\frac{1}{2 B} \sum_{i=1}^{B}\left(\mathrm{CE}\left(y_{i}, s \cdot F_{g_{i}} W^{T}\right)+\mathrm{CE}\left(y_{i}, s \cdot F_{p_{i}} W^{T}\right)\right)
$$
This loss was proven usefull, why not?
#### Domain Alignment
It is expected that different domain causes embedding shift. To avoid this we might want embeddigns to lie close to each other. This loss tries to fix this.
$$
\begin{aligned}
c_{j} &=\frac{1}{B} \sum_{i=1}^{B}\left(\left(F_{g_{i}}^{\mathcal{D}_{j}}+F_{p_{i}}^{\mathcal{D}_{j}}\right) / 2\right) \\
c_{m t r} &=\frac{1}{n} \sum_{j=1}^{n} c_{j} \\
\mathcal{L}_{d a} &=\frac{1}{n} \sum_{j=1}^{n}\left\|s \cdot\left(c_{j}-c_{m t r}\right)\right\|_{2}^{2}
\end{aligned}
$$
They point the attention to $F_{g_{i}}^{\mathcal{D}_{j}}$, $F_{p_{i}}^{\mathcal{D}_{j}}$ saying embeddings are already normalized (did it converge withot normalization or there were some problems?).
### The loss
Loss for "training" samples in the batch is all three losses at once.
$$
\mathcal{L}_{S}=\mathcal{L}_{h p}\left(\mathcal{X}_{S} ; \theta\right)+\mathcal{L}_{c l s}\left(\mathcal{X}_{S} ; \theta\right)+\mathcal{L}_{d a}\left(\mathcal{X}_{S} ; \theta\right)
$$
Loss for "testing" samples only two, **Domain alignment** is ommited. Note, parameters are not the same.
$$
\mathcal{L}_{T}=\mathcal{L}_{h p}\left(\mathcal{X}_{T} ; \theta^{\prime}\right)+\mathcal{L}_{c l s}\left(\mathcal{X}_{T} ; \theta^{\prime}\right)
$$

## Experiments
Some more results for hyperparameter tuning are also provided in the paper. Below are the most important.
### Mthod Ablation conclisions
* full order approximation is better.
* Optimal $\gamma$ is about 0.5
* Dampling meta train and meta test is in the right way important. In experiments the best way was. Sample 2 as train and 1 as test (I think it is about train batch is larger than test batch).
### New domain setup
Interesting results. Comparison is fair enough. However, individual loss contribution is not shown in here:

For reference, protocols

### In domain setup
Out of domain setup is new, old setup is important as well. They evaluate on CACD-VS, CASIA NIR-VIS 2.0, Multi-PIE, MeGlass, Public-IvS. They look not bad there and usually in top 2 results:






## Ideas to check
* Cross domain setup with toy VK data
* Req: implement the paper
* Cross dataset setup
* Exp: Check id merge problem
* Req: implement the paper