Attentional Feature-Pair Relation Networks for Accurate Face Recognition

# Attentional Feature-Pair Relation Networks for Accurate Face Recognition Aug 17, 2019 difficulty: 3 rating: 3 [paper](https://arxiv.org/abs/1908.06255v1) New architecture block to cope with non-perfect localization. Important: this is not self attention! Thisis Bilinear attention. As I understood the intuition, the proposed bilinear attention aggregates infromation about pairs of spatial locations. So the tesult should be treated as co-occurence as of spatial features in the picute what makes sence. They also complain about non-perfect localization network and say that the proposed method helps a bit. ## Architecture The whole pipeline picture looks complicated so we better look closer at details. ![](https://i.imgur.com/Wjr3VDc.png) ### Backbone ![](https://i.imgur.com/22U7vm0.png) For backbone it is OK to use any convolutional network. In the paper they use resnet-101. I see no preference in used backbone, however, no alternatives were considered in the study. ### Reshaping Feature Maps Adter backbone we have `9x9` feature maps. Rehsape them ![](https://i.imgur.com/Wd6Iq20.png) ### Computing "Similarities" This is not a similarity in sence that we compute inner product. THe formula for Biilinear Atteniton Map is $$ \mathcal{A}_{ij} = p^\top \left( \operatorname{ReLU}\left( U^\top F_i \right) \circ \operatorname{ReLU}\left( V^\top F_j \right) \right) $$ There are some parameters: $U\in \mathbb{R}^{D\times L}$, $V\in \mathbb{R}^{D\times L}$, $r\in \mathbb{R}^{L}$. $U$, $V$ make this operation non symmetric and $\mathcal{A}_{ij}\ne\mathcal{A}_{ji}$ After all $U^\top F_i$ or $V^\top F_j$ are features on a lower dimensional space. ### Pooling and softmax $p$ is learnable pooling of low dimensional features on dim $L$. TODO: continue