Rethinking and Improving Relative Position Encoding for Vision Transformer

{%hackmd SybccZ6XD %} ###### tags: `paper` # Rethinking and Improving Relative Position Encoding for Vision Transformer The key to the success of transformers is what? > self-attention: capture long-range dependencies Drawback > No position information Two kinds of position embedding > Absolute methods > Relative position methods Absolute methods > fixed encodings > ex. sine and cosine functions > ![](https://hackmd.io/_uploads/ryDlkKdL3.png) > learnable encodings through training parameters > ![](https://hackmd.io/_uploads/H1rcZY_Un.png) Relative position methods > encode the relative distance between input elements and learn the pairwise relations of tokens. > https://arxiv.org/abs/1803.02155 > ![](https://hackmd.io/_uploads/H1nraitIh.png) Contributions > reduces the computational cost > propose four new relative position encoding methods Overall diagram > blue: newly added > ![](https://hackmd.io/_uploads/BJkDPqdL2.png) Two relative position mode > Bias > Contextual Original > $e_{ij} = \frac{(x_iW^Q)(x_jW^K)^T}{\sqrt{d_z}}$ Bias mode > $e_{ij} = \frac{(x_iW^Q)(x_jW^K)^T + b_{ij}}{\sqrt{d_z}}$ > $b_{ij}=r_{ij}$ Contextual Mode > $e_{ij} = \frac{(x_iW^Q)(x_jW^K)^T + b_{ij}}{\sqrt{d_z}}$ > $b_{ij}=(x_iW^Q)r_{ij}^T$ Piecewise index function > mapping a relative distance into an integer in finite set, then rij can be indexed by the integer and share encondings among different relation positions. > ![](https://hackmd.io/_uploads/SkDQscu83.png) Clip function > $h(x) = max(-\beta, min(\beta, x))$ > https://arxiv.org/abs/1803.02155 > ![](https://hackmd.io/_uploads/HJnUasF8n.png) Piecewise function > $g(x) = \begin{cases} [x], & \text{if } |x|\leq\alpha \\ sign(x)\times min(\beta, [\alpha+\frac{ln(|x|/\alpha)}{ln(\gamma/\alpha)}(\beta-\alpha)]), & \text{if } |x|>\alpha \end{cases}$ > $\alpha:\beta:\gamma = 1:2:8$ > ![](https://hackmd.io/_uploads/Symv6sFU3.png) My > alpha = 2.5 beta = 5 gamma = 20 > ![](https://hackmd.io/_uploads/rkAFojdIh.png) > alpha = 4 beta = 5 gamma = 20 > ![](https://hackmd.io/_uploads/SktjiidIh.png) > alpha =2.5 beta = 5 gamma = 10 > ![](https://hackmd.io/_uploads/BJJWnodIn.png) 2D relative position calculation $r_{ij}$ > undirected mapping method: Euclidean and Quantization > directed mapping method: Cross and Product Euclidean > ![](https://hackmd.io/_uploads/HJTmasdIh.png) Quantization > ![](https://hackmd.io/_uploads/rJMH6su8h.png) > quant maps a set of real numbers {0, 1, 1.41, 2, 2.24, ...} into a set of integers {0, 1, 2, 3, 4, ...} Cross method > ![](https://hackmd.io/_uploads/SyztaiuL3.png) Product method > ![](https://hackmd.io/_uploads/BJAqpouIn.png) ## Experiments Baseline: Deit Directed >Undirected Contextual > Bias ![](https://hackmd.io/_uploads/SyLVRo_8h.png) image classification: $clip \approx piecewise$ ![](https://hackmd.io/_uploads/rkkYCsuIn.png) ![](https://hackmd.io/_uploads/Bykkl3dI3.png)