---
# System prepended metadata

title: Rethinking and Improving Relative Position Encoding for Vision Transformer
tags: [paper]

---

{%hackmd SybccZ6XD %}
###### tags: `paper`

# Rethinking and Improving Relative Position Encoding for Vision Transformer

The key to the success of transformers is what?
> self-attention: capture long-range dependencies

Drawback
> No position information

Two kinds of position embedding
> Absolute methods
> Relative position methods

Absolute methods
> fixed encodings
> ex. sine and cosine functions
> ![](https://hackmd.io/_uploads/ryDlkKdL3.png)
> learnable encodings through training parameters
> ![](https://hackmd.io/_uploads/H1rcZY_Un.png)


Relative position methods
> encode the relative distance between input elements and learn the pairwise relations of tokens.
> https://arxiv.org/abs/1803.02155
> ![](https://hackmd.io/_uploads/H1nraitIh.png)


Contributions
> reduces the computational cost
> propose four new relative position encoding methods

Overall diagram
> blue: newly added
> ![](https://hackmd.io/_uploads/BJkDPqdL2.png)

Two relative position mode
> Bias
> Contextual

Original
> $e_{ij} = \frac{(x_iW^Q)(x_jW^K)^T}{\sqrt{d_z}}$

Bias mode
> $e_{ij} = \frac{(x_iW^Q)(x_jW^K)^T + b_{ij}}{\sqrt{d_z}}$
> $b_{ij}=r_{ij}$

Contextual Mode
> $e_{ij} = \frac{(x_iW^Q)(x_jW^K)^T + b_{ij}}{\sqrt{d_z}}$
> $b_{ij}=(x_iW^Q)r_{ij}^T$

Piecewise index function
> mapping a relative distance into an integer in finite set, then rij can be indexed by the integer and share encondings among different relation positions.
> ![](https://hackmd.io/_uploads/SkDQscu83.png)

Clip function
> $h(x) = max(-\beta, min(\beta, x))$
> https://arxiv.org/abs/1803.02155
> ![](https://hackmd.io/_uploads/HJnUasF8n.png)


Piecewise function
> $g(x) = \begin{cases} [x], & \text{if } |x|\leq\alpha \\ sign(x)\times min(\beta, [\alpha+\frac{ln(|x|/\alpha)}{ln(\gamma/\alpha)}(\beta-\alpha)]), & \text{if } |x|>\alpha \end{cases}$ 
> $\alpha:\beta:\gamma = 1:2:8$
> ![](https://hackmd.io/_uploads/Symv6sFU3.png)


My
> alpha = 2.5
beta = 5
gamma = 20
> ![](https://hackmd.io/_uploads/rkAFojdIh.png)
> alpha = 4
beta = 5
gamma = 20
> ![](https://hackmd.io/_uploads/SktjiidIh.png)
> alpha =2.5
beta = 5
gamma = 10
> ![](https://hackmd.io/_uploads/BJJWnodIn.png)

2D relative position calculation $r_{ij}$
> undirected mapping method:  Euclidean and Quantization
> directed mapping method: Cross and Product

Euclidean
> ![](https://hackmd.io/_uploads/HJTmasdIh.png)

Quantization
> ![](https://hackmd.io/_uploads/rJMH6su8h.png)
> quant maps a set of real numbers {0, 1, 1.41, 2, 2.24, ...} into a set of integers {0, 1, 2, 3, 4, ...}

Cross method
> ![](https://hackmd.io/_uploads/SyztaiuL3.png)

Product method
> ![](https://hackmd.io/_uploads/BJAqpouIn.png)

## Experiments

Baseline: Deit
Directed >Undirected
Contextual > Bias
![](https://hackmd.io/_uploads/SyLVRo_8h.png)

image classification: $clip \approx piecewise$
![](https://hackmd.io/_uploads/rkkYCsuIn.png)

![](https://hackmd.io/_uploads/Bykkl3dI3.png)