{%hackmd SybccZ6XD %} ###### tags: `paper` # Rethinking and Improving Relative Position Encoding for Vision Transformer The key to the success of transformers is what? > self-attention: capture long-range dependencies Drawback > No position information Two kinds of position embedding > Absolute methods > Relative position methods Absolute methods > fixed encodings > ex. sine and cosine functions >  > learnable encodings through training parameters >  Relative position methods > encode the relative distance between input elements and learn the pairwise relations of tokens. > https://arxiv.org/abs/1803.02155 >  Contributions > reduces the computational cost > propose four new relative position encoding methods Overall diagram > blue: newly added >  Two relative position mode > Bias > Contextual Original > $e_{ij} = \frac{(x_iW^Q)(x_jW^K)^T}{\sqrt{d_z}}$ Bias mode > $e_{ij} = \frac{(x_iW^Q)(x_jW^K)^T + b_{ij}}{\sqrt{d_z}}$ > $b_{ij}=r_{ij}$ Contextual Mode > $e_{ij} = \frac{(x_iW^Q)(x_jW^K)^T + b_{ij}}{\sqrt{d_z}}$ > $b_{ij}=(x_iW^Q)r_{ij}^T$ Piecewise index function > mapping a relative distance into an integer in finite set, then rij can be indexed by the integer and share encondings among different relation positions. >  Clip function > $h(x) = max(-\beta, min(\beta, x))$ > https://arxiv.org/abs/1803.02155 >  Piecewise function > $g(x) = \begin{cases} [x], & \text{if } |x|\leq\alpha \\ sign(x)\times min(\beta, [\alpha+\frac{ln(|x|/\alpha)}{ln(\gamma/\alpha)}(\beta-\alpha)]), & \text{if } |x|>\alpha \end{cases}$ > $\alpha:\beta:\gamma = 1:2:8$ >  My > alpha = 2.5 beta = 5 gamma = 20 >  > alpha = 4 beta = 5 gamma = 20 >  > alpha =2.5 beta = 5 gamma = 10 >  2D relative position calculation $r_{ij}$ > undirected mapping method: Euclidean and Quantization > directed mapping method: Cross and Product Euclidean >  Quantization >  > quant maps a set of real numbers {0, 1, 1.41, 2, 2.24, ...} into a set of integers {0, 1, 2, 3, 4, ...} Cross method >  Product method >  ## Experiments Baseline: Deit Directed >Undirected Contextual > Bias  image classification: $clip \approx piecewise$  
×
Sign in
Email
Password
Forgot password
or
By clicking below, you agree to our
terms of service
.
Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
New to HackMD?
Sign up