RoFormer: Enhanced Transformer with Rotary Position Embedding - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2104.09864) | [Note link](https://zhuanlan.zhihu.com/p/574478161) | [Code link](https://github.com/ZhuiyiTechnology/roformer) | arXiv 2021 ## Abstract They propose a novel method named **Rotary Position Embedding (RoPE)** to effectively leverage the positional information. RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. RoPE enables valuable properties, including: 1. The flexibility of sequence length 2. Decaying inter-token dependency with increasing relative distances 3. The capability of equipping the linear self-attention with relative position encoding. They evaluate the enhanced transformer with rotary position embedding, also called **RoFormer**, on various long text classification benchmark datasets. Their experiments show that it consistently overcomes its alternatives. ## Introduction PLMs achieve a significant improvement in terms of parallelization over RNNs and improve the modeling ability of longer intra-token relations compared to CNNs. It is noteworthy that the self-attention architecture of the current PLMs has shown to be position-agnostic. And there are various approaches have been proposed to encode the position information into the learning process. Here, they introduce a novel method, namely Rotary Position Embedding(RoPE), to leverage the positional information into the learning process of PLMS. **RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation.** Also, they show that the enhanced transformer with rotary position embedding, namely RoFormer, can give better performance compared to baseline alternatives and thus demonstrates the efficacy of the proposed RoPE. Their main contribution: 1. They introduce a novel method, namely Rotary Position Embedding(RoPE), to leverage the positional information into the learning process of PLMS. The key idea is to **encode relative position by multiplying the context representations with a rotation matrix with a clear theoretical interpretation.** 2. The properties of RoPE and show that it decays with the relative distance increased, which is desired for natural language encoding. 3. They evaluate the proposed RoFormer on various long text benchmark datasets. Our experiments show that it consistently achieves better performance compared to its alternatives. ## Background and Related Work ### Preliminary The existing approaches of transformer-based position encoding mainly focus on choosing a suitable function to form Equation $(1)$. $$ \begin{aligned} \boldsymbol{q}_m &= f_q(\boldsymbol{x}_m, m) \\ \boldsymbol{k}_n &= f_q(\boldsymbol{x}_n, n) \\ \boldsymbol{v}_n &= f_q(\boldsymbol{x}_n, n) \end{aligned} \tag{1} $$ The query and key values are then used to compute the attention weights, while the output is computed as the weighted sum over the value representation. $$ \begin{aligned} a_{m, n} & =\frac{\exp \left(\frac{\boldsymbol{q}_m^{\boldsymbol{\top}} \boldsymbol{k}_n}{\sqrt{d}}\right)}{\sum_{j=1}^N \exp \left(\frac{\boldsymbol{q}_m^{\boldsymbol{\top}} \boldsymbol{k}_j}{\sqrt{d}}\right)} \\ \mathbf{o}_m & =\sum_{n=1}^N a_{m, n} \boldsymbol{v}_n \end{aligned} \tag{2} $$ ### Absolute position embedding A typical choice of Equation $(1)$ is $$ f_{t:t\in\{q,k,v\}}(\boldsymbol{x}_i, i) := \boldsymbol{W}_{t:t\in\{q,k,v\}}(\boldsymbol{x}_i + \boldsymbol{p}_i) \tag{3} $$ where $\boldsymbol{p}_i \in \mathbb{R}^d$ is a d-dimensional vector depending of the position of token $\boldsymbol{x}_i$. There have proposed to generate $\boldsymbol{p}_i$ using the sinusoidal function. $$ \begin{cases} \boldsymbol{p}_{i, 2 t} & =\sin \left(k / 10000^{2 t / d}\right) \\ \boldsymbol{p}_{i, 2 t+1} & =\cos \left(k / 10000^{2 t / d}\right) \end{cases} \tag{4} $$ in which $\boldsymbol{p}_{i, 2t}$ is the $2t^{th}$ element of the d-dimensional vector $\boldsymbol{p}_i$. ### Relative position embedding There are another settings of Equation $(1)$ as following: $$ \begin{align} f_q(\boldsymbol{x}_m) & := \boldsymbol{W}_q\boldsymbol{x}_m \\ f_k(\boldsymbol{x}_n, n) & := \boldsymbol{W}_k(\boldsymbol{x}_n + \tilde{\boldsymbol{p}}_r^k）\\ f_v(\boldsymbol{x}_n, n) & := \boldsymbol{W}_v(\boldsymbol{x}_n + \tilde{\boldsymbol{p}}_v^k） \end{align} \tag{5} $$ where $\tilde{\boldsymbol{p}}_r^k, \tilde{\boldsymbol{p}}_v^k \in \mathbb{R}^b$ are trainable relative position embeddings. And $r = \text{clip}(m-n, r_\min, r_\max)$ represents the relative distance between position $m$ and $n$. Currently, the most efficient in relative position embeddings, they decompose $\boldsymbol{q}_m^\top \boldsymbol{k}_n$ as: $$ \boldsymbol{q}_m^\top \boldsymbol{k}_n = \boldsymbol{x}_m^\top\boldsymbol{W}_q^\top\boldsymbol{W}_k\boldsymbol{x}_n + \boldsymbol{x}_m^\top \boldsymbol{W}_q^\top\boldsymbol{W}_k \tilde{p}_{m-n} + \tilde{p}^\top_{m-n}\boldsymbol{W}_q^\top\boldsymbol{W}_k\boldsymbol{x}_n \tag{6} $$ , the absolute position embeddings $\boldsymbol{p}_m$ and $\boldsymbol{p}_n$ were simply replaced with the relative position embeddings $\tilde{p}_{m-n}$. But in this work, they aim to derive the relative position encoding from Equation $(1)$ under some constraints. Next, they show that the derived approach is more interpretable by incorporating relative position information with the rotation of context representations. ## Proposed approach ### Formulation Transformer-based language modeling usually leverages the position information of individual tokens through a self-attention mechanism. In equation $(2)$, $\boldsymbol{q}_m^\top \boldsymbol{k}_n$ typically enables knowledge conveyance between tokens at different positions. Here they use the inner product of query $\boldsymbol{q}_m$ and the key $\boldsymbol{k}_n$ to be formulated by a function $g$ to incorporate relative position information. Such that they want the inner product encodes position information only in the relative form: $$ \left\langle f_q\left(\boldsymbol{x}_m, m\right), f_k\left(\boldsymbol{x}_n, n\right)\right\rangle=g\left(\boldsymbol{x}_m, \boldsymbol{x}_n, m-n\right) \tag{7} $$ The ultimate goal is to find an equivalent encoding mechanism to solve the functions $f_q(\boldsymbol{x}_m, m)$ and $f_k(\boldsymbol{x}_n, n)$ to conform the aforementioned relation. ### Rotary position embedding In the case when $d = 2$. They use the geometric property of vectors on a 2D plane and its complex form to prove that a solution to their formulation $(7)$ is: $$ \begin{aligned} f_q\left(\boldsymbol{x}_m, m\right) & =\left(\boldsymbol{W}_q \boldsymbol{x}_m\right) e^{i m \theta} \\ f_k\left(\boldsymbol{x}_n, n\right) & =\left(\boldsymbol{W}_k \boldsymbol{x}_n\right) e^{i n \theta} \\ g\left(\boldsymbol{x}_m, \boldsymbol{x}_n, m-n\right) & =\operatorname{Re}\left[\left(\boldsymbol{W}_q \boldsymbol{x}_m\right)\left(\boldsymbol{W}_k \boldsymbol{x}_n\right)^* e^{i(m-n) \theta}\right] \end{aligned} \tag{8} $$ where $\text{Re}[\cdot]$ is the real part of a complex number and $(\boldsymbol{W}_k\boldsymbol{x}_n)^*$ represents the conjugate complex number of $(\boldsymbol{W}_k\boldsymbol{x}_n)$. $\theta \in \mathbb{R}$ is a preset non-zero constant. We can further write $f_{\{q,k\}}$ in a multiplication matrix: $$ f_{\{q, k\}}\left(\boldsymbol{x}_m, m\right)=\left(\begin{array}{cc} \cos m \theta & -\sin m \theta \\ \sin m \theta & \cos m \theta \end{array}\right)\left(\begin{array}{cc} W_{\{q, k\}}^{(11)} & W_{\{q, k\}}^{(12)} \\ W_{\{q, k\}}^{(21)} & W_{\{q, k\}}^{(22)} \end{array}\right)\left(\begin{array}{l} x_m^{(1)} \\ x_m^{(2)} \end{array}\right) \tag{9} $$ where $(x_m^{(1)}, x_m^{(2)})$ is $\boldsymbol{x}_m$ expressed in the 2D coordinates. Similarly, $g$ can be viewed as a matrix and thus enables the solution of formulation in [Section (3.1)](#Formulation) under the 2D case. In order to generalize their results in 2D to any $\boldsymbol{x}_i \in \mathbb{R}^d$ where $d$ is even, they divide the d-dimension space into $d/2$ sub-spaces and combine them in the merit of the linearity of the inner product, that $f_{\{q,k\}}$ will be: $$ f_{\{q,k\}} (\boldsymbol{x}_m, m) = \boldsymbol{R}_{\Theta, m}^d \boldsymbol{W}_{\{q,k\}}\boldsymbol{x}_m \tag{10} $$ where $\boldsymbol{R}_{\Theta, m}^d$ is the rotary matrix with pre-defined parameters $\Theta = \{ \theta_i = 10000^{-2(i-1)/d}, i \in [1,2,\dots,d/2]\}$. Applying their RoPE to self-attention in Equation $(2)$, they obtain: $$ \boldsymbol{q}_m^{\boldsymbol{\top}} \boldsymbol{k}_n=\left(\boldsymbol{R}_{\Theta, m}^d \boldsymbol{W}_q \boldsymbol{x}_m\right)^{\boldsymbol{\top}}\left(\boldsymbol{R}_{\Theta, n}^d \boldsymbol{W}_k \boldsymbol{x}_n\right)=\boldsymbol{x}^{\boldsymbol{\top}} \boldsymbol{W}_q R_{\Theta, n-m}^d \boldsymbol{W}_k \boldsymbol{x}_n \tag{11} $$ ![](https://hackmd.io/_uploads/Bynox7ZKh.png) ### Properties of RoPE **Long-term decay:** It means the inner-product will decay when the relative position increase. This property coincides with the intuition that a pair of tokens with a long relative distance should have less connection. **RoPE with linear attention:** RoPE injects position information by rotation, which keeps the norm of hidden representations unchanged, we can combine RoPE with linear attention by multiplying the rotation matrix with the outputs of the non-negative functions. $$ \operatorname{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V})_m=\frac{\sum_{n=1}^N\left(\boldsymbol{R}_{\Theta, m}^d \phi\left(\boldsymbol{q}_m\right)\right)^{\top}\left(\boldsymbol{R}_{\Theta, n}^d \varphi\left(\boldsymbol{k}_n\right)\right) \boldsymbol{v}_n}{\sum_{n=1}^N \phi\left(\boldsymbol{q}_m\right)^{\top} \varphi\left(\boldsymbol{k}_n\right)} \tag{12} $$ ### Theoretical Explanation ![](https://hackmd.io/_uploads/HkPnRmZth.png) ## Experiments and Evaluation They evaluate the proposed RoFormer on various NLP tasks as follows: - Machine translation ![](https://hackmd.io/_uploads/rJX86QbK3.png) - Pre-training Language Modeling ![](https://hackmd.io/_uploads/SyYuTQZK2.png) - GLUE benchmarks ![](https://hackmd.io/_uploads/ByxppQbFh.png) - Performer with RoPE ![](https://hackmd.io/_uploads/SyYuTQZK2.png) - Chinese data ![](https://hackmd.io/_uploads/rJoqpQZK3.png) ## Conclusions In this paper, they proposed a new position embedding method that incorporates explicit relative position dependency in self-attention to enhance the performance of transformer architectures. They indicates that relative position can be naturally formulated using vector production in self-attention, with absolution position information being encoded through a rotation matrix. The experimental results also show that their proposed RoFormer can achieve better performance on long texts task.