ROPE 旋轉位置編碼

# ROPE 旋轉位置編碼 **主要應用模型** 1. Google PaLM 2. Grok 3. llama ## Simple recall - Transformer模型結構跟順序位置無關 - encoder: 找到算子 $\mathcal{T}$ 轉換語意序列為高階語意向量序列 $$\mathcal{T}\begin{pmatrix} X\_1\\\\ X\_2\\\\ \vdots\\\\ X\_N \end{pmatrix} \rightarrow\begin{pmatrix} Y\_1\\\\ Y\_2\\\\ \vdots\\\\ Y\_N \end{pmatrix}$$ - $Y=\mathcal{T}=F(A(X))$ - Attention : A - Feedfoward : F - Attention $Q_{i}=X_iW_Q$ $K_{i}=X_iW_K$ $V_{i}=X_iW_V$ $Y_i=\sum_{j=1}^Nsim(Q_i, K_j)V_j$ $sim(Q_i, K_j)=\frac{exp(\frac{Q_iK_j^T}{\sqrt{D}})}{\sum_{j=1}^Nexp(\frac{Q_iK_j^T}{\sqrt{D}})}$ ## 為什麼需要位置編碼 $\mathcal{T}=F(A(X))$ - $F$是與位置無關的 - 可以修改$X$或是$A$ ## 如何加入位置編碼 **通常不會加入固定整數，因為整數無法融合至深度學習網路當中，所以把這些位置以embedding的形式加入** ### learning embedding - 簡單實現，Bert/GPT - 序列長度外推困難，沒有學習到過的位置 ### 自定義絕對位置編碼 - 二維函數 $f(position, dimension)$ - 要求 - 函數隨著position, dimension增長應該是的界的，無界對網路不友好 - 足夠區分度，對每個函數隨著position，對應的行向量應該是不同的 - **語意應該是與相對位置有關，而不是絕對位置** ## 修改Attention $Q_{i}=X_iW_Q$ $K_{i}=X_iW_K$ $V_{i}=X_iW_V$ $Y_i=\sum_{j=1}^Nsim(Q_i, K_j)V_j$ $sim(Q_i, K_j)=\frac{exp(\frac{Q_iK_j^T}{\sqrt{D}})}{\sum_{j=1}^Nexp(\frac{Q_iK_j^T}{\sqrt{D}})}$ ### 想法 - i和j之間的語意相似性應該包含相對的距離信息 - 希望相似性只依賴相對距離，而不是絕對位置 $$Q\_{i}K\_j^T=g(X\_{i},X\_j,i-j)$$ ## 行向量和矩陣 - 定義線性算子 \$\mathcal{A}\$ - 可以作用到行向量 \$\mathcal{A}(X\_i) = X\_{i} A\$ - 也可以作用到矩陣 \$\mathcal{A}(X) = XA\$ - 右乘矩陣等於對每個行向量逐個施加行變換 - 旋轉變換 $R(\theta)= \begin{pmatrix} cos\theta& sin\theta\\\\ -sin\theta& cos\theta \end{pmatrix}$ - 用對角陣的正交空間上施加不同的行變換，假設有兩個方陣A, B，設$X=(X^1, X^2)$，得 $$(X^1,X^2)\begin{pmatrix} A & 0 \\\\ 0 & B \end{pmatrix} = (X^1A, X^2 B)$$ ### 關於二維子空間的旋轉矩陣 $R(\theta)= \begin{pmatrix} cos\theta& sin\theta\\\\ -sin\theta& cos\theta \end{pmatrix}$ - 物理意義 - $XR(\theta)$對$X$逆時針旋轉$\theta$ ### 在高維中旋轉 **假設空間是偶數維的，把原始空間切分乘一個個獨立正交的二維子空間，在上面做獨立旋轉** ![image](https://hackmd.io/_uploads/SkADKl8dJe.png) - 性質 - 在獨立二維子空間上做不同角度的旋轉 - $XR(\theta)=(X^1, X^2)\begin{pmatrix} R(\theta\_{1}) & 0 \\\\ 0 & R(\theta\_2) \end{pmatrix}=(X^1R(\theta\_1), X^2R(\theta\_2)$ - $R(\theta)=\widehat{R}(\theta\_1)\widehat{R}(\theta\_2)\ldots\widehat{R}(\theta\_{D/2}$，逐個在不同子空間進行旋轉，定義$\widehat{R}(\theta\_k)= \begin{pmatrix} 1 & 0 & 0 & 0 &0\\\\ 0 & \ddots & 0 &0 & 0\\\\ 0 & 0 & R(\theta\_{k}) &0 & 0 \\\\ 0 & 0 &0 & \ddots & 0\\\\ 0 & 0 &0 &0 &1 \\\\ \end{pmatrix})$，表示只對第k個子空間做旋轉，其他子空間不動 ## 旋轉位置編碼 - 希望 - \$Q\_{i}K\_j^T=g(X\_{i},X\_j,i-j)\$ - 假設 - $Q_I, K_j$都是二維的向量 - $i, j$是它們對應的position，這裡\$\eta\_{i},\eta\_{j}\$ 是\$Q\_i, K\_j\$ 弧度表示 - 基於 - 點積只和模長與夾角有關 \$Q\_iK\_j^T=\\|Q\_i\\|\\|K\_j\\| cos(\eta\_{j}-\eta\_i)\$ - 如何融入位置信息 - 把兩個向量各自按照\$i, j\$角度旋轉後來計算點積 - \$Q\_iR(i)(K\_jR(j))^T\$ - 新的向量內積帶上了位置信息 ### 二維空間的一個解 \$Q_i=X_iW_QR(i\theta)\$ \$K_j=X_jW_KR(j\theta)\$ \$Q_iK_j^T=X_iW_QR(i\theta)R(j\theta)^TW^T_KX_j^T\$ \$Q_iK_j^T=X_iW_QR(i\theta)R(-j\theta)W^T_KX_j^T\$ \$Q_iK_j^T=X_iW_QR((i-j)\theta)W^T_KX_j^T\$ \$Q_iK_j^T=g(X_i, X_j, i-j)\$ #### 性質 - \$R(\theta)^T=R(-\theta)\$ - \$R(\theta_1)R(\theta_2)=R(\theta_1+\theta_2)\$ ### 推廣到高維空間 **假設空間是偶數維的，把整個空間分割成\$d=D/2\$個子空間，在各個子空間上分別按照獨立的角度來旋轉** - 基礎旋轉角度序列\$\Theta=(\theta\_{1},\theta\_2,\ldots,\theta\_{d})\$ - \$i\$ 位置的旋轉角度序列\$i\Theta=(i\theta\_{1},i\theta\_2,\ldots,i\theta\_{d})\$ - \$X\_{i}R(i\Theta)\$ 表示對\$X\_{i}\$ 在各個子空間分別做角度為\$i\theta\_1, i\theta\_2,\ldots,i\theta\_{d}\$ \$R(i\Theta)=\begin{pmatrix} R(i\theta\_{1}) & 0 &0 & 0\\\\ 0 & R(i\theta\_2) & 0 &0 \\\\ 0 & 0 &\ldots &0 \\\\ 0 & 0 & 0 &R(i\theta\_{d})\\\\ \end{pmatrix}\$ ### ROPE在高維空間 \$Q_i=X_iW_QR(i\theta)\$ \$K_j=X_jW_KR(j\theta)\$ \$Q_iK_j^T=X_iW_QR(i\theta)R(j\theta)^TW^T_KX_j^T\$ \$Q_iK_j^T=X_iW_QR(i\theta)R(-j\theta)W^T_KX_j^T\$ \$Q_iK_j^T=X_iW_QR((i-j)\theta)W^T_KX_j^T\$ \$Q_iK_j^T=g(X_i, X_j, i-j)\$ 其中 $R(i\theta)R(j\theta)^T=$ \begin{split} R(i\Theta)R(j\Theta)^{T} &= \widehat{R}(i\theta\_1)\widehat{R}(i\theta\_2)\ldots\widehat{R}( i\theta\_{d})\widehat{R}(j\theta\_{d})^{T}\ldots \widehat{R}(j\theta\_{2})^{T} \widehat{R}(j\theta\_{1})^{T} \\\\ &= (\widehat{R}(i\theta\_1)\widehat{R}(j\theta\_1)^T)(\widehat{R}(i\theta\_2)\widehat {R}(j\theta\_2)^T)\ldots(\widehat{R}(i\theta\_{d})\widehat{R}(j\theta\_{d})^T)\ \\\ &= \widehat{R}((i-j)\theta\_1)\widehat{R}((i-j)\theta\_2)\ldots \widehat{R}((i-j)\theta\_{d})\ \\\ &= R((i-j)\Theta)\\\\ \end{split} - 空間是\$D\$ 維度，\$d=D/2\$ - 有\$d\$ 個正交的二維子空間\$\mathcal{X}\_1, \mathcal{X}\_2, \dots, \mathcal{X}\_{d}\$ - 每個子空間\$\mathcal{X}\_{k}\$ 有一個旋轉角度基準\$\theta\_{k}\$, 一個基準旋轉矩陣\$R(\theta \_{k})\$ - 合併後的基準角度序列和旋轉序列是\$\Theta, R(\Theta)\$ - 每個子空間對應於三角函數中的一個週期\$2\pi/\theta\_{k}\$ - 對於每個位置\$i\$, 角度序列和旋轉序列是\$i\Theta, R(i\Theta)\$ | $\theta$ | $\theta_2$ | $\theta_1$ | ... | $\theta_d$ | | ----------- | ------------- | ------------- | --- | ------------- | | $R(\theta)$ | $R(\theta_1)$ | $R(\theta_2)$ | | $R(\theta_d)$ | | $i\theta$ | $i\theta_1$ | $i\theta_2$ | ... | $i\theta_d$ | #### 具體設定 - \$\theta\_{k}=10000^{-(k-1)/d}, k\in[1,2,\ldots,d]\$， - 記\$B=10000^{1/d}\$, 則\$\theta\_{k}=1/B^{k-1}\$ 是等比數列 - \$B>1, k\rightarrow \infty, \theta\_{k}\rightarrow 0, T\rightarrow\infty\$，越長，後面的位置編碼越小，不怎麼做旋轉 | $\theta$ | $0$ | $1/B$ | ... | $1/B^{d-1}$ | | ----------- | ------------- | ------------- | --- | ------------- | | $T$ | $2\pi$ | $2B\pi$ | | $2B^{d-1}\pi$ | #### 優勢 - 在多個block 前向傳遞的過程中position的訊息不會遺失 - 每個block都會先做QKV的投影，然後QK投影之後會做位置旋轉變換 ### 如何避免大量旋轉矩陣相乘 **對每個\$Q_i\$，乘上不同的旋轉矩陣** \$\mathcal{R}(Q)=\begin{pmatrix} Q\_1 R(1\Theta)\\\\ Q\_2 R(2\Theta)\\\\ \ldots \\\\ Q\_N R(N\Theta)\\\\ \end{pmatrix}\$ - 而每個\$R(i\Theta)\$ 是稀疏矩陣，直接matmul代價太大 #### 在二維空間求解假設是二維空間，\$Q=(U,V)\$, 記錄 \$cos=\begin{pmatrix}cos1\theta \\\\ cos 2\theta\\\ \ldots,\\\ cos N\theta \end{pmatrix}, sin=\begin{pmatrix}sin 1\theta \\\\ sin 2\theta\\\ \ldots,\\\ sin N\theta \end{pmatrix}\$ 那麼 \$\begin{aligned} \mathcal{R}(U,V)&=\begin{pmatrix} u\_1 cos 1\theta-v\_1 sin 1\theta, u\_1 sin 1\theta + v\_1 cos 1\theta\\\\ u\_2 cos 2\theta-v\_2 sin 2\theta, u\_2 sin 2\theta + v\_2 cos 2\theta\\\\ \ldots\\\\ u\_N cos N\theta-v\_N sin N\theta, u\_N sin N \theta + v\_N cos N\theta\\\\ \end{pmatrix}\\\\ &=(Ucos - Vsin, Usin+Vcos) \\\\ &= (U,V)cos +(-V, U)sin \end{aligned}\$ 同樣的，在高維度空間，我們可以把\$Q\$ 拆分成\$D/2\$ 個列向量\$U\_1,V\_1,U\_2,V\ _2,\ldots,U\_{D/2},V\_{D/2}\$ #### 進一步在各個子空間中，如果第\$k\$ 個子空間\$\mathcal{X}\_{d}\$ 的兩個列向量為\$(U\_{k}, V\_{k})\$, 我們有對應的旋轉結果 \$(U\_{1},V\_1)cos +(-V\_1, U\_{1})sin\$ \$(U\_{2},V\_2)cos +(-V\_2, U\_{2})sin\$ \$(U\_{d},V\_d)cos +(-V\_d, U\_{d})sin\$ 我們可以做拼接記\$\hat{U}=(U\_1, U\_2, \ldots, U\_d)\$, \$\hat{V}=(V\_1, V\_2, \ldots , V\_d)\$ \$(\hat{U},\hat{V})cos +(-\hat{V}, \hat{U})sin\$