Self-attention(自注意力機制)

# Self-attention(自注意力機制) Transformer 架構由 Encoder 和 Decoder 兩部分組成，其中 Encoder 部分發展出了 BERT 這類主要用於表徵學習的雙向模型，而 Decoder 部分則發展出 GPT 等主要用於自回歸生成的模型。目前我們熟知的許多大型生成模型（如 GPT 系列、LLaMA）主要基於 Decoder 架構，但也有一些生成式模型（如 T5、BART）採用了 Encoder-Decoder 結構。 ## Single-head attention(單頭自注意力) 單頭注意力 (Single-Head Attention) 只使用一個注意力頭 (Attention Head) 來計算權重，相較於多頭注意力，計算量較小，但仍然保留了注意力機制的核心思想。針對輸入序列，注意力機制計算 Token 與 Token 之間的關聯性，透過 Query 和 Key 的相似度來決定注意力權重，然後對 Value 向量進行加權求和 (weighted sum)，從而為該 Token 生成新的表示 (representation)。 Note: 這裡的"詞"實際上指的是 Token，而非傳統 NLP 中的詞彙 (word)。 Eaxmple: 中和有一個永和路，可以切成以下四個Token |Token|中和|有|一個|永和路| |:-:|:-:|:-:|:-:|:-:| 假設Attention matrix(1)算出來如下 $$ \begin{matrix} &中和&有&一個&永和路&\\ 中和& 0.5 & 0 & 0 & 0.5 \\ 有 & 0 & 1 & 0 & 0 & \\ 一個& 0 & 0 & 1 & 0 \\ 永和路& 0.5 & 0 & 0 & 0.5 \\ \end{matrix} \tag{1} $$ 詞得到新的表示(representation): 中和 = $0.5\times中和+0\times有+0\times一個+0.5\times永和路$ 有 = $0\times中和+1\times有+0\times一個+0\times永和路$ 一個 = $0\times中和+0\times有+1\times一個+0\times永和路$ 永和路 = $0.5\times中和+0\times有+0\times一個+0.5\times永和路$ Self-Attention (自注意力機制) 透過計算 Query (Q) 和 Key (K) 之間的相似度來確定不同 Token 之間的關聯性。相關的 Token 會被分配較高的注意力權重，而不相關的 Token 則被分配較低的權重。這些權重透過 softmax 正規化，使其總和為 1，然後用這些權重對 Value (V) 向量進行加權求和 (weighted sum)，最終產生新的詞向量表示，使得該 Token 能夠融合來自其他 Token 的信息。以上範例中和←→永和路，有很高相關 "有"、"一個" 屬於贅詞就跟其他詞獨立。 ### Single-head attenttion公式所有介紹的文章都會放這張圖，我也放一下，這是Single-head attenttion的計算方式 ![image](https://hackmd.io/_uploads/BkS-FA_pJx.png) $$ Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V $$ Q: Query → 查詢: 其表示目前需要關注的內容。 K: Key → 關鍵: 其表示與查詢相符的內容。 V: Value → 值: 表示最終要提取的信息，通常和Key對應。 ![image](https://hackmd.io/_uploads/HyqHTlYayl.png) 我們不解釋運算原因，我們先介紹有什麼運算剛剛介紹我們針對每個句子來算(中和有一個永和路)，這跟Q、K、V三個輸入有什麼關係，答案是需要透過投影將每個詞投影到Q、K、V。要先將"中和有一個永和路"向量化(透過embedding layer)，假設句子我們已經Tokenization和向量化結束。 |Token|中和|有|一個|永和路| |:-:|:-:|:-:|:-:|:-:| |向量|$X_1$|$X_2$|$X_3$|$X_4$| 其中$X_i\in R^{d_x\times1}$，也就是$X_1= \left[\begin{array}{c}x_{11}\\x_{12}\\... \\x_{1d_x}\\ \end{array}\right]$ 所以這段一段句子(中和有一個永和路) $$X=\left[\begin{array}{c}X_1^T\\X_2^T\\X_3^T\\X_4^T\\ \end{array}\right] \in R^{4\times d_x}$$ 這個Tokeizer後我設定為4個詞，句子長度依據不同模型都不太一樣，假設句子的詞有$n$個，我們用$n$來表示。 $$X=\left[\begin{array}{c}X_1^T\\X_2^T\\...\\X_n^T\\ \end{array}\right] \in R^{n\times d_x}$$ ![image](https://hackmd.io/_uploads/SJRZmWFa1e.png) 整個Single-Head Attention計算如上圖，我們把流程切成五步驟，分別介紹 ---------------- (1) 計算Query, Key, Value 矩陣所以要先將$X$投影到Q、K、V，投影方式很簡單設定三個要學習的參數矩陣 $$ W_Q\in R^{d_x\times d_q}, W_K\in R^{d_x\times d_k}, W_V\in R^{d_x\times d_v} $$ 其中$d_q=d_k$ $Q = X W_Q →Q\in R^{n\times d_k}$ $K = X W_K →K\in R^{n\times d_k}$ $V = X W_V →V\in R^{n\times d_v}$ ![image](https://hackmd.io/_uploads/Sy0Rqgtpke.png) -------------- (2) 計算點積(MatMul)得到score 計算Query和Key的點積，得到score： $$ score = Q\times K^T \in R^{n \times n} $$ ![image](https://hackmd.io/_uploads/rJOtCgKTJx.png) -------------- (3)縮放(Scale)：在 Self-Attention 中，點積結果會除以 √dk，主要目的是防止數值過大影響 softmax 的分佈，並確保梯度穩定。具體來說，score 矩陣的每個元素是 Q 的一個向量與 Kᵀ 的一個向量的點積，這涉及 dk 個數值相乘再相加。如果 dk 很大，則點積結果也會變得很大，進而使得 softmax 產生極端分佈（接近 one-hot），這會影響梯度流動，使模型難以學習長距離關係。為了解決這個問題，Transformer 論文中選擇將點積結果除以 √dk，讓數值保持在合理範圍內。這樣既能防止 softmax 變得過於極端，也能確保梯度不會過小導致學習困難。因此，這個縮放 (scaling) 步驟的作用是在大維度時防止數值過大，在小維度時避免數值過小，確保模型能夠有效學習關係。 $$ score_{scaled} =\frac{score}{\sqrt{d_k}}=\frac{Q\times K^T}{\sqrt{d_k}} $$ -------------- (4)softmax：對縮放後的點積結果套用softmax函數，得到注意力權重矩陣(Attention Score)，主要原因是希望attention socre可以類似機率的概念，讓美的詞的權重分布在0~1，且總和(每個row)是1。 $$ attention_{score} = softmax(score_{scaled}) = softmax(\frac{Q\times K^T}{\sqrt{d_k}}) $$ -------------- (5)加權求和：將Attention Score與Value相乘，得到加權求和的結果 $$ O = attention_{score} \times V \in R^{n \times d_v} $$ ![image](https://hackmd.io/_uploads/B1sb0xF61x.png) ``` import torch import torch.nn as nn import torch.nn.functional as F class SingleHeadAttention(nn.Module): def __init__(self, embed_dim): """ :param embed_dim: embedding dimension for Query、Key and Value """ super(SingleHeadAttention, self).__init__() self.embed_dim = embed_dim # Linear project for Query,Key, Value self.query_linear = nn.Linear(embed_dim, embed_dim) self.key_linear = nn.Linear(embed_dim, embed_dim) self.value_linear = nn.Linear(embed_dim, embed_dim) self.scale = torch.sqrt(torch.FloatTensor([self.embed_dim])) def forward(self, x): """ :param x: input data, Shape: (batch_size, seq_len, embed_dim) :return: Output, Shape: (batch_size, seq_len, embed_dim) """ # project to Query, Key, Value Q = self.query_linear(x) K = self.key_linear(x) V = self.value_linear(x) # matmul and scale attention_scores = torch.matmul(Q, K.transpose(-2, -1)) attention_scores = attention_scores / self.scale # softmax attention_weights = F.softmax(attention_scores, dim=-1) # matmul output = torch.matmul(attention_weights, V) return output, attention_weights batch_size = 1 seq_len = 4 # 輸入的序列長度 embed_dim = 6 # 假設word embedding的dimension是6 # generate data x = torch.randn(batch_size, seq_len, embed_dim) # init attention = SingleHeadAttention(embed_dim) # forward output, attention_weights = attention(x) print("Output:\n", output) print("Output shape:\n", output.shape) print("Attention Weights:\n", attention_weights) print("Attention Weights shape:\n", attention_weights.shape) ``` 輸出結果如下: 可以看到attention matrix是4*4的大小，也就是我們預測輸入序列長度的大小，可以看到這就是表示每個詞彙之間的相關係數，然後每個row總和是1。 ![image](https://hackmd.io/_uploads/rkE7vEtayg.png) ## Multi-Head Attention(多頭自注意力) 所有介紹的文章都會放這張圖，我也放一下，Multi-Head attenttion的計算方式 ![image](https://hackmd.io/_uploads/HJa0fZK61x.png) Multi-Head Attention (MHA) 的核心思想是透過多組獨立的注意力機制來學習不同的關係模式，而不是單純將輸入拆成不同子空間。具體來說，輸入 X 會經過三組線性變換 (Wq, Wk, Wv)，分別得到 Q, K, V，然後將它們拆分成多個 head，每個 head 會獨立計算注意力權重，最後將所有 head 的輸出拼接並通過一個線性層投影回原始維度。我之前誤以為 QKV 是額外的 Linear 運算來投影到不同空間，但實際上，這些 Linear 變換 (Wq, Wk, Wv) 是 Multi-Head Attention 的標準做法，而不是額外的設計。另外，有些實作會直接對 feature vector 進行切割來模擬多頭注意力，但這並不是 Transformer 原始設計的方式。理論上，也可以用一個大矩陣來學習所有注意力模式，但這樣參數量會增加，計算成本也會變高。這邊一樣我們將運算拆成四個步驟介紹。 ---------------- (1) 計算Query, Key, Value 矩陣如同Single-Head Attention一樣，前面的輸入要先投影到Q、K、V，運算和Signle-Head Attention一樣就不多介紹了 $Q = X W_Q →Q\in R^{n\times d_{dim}}$ $K = X W_K →K\in R^{n\times d_{dim}}$ $V = X W_V →V\in R^{n\times d_{dim}}$ ![image](https://hackmd.io/_uploads/rkPzdQFTyx.png) ---------------- (2) 將維度切成多個Head使用將Q、K和V的維度切成多個head使用，假設輸入的embedding維度是$d_{dim}$，然後設定$h$個head 將維度切成 $$ d_k = \frac{d_{dim}}{h} $$ 所以可以切成$h$組$(Q_i\in R^{n\times d_{k}}, K_i\in R^{n\times d_{k}}, V_i\in R^{n\times d_{k}})$ $Q=\left[\begin{array}{ccc} Q_1&Q_2&...&Q_h\end{array}\right]$ $K=\left[\begin{array}{ccc} K_1&K_2&...&K_h\end{array}\right]$ $V=\left[\begin{array}{ccc} V_1&V_2&...&V_h\end{array}\right]$ 範例切成2個head ![image](https://hackmd.io/_uploads/r1wEOQK6kg.png) ---------------- (3) 計算每個Head的attention 如同Single-head Attention的計算 $$ Score_{i} = Q_i \times K_i^T \in R^{n \times n} $$ $$ Scaled Score_i = \frac{Score_i}{\sqrt{d_k}} $$ $$ Attention_i = Softmax(Scaled Score_i) $$ $$ O_i = Attention_i \times V_i \in R^{n \times d_k} $$ $\forall i=1,...,h$ ![image](https://hackmd.io/_uploads/SynH_7K6ye.png) ---------------- (4) Concat.所有head的輸出，然後Linear Project學習合併後的結果 $$ O = concat(O_1, O_2,...,O_h) \in R^{n \times d_{dim}} $$ $$ Output = O \times W_O \in R^{n \times d} $$ $W_O \in R^{d_{dim} \times d}$是學習出來的參數，用來將合併後的輸出映射回原始維度($d$)。 ![image](https://hackmd.io/_uploads/rk1wd7Ypkl.png) ``` import torch import torch.nn as nn import torch.nn.functional as F class MultiHeadAttention(nn.Module): def __init__(self, embed_dim, num_heads): """ :param embed_dim: embedding dimension for Query、Key and Value :param num_heads: number of head """ super(MultiHeadAttention, self).__init__() self.embed_dim = embed_dim self.num_heads = num_heads self.head_dim = embed_dim // num_heads assert self.head_dim * num_heads == embed_dim, "Embed size needs to be divisible by heads" # linear project to Query、Key 和 Value self.query_linear = nn.Linear(embed_dim, embed_dim) self.key_linear = nn.Linear(embed_dim, embed_dim) self.value_linear = nn.Linear(embed_dim, embed_dim) # Linear layer for output self.out = nn.Linear(embed_dim, embed_dim) # scale self.scale = torch.sqrt(torch.FloatTensor([self.head_dim])) def forward(self, x): """ :param x: input data, Shape: (batch_size, seq_len, embed_dim) :return: Output, Shape: (batch_size, seq_len, embed_dim) """ batch_size = x.shape[0] # project to Query, Key, Value Q = self.query_linear(x) # [batch_size, seq_len, embed_dim] K = self.key_linear(x) # [batch_size, seq_len, embed_dim] V = self.value_linear(x) # [batch_size, seq_len, embed_dim] # split to multi-heads Q = Q.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2) # [batch_size, num_heads, seq_len, head_dim] K = K.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2) # [batch_size, num_heads, seq_len, head_dim] V = V.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2) # [batch_size, num_heads, seq_len, head_dim] # matmul and scale attention_scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale # [batch_size, num_heads, seq_len, seq_len_k] # Softmax attention_weights = F.softmax(attention_scores, dim=-1) # [batch_size, num_heads, seq_len, seq_len_k] # matmul output = torch.matmul(attention_weights, V) # [batch_size, num_heads, seq_len, head_dim] # concat. output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.embed_dim) # [batch_size, seq_len, embed_dim] # liear for outpur output = self.out(output) # [batch_size, seq_len, embed_dim] return output, attention_weights batch_size = 1 seq_len = 5 embed_dim = 4 num_heads = 2 # generate data x = torch.randn(batch_size, seq_len, embed_dim) # init attention = MultiHeadAttention(embed_dim, num_heads) # forward output, attention_weights = attention(x) print("Output:\n", output) print("Output shape:\n", output.shape) print("Attention Weights:\n", attention_weights) print("Attention Weights shape:\n", attention_weights.shape) ```