llm.c 筆記 (1)

# llm.c 筆記 (1) --- 1. encoder wte ![image](https://hackmd.io/_uploads/S197giFlR.png) wpe ![image](https://hackmd.io/_uploads/S1YIloFxC.png) ``` encoder_forward 函數的任務是將文本中的每個單詞（或稱為token）和它在文本中的位置，轉換成一個數字向量，並將這些向量進一步合成一個代表整個句子或文本的輸出向量。具體來說，這個函數做的事情包括：取得每個單詞的embedding：每個單詞都被轉化為一個固定維度的數字向量（稱為embedding），這些向量是從一個事先訓練好的表中查找得來的，每個單詞有其對應的數字表示（token ID）。取得位置信息：除了單詞本身的信息，每個單詞在句子中的位置也很重要。這個函數同樣會將位置信息轉化為一個數字向量（位置embedding）。合成輸出向量：對於文本中的每個單詞，其單詞embedding和位置embedding會被相加，結果就是一個綜合了單詞信息和位置信息的向量。這個向量會被用於後續的神經網絡處理，例如用於語言模型、文本分類等任務。總結來說，encoder_forward 函數的目的是將文本數據轉化為神經網絡可以處理的數字形式，並保留足夠的信息以供進一步的學習和分析。 ``` ``` encoder_forward 函數功能：這個函數用於計算編碼器層的輸出向量。輸入參數： float* out：輸出向量，維度為 (B, T, C)。 int* inp：整數陣列，包含每個位置的token ID，維度為 (B, T)。 float* wte：token embeddings，維度為 (V, C)。 float* wpe：位置 embeddings，維度為 (maxT, C)。 int B, T, C：B是批次大小，T是序列長度，C是embedding維度。處理流程：對每個批次中的每個位置，計算對應的輸出向量。根據輸入的token ID，從wte中取得對應的token embedding。根據位置，從wpe中取得對應的位置embedding。將這兩個向量相加，結果存儲在輸出向量out的對應位置。 ``` encoder 函式該如何設計? 除了輸入，輸出外，encoder 還需要 wte/wpe. B (批次大小)：3 T (序列长度)：4 C (Embedding维度)：5 ``` void encoder_forward(float* out, int* inp, float* wte, float* wpe, int B, int T, int C) ``` ``` void encoder_forward(float* out, int* inp, float* wte, float* wpe, int B, int T, int C) { // out is (B,T,C). At each position (b,t), a C-dimensional vector summarizing token & position // inp is (B,T) of integers, holding the token ids at each (b,t) position // wte is (V,C) of token embeddings, short for "weight token embeddings" // wpe is (maxT,C) of position embeddings, short for "weight positional embedding" for (int b = 0; b < B; b++) { for (int t = 0; t < T; t++) { // seek to the output position in out[b,t,:] float* out_bt = out + b * T * C + t * C; // get the index of the token at inp[b, t] int ix = inp[b * T + t]; // seek to the position in wte corresponding to the token float* wte_ix = wte + ix * C; // seek to the position in wpe corresponding to the position float* wpe_t = wpe + t * C; // add the two vectors and store the result in out[b,t,:] for (int i = 0; i < C; i++) { out_bt[i] = wte_ix[i] + wpe_t[i]; } } } } void encoder_backward(float* dwte, float* dwpe, float* dout, int* inp, int B, int T, int C) { for (int b = 0; b < B; b++) { for (int t = 0; t < T; t++) { float* dout_bt = dout + b * T * C + t * C; int ix = inp[b * T + t]; float* dwte_ix = dwte + ix * C; float* dwpe_t = dwpe + t * C; for (int i = 0; i < C; i++) { float d = dout_bt[i]; dwte_ix[i] += d; dwpe_t[i] += d; } } } } ``` ``` 如何理解呢? 用一個數值例來理解。初始設定 B (批次大小)：3 T (序列長度)：4 C (Embedding維度)：5 目標是訪問第 1 個批次（b = 1）和第 2 個序列位置（t = 2）的數據。解釋與計算我們的三維數組 out 是按批次、序列位置和Embedding維度存儲數據。out 數組的總大小是 B * T * C，即 3 * 4 * 5 = 60 個浮點數。批次偏移 (b × T × C) 當 b = 1 時，我們想要訪問第二個批次的數據（在程式語言中，索引通常從0開始）。每個批次有 T×C 個元素，即每個批次包含 20 個浮點數（因為每個序列位置有5個浮點數，而每個批次有4個這樣的序列位置）。批次偏移計算為 1×4×5=20，這表示從數據起始位置到第一個批次的起始位置需要跳過前20個浮點數。序列位置偏移 (t × C) 當 t = 2 時，我們想要訪問第三個序列位置的數據（同樣，索引從0開始）。每個序列位置有 C 個元素，即每個序列位置有5個浮點數。序列位置偏移計算為 2×5=10，這表示在當前批次內，從批次的起始位置再向前移動10個浮點數的位置來到達第三個序列位置。總偏移量總偏移量為 20+10=30。因此，out_bt 是指向 out 數組開始位置後向前移動30個浮點數的位置。使用此數據 out_bt 現在指向 out 數組中第 1 個批次和第 2 個序列位置的起始點。在這個位置，我們可以訪問該序列位置對應的所有 Embedding 維度的值。這可以通過以下代碼實現： ``` ``` float* out_bt = out + 30; // 指向第1批次第2序列位置的起始點 float features[5]; // 假設每個序列位置有5個embedding維度 for (int i = 0; i < 5; i++) { features[i] = out_bt[i]; // 复制第1批次第2序列位置的embedding到features数组中 } ``` 如何確認寫出來的結果是對的? 1. 舉數值例 2. 測試並跟pytorch比對數值跟速度 ``` 初始設定假設我們有以下的輸入和參數： B (批次大小): 1 T (序列長度): 2 C (Embedding維度): 3 V (詞彙表大小): 4 maxT (最大序列長度，用於位置Embeddings): 5 假設我們有以下的數據： inp: [2, 3] // 兩個token的ID，表示這個序列的長度為2 wte (token embeddings): Token 0: [0.1, 0.2, 0.3] Token 1: [0.4, 0.5, 0.6] Token 2: [0.7, 0.8, 0.9] Token 3: [1.0, 1.1, 1.2] wpe (位置 embeddings): 位置 0: [0.01, 0.02, 0.03] 位置 1: [0.04, 0.05, 0.06] 位置 2: [0.07, 0.08, 0.09] 位置 3: [0.10, 0.11, 0.12] 位置 4: [0.13, 0.14, 0.15] 函數操作步驟初始化輸出向量 out: 對於序列長度T為2，我們準備兩個三維的輸出向量，每個向量的維度是C。初始化為 [0.0, 0.0, 0.0]。遍歷每個批次和每個序列位置: 對於 b = 0, t = 0: 讀取 token ID：inp[0] = 2 取得 token 2 的 embedding：wte[2] = [0.7, 0.8, 0.9] 取得位置 0 的 embedding：wpe[0] = [0.01, 0.02, 0.03] 計算和存储向量和：out[0, 0, :] = [0.7 + 0.01, 0.8 + 0.02, 0.9 + 0.03] = [0.71, 0.82, 0.93] 對於 b = 0, t = 1: 讀取 token ID：inp[1] = 3 取得 token 3 的 embedding：wte[3] = [1.0, 1.1, 1.2] 取得位置 1 的 embedding：wpe[1] = [0.04, 0.05, 0.06] 計算和存储向量和：out[0, 1, :] = [1.0 + 0.04, 1.1 + 0.05, 1.2 + 0.06] = [1.04, 1.15, 1.26] 結果最終的輸出 out 將包含每個序列位置的 Embedding 維度的值，具體為：第一個序列位置：[0.71, 0.82, 0.93] 第二個序列位置：[1.04, 1.15, 1.26] 這個過程顯示了如何結合 token embeddings 和位置 embeddings 來生成針對每個序列位置的輸出向 ``` ``` 為了清晰說明 encoder_backward 函數的運作過程，我們可以設計一個具體的數值例子，並展示該函數如何通過反向傳播來更新權重梯度。我們將使用與前一個前向傳播例子相同的參數設置。初始設定 B (批次大小): 1 T (序列長度): 2 C (Embedding維度): 3 inp: [2, 3] // 兩個token的ID 假設我們有以下的梯度數據，這些數據可能是從網絡的下一層或損失函數傳回的： dout (來自上層的梯度): 位置 0: [0.5, 0.6, 0.7] 位置 1: [0.8, 0.9, 1.0] 函數操作步驟這個函數將更新 token embeddings 和位置 embeddings 的梯度。初始化梯度存儲 dwte 和 dwpe: 假設初始時，所有梯度都設置為0： dwte: [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]] dwpe: [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]] 遍歷每個批次和每個序列位置來更新梯度: 對於 b = 0, t = 0: 從 dout 取得位置 0 的梯度：[0.5, 0.6, 0.7] Token ID 是 2，更新 dwte 的第 2 行：[0.5, 0.6, 0.7] 更新 dwpe 的第 0 行：[0.5, 0.6, 0.7] 對於 b = 0, t = 1: 從 dout 取得位置 1 的梯度：[0.8, 0.9, 1.0] Token ID 是 3，更新 dwte 的第 3 行：[0.8, 0.9, 1.0] 更新 dwpe 的第 1 行：[0.8, 0.9, 1.0] 結果最終的梯度矩陣更新為： dwte: [0.0, 0.0, 0.0] [0.0, 0.0, 0.0] [0.5, 0.6, 0.7] // 更新為第 2 個 token 的梯度 [0.8, 0.9, 1.0] // 更新為第 3 個 token 的梯度 dwpe: [0.5, 0.6, 0.7] // 更新為位置 0 的梯度 [0.8, 0.9, 1.0] // 更新為位置 1 的梯度 [0.0, 0.0, 0.0] [0.0, 0.0, 0.0] [0.0, 0.0, 0.0] 這個例子展示了 encoder_backward 函數如何利用從上層傳來的梯度信息來更新相關的權重梯度，這是神經網絡訓練中反向傳播的重要部分。 ```