Train LLM - HackMD

# Train LLM # 訓練種類 - **從零開始訓練** Building specialized LLMs entirely from domain-specific data. 有研究同時使用通用資料和領域資料混合，從頭開始訓練了一個大型語言模型， - **持續預訓練** 在已有的LLM基礎上，用特定領域數據進行進一步的預訓練。是一種在保持既有所學的同時，又能有效適應新領域的方法，也就是說在已有基礎能力的模型上加強模型在特定domain 的知識；一般而言使用 10B~100B tokens的資料來繼續做 pretraining。在 Continual pretraining 的過程中，最害怕的其實就是模型在學新知識的過程中把舊的知識接連忘掉，也就是所謂的 Forgetting 問題。 - **基礎大模型微調** 在一個通用模型的基礎上做 instruction tuning（SFT)，這種做法的優點是可以快速看到不錯的結果，但要建立訓練資料集也要花費不少時間。 - **通用大型語言模型+向量知識庫** 領域知識庫加上通用大型語言模型，針對通用大型語言模型見過的知識比較少的領域，利用向量資料庫等方式根據問題在領域知識庫中找到相關內容，再利用通用大模型強大的摘要和問答的能力產生回應。 ## Pretraining 模型一開始是完全空白，對世界毫無知識，連英文單詞都無法構成，使用下一個詞預測(next token prediction)的訓練方式，使用大量零散的文字資料進行預訓練，通常從網路抓取而來的「無標籤」資料。使用自主監督學習法(self-supervised learning) 預訓練後模型學會: - 語言結構 - 基礎常識 Fine-tuning 是在預訓練取得基礎語言能力後，使用標註資料訓練模型在特定任務上的專業知識。預訓練得到通用基礎能力，Fine-tuning 將模型專業化 # pretraining stage ->文字接龍 1. 準備大量任何形式的語料庫 2. tokenization 將句子斷開成小塊(最小的輸入單位)，可以用簡單的編碼像是 ```python stoi = { ch:i for i,ch in enumerate(chars) } itos = { i:ch for i,ch in enumerate(chars) } encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string print(encode("hii there")) print(decode(encode("hii there"))) ``` output ``` [46, 47, 47, 1, 58, 46, 43, 56, 43] hii there ``` 或是Google使用[sentencepiece](https://github.com/google/sentencepiece)，OpenAI使用[tiktoken](https://github.com/openai/tiktoken) 3. 建立訓練/驗證資料集建立文字接龍，但我們並不會整個同時餵進模型，會切成一個個block可以想像成段落 - batch_size how many independent sequences will we process in parallel? - block_size what is the maximum context length for predictions? ```python x = train_data[:block_size] y = train_data[1:block_size+1] for t in range(block_size): context = x[:t+1] target = y[t] print(f"when input is {context} the target: {target}") ``` ``` when input is tensor([18]) the target: 47 when input is tensor([18, 47]) the target: 56 when input is tensor([18, 47, 56]) the target: 57 when input is tensor([18, 47, 56, 57]) the target: 58 when input is tensor([18, 47, 56, 57, 58]) the target: 1 when input is tensor([18, 47, 56, 57, 58, 1]) the target: 15 when input is tensor([18, 47, 56, 57, 58, 1, 15]) the target: 47 when input is tensor([18, 47, 56, 57, 58, 1, 15, 47]) the target: 58 ``` 5. 使用可以產生序列的model Transformer, RNN, LSTM, Mamba ... 6. 計算分類loss softmax 7. 優化器 8. 梯度傳播梯度更新 # fine tuning stage use QA pairs # Tokenization 切字 # Note 1.<SOS>、<BOS>、<GO>：代表一個序列的開始。 2.<EOS>：代表一個序列的結束，作為判斷終止的標簽。 3.<MASK>：用於遮蓋句子中的一些單詞。 4.<UNK>：未知字符，代表詞典中沒有的詞。 5.<SEP>: 用於分隔兩個輸入句子，例如輸入句子 A 和 B，要在句子 A，B 後面增加 <SEP> 標志。 6.<CLS> ：放在句子的首位，表示句子的開始，就是classification的意思，通常會在bert等模型出現。 7.<PAD>：補全字符，例如要將句子處理為特定的長度，我們就要在句子前後補<PAD>。 # reference [客製化的大型語言模型 (LLM) — 針對特定領域做 Continual pre-training](https://medium.com/@albertchen3389/%E5%AE%A2%E8%A3%BD%E5%8C%96%E7%9A%84%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B-llm-%E9%87%9D%E5%B0%8D%E7%89%B9%E5%AE%9A%E9%A0%98%E5%9F%9F%E5%81%9A-continual-pre-training-0a961a0161b9) [大神](https://www.youtube.com/watch?v=kCc8FmEb1nY)