從理論到實踐：使用 Hugging Face 實作神經語言模型

# 從理論到實踐：神經語言模型與詞嵌入 HUGGING FACE 課程將依循以下學習路徑： 1. **理解詞嵌入**：首先，我們將實作《嵌入向量表達》課程的核心概念，學習如何用向量表示詞彙，並探索這些向量如何捕捉詞語的**相似性**與**類比關係** 。 2. **應用詞嵌入**：接著，我們將延續《神經語言模型》課程的內容，學習如何將這些充滿意義的詞嵌入向量作為輸入，餵給神經網路模型來完成**文字分類**和**文本生成**等高階任務。 **我們將涵蓋：** 1. **環境設定**：安裝必要的工具。 2. **探索詞嵌入 (Word Embedding)**：手動操作詞向量，尋找相似詞與完成詞彙類比任務。 3. **用 `pipeline` 進行文字分類**：體驗將詞嵌入應用於下游任務。 4. **深入模型內部**：了解模型 (Model) 與分詞器 (Tokenizer) 如何協同工作。 5. **語言模型與文本生成**：實作語言模型，讓AI自動生成通順的句子。 6. **繁體中文模型實作**：將前述技巧應用在支援繁體中文的模型上。 ----- ## Part 1: 環境設定首先，我們需要安裝必要的函式庫。除了 `transformers` 和 `torch`，我們還需要 `scikit-learn` 來輔助計算。打開你的終端機（Terminal 或 Command Prompt），輸入以下指令： ```bash pip install transformers torch scikit-learn ``` ----- ## Part 2: 探索詞嵌入 (Word Embedding) 在《嵌入向量表達》課程中，我們學到可以將詞彙表示為一個多維度的**密集向量 (dense vectors)**。這些向量（或稱「詞嵌入」）是在巨大的文本語料庫中，透過像 Word2vec 這樣的自監督學習方法訓練出來的。其核心精神是「**上下文相似的詞，其語意也相近**」 (Distributional Hypothesis) 。因此，它們的詞向量在向量空間中也應該是相近的。讓我們載入一個預訓練好的中文模型，直接取得它內部學好的詞嵌入層，並驗證這些概念。 {%preview https://huggingface.co/google-bert/bert-base-chinese?library=transformers %} ```python import torch from transformers import AutoTokenizer, AutoModel from sklearn.metrics.pairwise import cosine_similarity # 載入一個預訓練的中文 BERT 模型 model_name = "bert-base-chinese" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) # 取得模型的整個詞嵌入矩陣 (embedding matrix) # 這相當於講義中提到的 W 矩陣 embeddings = model.embeddings.word_embeddings.weight.detach() print(f"詞彙表大小 (Vocabulary Size): {embeddings.shape[0]}") print(f"每個詞的嵌入維度 (Embedding Dimension): {embeddings.shape[1]}") ``` **預期輸出：** ``` 詞彙表大小 (Vocabulary Size): 21128 每個詞的嵌入維度 (Embedding Dimension): 768 ``` ### 2.1 任務一：相似詞搜尋 (Similar Word Search) 我們可以透過計算向量之間的「**餘弦相似度 (Cosine Similarity)**」來衡量它們的接近程度，這完全對應了講義中的概念。底下，我們來實作一個功能，找出與給定詞彙最相近的詞。 ```python def find_similar_words(word, top_n=10): """ 接收一個詞，回傳在向量空間中最接近的 top_n 個詞。 """ # 透過 tokenizer 找到這個詞在詞彙表中的 ID try: word_id = tokenizer.convert_tokens_to_ids(word) if word_id == tokenizer.unk_token_id: # [UNK] print(f"'{word}' 不在模型的詞彙表中。") return else: print(f"'{word}' 在模型的詞彙表中的 ID 是 {word_id}。") except: print(f"'{word}' 不在模型的詞彙表中。") return # 從嵌入矩陣中取出這個詞的向量 word_vector = embeddings[word_id].reshape(1, -1) # 計算它與詞彙表中所有詞的餘弦相似度 similarities = cosine_similarity(word_vector, embeddings) # 找出相似度最高的前 top_n+1 個詞的索引 (包含自己) top_indices = similarities[0].argsort()[-top_n-1:][::-1] print(f"與「{word}」最相似的 {top_n} 個詞：") for idx in top_indices: token = tokenizer.convert_ids_to_tokens([idx]) if token != word and token != '[PAD]' and token != '[UNK]': print(f"- {token} (相似度: {similarities[0][idx]:.4f})") # 來試試講義中的例子 find_similar_words("咖") print("-" * 20) find_similar_words("蘋") ``` **預期輸出 (可能因模型版本略有不同)：** ``` '咖' 在模型的詞彙表中的 ID 是 1476。與「咖」最相似的 10 個詞： - ['咖'] (相似度: 1.0000) - ['啡'] (相似度: 0.5923) - ['茶'] (相似度: 0.3287) - ['coffee'] (相似度: 0.3143) - ['cafe'] (相似度: 0.3077) - ['啤'] (相似度: 0.2917) - ['檸'] (相似度: 0.2871) - ['柠'] (相似度: 0.2669) - ['棕'] (相似度: 0.2501) - ['豆'] (相似度: 0.2484) - ['茗'] (相似度: 0.2433) -------------------- '蘋' 在模型的詞彙表中的 ID 是 5981。與「蘋」最相似的 10 個詞： - ['蘋'] (相似度: 1.0000) - ['苹'] (相似度: 0.7618) - ['apple'] (相似度: 0.4236) - ['檸'] (相似度: 0.3647) - ['iphone'] (相似度: 0.3594) - ['macbook'] (相似度: 0.2914) - ['鳳'] (相似度: 0.2885) - ['櫻'] (相似度: 0.2872) - ['鈴'] (相似度: 0.2834) - ['柠'] (相似度: 0.2830) - ['玫'] (相似度: 0.2796) ``` ### 2.2 任務二：詞彙類比 (Word Analogy) 更神奇的是，詞嵌入向量之間的**加減法**也具有語意上的意義。最經典的例子就是「國王 - 男人 + 女人 ≈ 皇后」。這代表向量空間捕捉到了詞彙之間的**關係 (relation)**。我們來實作這個類比功能。 ```python import torch from transformers import AutoTokenizer, AutoModel from sklearn.metrics.pairwise import cosine_similarity import numpy as np # --- 載入模型和分詞器 (這部分不變) --- model_name = "bert-base-chinese" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) embeddings = model.embeddings.word_embeddings.weight.detach() # --- 輔助函式，用來取得詞向量 (這部分不變) --- def get_word_vector(word): tokens = tokenizer.tokenize(word) if not tokens: return None ids = tokenizer.convert_tokens_to_ids(tokens) word_vectors = embeddings[ids] return torch.mean(word_vectors, dim=0) # --- 【最終版】word_analogy 函式 --- def word_analogy(w1, w2, w3, candidate_words, top_n=5): """ 執行 w2 - w1 + w3 的類比運算。在指定的 candidate_words 列表中尋找最相似的答案。 """ # 取得輸入詞的平均向量 vec1 = get_word_vector(w1) vec2 = get_word_vector(w2) vec3 = get_word_vector(w3) if any(v is None for v in [vec1, vec2, vec3]): print("輸入的詞彙有無法辨識的。") return # 執行向量運算 target_vector = (vec2 - vec1 + vec3).reshape(1, -1) # --- 核心修正：建立候選詞的向量矩陣 --- candidate_vectors = [] valid_candidates = [] for candidate in candidate_words: # 確保候選詞也能被轉換成向量 cand_vec = get_word_vector(candidate) if cand_vec is not None: candidate_vectors.append(cand_vec.numpy()) valid_candidates.append(candidate) if not valid_candidates: print("候選詞列表為空或所有候選詞都無法辨識。") return candidate_matrix = np.vstack(candidate_vectors) # --- 核心修正：在候選詞向量中進行搜尋 --- similarities = cosine_similarity(target_vector.numpy(), candidate_matrix) top_indices = similarities[0].argsort()[-top_n:][::-1] print(f"類比任務: {w2} - {w1} + {w3} ≈ ?") print(f"（在 {len(valid_candidates)} 個候選詞中搜尋）") for idx in top_indices: # 從我們的候選列表中取得詞 token = valid_candidates[idx] print(f"- {token} (相似度: {similarities[0][idx]:.4f})") # --- 執行範例 --- # 1. 為「國王」這個類比任務建立一個候選詞列表 gender_candidates = ["女王", "公主", "太后", "娘娘", "王子", "皇帝", "太子"] print("--- 經典範例 ---") word_analogy("男人", "國王", "女人", candidate_words=gender_candidates, top_n=3) print("\n" + "-" * 20 + "\n") # 2. 為「首都」這個類比任務建立一個候選詞列表 capital_candidates = ["東京", "首爾", "巴黎", "倫敦", "曼谷", "河內", "莫斯科"] print("--- 中文範例 ---") word_analogy("中國", "北京", "日本", candidate_words=capital_candidates, top_n=3) ``` **預期輸出：** ``` --- 經典範例 --- 類比任務: 國王 - 男人 + 女人 ≈ ? （在 7 個候選詞中搜尋） - 女王 (相似度: 0.6312) - 王子 (相似度: 0.5525) - 皇帝 (相似度: 0.3805) -------------------- --- 中文範例 --- 類比任務: 北京 - 中國 + 日本 ≈ ? （在 7 個候選詞中搜尋） - 東京 (相似度: 0.4666) - 巴黎 (相似度: 0.0961) - 曼谷 (相似度: 0.0756) ``` 模型確實學到了「國家」與「首都」之間的關係。 ----- ## Part 3: 秒速上手！用 `pipeline` 進行文字分類現在我們理解了詞嵌入是如何表示詞義的，接下來就簡單了。所有神經語言模型的第一步，就是將輸入的文字轉換成這些詞嵌入向量，然後再進行後續的計算。 Hugging Face 的 `pipeline` 將這個複雜的流程 **(文字 -\> Token IDs -\> 詞嵌入 -\> 神經網路計算 -\> 輸出)** 打包成一個簡單的函式，讓我們可以立即使用。 **任務：情感分析 (Sentiment Analysis)** 我們來實作一個情感分析器，判斷句子的情緒是正向 (POSITIVE) 還是負向 (NEGATIVE)。 > 一個完整的前饋神經網路 (Feedforward Network) 如何從輸入特徵（如詞頻）或詞嵌入 (Word Embedding)，經過隱藏層，最終通過 Softmax 輸出分類結果（如正面/負面/中性）。 ```python # 引用 pipeline 功能 from transformers import pipeline # 建立一個情感分析的 pipeline # 這會自動從 Hugging Face Hub 下載一個預訓練好的模型 classifier = pipeline("sentiment-analysis") # 準備要分析的句子 sentences = [ "I've been waiting for a Hugging Face course my whole life.", "I hate this so much!", "The dessert was great." # 講義中的例子 ] # 進行預測 results = classifier(sentences) # 印出結果 for sentence, result in zip(sentences, results): print(f"句子: '{sentence}'") print(f"預測標籤: {result['label']}, 分數: {result['score']:.4f}\n") ``` **預期輸出：** ``` 句子: 'I've been waiting for a Hugging Face course my whole life.' 預測標籤: POSITIVE, 分數: 0.9983 句子: 'I hate this so much!' 預測標籤: NEGATIVE, 分數: 0.9995 句子: 'The dessert was great.' 預測標籤: POSITIVE, 分數: 0.9998 ``` **思考一下：** 這個簡單的 `pipeline` 背後，其實就是一個已經訓練好的深度學習模型（類似講義中提到的架構）。它接收文字，將文字轉換為詞向量 (Embedding)，輸入到神經網路中，最後輸出了分類結果。 ----- ## Part 4: 深入一點：模型 (Model) 與分詞器 (Tokenizer) `pipeline` 雖然方便，但它隱藏了許多細節。在實際應用中，我們通常需要分別操作**分詞器 (Tokenizer)** 和 **模型 (Model)**。 * **Tokenizer**：將人類閱讀的文字，轉換成模型能夠理解的數字ID（Token IDs）。這一步驟類似於講義中將單詞轉換為 one-hot 向量或對應到 Embedding 矩陣 E 的索引。 * **Model**：這就是神經網路本體，例如講義中提到的 RNN、LSTM 或 CNN 架構的演進版 — Transformer。它接收 Tokenizer 處理後的數字作為輸入，其**第一層就是我們在 Part 2 探索的 Embedding 層**，它會查詢每個 ID 對應的詞向量，然後才送入後續的神經網路進行運算。 {%preview https://huggingface.co/distilbert/distilbert-base-uncased %} 讓我們手動重現剛剛 `pipeline` 的工作流程： ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # 指定我們要使用的模型 (與剛才 pipeline 預設的模型相同) model_name = "distilbert-base-uncased-finetuned-sst-2-english" # 1. 載入分詞器 (Tokenizer) tokenizer = AutoTokenizer.from_pretrained(model_name) # 2. 載入模型 (Model) model = AutoModelForSequenceClassification.from_pretrained(model_name) # 準備要分析的句子 sentences = [ "This course is absolutely fantastic!", "I'm not sure if I like this." ] # 3. 使用分詞器處理文字 # padding=True: 將句子填充到相同長度 # truncation=True: 將過長的句子截斷 # return_tensors="pt": 回傳 PyTorch Tensors 格式 inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") print("分詞後的輸入 (Input IDs):\n", inputs['input_ids']) # 4. 將處理好的輸入送入模型 # 模型會回傳 "logits"，這是尚未經過 Softmax 轉換的原始分數 outputs = model(**inputs) print("\n模型的原始輸出 (Logits):\n", outputs.logits) # 5. 將 Logits 轉換為機率 # 這對應了講義中提到的 Softmax 輸出層 predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) print("\n經過 Softmax 後的機率:\n", predictions) # 6. 解讀結果 # model.config.id2label 可以告訴我們每個位置的機率代表哪個標籤 for i in range(len(sentences)): print(f"\n句子: '{sentences[i]}'") for j in range(len(predictions[i])): label = model.config.id2label[j] score = predictions[i][j].item() print(f"- 標籤: {label}, 機率: {score:.4f}") ``` ``` 分詞後的輸入 (Input IDs): tensor([[ 101, 2023, 2607, 2003, 7078, 10392, 999, 102, 0, 0, 0, 0], [ 101, 1045, 1005, 1049, 2025, 2469, 2065, 1045, 2066, 2023, 1012, 102]]) 模型的原始輸出 (Logits): tensor([[-4.3372, 4.6703], [ 3.2759, -2.6400]], grad_fn=<AddmmBackward0>) 經過 Softmax 後的機率: tensor([[1.2248e-04, 9.9988e-01], [9.9731e-01, 2.6889e-03]], grad_fn=<SoftmaxBackward0>) 句子: 'This course is absolutely fantastic!' - 標籤: NEGATIVE, 機率: 0.0001 - 標籤: POSITIVE, 機率: 0.9999 句子: 'I'm not sure if I like this.' - 標籤: NEGATIVE, 機率: 0.9973 - 標籤: POSITIVE, 機率: 0.0027 ``` **重點分析：** 1. **Input**: 原始文字 `sentences`。 2. **Embedding Layer**: `tokenizer` 將文字轉換為 `input_ids`，模型內部會將這些 ID 對應到高維度的詞向量。 3. **Hidden Layer**: `model` 內部的 Transformer 架構（比 RNN/LSTM 更強大）進行深度計算。 4. **Output Layer (Softmax)**: 我們手動使用 `torch.nn.functional.softmax` 將模型的 `logits` 轉換為可解釋的機率分佈。 ----- ## Part 5: 語言模型與文本生成 RNN 如何被用作**語言模型 (Language Models)**，其核心任務就是根據前面的詞預測下一個最可能的詞 `P(w_t | w_t-N+1, ..., w_t-1)`。透過連續預測下一個詞，我們就能實現**文本生成**。語言模型的任務是預測下一個詞，這同樣是建立在詞嵌入之上。模型會看著前面詞彙的嵌入向量，來推斷下一個最可能的詞向量應該長什麼樣子。讓我們再次使用 `pipeline` 來體驗這個強大的功能。 ```python from transformers import pipeline # 建立一個文本生成的 pipeline，使用經典的 GPT-2 模型 generator = pipeline("text-generation", model="gpt2") # 給定一個開頭，讓模型繼續寫下去 prompt = "In a shocking finding, scientist discovered a herd of unicorns..." # 執行生成 # max_length: 生成文本的總長度 # num_return_sequences: 生成幾種不同的結果 results = generator(prompt, max_length=100, num_return_sequences=1) # 印出結果 for result in results: print(result['generated_text'] ``` ``` In a shocking finding, scientist discovered a herd of unicorns... that are part of a massive hoard that includes a hoard of giant fangs. "It's been nearly 24 years since anyone had any idea what unicorns were, how to properly eat ``` 這個過程被稱為**自回歸生成 (Autoregressive generation)**，因為模型每生成一個新的詞，就會將這個詞作為新的輸入，來預測下一個詞，不斷循環。 ----- ## Part 6: 處理繁體中文模型 Hugging Face Hub 上有大量由社群貢獻、支援多國語言的模型。要處理繁體中文，我們只需要**載入一套針對中文訓練好的詞嵌入矩陣和神經網路權重**即可。 ### 6.1 中文情感分析我們來做一個**情感分析 (Sentiment Analysis)** 的任務，使用一個基於 BERT 的中文模型。 ```python from transformers import pipeline # 載入支援中文的模型來進行情感分析 # model: "uer/albert-base-chinese-cluecorpussmall" 是個輕量且效果不錯的簡繁中文模型 # tokenizer: 同上 chinese_classifier = pipeline( "sentiment-analysis", model="uer/albert-base-chinese-cluecorpussmall" ) # 準備繁體中文句子 sentences = [ "今天的課程真是太棒了，收穫滿滿！", "這部電影的劇情轉折有點生硬，不太推薦。", "這家餐廳的服務態度很差，食物也很普通。" ] # 進行預測 results = chinese_classifier(sentences) # 印出結果 for sentence, result in zip(sentences, results): # 此模型的標籤是 LABEL_0 (負面) 和 LABEL_1 (正面) label = "正面 (POSITIVE)" if result['label'] == 'LABEL_1' else "負面 (NEGATIVE)" print(f"句子: '{sentence}'") print(f"預測標籤: {label}, 分數: {result['score']:.4f}\n") ``` **預期輸出：** ``` 句子: '今天的課程真是太棒了，收穫滿滿！' 預測標籤: 正面 (POSITIVE), 分數: 0.9632 句子: '這部電影的劇情轉折有點生硬，不太推薦。' 預測標籤: 負面 (NEGATIVE), 分數: 0.9983 句子: '這家餐廳的服務態度很差，食物也很普通。' 預測標籤: 負面 (NEGATIVE), 分數: 0.9995 ``` ### 6.2 中文文本生成同樣地，我們也可以使用支援中文的生成模型（如 GPT-2 的中文版）來創作內容。 {%preview https://github.com/ckiplab/ckip-transformers %} ```python from transformers import ( BertTokenizerFast, AutoModelForCausalLM, ) # casual language model (GPT2) tokenizer = BertTokenizerFast.from_pretrained('ckiplab/gpt2-base-chinese') model = AutoModelForCausalLM.from_pretrained('ckiplab/gpt2-base-chinese') # 給定一個繁體中文的開頭 input_text = "人工智慧的快速發展，為我們的生活帶來了許多便利，例如：" inputs = tokenizer(input_text, return_tensors='pt') outputs = model.generate(**inputs, max_length=100, no_repeat_ngram_size=2) print(outputs) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(f"下一個最可能的字是: {generated_text}") ``` ----- ## Part 7: 回顧恭喜你！你已經成功地使用 Hugging Face 實作了講義中幾個最重要的NLP概念，並且還將它擴展到了**繁體中文**的應用： 1. **文字分類**：了解如何利用預訓練模型判斷文字的屬性，這與情感分析、垃圾郵件偵測、文章主題分類等應用直接相關。 2. **模型與分詞器**：揭開了 `pipeline` 的神秘面紗，理解了從文字到數字再到預測結果的完整流程。 3. **語言模型與文本生成**：體驗了語言模型的預測能力，並將其應用於自動內容創作。 4. **多語言支援**：了解到 Hugging Face 生態系的強大之處，處理不同語言的核心方法論是相同的，關鍵在於選用適合的預訓練模型。講義中提到的 CNN、RNN、LSTM 是理解現代神經網路的基石。而今天我們使用的 Transformer 架構，正是為了解決 RNN 和 LSTM 在處理長距離依賴性時遇到的困難而誕生的。 **下一步可以探索的方向：** * **更換模型**：到 [Hugging Face Hub](https://huggingface.co/models) 網站上，尋找其他預訓練好的模型，試試看不同的模型在相同任務上的表現。你可以用 "Chinese" 或 "Taiwan" 等關鍵字來搜尋。 * **其他 Pipeline**：Hugging Face 還提供了命名實體辨識 (NER)、問答 (QA)、翻譯 (Translation) 等多種 pipeline，其中很多也都有支援中文的模型。 * **模型微調 (Fine-tuning)**：今天的課程我們都是使用「別人訓練好的模型」。真正的威力在於，你可以拿這些預訓練模型，在你自己的資料集上進行「微調」，讓它適應像 AESW 這樣的特定任務。這是成為 NLP 專家的必經之路！感謝 gemini 2.5 pro、perplexity.ai、hugging face協助製作