Evaluation Results for Alias Model

# Evaluation Results for Alias Model ## **alias_output** ### Aggregated Results: - **Precision**: 0.75 - **Recall**: 0.49 - **F1 Score**: 0.59 ### 每一步的 Average: - **Average Precision**: 0.84 - **Average Recall**: 0.63 - **Average F1 Score**: 0.68 - **Exact Match Accuracy**: 0.31 - **Total True Positives (TP)**: 116 - **Total False Positives (FP)**: 39 - **Total False Negatives (FN)**: 123 - **Error Count**: 0 - **Error Ratio**: 0.00 --- ## **alias_baseline** ### Aggregated Results: - **Precision**: 1.00 - **Recall**: 0.46 - **F1 Score**: 0.63 ### 每一步的 Average: - **Average Precision**: 1.00 - **Average Recall**: 0.63 - **Average F1 Score**: 0.73 - **Exact Match Accuracy**: 0.38 - **Total True Positives (TP)**: 111 - **Total False Positives (FP)**: 0 - **Total False Negatives (FN)**: 128 - **Error Count**: 0 - **Error Ratio**: 0.00 --- # Possible Reasons for Issues 1. **Overfitting**: - The fine-tuned model may have overfit specific patterns during training, leading to overprediction of aliases. 2. **Insufficient Post-Processing**: - The lack of proper post-processing (e.g., deduplication, filtering) results in repeated or incorrect aliases in the `model_output`. 3. **Dataset Imbalance**: - If the training dataset disproportionately favors certain patterns, the fine-tuned model might overpredict or miss some aliases. 4. **Baseline Simplicity**: - The baseline model outputs fewer aliases, potentially due to conservative predictions. While this improves precision, it can hurt recall. --- # Recommendations for Improvement ### **1. Enhance Post-Processing** - Implement stricter deduplication and filtering mechanisms for `model_output`. ### **2. Balanced Training Data** - Ensure the training data includes a balanced representation of single and multiple alias cases. ### **3. Error Analysis** - Focus on examples where false positives, duplicates, or false negatives occur and adjust the model or preprocessing accordingly. ### **4. Fine-Tuning Strategy** - Experiment with different loss functions or hyperparameters to better balance precision and recall. ## 改進模型如果要徹底解決問題，需要在模型訓練與生成過程中改進。 ### 1. 訓練階段：改進模型學習 #### 1.1 加入重複懲罰的損失函數 • 在訓練過程中，對於生成重複別名的行為加入懲罰項，讓模型學會避免重複。 • 方法： • 對於每一批輸出數據，檢查是否存在重複 token，並增加懲罰損失：def custom_loss_fn(output, target): loss = original_loss_fn(output, target) penalty = repetition_penalty(output) # 自定義重複懲罰 return loss + penalty #### 1.2 增強數據多樣性 • 訓練資料可能過於單一，導致模型過度依賴特定模式（如生成重複別名）。 • 解法： • 增加多樣化數據，例如引入更多帶有多個別名的訓練樣本：輸入：某人物資料/輸出：["杜甫", "子美", "少陵", "詩聖"] #### 1.3 對比學習（Contrastive Learning） • 訓練模型區分正確的別名集合與不正確的別名集合（如包含重複的集合），幫助模型學會更精確地生成。 • 實現思路： • 對於每個樣本，生成一個正確別名集合（ground truth）與一個帶有重複的集合，讓模型學會識別區別：正確集合：["杜甫", "子美"] / 錯誤集合：["杜甫", "子美", "杜甫"] --- ### 2. 生成階段：引入生成策略 #### 2.1 使用重複懲罰機制 • 在生成階段對重複生成的 token 降低概率，這是一種有效的即時策略。 • 實現方式： • 在 transformers 框架中，使用 repetition_penalty 參數：outputs = model.generate(input_ids, repetition_penalty=2.0) #### 2.2 Beam Search 的多樣性 • 使用更靈活的 Beam Search 策略，避免多條生成結果中出現重複。 • 方法： • 在 Beam Search 過程中加入去重檢查：outputs = model.generate(input_ids, num_beams=5, no_repeat_ngram_size=2) ## Contrastive Learning（對比學習）資料需要按照對比學習的要求準備並格式化，確保模型可以學會辨別正確和不正確的別名集合。以下是資料準備的詳細說明與格式建議： 1. Contrastive Learning 的核心邏輯對比學習的目標是讓模型學會區分： • 正樣本（Positive Examples）：屬於某一實體的正確別名集合。 • 負樣本（Negative Examples）：不屬於該實體的別名，或者與正樣本語義相近但不正確的別名。 2. 資料準備的格式 (1) Input-Output 格式對於每個實體，提供以下格式的資料： • Input：實體的主要名稱或上下文描述。 • Positive Aliases：該實體的正確別名集合。 • Negative Aliases：錯誤或重複的別名集合（作為對比學習的負樣本）。範例（JSON 格式）： ``` [ { "entity": "杜甫", "positive_aliases": ["杜甫", "子美", "少陵", "詩聖"], "negative_aliases": ["杜甫杜甫", "少陵子美", "杜甫123"] }, { "entity": "李白", "positive_aliases": ["李白", "青蓮居士", "詩仙"], "negative_aliases": ["李白李", "詩仙123", "青蓮李"] } ] ``` (2) Sentence Pair 格式如果模型需要處理兩兩對比的資料，可以將別名配對，標註是否為正樣本。 • Input 1：實體的主要名稱或上下文。 • Input 2：候選別名。 • Label：1 表示正樣本，0 表示負樣本。範例（CSV 格式）： Entity Candidate Alias Label 杜甫杜甫 1 杜甫子美 1 杜甫杜甫杜甫 0 杜甫少陵 1 杜甫少陵子美 0 (3) Triplet 格式這是對比學習中常見的格式，包含三部分： • Anchor：實體的主要名稱或上下文。 • Positive：正確的別名（正樣本）。 • Negative：不正確或語義相近的錯誤別名（負樣本）。範例（JSON 格式）： ``` [ { "anchor": "杜甫", "positive": "少陵", "negative": "杜甫杜甫" }, { "anchor": "杜甫", "positive": "詩聖", "negative": "少陵子美" } ] ``` 3. 資料來源與準備 (1) 正樣本（Positive Examples） • 來源：使用已有的正確別名資料（例如人工標註、知識庫）。 • 格式：確保每個實體的正樣本集合是完整且唯一的。 (2) 負樣本（Negative Examples） • 來源： • 生成包含錯誤別名的資料（例如在正樣本中插入拼寫錯誤、重複或不相關的別名）。 • 從模型的錯誤預測中提取負樣本。 • 格式： • 負樣本應該包含語義上接近但不正確的別名。 • 可以使用 Levenshtein 距離或語義相似度生成負樣本。生成負樣本的策略： • 拼寫變體：插入錯誤，如杜甫 → 杜甫杜甫。 • 重複詞組：如子美 → 少陵子美。 • 不相關名詞：如杜甫 → 杜甫123。 4. 資料處理流程 1. 收集正樣本： • 從訓練資料中提取每個實體的正確別名集合。 2. 生成負樣本： • 使用規則或模型生成語義相近但不正確的別名。 3. 格式化輸出： • 選擇適合的格式（Input-Output 格式、Sentence Pair 格式或 Triplet 格式）。 4. 驗證資料： • 確保正樣本無誤，負樣本語義合理且有挑戰性。 5. 與模型訓練人的對接方式 (1) 提供資料文件 • 提供上述格式的資料（如 JSON 或 CSV 文件）。 • 包含正樣本和負樣本的詳細信息。 (2) 定義清楚的標籤規則 • 明確每個資料的含義，例如： • Label 1 表示正樣本。 • Label 0 表示負樣本。 (3) 提供資料說明文檔 • 說明如何生成正樣本和負樣本。 • 提供生成負樣本的規則和策略，方便訓練模型人理解資料邏輯。 6. 資料準備的注意事項 1. 確保資料質量： • 正樣本應完整準確，負樣本應語義合理。 2. 負樣本的多樣性： • 包括拼寫錯誤、重複、無關名詞等多種形式。 3. 資料量的平衡： • 正樣本和負樣本的比例應合理（例如 1:1 或 1:2）。結論 • 建議根據上述格式準備資料，並提供清晰的說明文件給模型訓練人。 • 推薦格式：Triplet 格式或 Sentence Pair 格式，這些格式能直接應用於對比學習的模型訓練。 • 如果需要具體的數據生成或處理範例，我可以進一步協助！