Training model 結果

# Training model 結果 ## Now 將原本的data與後來的資料合併，並篩出純中文部分去train ### 改善部分 * 修正 training data 中 **同字不同 label 的重複部分** * 依預測結果與label 對照，修改錯誤的 label ---- * Training data: **word_labeled_zh_201104_0039CC.csv** ![](https://i.imgur.com/Rq93QZO.png) * Test data: **word_labeled_zh_200929_test_4835.csv** ![](https://i.imgur.com/SyJNXkE.png) * Model: **text_cnn_best_97.244_LR0.001_BATCH100_EPOCH100** * Accuracy: ![](https://i.imgur.com/VBc6He9.png) ![](https://i.imgur.com/inO4OJp.png) --- ### Testing all data (word_labeled_zh_201030_1149CC.csv: 25478) ![](https://i.imgur.com/PPPkLsx.png) ![](https://i.imgur.com/oFw5qRf.png) #### 匯出預測結果與 label 不相同的詞 (predict_error.csv): P:預測的答案 ; L: label 標記的答案 https://drive.google.com/file/d/1FOPGCCXpZXUrEbmT0hFs_r-p_mxZ2mZS/view?usp=sharing --- ## 修正 label 結果(11/11) 將過去有誤的 label 與同音異字的詞彙加入新版 train data * Training data: **word_labeled_zh_201104_0039CC.csv** ![](https://i.imgur.com/46BFlC9.png) * Model: **text_cnn_best_0.9968_LR0.001_BATCH100_EPOCH100** * Testing all data (word_labeled_zh_201104_0039CC.csv: 25504) ![](https://i.imgur.com/oXW9HsA.png) ![](https://i.imgur.com/iipgaWt.png) ### 預測與label 不相符的字 **(P:預測的答案 ; L: label 標記的答案)** ![](https://i.imgur.com/n9oyB0u.png) --- ## Update 2020/12/22 ### 改變點 * 將字典加入 jieba 協助正確切字 * Training data + 加入中國審查詞彙 + 去除長字(改為短字詞，textCNN 預測效果也會較好) + 切分後加入 ex: 內蒙古自治區→內蒙古/自治區/內蒙古自治區皆加入 ### 效果 * Training data : **word_label_data_remove_long_word_201223.csv** ![](https://i.imgur.com/9767UQC.png) * Test data : **word_label_data_remove_long_word_201223.csv** * model : **text_cnn_best_99.607_LR0.001_BATCH100_EPOCH100'** * Accurancy : ![](https://i.imgur.com/P5UWycf.png) ### 與預測結果不同大部分不同的都是從'中國審查詞彙列表'新增加的，可再討論與判斷是否需要 label ![](https://i.imgur.com/MBCltmU.png) ![](https://i.imgur.com/f3vBpaj.png) ![](https://i.imgur.com/XWayEIv.png) ![](https://i.imgur.com/yJkEbwR.png) ### To-do - [x] 1. check是否已有使用此資源作為訓練資料 https://wenxie1216.fandom.com/zh/wiki/%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD%E5%AE%A1%E6%9F%A5%E8%AF%8D%E6%B1%87%E5%88%97%E8%A1%A8 - [x] 2. 收集更全面的人名、地名列表給數美判斷 - [ ] 3. 把wikipedia禁字表加入訓練資料(含"內蒙古") - [x] 4. 確認包含拼音禁字的完整打包版本是否能判斷"你他媽"、"內蒙古" - [x] 5. check之前的禁字資料是否有比較長的？是否要對禁字資料先斷詞、上標記後加入訓練資料？(e.g., 蒙古獨立-->蒙古獨立、蒙古、獨立 3個詞都放訓練資料) ## Update 2021/1/04 ### 改變點 * 更新 training data(刪除重複) https://docs.google.com/spreadsheets/d/1TR3YDQ4zT0wsk9QH4o1XeZmRjBkhtiWNra43V-tBRx4/edit?usp=sharing * 增加權重(差異不大) ``` python nSamples = [33853, 1795, 366, 136, 741, 325] normedWeights = [1 - (x / sum(nSamples)) for x in nSamples] normedWeights = torch.FloatTensor(normedWeights).to(device) loss_func = torch.nn.CrossEntropyLoss(weight=normedWeights) ``` ![](https://i.imgur.com/G3YVxIb.png) ### 效果 * Training data : **word_label_data_remove_long_word_210104.csv** * Testing data : **word_label_data_remove_long_word_210104.csv** * Model(with weight): **text_cnn_best_99.74262734584451_LR0.001_BATCH100_EPOCH100** * Accurancy : ![](https://i.imgur.com/h1FAQFM.png) ### 有權重 ![](https://i.imgur.com/WynydTJ.png) ![](https://i.imgur.com/pWE0GCo.png) ### 沒權重 ![](https://i.imgur.com/SBPmHRK.png) ![](https://i.imgur.com/CM1BRzC.png) # Update 2021/02/22 ## 改變點 * train data 加入正常人名 (37216 -> 38665) * test data: Label.csv (含有數美 vs 我們自行標記的 label) ### Train data (原本 37216 筆 + 正常人名 = 38665) * Train data: **word_label_data_remove_long_word_210222_add normal name.csv** * Train data information ![](https://i.imgur.com/14yhCgm.png) ### Test data * Test data: **Label.csv** * Test data information ![](https://i.imgur.com/pEzSv7p.png) ## 結果 * model: **text_cnn_best_99.7479892761394_LR0.001_BATCH100_EPOCH100** ![](https://i.imgur.com/EfrNt2x.png) ### 數美 * **數美 Accuracy: 0.7144766146993319** ![](https://i.imgur.com/VTPbbZV.png) ### Predict * **Predict Accuracy: 0.7652561247216035** ![](https://i.imgur.com/eAiPKkj.png) #### 所有資料(2245筆) https://docs.google.com/spreadsheets/d/1vZT7tQDowUi4hFrCINKsKH24jLmUQb92qOo3Go45G6E/edit?usp=sharing #### 數美與正確答案不同 https://docs.google.com/spreadsheets/d/1OsTZB4aZNQyzhmouyl0wGs9ZAUMSzyIcMzeCBP1kc7Y/edit?usp=sharing #### predict 與正確答案不同 https://docs.google.com/spreadsheets/d/1T-LUR3DtY5Jtar5cWTuYrqNJSIqyO8Jx7LZmtyqEdMQ/edit?usp=sharing ## Cross Validation ### 作法將原本的 train data(38664筆) + 80% 的 Label.csv(因為做5-fold) => train_data: 40461筆 test_data: 449 筆 ### fold 0 Predict Accuracy: 0.8374164810690423 數美 Accuracy: 0.7305122494432071 ![](https://i.imgur.com/zAMg1r5.png) ![](https://i.imgur.com/Q3L7pVM.png) ### fold 1 Predict Accuracy: 0.8195991091314031 數美 Accuracy: 0.7193763919821826 ![](https://i.imgur.com/UeV7yxX.png) ![](https://i.imgur.com/lNVCP74.png) ### fold 2 Predict Accuracy: 0.8017817371937639 數美 Accuracy: 0.6948775055679287 ![](https://i.imgur.com/O6kzVy1.png) ![](https://i.imgur.com/FRqVY3X.png) ### fold 3 Predict Accuracy: 0.7839643652561247 數美 Accuracy: 0.7216035634743875 ![](https://i.imgur.com/AHR9N2M.png) ![](https://i.imgur.com/dRIFwFR.png) ### fold 4 Predict Accuracy: 0.8262806236080178 數美 Accuracy: 0.7060133630289532 ![](https://i.imgur.com/XTkBZND.png) ![](https://i.imgur.com/VSPPGDm.png) ### 比較 | fold | Predict | 數美 | | ---- | ------- | ----- | | 0 | 0.837 | 0.730 | | 1 | 0.819 | 0.719 | | 2 | 0.801 | 0.694 | | 3 | 0.783 | 0.721 | | 4 | 0.826 | 0.706 | | Range | 0.83~ 0.78 | 0.69~0.73 | ###### tags: `Progress Report`