# Update 2021.06.13 ## new model * model: `text_cnn_best_99.93606557377049_LR0.001_BATCH100_EPOCH100` * Train data: `new word dataset_add label_normal name_default_Ground truth_Corrected_prohibit_fixed_0611_remove long word.csv` ``` Training data: 60982 0 50636 1 4123 3 2828 4 2339 2 551 5 505 ``` ### result | 類別 | 筆數 | new model | Old model | | ---- | ---- | ---------| --- | | 涉政 | 1736 | **0.960** | 0.942 | | 辱罵 | 299 | **0.903** | 0.903 | | 違禁 | 1181 | **0.980** | 0.959 | | 色情 | 415 | **0.918** | 0.884 | | 廣告 | 5786 | **0.997** | 0.997 | #### 小幅上升原因 * 新 model 修正 * rule base 不區分(之前可能把`逼` `操` 之類的詞在 2 or 4 預測錯誤) ##### 檔案: https://drive.google.com/drive/folders/10sNG5qjzdIMPj-stcVu_5bKBLp4JkVLQ?usp=sharing ``` 涉政 spend time: 176.04912281036377 Predict 0 1 2 3 5 6 All label 1 21 1626 2 36 10 41 1736 All 21 1626 2 36 10 41 1736 Accuracy: 0.9602534562211982 Error amount: 69 / 1736 dataset amount after removing official messages: 5287 dataset amount after removing duplicate messages: 299 辱罵 spend time: 31.82636284828186 Predict 0 1 2 4 6 All label 2 13 6 125 10 145 299 All 13 6 125 10 145 299 Accuracy: 0.903010033444816 Error amount: 29 / 299 違禁 spend time: 123.72088527679443 Predict 0 1 3 4 5 6 All label 3 7 12 1114 2 3 43 1181 All 7 12 1114 2 3 43 1181 Accuracy: 0.9796782387806944 Error amount: 24 / 1181 dataset amount after removing official messages: 415 dataset amount after removing duplicate messages: 415 色情 spend time: 43.15791177749634 Predict 0 2 3 4 5 6 All label 4 6 2 22 305 4 76 415 All 6 2 22 305 4 76 415 Accuracy: 0.9180722891566265 Error amount: 34 / 415 dataset amount after removing official messages: 10000 dataset amount after removing duplicate messages: 5786 廣告 spend time: 23.311609268188477 Predict 0 1 4 5 6 All label 5 17 1 1 5765 2 5786 All 17 1 1 5765 2 5786 Accuracy: 0.9967162115451089 Error amount: 19 / 5786 ``` ## nomal article_jieba * file: nomal article_jieba.csv * 組成: 數篇報導給 jieba 切詞(使用自訂字典) ### result https://docs.google.com/spreadsheets/d/1BvfvIo3h3BwOWMN7_sxTNujYIBJsFo11D_4-AkoxLX8/edit?usp=sharing ``` Predict 0 1 2 3 4 5 6 All label 0 4270 95 1 32 13 20 10 4441 1 2 9 0 0 0 0 0 11 2 1 0 1 0 0 0 1 3 3 0 0 0 0 0 0 1 1 4 0 0 0 0 0 0 1 1 All 4273 104 2 32 13 20 13 4457 Similarity: 0.9632039488445142 difference amount: 164 / 4457 ``` ### Analysis * model predict 比數美嚴 * 數美標違規 ; Model 判正常的很少(三個)`日:2` `周舵:1` `武汉肺炎:1` * 整句正常但含有違規詞 ex: `('台湾', 1), ('境内', 0)` `('独立', 1), ('门户', 0)` ### Example #### 數美:違規 ; Model: 正常 `日:2` `周舵:1` `武汉肺炎:1` #### 數美:正常 ; Model: 違規 ``` [('中国', 1)] [('中央', 1)] [('两国', 1)] [('北京', 1)] [('台湾', 1)] [('和谐', 1)] [('国家', 1)] [('坦克', 1)] [('封锁', 1)] [('改革', 1)] [('政府', 1)] [('政治', 1)] [('日本', 1)] [('民主', 1)] [('民意', 1)] [('独立', 1)] [('当局', 1)] [('绝食', 1)] [('总统', 1)] [('肃杀', 1)] [('自由', 1)] [('军方', 1)] [('镇压', 1)] [('关枪', 1)] [('香港', 1)] [('陈建仁', 1)] [('法西斯', 1)] [('斗争', 1)] [('乱射', 4)] [('交易', 5)] [('代价', 3)] [('供货', 5)] [('充分', 5)] [('出售', 5)] [('出卖', 5)] [('啪啪', 4)] [('子弹', 3)] [('枪口', 3)] [('枪声', 3)] [('机枪', 3)] [('武器', 3)] [('病毒', 3)] [('赌注', 3)] [('开枪', 3)] [('机枪', 3), ('声', 0)] [('杀人狂', 3)] [('监控', 3)] [('空', 0), ('鸣枪', 3)] ``` #### 數美:正常 ; Model: 違規 (Model 太嚴) ``` [('博士', 3)] [('收', 5)] [('买', 5)] [('卖', 5)] [('办', 3)] [('三级', 4)] [('交易', 5)] [('代价', 3)] [('博士', 3)] [('压住', 5)] [('夹道', 4)] [('学位', 3)] [('床上', 4)] [('底片', 4)] [('按摩', 4)] [('接著', 3)] [('收手', 5)] [('核酸', 3)] [('流病', 4)] [('耐操', 4)] [('胸脯', 4)] [('证书', 3)] [('买张', 5)] [('资格', 3)] [('卖出', 5)] [('销售', 5)] [('开发', 3)] [('鞭炮', 4)] [('首选', 5)] [('英雄式', 3)] [('奥运', 0), ('资格', 3)] ```