# Update 2021.06.13
## new model
* model: `text_cnn_best_99.93606557377049_LR0.001_BATCH100_EPOCH100`
* Train data: `new word dataset_add label_normal name_default_Ground truth_Corrected_prohibit_fixed_0611_remove long word.csv`
```
Training data: 60982
0 50636
1 4123
3 2828
4 2339
2 551
5 505
```
### result
| 類別 | 筆數 | new model | Old model |
| ---- | ---- | ---------| --- |
| 涉政 | 1736 | **0.960** | 0.942 |
| 辱罵 | 299 | **0.903** | 0.903 |
| 違禁 | 1181 | **0.980** | 0.959 |
| 色情 | 415 | **0.918** | 0.884 |
| 廣告 | 5786 | **0.997** | 0.997 |
#### 小幅上升原因
* 新 model 修正
* rule base 不區分(之前可能把`逼` `操` 之類的詞在 2 or 4 預測錯誤)
##### 檔案: https://drive.google.com/drive/folders/10sNG5qjzdIMPj-stcVu_5bKBLp4JkVLQ?usp=sharing
```
涉政 spend time: 176.04912281036377
Predict 0 1 2 3 5 6 All
label
1 21 1626 2 36 10 41 1736
All 21 1626 2 36 10 41 1736
Accuracy: 0.9602534562211982
Error amount: 69 / 1736
dataset amount after removing official messages: 5287
dataset amount after removing duplicate messages: 299
辱罵 spend time: 31.82636284828186
Predict 0 1 2 4 6 All
label
2 13 6 125 10 145 299
All 13 6 125 10 145 299
Accuracy: 0.903010033444816
Error amount: 29 / 299
違禁 spend time: 123.72088527679443
Predict 0 1 3 4 5 6 All
label
3 7 12 1114 2 3 43 1181
All 7 12 1114 2 3 43 1181
Accuracy: 0.9796782387806944
Error amount: 24 / 1181
dataset amount after removing official messages: 415
dataset amount after removing duplicate messages: 415
色情 spend time: 43.15791177749634
Predict 0 2 3 4 5 6 All
label
4 6 2 22 305 4 76 415
All 6 2 22 305 4 76 415
Accuracy: 0.9180722891566265
Error amount: 34 / 415
dataset amount after removing official messages: 10000
dataset amount after removing duplicate messages: 5786
廣告 spend time: 23.311609268188477
Predict 0 1 4 5 6 All
label
5 17 1 1 5765 2 5786
All 17 1 1 5765 2 5786
Accuracy: 0.9967162115451089
Error amount: 19 / 5786
```
## nomal article_jieba
* file: nomal article_jieba.csv
* 組成: 數篇報導給 jieba 切詞(使用自訂字典)
### result
https://docs.google.com/spreadsheets/d/1BvfvIo3h3BwOWMN7_sxTNujYIBJsFo11D_4-AkoxLX8/edit?usp=sharing
```
Predict 0 1 2 3 4 5 6 All
label
0 4270 95 1 32 13 20 10 4441
1 2 9 0 0 0 0 0 11
2 1 0 1 0 0 0 1 3
3 0 0 0 0 0 0 1 1
4 0 0 0 0 0 0 1 1
All 4273 104 2 32 13 20 13 4457
Similarity: 0.9632039488445142
difference amount: 164 / 4457
```
### Analysis
* model predict 比數美嚴
* 數美標違規 ; Model 判正常的很少(三個)`日:2` `周舵:1` `武汉肺炎:1`
* 整句正常但含有違規詞 ex: `('台湾', 1), ('境内', 0)` `('独立', 1), ('门户', 0)`
### Example
#### 數美:違規 ; Model: 正常
`日:2` `周舵:1` `武汉肺炎:1`
#### 數美:正常 ; Model: 違規
```
[('中国', 1)]
[('中央', 1)]
[('两国', 1)]
[('北京', 1)]
[('台湾', 1)]
[('和谐', 1)]
[('国家', 1)]
[('坦克', 1)]
[('封锁', 1)]
[('改革', 1)]
[('政府', 1)]
[('政治', 1)]
[('日本', 1)]
[('民主', 1)]
[('民意', 1)]
[('独立', 1)]
[('当局', 1)]
[('绝食', 1)]
[('总统', 1)]
[('肃杀', 1)]
[('自由', 1)]
[('军方', 1)]
[('镇压', 1)]
[('关枪', 1)]
[('香港', 1)]
[('陈建仁', 1)]
[('法西斯', 1)]
[('斗争', 1)]
[('乱射', 4)]
[('交易', 5)]
[('代价', 3)]
[('供货', 5)]
[('充分', 5)]
[('出售', 5)]
[('出卖', 5)]
[('啪啪', 4)]
[('子弹', 3)]
[('枪口', 3)]
[('枪声', 3)]
[('机枪', 3)]
[('武器', 3)]
[('病毒', 3)]
[('赌注', 3)]
[('开枪', 3)]
[('机枪', 3), ('声', 0)]
[('杀人狂', 3)]
[('监控', 3)]
[('空', 0), ('鸣枪', 3)]
```
#### 數美:正常 ; Model: 違規 (Model 太嚴)
```
[('博士', 3)]
[('收', 5)]
[('买', 5)]
[('卖', 5)]
[('办', 3)]
[('三级', 4)]
[('交易', 5)]
[('代价', 3)]
[('博士', 3)]
[('压住', 5)]
[('夹道', 4)]
[('学位', 3)]
[('床上', 4)]
[('底片', 4)]
[('按摩', 4)]
[('接著', 3)]
[('收手', 5)]
[('核酸', 3)]
[('流病', 4)]
[('耐操', 4)]
[('胸脯', 4)]
[('证书', 3)]
[('买张', 5)]
[('资格', 3)]
[('卖出', 5)]
[('销售', 5)]
[('开发', 3)]
[('鞭炮', 4)]
[('首选', 5)]
[('英雄式', 3)]
[('奥运', 0), ('资格', 3)]
```