Update 2021.06.13

new model

  • model: text_cnn_best_99.93606557377049_LR0.001_BATCH100_EPOCH100
  • Train data: new word dataset_add label_normal name_default_Ground truth_Corrected_prohibit_fixed_0611_remove long word.csv
Training data: 60982
0    50636
1     4123
3     2828
4     2339
2      551
5      505

result

類別 筆數 new model Old model
涉政 1736 0.960 0.942
辱罵 299 0.903 0.903
違禁 1181 0.980 0.959
色情 415 0.918 0.884
廣告 5786 0.997 0.997

小幅上升原因

  • 新 model 修正
  • rule base 不區分(之前可能把 之類的詞在 2 or 4 預測錯誤)
檔案: https://drive.google.com/drive/folders/10sNG5qjzdIMPj-stcVu_5bKBLp4JkVLQ?usp=sharing
涉政  spend time: 176.04912281036377
Predict   0     1  2   3   5   6   All
label
1        21  1626  2  36  10  41  1736
All      21  1626  2  36  10  41  1736

Accuracy: 0.9602534562211982
Error amount: 69 / 1736

dataset amount after removing official messages: 5287
dataset amount after removing duplicate messages: 299

辱罵  spend time: 31.82636284828186
Predict   0  1    2   4    6  All
label
2        13  6  125  10  145  299
All      13  6  125  10  145  299

Accuracy: 0.903010033444816
Error amount: 29 / 299


違禁  spend time: 123.72088527679443
Predict  0   1     3  4  5   6   All
label
3        7  12  1114  2  3  43  1181
All      7  12  1114  2  3  43  1181

Accuracy: 0.9796782387806944
Error amount: 24 / 1181

dataset amount after removing official messages: 415
dataset amount after removing duplicate messages: 415

色情  spend time: 43.15791177749634
Predict  0  2   3    4  5   6  All
label
4        6  2  22  305  4  76  415
All      6  2  22  305  4  76  415

Accuracy: 0.9180722891566265
Error amount: 34 / 415

dataset amount after removing official messages: 10000
dataset amount after removing duplicate messages: 5786

廣告  spend time: 23.311609268188477
Predict   0  1  4     5  6   All
label
5        17  1  1  5765  2  5786
All      17  1  1  5765  2  5786

Accuracy: 0.9967162115451089
Error amount: 19 / 5786

nomal article_jieba

  • file: nomal article_jieba.csv
  • 組成: 數篇報導給 jieba 切詞(使用自訂字典)

result

https://docs.google.com/spreadsheets/d/1BvfvIo3h3BwOWMN7_sxTNujYIBJsFo11D_4-AkoxLX8/edit?usp=sharing

Predict     0    1  2   3   4   5   6   All
label
0        4270   95  1  32  13  20  10  4441
1           2    9  0   0   0   0   0    11
2           1    0  1   0   0   0   1     3
3           0    0  0   0   0   0   1     1
4           0    0  0   0   0   0   1     1
All      4273  104  2  32  13  20  13  4457

Similarity: 0.9632039488445142 
difference amount: 164 / 4457

Analysis

  • model predict 比數美嚴
  • 數美標違規 ; Model 判正常的很少(三個)日:2 周舵:1 武汉肺炎:1
  • 整句正常但含有違規詞 ex: ('台湾', 1), ('境内', 0) ('独立', 1), ('门户', 0)

Example

數美:違規 ; Model: 正常

日:2 周舵:1 武汉肺炎:1

數美:正常 ; Model: 違規

[('中国', 1)]
[('中央', 1)]
[('两国', 1)]
[('北京', 1)]
[('台湾', 1)]
[('和谐', 1)]
[('国家', 1)]
[('坦克', 1)]
[('封锁', 1)]
[('改革', 1)]
[('政府', 1)]
[('政治', 1)]
[('日本', 1)]
[('民主', 1)]
[('民意', 1)]
[('独立', 1)]
[('当局', 1)]
[('绝食', 1)]
[('总统', 1)]
[('肃杀', 1)]
[('自由', 1)]
[('军方', 1)]
[('镇压', 1)]
[('关枪', 1)]
[('香港', 1)]
[('陈建仁', 1)]
[('法西斯', 1)]
[('斗争', 1)]

[('乱射', 4)]
[('交易', 5)]
[('代价', 3)]
[('供货', 5)]
[('充分', 5)]
[('出售', 5)]
[('出卖', 5)]

[('啪啪', 4)]
[('子弹', 3)]
[('枪口', 3)]
[('枪声', 3)]
[('机枪', 3)]
[('武器', 3)]
[('病毒', 3)]
[('赌注', 3)]
[('开枪', 3)]
[('机枪', 3), ('声', 0)]
[('杀人狂', 3)]
[('监控', 3)]
[('空', 0), ('鸣枪', 3)]

數美:正常 ; Model: 違規 (Model 太嚴)

[('博士', 3)]
[('收', 5)]
[('买', 5)]
[('卖', 5)]
[('办', 3)]
[('三级', 4)]
[('交易', 5)]
[('代价', 3)]
[('博士', 3)]
[('压住', 5)]
[('夹道', 4)]
[('学位', 3)]
[('床上', 4)]
[('底片', 4)]
[('按摩', 4)]
[('接著', 3)]
[('收手', 5)]
[('核酸', 3)]
[('流病', 4)]
[('耐操', 4)]
[('胸脯', 4)]
[('证书', 3)]
[('买张', 5)]
[('资格', 3)]
[('卖出', 5)]
[('销售', 5)]
[('开发', 3)]
[('鞭炮', 4)]
[('首选', 5)]
[('英雄式', 3)]
[('奥运', 0), ('资格', 3)]