# Update 2021.06.05
## To-do
1. 長字測驗
2. Cross validation
## 長字測驗
* file: `long worf.csv`
* model: `zh_model_210529.model`
### result
```
Predict 0 1 2 3 4 5 All
label
0 18 4 1 2 0 0 25
1 6 45 1 1 0 2 55
3 2 4 0 60 0 6 72
4 0 0 0 1 8 0 9
5 0 0 0 3 0 12 15
All 26 53 2 67 8 20 176
Accuracy: 0.8125
Error amount: 33 / 176
```
predict error 都是可接受的結果(ex:被斷開 or 違規順序)
* predict result
https://docs.google.com/spreadsheets/d/1a3UH-gUjlGPIOS0Cpu9RWtnvvkSV9wnCE9l-H4KRfEc/edit?usp=sharing
## Cross Validation
### 做法
1. 將 `all word test.csv` 移除 `url`/`account`/`phone number`
2. Random 後 切成 5 等分做 Cross Validation
3. 20% 的 data 確保 Training data 沒有看過
4. 80% 的 data 加回 Training data
5. Train model & Testing
### Result
| fold / Accuracy | 斷 20%| 斷 80% |不斷 20%|不斷 80%|
| --------------- | ----- | ----- | --- | --- |
| fold 0 | 0.814 | 0.950 | 0.786 | 0.969 |
| fold 1 | 0.776 | 0.957 | 0.744 | 0.974 |
| fold 2 | 0.771 | 0.952 | 0.738 | 0.967 |
| fold 3 | 0.772 | 0.953 | 0.769 | 0.969 |
| fold 4 | 0.782 | 0.952 | 0.754 | 0.973 |
#### Analysis
1. 詞彙被移除了,所以斷詞後抓不出來
ex:`[('超', 0), ('拽', 0), ('的', 0), ('黑道', 0), ('少爷', 0)]` `黑道`在test data 中,故被移除導致判不出`黑道`
2. 宗教/法輪功系列 1/3

3. 依照 Jieba 字典切,但training data 中被移除,故判不出來

4. 不斷詞(80%) 長字錯誤率預測錯誤較少(or 也可以算預測正確)


#### 結論
* 預測正確率還滿穩定的(看過: 0.77 / 看過 0.95 以上),
* 一些涉政名詞沒看過是可以接受的
* 因為字典而斷出的長詞若加入 training data 會使正確率再上升
### 斷詞 & 掃 Rule base
https://drive.google.com/drive/folders/19Nh3AXjw8fzOStjUxcpNy0DZrb4Uk5pf?usp=sharing
#### fold 0
```
fold 0 model: text_cnn_best_99.7172859450727_LR0.001_BATCH100_EPOCH100
Test data(20%)
test fold data: 1196
0.0 507
1.0 290
3.0 256
4.0 84
2.0 57
5.0 2
Predict 0 1 2 3 4 5 6 All
label
0.0 495 6 1 3 0 1 1 507
1.0 109 163 0 8 5 1 4 290
2.0 7 3 16 1 0 0 30 57
3.0 33 8 0 202 1 2 10 256
4.0 18 7 3 5 31 0 20 84
5.0 1 0 0 0 0 0 1 2
All 663 187 20 219 37 4 66 1196
fold 0 20%_Accuracy 0.8135451505016722
fold 0 20%_Error amount: 223 / 1196
Test data(80%)
test fold data: 4782
0.0 1814
1.0 1293
3.0 1100
4.0 329
2.0 242
5.0 3
Name: label, dtype: int64
4782it [01:59, 40.14it/s]
Predict 0 1 2 3 4 5 6 All
label
0.0 1793 15 0 3 0 1 2 1814
1.0 38 1146 1 74 0 3 31 1293
2.0 22 3 92 0 10 0 115 242
3.0 14 13 0 1033 1 4 35 1100
4.0 8 1 2 21 243 0 54 329
5.0 2 0 0 0 0 0 1 3
All 1877 1178 95 1131 254 8 238 4781
fold 0 80%_Accuracy 0.9504391468005019
fold 0 80%_Error amount: 237 / 4782
```
#### fold 1
```
add train amount: 1696
remove amount: 806
train_merge fold data: 61872
0.0 51956
1.0 3523
3.0 2952
4.0 2270
2.0 662
5.0 509
fold 1 model:text_cnn_best_99.80775444264944_LR0.001_BATCH100_EPOCH100
Test data(20%)
test fold data: 1196
0.0 447
1.0 339
3.0 260
4.0 90
2.0 58
5.0 1
Predict 0 1 2 3 4 5 6 All
label
0.0 441 4 0 2 0 0 0 447
1.0 132 179 0 15 1 4 8 339
2.0 12 2 13 0 0 0 31 58
3.0 43 5 3 193 0 5 11 260
4.0 26 1 5 6 35 0 17 90
5.0 1 0 0 0 0 0 0 1
All 655 191 21 216 36 9 67 1195
fold 1 20%_Accuracy 0.7759197324414716
fold 1 20%_Error amount: 268 / 1196
Test data(80%)
test fold data: 4782
0.0 1874
1.0 1244
3.0 1096
4.0 323
2.0 241
5.0 4
Predict 0 1 2 3 4 5 6 All
label
0.0 1854 12 0 3 0 2 3 1874
1.0 27 1112 1 73 0 4 27 1244
2.0 13 4 103 0 7 0 114 241
3.0 11 10 0 1031 2 8 34 1096
4.0 7 0 1 18 240 0 57 323
5.0 2 0 0 0 0 0 2 4
All 1914 1138 105 1125 249 14 237 4782
fold 1 80%_Accuracy 0.9571309075700544
fold 1 80%_Error amount: 205 / 4782
```
#### fold 2
```
add train amount: 1657
remove amount: 773
train_merge fold data: 61866
0.0 51930
1.0 3523
3.0 2947
4.0 2299
2.0 658
5.0 508
dtype: int64
fold 2 model :text_cnn_best_99.65589660743134_LR0.001_BATCH100_EPOCH100
Test data(20%)
test fold data: 1196
0.0 471
1.0 328
3.0 275
2.0 62
4.0 60
Predict 0 1 2 3 4 5 6 All
label
0.0 459 5 0 4 2 1 0 471
1.0 140 150 3 12 15 2 6 328
2.0 8 0 19 0 3 0 32 62
3.0 33 5 3 203 12 7 12 275
4.0 13 1 0 5 31 0 10 60
All 653 161 25 224 63 10 60 1196
fold 2 20%_Accuracy 0.7709030100334449
fold 2 20%_Error amount: 274 / 1196
Test data(80%)
test fold data: 4782
0.0 1850
1.0 1255
3.0 1081
4.0 353
2.0 237
5.0 5
Predict 0 1 2 3 4 5 6 All
label
0.0 1824 15 0 6 0 2 3 1850
1.0 30 1111 0 75 5 5 29 1255
2.0 11 6 99 0 7 1 113 237
3.0 13 12 0 1015 4 4 33 1081
4.0 7 0 2 21 259 0 64 353
5.0 3 0 0 0 0 0 2 5
All 1888 1144 101 1117 275 12 244 4781
fold 2 80%_Accuracy 0.9519029694688415
fold 2 80%_Error amount: 230 / 4782
```
#### fold 3
```
add train amount: 1678
remove amount: 799
train_merge fold data: 61861
0.0 51946
1.0 3543
3.0 2928
4.0 2277
2.0 657
5.0 509
fold 3 model :text_cnn_best_99.78352180936996_LR0.001_BATCH100_EPOCH100
Test data(20%)
test fold data: 1195
0.0 453
1.0 311
3.0 284
4.0 83
2.0 63
5.0 1
Predict 0 1 2 3 4 5 6 All
label
0.0 441 9 0 0 1 2 0 453
1.0 135 144 2 17 5 1 7 311
2.0 9 1 23 0 3 0 27 63
3.0 41 8 0 225 0 5 5 284
4.0 20 2 2 9 37 0 13 83
5.0 1 0 0 0 0 0 0 1
All 647 164 27 251 46 8 52 1195
fold 3 20%_Accuracy 0.7715481171548118
fold 3 20%_Error amount: 273 / 1195
Test data(80%)
test fold data: 4783
0.0 1868
1.0 1272
3.0 1072
4.0 330
2.0 236
5.0 4
Predict 0 1 2 3 4 5 6 All
label
0.0 1836 21 0 7 0 1 3 1868
1.0 30 1132 1 75 0 6 28 1272
2.0 17 5 89 0 7 0 118 236
3.0 8 13 0 1006 2 3 40 1072
4.0 7 0 1 16 245 0 61 330
5.0 2 0 0 0 0 0 2 4
All 1900 1171 91 1104 254 10 252 4782
fold 3 80%_Accuracy 0.9533765419192975
fold 3 80%_Error amount: 223 / 4783
```
#### fold 4
```
4 embedding
add train amount: 1677
remove amount: 794
train_merge fold data: 61865
0.0 51956
1.0 3535
3.0 2937
4.0 2261
2.0 666
5.0 509
fold 4 model :text_cnn_best_99.77705977382875_LR0.001_BATCH100_EPOCH100
Test data(20%)
test fold data: 1195
0.0 443
1.0 315
3.0 281
4.0 96
2.0 59
5.0 1
Predict 0 1 2 3 4 5 6 All
label
0.0 431 7 0 3 0 0 2 443
1.0 129 162 0 12 1 1 10 315
2.0 9 0 23 0 2 0 25 59
3.0 44 7 0 219 1 3 7 281
4.0 24 3 3 10 41 1 14 96
5.0 0 0 0 0 0 0 1 1
All 637 179 26 244 45 5 59 1195
fold 4 20%_Accuracy 0.7824267782426778
fold 4 20%_Error amount: 260 / 1195
Test data(80%)
test fold data: 4783
0.0 1878
1.0 1268
3.0 1075
4.0 317
2.0 240
5.0 4
Predict 0 1 2 3 4 5 6 All
label
0.0 1858 12 0 4 0 3 1 1878
1.0 43 1123 1 71 1 4 25 1268
2.0 17 6 90 0 7 0 120 240
3.0 14 12 0 1003 1 7 38 1075
4.0 6 1 2 15 233 0 60 317
5.0 3 0 0 0 0 0 1 4
All 1941 1154 93 1093 242 14 245 4782
fold 4 80%_Accuracy 0.9517039514948777
fold 4 80%_Error amount: 231 / 4783
```
### 不斷詞
https://drive.google.com/drive/folders/1M2nSbbiHa-1RuUGA2-rE2WEZbVCbJ9sA?usp=sharing
#### fold 0
```
Test data(20%)
8it [00:00, 71.63it/s]Predict 0 1 2 3 4 5 All
label
0.0 493 7 1 4 1 1 507
1.0 112 159 0 12 5 0 288
2.0 18 3 28 2 5 1 57
3.0 34 7 1 211 3 0 256
4.0 18 8 3 7 48 0 84
5.0 2 0 0 0 0 0 2
All 677 184 33 236 62 2 1194
fold 0 20%_Accuracy: 0.7864321608040201
fold 0 20%_Error amount: 255 / 1194
Test data(80%)
48it [00:00, 68.71it/s]
Predict 0 1 2 3 4 5 All
label
0.0 1809 4 0 1 0 0 1814
1.0 8 1192 1 83 0 3 1287
2.0 1 0 231 0 10 0 242
3.0 0 8 0 1086 1 1 1096
4.0 1 1 2 22 303 0 329
5.0 0 0 0 0 0 3 3
All 1819 1205 234 1192 314 7 4771
fold 0 80%_Accuracy 0.9689857502095558
fold 0 80%_Error amount: 148 / 4772
```
#### fold 1
```
Test data(20%)
Predict 0 1 2 3 4 5 All
label
0.0 436 8 0 2 1 0 447
1.0 133 181 0 19 1 5 339
2.0 19 1 28 3 6 1 58
3.0 47 5 3 196 0 6 257
4.0 27 2 8 6 47 0 90
5.0 0 1 0 0 0 0 1
All 662 198 39 226 55 12 1192
fold 1 20%_Accuracy: 0.7443419949706622
fold 1 20%_Error amount: 305 / 1193
Test data(80%)
Predict 0 1 2 3 4 5 All
label
0.0 1870 2 0 2 0 0 1874
1.0 3 1154 1 75 0 3 1236
2.0 2 0 229 0 10 0 241
3.0 1 5 0 1087 2 0 1095
4.0 0 0 1 19 303 0 323
5.0 0 0 0 0 0 4 4
All 1876 1161 231 1183 315 7 4773
fold 1 80%_Accuracy 0.9736015084852294
fold 1 80%_Error amount: 126 / 4773
```
#### fold 2
```
Test data(20%)
Predict 0 1 2 3 4 5 All
label
0.0 458 3 0 7 2 1 471
1.0 140 148 4 13 17 2 324
2.0 21 0 27 5 7 2 62
3.0 34 4 5 210 14 7 274
4.0 14 1 0 8 37 0 60
All 667 156 36 243 77 12 1191
fold 2 20%_Accuracy: 0.7388748950461796
fold 2 20%_Error amount: 311 / 1191
Test data(80%)
Predict 0 1 2 3 4 5 All
label
0.0 1843 1 0 4 1 1 1850
1.0 7 1157 0 78 6 3 1251
2.0 1 1 223 1 10 1 237
3.0 4 6 0 1062 5 1 1078
4.0 1 0 2 21 329 0 353
5.0 0 0 0 0 0 5 5
All 1856 1165 225 1166 351 11 4774
fold 2 80%_Accuracy 0.9673298429319371
fold 2 80%_Error amount: 156 / 4775
```
#### fold 3
```
Test data(20%)
Predict 0 1 2 3 4 5 All
label
0.0 444 5 0 0 2 2 453
1.0 133 148 2 20 5 1 309
2.0 13 0 43 2 5 0 63
3.0 40 6 1 235 0 2 284
4.0 23 2 2 9 46 1 83
5.0 0 0 0 0 0 1 1
All 653 161 48 266 58 7 1193
fold 3 20%_Accuracy: 0.768650461022632
fold 3 20%_Error amount: 276 / 1193
Test data(80%)
Predict 0 1 2 3 4 5 All
label
0.0 1855 8 0 5 0 0 1868
1.0 6 1175 1 79 0 5 1266
2.0 1 0 225 0 10 0 236
3.0 0 7 0 1059 2 0 1068
4.0 1 0 1 18 310 0 330
5.0 0 0 0 2 0 2 4
All 1863 1190 227 1163 322 7 4772
fold 3 80%_Accuracy 0.9692017598994344
fold 3 80%_Error amount: 147 / 4773
```
#### fold 4
```
Test data(20%)
Predict 0 1 2 3 4 5 All
label
0.0 434 6 1 2 0 0 443
1.0 136 161 0 15 1 2 315
2.0 22 0 29 0 6 2 59
3.0 44 5 0 226 2 4 281
4.0 22 3 5 13 51 2 96
5.0 0 0 0 1 0 0 1
All 658 175 35 257 60 10 1195
fold 4 20%_Accuracy: 0.7539748953974895
fold 4 20%_Error amount: 294 / 1195
Test data(80%)
Predict 0 1 2 3 4 5 All
label
0.0 1872 2 0 3 0 1 1878
1.0 4 1169 1 82 1 3 1260
2.0 1 0 232 0 7 0 240
3.0 1 7 0 1062 1 0 1071
4.0 0 0 2 14 301 0 317
5.0 0 0 0 0 0 4 4
All 1878 1178 235 1161 310 8 4770
fold 4 80%_Accuracy 0.9725424439320897
fold 4 80%_Error amount: 131 / 4771
```