# Nickname 測驗結果
## 檔案資訊:
* Testdata : **NicknameTestDataWithLabel.csv**
### 直接 test
* 原始 information:

#### 結果
主要是正常及廣告,但句子過長且非中文(泰文、英文等)過多,故預測效果不好
* Accuracy:

* 混淆矩陣:

* 預測錯誤:


### 改善後流程
1.以 langid 過濾出中文 (45090 -> 26444)
2.斷詞丟入 model,以數量最多的 label 當作該 nickname label
* Information:

#### 結果
* result:
**nickname_labeled_zh_210112_0055CC_210113_0732.csv**

https://docs.google.com/spreadsheets/d/1rC6c4M62UBgJ9Ktcj7-crKyB9P8ADBj0dBIMoZ6Nz_o/edit?usp=sharing
* Accuracy:

* 混淆矩陣

* 結果
https://docs.google.com/spreadsheets/d/1ze0Rx4T9w1YK1TT9mxKqwWcs9iFlR8r_4F5845q-e7w/edit?usp=sharing
* 預測錯誤原因
+ 人名皆被判斷成 1 [ex: row 12.13.14 等]
+ 數美標錯( or 我們的模型比較嚴格) [ex: row 34.94 等]
+ 廣告帳號濾掉數字後,只剩下正常的詞彙 [ex: row 1114 等]
### 可以改善的方法 & 待確認問題
1. 將過長的數字串列入判斷廣告條件
Q: 人名皆判斷成涉政(之前的要求),若想用自己名字當 nickname 也會被判斷成涉政
# Update
* 修正廣告的帳號
* 去除重複中文暱稱的概況
* 蒙古內蒙古蒙古國距離
* 帳號去除人工流水號後情況(元豪)
## 修正廣告的帳號
#### Before
廣告訊息因濾除數字而只剩名詞被模型判斷為正常或涉政(人名)

#### After
用正則表示式將含有 11 碼數字的帳號,在新增的 phone 欄位標示為 1,並直接標示為廣告


## 去除重複中文暱稱的概況
* Testdata : **NicknameTestDataWithLabel.csv**
##### 將原有的 Testdata drop 掉剩餘 5112 筆

##### 重複的 nickname 舉例

### 結果
https://docs.google.com/spreadsheets/d/1Pju1ax6ebGnehCgzFwv-rryPn0-QZ_E6or5tCZ_CFBA/edit?usp=sharing


#### 預測錯誤舉例

## 蒙古 vs 內蒙古 vs 蒙古國距離
經過XLNET embedding 後會產生 n*768 維向量,若 n 不同則無法直接以`cosine_similarity`比較
(進入 textCNN 前會 padding 至 15 維)
#### n 相同 (n = 3)

#### n 相同 (n = 2)

#### padding 至較長的詞彙長度

#### padding 至 15 維

#### padding 至 5 維

### 總結
雖然因 padding 會影響不同維度間的相似度比較,但總結來說
```蒙古 vs 蒙古國``` 相似度最高,距離最近(因維度相同)
且
```內蒙古 vs 蒙古國``` `內蒙古 vs 蒙古` 相似度也都大於其他維度不同的詞彙
並且
同為地名的相似度也較高
# Update 2021.02.03
Test data: **NicknameLabelCheck.csv**
Test data information:

### 數美標的(labelNo) vs 我們人工標記(Correct)
#### 正確率

#### 混淆矩陣

#### 數美標錯
https://docs.google.com/spreadsheets/d/1Ast7xHIRY_6Sl1FqA1RGWZRVvxYYR_tM1l3uVBaMsFc/edit?usp=sharing
### 模型預測 vs 我們人工標記(Correct)
#### 正確率

#### 混淆矩陣

#### 模型預測錯
https://docs.google.com/spreadsheets/d/173lbLwJpgXbRU0Y9CBOYYslDGmtSeRraZ1QYb87awEc/edit?usp=sharing
* 模型預測人名還是會全部預測為1 (尚未加入正常人名進去 Train)

* 詞彙因為切詞被切開故無法正確預測



* 綽號模型無法正確預測為涉政


# Update 2021.03.10
## Cross Validation
拿之前人工標記過的
**NicknameLabelCheck.csv** (1720筆)
加上
**word_label_data_remove_long_word_210222_add normal name_and_Label.csv** (40910筆)
做 5-fold
```
word_label_data_remove_long_word_210222_add normal name_and_Label.csv
total: 40910
0.0 36509
1.0 2662
4.0 816
2.0 378
5.0 326
3.0 218
```
### fold 0
```
train fold data: 42286
0.0 37038
1.0 2929
4.0 937
2.0 696
5.0 410
3.0 275
test fold data: 344
0 128
2 79
1 72
4 27
5 26
3 12
```
```
predict Accuracy: 0.6366279069767442
數美 Accuracy: 0.7325581395348837
predict 0 1 2 3 4 5
correct
0 101 8 1 2 9 3
1 42 25 0 0 4 0
2 18 0 49 0 8 2
3 8 1 0 3 0 0
4 14 0 4 0 7 2
5 1 0 0 0 1 21
```
Predict Error:
https://docs.google.com/spreadsheets/d/1JQKdDzpLQsqMjJddF3zy4zqHKX-NHMqlzhcG8wJukcs/edit?usp=sharing
### fold 1
```
train fold data: 42286
0.0 37043
1.0 2934
4.0 935
2.0 687
5.0 416
3.0 270
test fold data: 344
0 123
2 88
1 67
4 29
5 20
3 17
```
```
predict Accuracy: 0.6104651162790697
數美 Accuracy: 0.7877906976744186
predict 0 1 2 3 4 5
correct
0 98 12 4 0 6 2
1 43 23 0 0 0 1
2 14 1 48 3 13 8
3 9 2 0 4 1 1
4 12 0 5 0 12 0
5 1 2 2 1 1 12
```
Predict Error:
https://docs.google.com/spreadsheets/d/1d6QZUC2mAQVHCaiarc4zadE634SUAkOOSlyp_sfohYE/edit?usp=sharing
### fold 2
```
train fold data: 42286
0.0 37026
1.0 2949
4.0 932
2.0 689
5.0 409
3.0 280
test fold data: 344
0 140
2 86
1 52
4 32
5 27
3 7
```
```
predict Accuracy: 0.6308139534883721
數美 Accuracy: 0.7965116279069767
predict 0 1 2 3 4 5
correct
0 102 22 2 7 5 0
1 23 25 1 3 0 0
2 20 3 48 5 7 1
3 1 3 0 3 0 0
4 10 2 5 2 13 0
5 5 1 1 1 2 17
```
Predict Error:
https://docs.google.com/spreadsheets/d/12X3sNAakyvpvuZlrzTaX7reFNdFTH_aLBBA5BNoNfRE/edit?usp=sharing
### fold 3
```
train fold data: 42286
0.0 37019
1.0 2930
4.0 940
2.0 703
5.0 421
3.0 272
test fold data: 344
0 147
2 72
1 71
4 24
5 15
3 15
```
```
predict Accuracy: 0.6337209302325582
數美 Accuracy: 0.7674418604651163
predict 0 1 2 3 4 5
correct
0 105 26 6 2 5 1
1 28 39 0 3 0 0
2 22 0 36 5 9 0
3 9 2 0 4 0 0
4 10 0 4 0 9 1
5 3 2 1 0 1 8
```
Predict Error:
https://docs.google.com/spreadsheets/d/1V8Mc3pxJXhiTxe-G2A9NHeup0NqJg4JTIBehm4QLuCM/edit?usp=sharing
### fold 4
```
train fold data: 42286
0.0 37047
1.0 2924
4.0 928
2.0 703
5.0 414
3.0 269
0 119
1 77
2 72
4 36
5 22
3 18
```
```
predict Accuracy: 0.6075581395348837
數美 Accuracy: 0.7936046511627907
predict 0 1 2 3 4 5
correct
0 90 13 8 1 4 1
1 40 29 3 3 1 0
2 14 1 47 1 8 1
3 7 3 1 5 2 0
4 21 1 5 0 9 0
5 1 0 0 0 0 20
```
Predict Error:
https://docs.google.com/spreadsheets/d/1-7CEAtDhbR24-j6jEIO9vcNPl0FyXSDyqkGup94wStA/edit?usp=sharing
### 比較
| fold | Predict | 數美 |
| ---- | ------- | ----- |
| 0 | 0.637 | 0.733 |
| 1 | 0.610 | 0.787 |
| 2 | 0.630 | 0.795 |
| 3 | 0.633 | 0.767 |
| 4 | 0.607 | 0.793 |
| Range | 0.60~ 0.63 | 0.73~0.79 |
# 所有 nickname(沒斷詞) 當 test data(train data 沒看過)
```
Training data: 40910
0.0 36509
1.0 2662
4.0 816
2.0 378
5.0 326
3.0 218
NicknameLabelCheck.csv
Testing data: 1720
0 657
2 397
1 339
4 148
5 110
3 69
```
```
predict correct number: 854
predict Accuracy: 0.5035377358490566
predict 0 1 2 3 4 5
label
0 528 69 11 9 25 4
1 186 136 2 9 0 3
2 186 4 131 21 45 5
3 43 9 0 14 3 0
4 82 3 19 5 38 1
5 20 55 1 19 3 7
```
Predict Error:
https://docs.google.com/spreadsheets/d/1RZJ_vQxLpkXBOC6ClfS_igldZ3n0-XyEHF8eVYb5PQU/edit?usp=sharing
### 結論
* 暱稱仍然需要斷詞,暱稱只有某部分符合不正常(1~5類),如果不斷詞 model 抓不到
* 加入 80% 相似資料進入訓練效果有限
# Update 2021.03.17
## Cross Validation
0. data: **NicknameTestDataWithLabel.csv** (45090筆)
1. 過濾中文及去除重複 (45090-> 5112)
2. 加上**word_label_data_remove_long_word_210222_add normal name_and_Label.csv** (40910筆)
3. 做 5-fold `註: 太長的字(維度超過 13)會被 Pass`
### 結果
| fold | Predict |
| ---- | ------- |
| 0 | 0.916 |
| 1 | 0.887 |
| 2 | 0.782 |
| 3 | 0.901 |
| 4 | 0.918 |
| Range | 0.78 ~ 0.91 |
### fold 0
```
fold 0
train fold data: 44998
0.0 39685
1.0 2725
5.0 1102
4.0 840
2.0 423
3.0 223
test fold data: 1023
0 781
5 211
1 12
4 11
2 8
```
```
Train RuntimeError_count: 159
Test RuntimeError_count: 39
Predict Accuracy: 0.9159
predict 0 1 2 4 5
label
0 711 49 1 5 10
1 8 4 0 0 0
2 4 0 4 0 0
4 9 0 1 1 0
5 4 0 0 0 173
```
### fold 1
```
train fold data: 44998
0.0 39685
1.0 2717
5.0 1114
4.0 845
2.0 415
3.0 222
test fold data: 1023
0 781
5 199
1 20
2 16
4 6
3 1
```
```
Train RuntimeError_count: 158
Test RuntimeError_count: 40
Predict Accuracy: 0.8866
predict 0 1 2 3 4 5
label
0 721 35 2 1 10 4
1 11 8 0 0 1 0
2 12 0 2 0 1 0
3 0 0 0 1 0 0
4 4 0 1 1 0 0
5 13 2 0 0 1 152
```
### fold 2
```
train fold data: 44999
0.0 39657
1.0 2724
5.0 1131
4.0 847
2.0 419
3.0 221
test fold data: 1022
0 809
5 182
1 13
2 12
4 4
3 2
```
```
Train RuntimeError_count: 156
Test RuntimeError_count: 42
Predict Accuracy: 0.7818
predict 0
label
0 799
1 13
2 11
3 2
4 4
5 151
```
### fold 3
```
train fold data: 44999
0.0 39669
1.0 2722
5.0 1121
4.0 843
2.0 423
3.0 221
test fold data: 1022
0 797
5 192
1 15
4 8
2 8
3 2
```
```
Train RuntimeError_count: 159
Test RuntimeError_count: 39
Predict Accuracy: 0.9099
predict 0 1 2 4 5
label
0 744 30 4 6 10
1 6 8 0 0 1
2 4 0 3 0 1
3 2 0 0 0 0
4 5 0 1 2 0
5 8 0 0 0 148
```
### fold 4
```
train fold data: 44999
0.0 39677
1.0 2722
5.0 1110
4.0 845
2.0 422
3.0 223
test fold data: 1022
0 789
5 203
1 15
2 9
4 6
```
```
Train RuntimeError_count: 162****
Test RuntimeError_count: 36
Predict Accuracy: 0.9178
predict 0 1 2 3 4 5
label
0 736 21 3 4 6 12
1 7 8 0 0 0 0
2 6 1 1 0 0 1
4 2 0 2 1 1 0
5 6 0 0 0 0 168
```
### 結果分析
* 這份 Data 絕大部分都是 正常/廣告,對於其他類別較無法分析
* 會將預測錯誤的結果調整 Train data 標記
# Update 2021.03.21
## 過濾非中文暱稱&去除重複
* Train data: **word_label_data_remove_long_word_210222_add normal name_and_Label.csv (40910筆)**
* Test data : **NicknameTestDataWithLabel.csv**
| 描述 | 檔案名 | 筆數 |
| --------| -------------------------------------- | -------- |
| 原始 | NicknameTestDataWithLabel.csv |45090 |
| 濾出中文 | nickname_labeled_zh_210112_0055CC.csv |26444 |
| 去除重複 | 使用 drop_duplicates語法(如下) |5112 |

```python=
test_data = pd.read_csv('nickname_labeled_zh_210112_0055CC.csv')
test_data.drop_duplicates(inplace=True)
```
##### 重複的 nickname 舉例

### 結果
| 狀態 | Accuracy |
| -------- | -------- |
| (上次) 有斷詞&帳號偵測 | 0.8057 |
| (此次) 直接 predict | 0.742 |

* 廣告偵測不大出來
predict error: https://docs.google.com/spreadsheets/d/1LUCwbK_zQ39hcHL0X0HYk7kH5shAyckiDmbl_I-xktg/edit?usp=sharing
# Update 2021.04.06
## Training data
* File: `new word dataset_add label_normal name_default.csv`
* 組成:
```
word_label_data_remove_long_word_210106.csv (37217筆)
+ 正常人名_簡_去掉非中文.csv(1450筆)
+ Label.csv (2246筆)
+ nickname_wechat_jieba_labeled.csv(2688筆)
+ nickname_default_jieba_labeled(6173筆)
1.去除 emoji & 無義標點符號 & 微信號(都是數字)
2.drop_duplicates
3.dropna
4.修正過去 predict error 的 label
```
* information:
```
Training data: 44786
0 40956
1 2039
4 830
2 406
5 342
3 213
```
## Testing data
* File: `NicknameLabelCheck.csv`
* information:
```
Testing data: 1720
0 657
2 397
1 339
4 148
5 110
3 69
```
* note: 有數美標記(label) 與 我們人工標記(Correct) 的欄位
## Model
* **text_cnn_best_99.47767857142857_LR0.001_BATCH100_EPOCH100**
## Accuracy
| | [數美](#Baseline-數美)| [Search](#搜尋模式-Search) | [Full](#全模式-Full-Best) | [Precise](#精確模式-Precise) |
| -------- | ---- | ------------------------------------ | ------------------------------------ | ------------------------------------ | -- |
| Accuracy |0.82558 | 0.71453| 0.711627 | 0.7058139 |
| Error | 300 | 491 | 496 | 506 |
### Baseline: 數美
```
數美 0 1 2 3 4 5 All
Correct
0 477 107 3 7 62 1 657
1 18 320 0 1 0 0 339
2 1 3 338 0 55 0 397
3 0 25 0 44 0 0 69
4 2 0 0 0 146 0 148
5 1 11 0 0 3 95 110
All 499 466 341 52 266 96 1720
Accuracy: 0.8255813953488372
Error amount: 300 / 1720
Type 0 Accuracy: 477 / 657 = 0.726027
Type 1 Accuracy: 320 / 339 = 0.943953
Type 2 Accuracy: 338 / 397 = 0.851385
Type 3 Accuracy: 44 / 69 = 0.637681
Type 4 Accuracy: 146 / 148 = 0.986486
Type 5 Accuracy: 95 / 110 = 0.863636
```
### 搜尋模式-Search
* Result Spreadsheet:
https://docs.google.com/spreadsheets/d/16kwsiDSAawzeR6cyqUD72IVelgC1KbykniqJDThpzjU/edit?usp=sharing
```
Predict 0 1 2 3 4 5 All
Correct
0 589 38 10 3 13 4 657
1 188 146 0 3 1 1 339
2 90 1 274 0 32 0 397
3 16 10 0 43 0 0 69
4 49 1 15 0 82 1 148
5 9 2 0 2 2 95 110
All 941 198 299 51 130 101 1720
Accuracy: 0.7145348837209302
Error amount: 491 / 1720
Type 0 Accuracy: 589 / 657 = 0.896499
Type 1 Accuracy: 146 / 339 = 0.430678
Type 2 Accuracy: 274 / 397 = 0.690176
Type 3 Accuracy: 43 / 69 = 0.623188
Type 4 Accuracy: 82 / 148 = 0.554054
Type 5 Accuracy: 95 / 110 = 0.863636
```
### 全模式-Full-Best
* Result Spreadsheet:
https://docs.google.com/spreadsheets/d/1SW-nucHjmcdANstnvaVBaPpA4HsediQu8f2Ko059ffI/edit?usp=sharing
```
Predict 0 1 2 3 4 5 All
Correct
0 596 29 10 4 14 4 657
1 211 122 0 4 1 1 339
2 68 1 290 0 37 1 397
3 17 9 0 43 0 0 69
4 51 1 17 0 78 1 148
5 9 3 0 1 2 95 110
All 952 165 317 52 132 102 1720
Accuracy: 0.7116279069767442
Error amount: 496 / 1720
Type 0 Accuracy: 596 / 657 = 0.907154
Type 1 Accuracy: 122 / 339 = 0.359882
Type 2 Accuracy: 290 / 397 = 0.730479
Type 3 Accuracy: 43 / 69 = 0.623188
Type 4 Accuracy: 78 / 148 = 0.527027
Type 5 Accuracy: 95 / 110 = 0.863636
```
### 精確模式-Precise
* Result Spreadsheet:
https://docs.google.com/spreadsheets/d/1Tg3Qo9Ztkp9GCWVIHOav8butgVzQWIK7fcBefs20lZU/edit?usp=sharing
```
Predict 0 1 2 3 4 5 All
Correct
0 590 38 10 3 13 3 657
1 189 145 0 3 1 1 339
2 93 2 271 0 31 0 397
3 17 7 0 45 0 0 69
4 49 1 14 0 83 1 148
5 24 2 0 2 2 80 110
All 962 195 295 53 130 85 1720
Accuracy: 0.7058139534883721
Error amount: 506 / 1720
Type 0 Accuracy: 590 / 657 = 0.898021
Type 1 Accuracy: 145 / 339 = 0.427729
Type 2 Accuracy: 271 / 397 = 0.682620
Type 3 Accuracy: 45 / 69 = 0.652174
Type 4 Accuracy: 83 / 148 = 0.560811
Type 5 Accuracy: 80 / 110 = 0.727273
```
## 預測錯誤原因
### model 漏判 涉政(1) 原因
1.model 未識別 相關政治人物(數美判斷正確)`ex: 林建荣、杨添福`
2.暱稱模型判斷不出來 `ex:習平、金三胖`
### model 漏判 辱罵(2) 原因
1.切開後個別的詞,model 看不出來是辱罵
2.model 斷詞後被偵測到 涉政字眼 `ex:台湾`
3.model 誤判為色情 `ex:逼逼、骚货`
### model 漏判 違禁(3) 原因
1.宗教後來才定義為違禁,可能還沒改完整 `ex:释迦、佛陀`
2.model 沒有學習到 違禁(3) 字眼 `ex: 沙漠之鷹、掌心雷`
### model 漏判/誤判 色情(4) 原因
1.切開後個別的詞,model 看不出來是色情 `ex:加v看我喷水`
2.辱罵常常提到性器官與性行為,與色情重疊
### model 漏判 廣告(5) 原因
1.切詞沒切準確 `ex:[('上', 0), ('下分加', 0), ('薇', 0), ('f', 0), ('89', 0), ('w98', 0)]`
2.整句才看的出來是廣告,個別的詞皆正常 `ex:新美达健身游泳报名负责人文仔`
# Update 2021.04.12
## Training data
* File: `new word dataset_add label_normal name_Censor.csv`
* 組成:
```
new word dataset_add label_normal name_Censor.csv
(44786筆)
+ CensorWordsCorrected(35838筆)
1.drop_duplicates
2.dropna
```
### Information
```
CensorWordsCorrected: 35838
0 16192
5 14341
3 2220
1 1925
4 1041
2 119
new word dataset_add label_normal name_Censor: 66290
0 57148
1 3964
3 2433
4 1871
2 525
5 349
```
## Testing data
* File: `NicknameLabelCheck.csv`
* information:
```
Testing data: 1720
0 657
2 397
1 339
4 148
5 110
3 69
```
## Model
* **text_cnn_best_99.47767857142857_LR0.001_BATCH100_EPOCH100**
## Accuracy
| | [數美](#Baseline-數美)| [Search](#搜尋模式-Search) | [Full](#全模式-Full-Best) | [Precise](#精確模式-Precise) |
| -------- | ---- | ------------------------------------ | ------------------------------------ | ------------------------------------ | -- |
| (0406)Accuracy |0.82558 | 0.71453| 0.711627 | 0.7058139 |
| Error | 300 | 491 | 496 | 506 |
| (add Censor)Accuracy |0.82558 | 0.68837| 0.68197 | 0.68837 |
| Error | 300 | 536 | 547 | 536 |
| (add Censor without url)Accuracy |0.82558 | 0.68953| 0.68488 | 0.675 |
| Error | 300 | 534 | 542 | 559 |
### 搜尋模式-Search
```
Predict 0 1 2 3 4 5 All
Correct
0 582 43 8 2 20 2 657
1 186 148 0 3 1 1 339
2 86 2 238 1 70 0 397
3 16 9 0 44 0 0 69
4 54 2 13 1 77 1 148
5 7 6 0 0 2 95 110
All 931 210 259 51 170 99 1720
Accuracy: 0.6883720930232559
Error amount: 536 / 1720
```
### 全模式-full
```
Predict 0 1 2 3 4 5 All
Correct
0 568 55 10 2 20 2 657
1 191 143 0 4 0 1 339
2 63 3 249 1 81 0 397
3 17 8 0 44 0 0 69
4 52 3 17 1 74 1 148
5 6 7 0 0 2 95 110
All 897 219 276 52 177 99 1720
Accuracy: 0.6819767441860465
Error amount: 547 / 1720
```
### 精確模式-precise
```
Predict 0 1 2 3 4 5 All
Correct
0 584 42 8 2 19 2 657
1 188 146 0 3 1 1 339
2 86 3 235 1 72 0 397
3 17 6 0 46 0 0 69
4 54 2 12 1 78 1 148
5 7 6 0 0 2 95 110
All 936 205 255 53 172 99 1720
Accuracy: 0.6883720930232559
Error amount: 536 / 1720
```
# Update 2021.04.18
## Training data
| 時間軸 | 檔名 | 筆數 |備註 |
| --------| -------- | --------|---- |
| 4/7 | CensorWords_Drop Duplates_labeled | 35840 |數美標的 |
| 4/11 | CensorWordsCorrected | 35840 |學姊修改過 |
| 4/14 | CensorWords_predict | 14497 |數美標0的label 給 model predict |
| 4/15 | CensorWords_predict_checked | 14497 |數美標0的label 給 model predict 且學姊修改過|
| 4/7 | CensorWords_Ground Truth | 19534 |詳見下方說明 |
### CensorWords_Ground Truth
* CensorWordsCorrected 刪除網址
* CensorWordsCorrected(1~5類) + CensorWords_predict_checked
```
CensorWords_Ground Truth: 19534
0 12895
1 2654
3 2388
4 1256
5 183
2 158
```
## Accurancy
### 數美 vs Ground Truth
```
數美 Accuracy : 0.8237432169550527
數美
correct 0 1 2 3 4 5
label
0 12894 1116 59 1125 801 179
1 1 1536 1 96 0 0
2 0 0 91 0 27 0
3 0 2 0 1146 6 0
4 0 0 7 20 421 1
5 0 0 0 1 1 3
```
### Model vs Ground Truth
```
Predict Accuracy : 0.829118460120815
Model
correct 0 1 2 3 4 5
predict
0 12877 692 34 1337 432 85
1 8 1923 2 166 14 0
2 2 2 108 12 55 0
3 1 31 0 446 5 0
4 6 5 12 123 746 2
5 1 1 2 304 4 96
```
## Nickname Test
### Model
* text_cnn_best_99.04503105590062_LR0.001_BATCH100_EPOCH100
### Accuracy
| 類別 | 筆數 | Accuracy |備註 |
| --- | -------- | -------- |------- |
| 涉政 | 1376 | 0.91782 | |
| 辱罵 | 302 | 0.48013 | |
| 辱罵 | 302 | 0.5298 |切詞預測 |
| 違禁 | 1182 | 0.97453 | |
| 色情 | 415 | 0.91566 | |
| 廣告 | 5529 | 0.00217 | |
| 廣告 | 5529 | 0.066712 |切詞預測 |
```
predict 0 1 2 3 5
label
1 60 1586 1 80 1
```
```
predict 0 1 2 3 4 5
label
2 97 9 145 11 39 1
[切詞]
Predict 0 1 2 4 5
label
2 108 4 160 28 2
```
```
predict 0 1 3 4 5
label
3 15 6 1148 6 3
```
```
predict 0 3 4 5
label
4 14 19 380 2
```
```
predict 0 1 2 3 4 5
label
5 4903 201 156 107 150 12
[切詞]
Predict 0 4 5
label
5 5398 2 38
```
## 辱罵 & 廣告 Analysis
### 涉政
https://docs.google.com/spreadsheets/d/14QEuSDEXL258ZHzNhah4p8sTRmZgGjCC_PkNqLZGzow/edit?usp=sharing
### 辱罵
* 判斷成色情
* 辱罵與涉政皆有(涉政順位較高)
https://docs.google.com/spreadsheets/d/1_AzDL6ZWVoEijxby7IdgoSut5eqRI9Robi5t3Q32K98/edit?usp=sharing
### 違禁
https://docs.google.com/spreadsheets/d/1J1KnJgg9ZVeGXOqntyb-DKJAV6yIelnPjjn_bjw2yXk/edit?usp=sharing
### 色情
https://docs.google.com/spreadsheets/d/19kPWaJXN5xqxlZgZ1pXcTNck_c451wS6Qjihk23-Hl8/edit?usp=sharing
### 廣告
* URL 偵測不出(Train_data 加 URL 也無法)
https://docs.google.com/spreadsheets/d/10ZFL1HyrPF6WiVK1vGD7oRvrBK1wneMP4SDcoZV1Fws/edit?usp=sharing
# Update 2021.04.26
## 作法
**CensorWords_Ground Truth.csv** 拿去斷詞 predict
## 結果
**CensorWords_Ground Truth_precise_210419_1316.csv**
```
斷詞前
Predict Accuracy : 0.829118460120815
Model
correct 0 1 2 3 4 5
predict
0 12877 692 34 1337 432 85
1 8 1923 2 166 14 0
2 2 2 108 12 55 0
3 1 31 0 446 5 0
4 6 5 12 123 746 2
5 1 1 2 304 4 96
```
```
CensorWords_Ground Truth: 19534
0 12895
1 2654
3 2388
4 1256
5 183
2 158
Label Accuracy : 0.8237432169550527
數美
correct 0 1 2 3 4 5
label
0 12894 1116 59 1125 801 179
1 1 1536 1 96 0 0
2 0 0 91 0 27 0
3 0 2 0 1146 6 0
4 0 0 7 20 421 1
5 0 0 0 1 1 3
Predict Accuracy : 0.8374116924337053
Model
correct 0 1 2 3 4 5
predict
0 12753 911 65 987 450 111
1 92 1703 2 130 14 1
2 3 1 79 0 41 0
3 29 35 1 1020 14 0
4 14 3 9 29 734 2
5 4 1 2 222 3 69
```
https://docs.google.com/spreadsheets/d/1E_jU-_UOfK1xrCzh0k7mrVtD28-bhn544JED-BsMdWU/edit?usp=sharing
## URL
### 作法

https://www.notion.so/0ea0785783454d2fb05b7afc2f0d7401?v=4ac201766ff1426eb514b67750643558&p=b6ffd73c1c1a4404bfe56b539033c99a
### 結果
* Test data: 廣告
* URL 數: 約 5345個
#### regular expression + urlextract
`dataset containing URL: 5342 / 5786`
#### only regular expression
`dataset containing URL: 5349 / 5786`
#### only urlextract
`dataset containing URL: 5342 / 5786`
https://docs.google.com/spreadsheets/d/1au6tyiSB7nAK2QMJUOSyPBKeguWtzQOZY40qXlJihIA/edit?usp=sharing
# Update 2021.04.28
## model
**text_cnn_best_99.04885057471265_LR0.001_BATCH100_EPOCH100**
### training data
* File: `new word dataset_add label_normal name_default_Ground truth_Corrected_2.csv`
* 組成:
```
new word dataset_add label_normal name_default.csv
+ Ground truth
+ 常見違禁字(ex: 代开、代开发票、咨询电话、刺刀、代办、刺刀、身分证、户口本、刻章)
```
* information:
```
Training data: 69536
0 53851
1 6566
3 4795
4 3102
2 677
5 545
```
## Result
```
Predict Accuracy : 0.8963857888809256
Model
correct 0 1 2 3 4 5
predict
0 12684 360 27 601 235 73
1 112 2258 3 152 16 1
2 0 0 110 0 13 0
3 44 31 0 1380 16 0
4 44 5 16 53 974 5
5 11 0 2 202 2 104
```
https://docs.google.com/spreadsheets/d/1lk9wwRFJniaAQwPWpwdsAKz5yTd6SFxms9wYCROCwWc/edit?usp=sharing
# Update 2021.04.30
## Model
`text_cnn_best_98.63338788870703_LR0.001_BATCH100_EPOCH100`
### Training data
* file: `new word dataset_add label_normal name_default_Ground truth_Corrected__prohibit`
* 組成
```
new word dataset_add label_normal name_default_Ground truth_Corrected_2.csv
+ 違禁_jieba_corrected
* drop_duplicated()
```
* Information
```
Training data: 61055 (drop duplicate 前 :71013 筆)
0.0 51339
1.0 3967
3.0 2940
4.0 1780
5.0 517
2.0 509
違禁_jieba_corrected: 1461
0 773
3 577
1 85
5 14
4 11
2 1
Corrected similarity : 0.6386036960985626
```
## Result
### CensorWords_Ground Truth.csv
```
Predict Accuracy : 0.8995085491962732
Model
correct 0 1 2 3 4 5
predict
0 12679 483 31 472 322 72
1 57 2125 2 72 9 0
2 1 0 106 0 19 0
3 132 40 3 1701 38 14
4 14 5 14 36 865 2
5 12 1 2 107 3 95
```
https://docs.google.com/spreadsheets/d/15ERx7iORf7hVp2oSz8cU_Ap0EVlAsNqe8JT9eMuWXcs/edit?usp=sharing
### 廠商提供的涉政、辱罵、違禁、色情、廣告
#### 流程
#### 結果
| 類別 | 筆數 | 上次 |A.斷詞 |B.無斷詞|
| --- | -----| -------|-------|-----|
| 涉政 | 1376 | **0.917** |0.729 |0.894|
| 辱罵 | 302 | **0.480** |0.470 |0.417|
| 違禁 | 1182 | 0.974 |0.799 |**0.974**|
| 色情 | 415 | 0.915 |0.534 |**0.920**|
| 廣告 | 5529 | 0.002 |0.991 |**0.991**|
```
[斷詞]
涉政_210502_1610.csv
Predict 0 1 3 4 5 All
label
1 370 1265 90 3 8 1736
All 370 1265 90 3 8 1736
Accuracy: 0.7286866359447005
Error amount: 471 / 1736
辱罵_210502_1611.csv
Predict 0 1 2 4 5 All
label
2 113 3 142 43 1 302
All 113 3 142 43 1 302
Accuracy: 0.47019867549668876
Error amount: 160 / 302
違禁_210502_1611.csv
Predict 0 1 3 4 5 All
label
3 144 28 945 5 60 1182
All 144 28 945 5 60 1182
Accuracy: 0.799492385786802
Error amount: 237 / 1182
色情_210502_1612.csv
Predict 0 1 2 3 4 5 All
label
4 157 6 4 20 222 6 415
All 157 6 4 20 222 6 415
Accuracy: 0.5349397590361445
Error amount: 193 / 415
廣告_210502_1612.csv
Predict 0 1 4 5 All
label
5 47 1 2 5736 5786
All 47 1 2 5736 5786
Accuracy: 0.9913584514344971
Error amount: 50 / 5786
```
```
[無斷詞]
涉政_210502_1801.csv
Predict 0 1 3 4 5 All
label
1 71 1552 105 2 6 1736
All 71 1552 105 2 6 1736
Accuracy: 0.8940092165898618
Error amount: 184 / 1736
辱罵_210502_1801.csv
Predict 0 1 2 3 4 All
label
2 113 6 126 8 49 302
All 113 6 126 8 49 302
Accuracy: 0.41721854304635764
Error amount: 176 / 302
違禁_210502_1802.csv
Predict 0 1 3 4 5 All
label
3 18 3 1151 8 2 1182
All 18 3 1151 8 2 1182
Accuracy: 0.9737732656514383
Error amount: 31 / 1182
色情_210502_1802.csv
Predict 0 3 4 5 All
label
4 7 21 382 5 415
All 7 21 382 5 415
Accuracy: 0.9204819277108434
Error amount: 33 / 415
廣告_210502_1802.csv
Predict 0 1 3 4 5 All
label
5 37 1 12 4 5732 5786
All 37 1 12 4 5732 5786
Accuracy: 0.9906671275492568
Error amount: 54 / 5786
```
https://docs.google.com/spreadsheets/d/1Mz0YOHm6lrIYmUfH6-M1eSOOMq9G3lP-R0y0r9Nqukg/edit?usp=sharing
# Update 2021.05.10
## URL 時間比較 (套件 vs regular expression)
* 套件: `urlextract`
* regular expression: `[a-zA-Z0-9]+\.[a-zA-Z]+`
### result
* Test data: 廣告(URL: 約5345/5789)
* 環境: colab(cpu)
* 時間單位: 秒
| 方法 | 時間 | URL數 | 先掃 rule-base 再斷詞 predict (total) |
| ------------------ | ----- | -----| --- |
| regular expression | 6.004 | 5349 | 59.410 |
| urlextract | 7.903 | 5342 | 71.159 |
| Both | 7.737 | 5342 | 71.32 |
##### 註: 確切時間屆時再拿打包給廠商的那版程式測一次
## 掃 Rulebase Time and Accuracy
* 環境: colab(cpu)
* 觀察: 先掃 rule-base 的詞語是否出現在 nickname 中,不在rule-base的話再送去斷詞 & predict 花費時間較少 (因為不用每個詞都斷詞及 embedding),正確率也較高
### 流程
* 5/10 以前的方法
```
1. 找出 帳號/電話/URL => 直接標記 5
2. Jieba 斷詞 / 完全不斷詞
3. 比對 rule base,完全相同的給對應 label
nickname: "你這個蠢蛋"
rule base = ['白癡', '蠢蛋' ]
[A.斷詞:]
case(1) " 你 這個 蠢蛋" 因為 "蠢蛋" 完全相同 rule base 的 "蠢蛋" => 標記 2
case(2) "你 這個 蠢 蛋" 無與 "蠢蛋" 完全相同的字 => model predict
[B.無斷詞]
將 "你這個蠢蛋" 與 rule base 比對,無相同 => model predict
4. model predict
```
* 5/10 新採用的方法
```
1. 找出 帳號/電話/URL => 直接標記 5
2. 比對 rule base 的詞有無出現在帳號中
rule base = ['白癡', '蠢蛋' ]
nickname_1 = "你這個蠢蛋"
nickname_1 = "你這個帥哥"
以"白癡"與"蠢蛋" 去比對是否存在於 nickname 中
=> 解決因斷詞被斷開而無法透過 rule base 找出的情況
3. model predict
[C.不斷詞 predict]
「你這個蠢蛋: 2」
[D.斷詞後 predict ]
「你 這個 帥哥」 =>「你:0 這個:0 帥哥:0」 => 「final label:0」
```
### Time
| 類別 | 筆數 | A.斷詞後檢查 rule-base | B.不斷詞檢查 rule-base | C.先掃 rule-base 不斷詞 predict | D.先掃 rule-base 再斷詞 predict |
| ---- | ---- | ---------------------- | --- | ------------------------------- | ------------------------------- |
| 涉政 | 1376 | 360.492 | 188.859 | 263.931 | 483.819 |
| 辱罵 | 302 | 88.320 | 31.945 | 24.005 | 62.088 |
| 違禁 | 1182 | 391.988 | 133.854 | 185.525 | 567.888 |
| 色情 | 415 | 112.990 | 45.805 | 65.774 | 161.557 |
| 廣告 | 5529 | 23.158 | 16.497 | 23.525 | 74.786 |
### Accuracy
| 類別 | 筆數 | A.斷詞後檢查 rule-base | B.不斷詞檢查 rule-base | C.先掃 rule-base | D.先掃 rule-base 再斷詞 predict |
| ---- | ---- | ------| --- | -------- | ------ |
| 涉政 | 1376 | 0.729 | 0.894 | **0.895** | 0.680 |
| 辱罵 | 302 | 0.470 | 0.417 | 0.708 | **0.745** |
| 違禁 | 1182 | 0.799 | **0.974** | 0.934 | 0.759 |
| 色情 | 415 | 0.534 | **0.920** | 0.889 | 0.460 |
| 廣告 | 5529 | 0.991 | 0.991 | 0.991 | 0.991 |
```
[說明]
INFO:root:dataset containing rule_politics: 含涉政 rule - base
INFO:root:dataset containing rule_abuse: 含辱罵 rule - base
INFO:root:dataset containing rule_porn: 含色情 rule - base
INFO:root:dataset containing rule: 含廣告 rule - base
```
```
[涉政]
INFO:root:dataset containing rule_politics: 23 / 1736
INFO:root:dataset containing rule_abuse: 20 / 1736
INFO:root:dataset containing rule_porn: 0 / 1736
INFO:root:dataset containing rule: 7 / 1736
[辱罵]
INFO:root:dataset containing rule_politics: 0 / 1182
INFO:root:dataset containing rule_abuse: 27 / 1182
INFO:root:dataset containing rule_porn: 0 / 1182
INFO:root:dataset containing rule: 20 / 1182
[違禁]
INFO:root:dataset need decode: 0 / 415
INFO:root:dataset containing account: 0 / 415
INFO:root:dataset containing phone number: 0 / 415
INFO:root:dataset containing URL: 4 / 415
[色情]
INFO:root:dataset containing rule_politics: 5 / 415
INFO:root:dataset containing rule_abuse: 10 / 415
INFO:root:dataset containing rule_porn: 0 / 415
INFO:root:dataset containing rule: 4 / 415
[廣告]
INFO:root:dataset need decode: 0 / 5786
INFO:root:dataset containing account: 26 / 5786
INFO:root:dataset containing phone number: 362 / 5786
INFO:root:dataset containing URL: 5349 / 5786
```
* 無斷詞 辱罵 Predict result: https://docs.google.com/spreadsheets/d/1QNbVK3weAca4HHQm69f-XLih8fzyC7wQGXnlQY2Lc-4/edit?usp=sharing
## 斷詞 Threshold
### 作法
Nickname 長度超過 Threshold 則斷詞
```
[檔案中最長 nickname]
涉政: 25
辱罵: 13
違禁: 17
色情: 13
廣告: 20
```
| 類別/Threshold | 0 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |不斷詞|
| -------------- | ---- | ----- | ----- | --------- | ----- | ----- | ----- | ---------| --- |
| 涉政 | 0.68 | 0.77 | 0.84 | 0.87 | 0.892 | 0.895 | 0.896 | **0.896** |0.895|
| 辱罵 | 0.74 | 0.768 | 0.768 | **0.768** | 0.764 | 0.74 | 0.725 | 0.721 |0.708|
| 違禁 | 0.75 | 0.79 | 0.82 | 0.84 | 0.86 | 0.89 | 0.920 | **0.927** |0.934|
| 色情 | 0.46 | 0.52 | 0.65 | 0.72 | 0.77 | 0.81 | 0.86 | **0.88** |0.889|
| 廣告 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | **0.99** |0.991|
* `廣告` 因為幾乎都是 URL ,所以都是靠 regular expression 偵測出來的,故正確率相同