# Nickname jieba mode difference ## jieba mode e.g. 內蒙古自治區獨立 ``` ---搜尋模式--- # search 蒙古 / 內蒙古 / 自治 / 區獨立 ---全模式--- # full 內蒙古 / 蒙古 / 自治 / 區 / 獨 / 立 ---精確模式--- # precise 內蒙古 / 自治 / 區獨立 ---合併3種模式--- # all 區獨立 / 區 / 自治 / 內蒙古 / 獨 / 蒙古 / 立 ---三種模式加上nickname--- # all+ 區獨立 / 區 / 自治 / 內蒙古自治區獨立 / 內蒙古 / 獨 / 蒙古 / 立 ``` ## Dataset 使用上次大家一起人工標註的 `NicknameLabelCheck` 做為測試實驗集 https://docs.google.com/spreadsheets/d/1GyuGdOMjm62wbaKIXjVMDqoGyg-05QTTmNYDsriSc3Q/edit?usp=sharing ``` drop duplicate: 1720 0 657 1 339 2 397 3 69 4 148 5 110 ```  ## Baseline: 數美  ``` 數美 0 1 2 3 4 5 All Correct 0 477 107 3 7 62 1 657 1 18 320 0 1 0 0 339 2 1 3 338 0 55 0 397 3 0 25 0 44 0 0 69 4 2 0 0 0 146 0 148 5 1 11 0 0 3 95 110 All 499 466 341 52 266 96 1720 Accuracy: 0.8255813953488372 Error amount: 300 / 1720 Type 0 Accuracy: 477 / 657 = 0.726027 Type 1 Accuracy: 320 / 339 = 0.943953 Type 2 Accuracy: 338 / 397 = 0.851385 Type 3 Accuracy: 44 / 69 = 0.637681 Type 4 Accuracy: 146 / 148 = 0.986486 Type 5 Accuracy: 95 / 110 = 0.863636 ``` ## Experiments 210315 加上新的斷詞方式,並且修正過濾中文字元的function ```python 看a片 傻b c級片 # 原 function 會認定非中文字詞 ``` | | [數美](#Baseline-數美) | [Search](#搜尋模式-Search) | [Full](#全模式-Full-Best) | [Precise](#精確模式-Precise) | [All](#合併3種模式-All) | [None](#不做斷詞-None) | [All+](三種模式加上nickname-All) | | -------- | ---------------------- | -------------------------- | ------------------------- | ---------------------------- | ----------------------- | ---------------------- | -------------------------------- | | Accuracy | 0.82558 | 0.67034 | **0.68255** | 0.66686 | 0.67732 | 0.52906 | 0.66569 | | Error | 300 | 567 | **546** | 570 | 555 | 810 | 575 | ### 搜尋模式 Search ``` ./NicknameLabelCheck_search_210315_0232.csv Predict 0 1 2 3 4 5 All Correct 0 529 96 11 3 17 1 657 1 117 215 0 3 1 3 339 2 116 13 241 1 26 0 397 3 20 16 0 33 0 0 69 4 55 13 13 0 66 1 148 5 31 6 0 2 2 69 110 All 868 359 265 42 112 74 1720 Accuracy: 0.6703488372093023 Error amount: 567 / 1720 Type 0 Accuracy: 529 / 657 = 0.805175 Type 1 Accuracy: 215 / 339 = 0.634218 Type 2 Accuracy: 241 / 397 = 0.607053 Type 3 Accuracy: 33 / 69 = 0.478261 Type 4 Accuracy: 66 / 148 = 0.445946 Type 5 Accuracy: 69 / 110 = 0.627273 ``` ### 全模式 Full (Best) ``` ./NicknameLabelCheck_full_210315_0233.csv Predict 0 1 2 3 4 5 All Correct 0 534 89 11 4 18 1 657 1 123 211 0 4 0 1 339 2 95 6 258 1 37 0 397 3 21 15 0 33 0 0 69 4 57 11 16 0 63 1 148 5 24 8 0 1 2 75 110 All 854 340 285 43 120 78 1720 Accuracy: 0.6825581395348838 Error amount: 546 / 1720 Type 0 Accuracy: 534 / 657 = 0.812785 Type 1 Accuracy: 211 / 339 = 0.622419 Type 2 Accuracy: 258 / 397 = 0.649874 Type 3 Accuracy: 33 / 69 = 0.478261 Type 4 Accuracy: 63 / 148 = 0.425676 Type 5 Accuracy: 75 / 110 = 0.681818 ``` ### 精確模式 Precise ``` ./NicknameLabelCheck_precise_210315_0233.csv Predict 0 1 2 3 4 5 All Correct 0 531 95 11 3 16 1 657 1 118 214 0 3 1 3 339 2 119 16 236 1 25 0 397 3 21 15 0 33 0 0 69 4 55 13 12 0 67 1 148 5 31 6 0 2 2 69 110 All 875 359 259 42 111 74 1720 Accuracy: 0.6686046511627907 Error amount: 570 / 1720 Type 0 Accuracy: 531 / 657 = 0.808219 Type 1 Accuracy: 214 / 339 = 0.631268 Type 2 Accuracy: 236 / 397 = 0.594458 Type 3 Accuracy: 33 / 69 = 0.478261 Type 4 Accuracy: 67 / 148 = 0.452703 Type 5 Accuracy: 69 / 110 = 0.627273 ``` ### 合併3種模式 All ``` ./NicknameLabelCheck_all_210315_0234.csv Predict 0 1 2 3 4 5 All Correct 0 497 125 11 4 19 1 657 1 99 234 0 4 1 1 339 2 89 13 259 1 35 0 397 3 20 16 0 33 0 0 69 4 52 13 15 0 67 1 148 5 24 8 0 1 2 75 110 All 781 409 285 43 124 78 1720 Accuracy: 0.6773255813953488 Error amount: 555 / 1720 Type 0 Accuracy: 497 / 657 = 0.756469 Type 1 Accuracy: 234 / 339 = 0.690265 Type 2 Accuracy: 259 / 397 = 0.652393 Type 3 Accuracy: 33 / 69 = 0.478261 Type 4 Accuracy: 67 / 148 = 0.452703 Type 5 Accuracy: 75 / 110 = 0.681818 ``` ### 不做斷詞 None ``` ./NicknameLabelCheck_None_210315_0235.csv Predict 0 1 2 3 4 5 All Correct 0 524 95 11 3 21 3 657 1 166 163 2 2 4 2 339 2 179 10 152 12 29 15 397 3 39 12 0 15 3 0 69 4 96 7 15 2 24 4 148 5 57 10 0 8 3 32 110 All 1061 297 180 42 84 56 1720 Accuracy: 0.5290697674418605 Error amount: 810 / 1720 Type 0 Accuracy: 524 / 657 = 0.797565 Type 1 Accuracy: 163 / 339 = 0.480826 Type 2 Accuracy: 152 / 397 = 0.382872 Type 3 Accuracy: 15 / 69 = 0.217391 Type 4 Accuracy: 24 / 148 = 0.162162 Type 5 Accuracy: 32 / 110 = 0.290909 ``` ### 三種模式加上nickname All+ ``` ./NicknameLabelCheck_all+_210315_0240.csv Predict 0 1 2 3 4 5 All Correct 0 458 143 18 5 30 3 657 1 91 240 2 4 1 1 339 2 61 18 281 1 33 3 397 3 18 16 0 33 2 0 69 4 45 17 20 2 63 1 148 5 20 13 0 2 5 70 110 All 693 447 321 47 134 78 1720 Accuracy: 0.6656976744186046 Error amount: 575 / 1720 Type 0 Accuracy: 458 / 657 = 0.697108 Type 1 Accuracy: 240 / 339 = 0.707965 Type 2 Accuracy: 281 / 397 = 0.707809 Type 3 Accuracy: 33 / 69 = 0.478261 Type 4 Accuracy: 63 / 148 = 0.425676 Type 5 Accuracy: 70 / 110 = 0.636364 ``` ## Experiments 210307 判斷流程中,若偵測到電話或是帳號就會直接判別為廣告(5),省去後面斷詞結果處理。 斷詞模式使用上面提到的 [jieba mode](#jieba-mode) 來分別檢視 Accuracy Model: text_cnn_best_99.4219512195122_LR0.001_BATCH100_EPOCH100 (Label.csv (2245筆)加入訓練資料) Result Spreadsheet: https://drive.google.com/drive/folders/1Ltoh4lA2srSzjApLRGrcFOtDxs22Istf?usp=sharing | | [Dataset](#Dataset) | [數美](#Baseline-數美) | [Search](#搜尋模式-Search) | [Full](#全模式-Full-Best) | [Precise](#精確模式-Precise) | | -------- | ------------------------------------ | ------------------------------------ | ------------------------------------ | ------------------------------------ | ------------------------------------ | | 分布 |  |  |  | |  | | Accuracy | | 0.82558 | 0.66860 | **0.68139** | 0.66686 | | Error | | 300 | 570 | **548** | 573 | ### Baseline: 數美  ``` 數美 0 1 2 3 4 5 All Correct 0 477 107 3 7 62 1 657 1 18 320 0 1 0 0 339 2 1 3 338 0 55 0 397 3 0 25 0 44 0 0 69 4 2 0 0 0 146 0 148 5 1 11 0 0 3 95 110 All 499 466 341 52 266 96 1720 Accuracy: 0.8255813953488372 Error amount: 300 / 1720 Type 0 Accuracy: 477 / 657 = 0.726027 Type 1 Accuracy: 320 / 339 = 0.943953 Type 2 Accuracy: 338 / 397 = 0.851385 Type 3 Accuracy: 44 / 69 = 0.637681 Type 4 Accuracy: 146 / 148 = 0.986486 Type 5 Accuracy: 95 / 110 = 0.863636 ``` ### 搜尋模式 Search  ``` Predict 0 1 2 3 4 5 All Correct 0 529 96 11 3 17 1 657 1 117 215 0 3 1 3 339 2 118 13 239 1 26 0 397 3 20 16 0 33 0 0 69 4 55 13 13 0 65 2 148 5 31 6 0 2 2 69 110 All 870 359 263 42 111 75 1720 Accuracy: 0.6686046511627907 Error amount: 570 / 1720 Type 0 Accuracy: 529 / 657 = 0.805175 Type 1 Accuracy: 215 / 339 = 0.634218 Type 2 Accuracy: 239 / 397 = 0.602015 Type 3 Accuracy: 33 / 69 = 0.478261 Type 4 Accuracy: 65 / 148 = 0.439189 Type 5 Accuracy: 69 / 110 = 0.627273 ``` ### 全模式 Full (Best)  ``` Predict 0 1 2 3 4 5 All Correct 0 534 89 11 4 18 1 657 1 123 211 0 4 0 1 339 2 96 6 257 1 37 0 397 3 21 15 0 33 0 0 69 4 57 11 16 0 62 2 148 5 24 8 0 1 2 75 110 All 855 340 284 43 119 79 1720 Accuracy: 0.6813953488372093 Error amount: 548 / 1720 Type 0 Accuracy: 534 / 657 = 0.812785 Type 1 Accuracy: 211 / 339 = 0.622419 Type 2 Accuracy: 257 / 397 = 0.647355 Type 3 Accuracy: 33 / 69 = 0.478261 Type 4 Accuracy: 62 / 148 = 0.418919 Type 5 Accuracy: 75 / 110 = 0.681818 ``` ### 精確模式 Precise  ``` Predict 0 1 2 3 4 5 All Correct 0 531 95 11 3 16 1 657 1 118 214 0 3 1 3 339 2 121 16 234 1 25 0 397 3 21 15 0 33 0 0 69 4 55 13 12 0 66 2 148 5 31 6 0 2 2 69 110 All 877 359 257 42 110 75 1720 Accuracy: 0.666860465116279 Error amount: 573 / 1720 Type 0 Accuracy: 531 / 657 = 0.808219 Type 1 Accuracy: 214 / 339 = 0.631268 Type 2 Accuracy: 234 / 397 = 0.589421 Type 3 Accuracy: 33 / 69 = 0.478261 Type 4 Accuracy: 66 / 148 = 0.445946 Type 5 Accuracy: 69 / 110 = 0.627273 ``` ## Experiments 210302 判斷流程中,若偵測到電話或是帳號就會直接判別為廣告(5),省去後面斷詞結果處理。 斷詞模式使用上面提到的 [jieba mode](#jieba-mode) 來分別檢視 Accuracy | | [Dataset](#Dataset) | [數美](#Baseline-數美) | [Search](#搜尋模式-Search) | [Full](#全模式-Full-Best) | [Precise](#精確模式-Precise) | | -------- | ------------------------------------ | ------------------------------------ | ------------------------------------ | ------------------------------------ | ------------------------------------ | | 分布 |  |  |  |  |  | | Accuracy | | 0.82558 | 0.64593 | **0.65581** | 0.64651 | | Error | | 300 | 609 | **592** | 608 | ### Baseline: 數美  ``` 數美 0 1 2 3 4 5 All Correct 0 477 107 3 7 62 1 657 1 18 320 0 1 0 0 339 2 1 3 338 0 55 0 397 3 0 25 0 44 0 0 69 4 2 0 0 0 146 0 148 5 1 11 0 0 3 95 110 All 499 466 341 52 266 96 1720 Accuracy: 0.8255813953488372 Error amount: 300 / 1720 ``` ### 搜尋模式 Search [Result Spreadsheet](https://docs.google.com/spreadsheets/d/1Gin_vldbTeEqt_lahFfCkPBL1fsQ-eAx-1uz8DT3kRY/edit?usp=sharing)  ``` Predict 0 1 2 3 4 5 All Correct 0 563 67 8 1 16 2 657 1 179 155 0 3 1 1 339 2 112 4 250 1 30 0 397 3 25 15 0 9 20 0 69 4 57 6 16 0 67 2 148 5 35 6 0 0 2 67 110 All 971 253 274 14 136 72 1720 Accuracy: 0.6459302325581395 Error amount: 609 / 1720 ``` ### 全模式 Full (Best) [Result Spreadsheet](https://docs.google.com/spreadsheets/d/1CkLtHTcePGkBDR-ZoyRPsIljxsZryXFBmVZgCmYwUNw/edit?usp=sharing)  ``` Predict 0 1 2 3 4 5 All Correct 0 585 39 9 2 19 3 657 1 194 142 0 2 0 1 339 2 105 3 252 1 36 0 397 3 26 14 0 9 20 0 69 4 57 7 17 0 65 2 148 5 27 6 0 0 2 75 110 All 994 211 278 14 142 81 1720 Accuracy: 0.6558139534883721 Error amount: 592 / 1720 ``` ### 精確模式 Precise [Result Spreadsheet](https://docs.google.com/spreadsheets/d/1HBawn_wZyB-7CqaYGd8AZywdZ9CqNGbjwydIdxQBZWM/edit?usp=sharing)  ``` Predict 0 1 2 3 4 5 All Correct 0 565 66 8 1 15 2 657 1 181 153 0 3 1 1 339 2 112 4 250 1 30 0 397 3 26 14 0 9 20 0 69 4 57 6 15 0 68 2 148 5 36 5 0 0 2 67 110 All 977 248 273 14 136 72 1720 Accuracy: 0.6465116279069767 Error amount: 608 / 1720 ```
×
Sign in
Email
Password
Forgot password
or
By clicking below, you agree to our
terms of service
.
Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
New to HackMD?
Sign up