# 210622 執行速度測試結果 [TOC] ## 測試目的 + 於 **`batch_analysis`** 中,開啟或關閉 `url_detect`, `account_detect`, `remove_duplicate` 的情況下對執行時間影響。 ### Code + 主要觀看 `model.batch_analysis()` 的耗時 ```python= times = 0 msg_file_list = ['2019_Oct_Data'] for msg_file in msg_file_list: with open(msg_file + '.csv', 'r', encoding="utf-8") as f: csv_list = preprocess_from_csv_to_list(f) i = 0 start = 0 window = 1000 size = len(csv_list) while start < size or start < 3000: start = i * window end = (i + 1) * window print(datetime.now().strftime("%Y/%m/%d %H:%M:%S") + ', ' + msg_file + ' ' + f'{i:04d}') # 每次丟1000筆進去,模擬連續小批次分析訊息 csv_df = model.batch_analysis(csv_list, pinyin_mode=True, url_detect=True, account_detect=True) # csv_df = model.batch_analysis(csv_list, pinyin_mode=True, url_detect=False, account_detect=False) save_path = 'data/' + msg_file + f'{i:04d}' + '_result.csv' csv_df.to_csv(save_path, index=False) i += 1 times += 1 ``` 將 **`batch_analysis`** 分割成以下步驟: ``` [1_pinyin_translate] check pinyin available translate to zh-cn and urldecode and convert special num [2_account_phone] account or phone number detect [3_url_detect] url detect [4_language_rulebase] language classify rule base detect [5_segment_duplicate] create segment words list remove duplicate words [6_classZHEN] classification zh and en words [7_predictZH][8_predictEN] predict words label [9_result] result list ``` ## 測試機器 + **CPU:** i7-6700 + **RAM:** 16G + **GPU:** Null ## Result ### Log 格式 ``` start_time, end_time, second, description ``` ### Case 1. 維持原狀 (Default: 關閉 `url_detect`, 啟用 `account_detect`, 執行 `remove_duplicate`) + **`csv_df = model.batch_analysis(csv_list, pinyin_mode=True)`** + `remove_duplicate` 每次會丟 1000 筆進 batch_analysis,模擬連續小批次分析訊息,以下列出數個批次的結果 ``` 2021/06/22 20:40:18, 2019_Oct_Data 0000 2021/06/22 20:40:18, 2021/06/22 20:40:19, 0.224617, 1_pinyin_translate 2021/06/22 20:40:19, 2021/06/22 20:40:19, 0.040005, 2_account_phone 2021/06/22 20:40:19, 2021/06/22 20:40:19, 0.000497, 3_url_detect 2021/06/22 20:40:19, 2021/06/22 20:40:28, 9.479768, 4_language_rulebase 2021/06/22 20:41:02, 2021/06/22 20:41:02, 0.338362, 5_segment_duplicate 2021/06/22 20:41:02, 2021/06/22 20:41:02, 0.008508, 6_classZHEN 2021/06/22 20:41:09, 2021/06/22 20:43:41, 152.221834, 7_predictZH 2021/06/22 20:43:48, 2021/06/22 20:45:25, 97.14799, 8_predictEN 2021/06/22 20:45:25, 2021/06/22 20:45:25, 0.124075, 9_result 2021/06/22 20:45:25, 2019_Oct_Data 0001 2021/06/22 20:45:25, 2021/06/22 20:45:25, 0.21634, 1_pinyin_translate 2021/06/22 20:45:25, 2021/06/22 20:45:25, 0.018332, 2_account_phone 2021/06/22 20:45:25, 2021/06/22 20:45:25, 0.000658, 3_url_detect 2021/06/22 20:45:25, 2021/06/22 20:45:33, 7.879221, 4_language_rulebase 2021/06/22 20:46:07, 2021/06/22 20:46:07, 0.342503, 5_segment_duplicate 2021/06/22 20:46:07, 2021/06/22 20:46:07, 0.00773, 6_classZHEN 2021/06/22 20:46:15, 2021/06/22 20:48:43, 148.295625, 7_predictZH 2021/06/22 20:48:50, 2021/06/22 20:50:24, 94.240869, 8_predictEN 2021/06/22 20:50:24, 2021/06/22 20:50:24, 0.134321, 9_result 2021/06/22 20:50:24, 2019_Oct_Data 0002 2021/06/22 20:50:24, 2021/06/22 20:50:24, 0.216787, 1_pinyin_translate 2021/06/22 20:50:24, 2021/06/22 20:50:24, 0.017886, 2_account_phone 2021/06/22 20:50:24, 2021/06/22 20:50:24, 0.00064, 3_url_detect 2021/06/22 20:50:24, 2021/06/22 20:50:33, 8.484351, 4_language_rulebase 2021/06/22 20:51:07, 2021/06/22 20:51:07, 0.336325, 5_segment_duplicate 2021/06/22 20:51:07, 2021/06/22 20:51:07, 0.007933, 6_classZHEN 2021/06/22 20:51:16, 2021/06/22 20:53:46, 149.921058, 7_predictZH 2021/06/22 20:53:53, 2021/06/22 20:55:28, 95.08858, 8_predictEN 2021/06/22 20:55:28, 2021/06/22 20:55:28, 0.134846, 9_result 2021/06/22 20:55:28, 2019_Oct_Data 0003 2021/06/22 20:55:28, 2021/06/22 20:55:28, 0.216355, 1_pinyin_translate 2021/06/22 20:55:28, 2021/06/22 20:55:28, 0.018108, 2_account_phone 2021/06/22 20:55:28, 2021/06/22 20:55:28, 0.000569, 3_url_detect 2021/06/22 20:55:28, 2021/06/22 20:55:37, 8.500501, 4_language_rulebase 2021/06/22 20:56:10, 2021/06/22 20:56:11, 0.336066, 5_segment_duplicate 2021/06/22 20:56:11, 2021/06/22 20:56:11, 0.007613, 6_classZHEN 2021/06/22 20:56:19, 2021/06/22 20:58:48, 148.918653, 7_predictZH 2021/06/22 20:58:55, 2021/06/22 21:00:31, 96.162245, 8_predictEN 2021/06/22 21:00:31, 2021/06/22 21:00:31, 0.135948, 9_result Total second: 1230 ``` ### Case 2. 啟用 `url_detect`, 關閉 `account_detect`, 不執行 `remove_duplicate` + `model.batch_analysis(csv_list, pinyin_mode=True, url_detect='urlextract', account_detect=False)` + 將 `remove_duplicate` 部分註解 每次會丟 1000 筆進 batch_analysis,模擬連續小批次分析訊息,以下列出數個批次的結果。 ==由各批次的 `7_predictZH` 和 `8_predictEN` 可看出,Case 2 所花時間較多一些。== ``` 2021/06/22 21:08:40, 2019_Oct_Data 0000 2021/06/22 21:08:40, 2021/06/22 21:08:40, 0.226093, 1_pinyin_translate 2021/06/22 21:08:40, 2021/06/22 21:08:40, 0.000605, 2_account_phone 2021/06/22 21:08:40, 2021/06/22 21:08:40, 0.037508, 3_url_detect 2021/06/22 21:08:40, 2021/06/22 21:08:50, 9.685325, 4_language_rulebase 2021/06/22 21:09:24, 2021/06/22 21:09:25, 0.43166, 5_segment_duplicate 2021/06/22 21:09:25, 2021/06/22 21:09:25, 0.14668, 6_classZHEN 2021/06/22 21:09:31, 2021/06/22 21:12:14, 162.493483, 7_predictZH 2021/06/22 21:12:20, 2021/06/22 21:14:05, 104.640442, 8_predictEN 2021/06/22 21:14:05, 2021/06/22 21:14:05, 0.134122, 9_result 2021/06/22 21:14:05, 2019_Oct_Data 0001 2021/06/22 21:14:05, 2021/06/22 21:14:05, 0.216142, 1_pinyin_translate 2021/06/22 21:14:05, 2021/06/22 21:14:05, 0.000631, 2_account_phone 2021/06/22 21:14:05, 2021/06/22 21:14:05, 0.016253, 3_url_detect 2021/06/22 21:14:05, 2021/06/22 21:14:14, 8.188204, 4_language_rulebase 2021/06/22 21:14:47, 2021/06/22 21:14:47, 0.426519, 5_segment_duplicate 2021/06/22 21:14:47, 2021/06/22 21:14:48, 0.147275, 6_classZHEN 2021/06/22 21:14:57, 2021/06/22 21:17:33, 156.788268, 7_predictZH 2021/06/22 21:17:41, 2021/06/22 21:19:22, 100.853545, 8_predictEN 2021/06/22 21:19:22, 2021/06/22 21:19:22, 0.143096, 9_result 2021/06/22 21:19:22, 2019_Oct_Data 0002 2021/06/22 21:19:22, 2021/06/22 21:19:22, 0.216124, 1_pinyin_translate 2021/06/22 21:19:22, 2021/06/22 21:19:22, 0.000636, 2_account_phone 2021/06/22 21:19:22, 2021/06/22 21:19:22, 0.015805, 3_url_detect 2021/06/22 21:19:22, 2021/06/22 21:19:30, 7.798961, 4_language_rulebase 2021/06/22 21:20:03, 2021/06/22 21:20:04, 0.416748, 5_segment_duplicate 2021/06/22 21:20:04, 2021/06/22 21:20:04, 0.144886, 6_classZHEN 2021/06/22 21:20:12, 2021/06/22 21:22:50, 158.546694, 7_predictZH 2021/06/22 21:22:58, 2021/06/22 21:24:39, 101.023627, 8_predictEN 2021/06/22 21:24:39, 2021/06/22 21:24:39, 0.147923, 9_result 2021/06/22 21:24:39, 2019_Oct_Data 0003 2021/06/22 21:24:39, 2021/06/22 21:24:39, 0.215923, 1_pinyin_translate 2021/06/22 21:24:39, 2021/06/22 21:24:39, 0.000541, 2_account_phone 2021/06/22 21:24:39, 2021/06/22 21:24:39, 0.015432, 3_url_detect 2021/06/22 21:24:39, 2021/06/22 21:24:48, 8.265825, 4_language_rulebase 2021/06/22 21:25:21, 2021/06/22 21:25:21, 0.418908, 5_segment_duplicate 2021/06/22 21:25:21, 2021/06/22 21:25:22, 0.142114, 6_classZHEN 2021/06/22 21:25:29, 2021/06/22 21:28:07, 157.430169, 7_predictZH 2021/06/22 21:28:14, 2021/06/22 21:29:55, 101.167407, 8_predictEN 2021/06/22 21:29:55, 2021/06/22 21:29:55, 0.14544, 9_result Total second: 1292 ```