# CSV 大檔案資料提取 & 匯入 Ep.1 > 2023/11/17 Fri. ###### tags: `shell` `懶人包` `SQL` `資料清潔` `公開筆記` ## 案例: 篩選 E-mail 並查詢對應的 decoded User ID > 需求: 透過小鈴鐺發送系統通知訊息給活躍會員 ### Step 1 ``` $ wc -l some_large_illed_file.csv 158372 some_large_illed_file.csv ``` 使用指令 `wc` 確認行數。看到 15 萬行!就請不要想嘗試用 Sublime Text 打開檔案了!!! ### Step 2 ``` $ head some_large_illed_file.csv pid,Last Action Time,user_id,Email ENe5EK,11/17/2023 05:22:22,inXXXqj,wyXYZXX3@gmaXX.com GRENXQ,11/16/2023 17:58:30,ijXXXgm,evXYZXX85@gmXXl.com Bo754,11/16/2023 17:58:30,cikXXXb,luoXYZXX113@gXXil.com 4wBxo5,11/16/2023 17:58:29,92XXX27,alXYZXXas200XXyahoo.com.tw EfCqgN,11/16/2023 17:58:29,79XXX3a,abXYZXX000@gXXil.com Fmpskt,11/16/2023 17:58:28,eaXXX68,ivXYZXX874@gXXil.com 9DvmvB,11/16/2023 17:58:27,jcXXXhj,waXYZXXc35@gXXil.com JBu975,11/16/2023 17:58:26,mrXXXhg,emXYZXXo@hotXXil.com 3uKcxu,11/16/2023 17:58:26,5fXXXee,g9XYZXX61@gmXXl.com ``` 使用 `head` 指令列出頭幾行,預覽欄位及資料格式長相。確認真正需要的資料是哪幾個欄位。 ### Step 3 ``` $ awk -F "\"*,\"*" '{print $4}' some_large_illed_file.csv wyXYZXX3@gmaXX.com evXYZXX85@gmXXl.com luoXYZXX113@gXXil.com alXYZXXas200XXyahoo.com.tw abXYZXX000@gXXil.com ivXYZXX874@gXXil.com waXYZXXc35@gXXil.com emXYZXXo@hotXXil.com g9XYZXX61@gmXXl.com ``` 善用 `awk` 指令,取出需要的欄位就好。 ``` $ awk -F "\"*,\"*" '{print $4}' some_large_illed_file.csv >> import.csv ``` ### Step 4 ```sql -- temp CREATE TABLE `temp_AXXXS_USXR_LIST` ( `Email` varchar(255) DEFAULT '', `user_id` int(11) NOT NULL DEFAULT 0, KEY `Email` (`Email`), KEY `user_id` (`user_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci; -- 以 'import.csv' 匯入處理清單 -- preview SELECT temp.*, users.`email`, users.`id` FROM `temp_AXXXS_USXR_LIST` AS temp JOIN `UXXRS` AS users ON temp.`Email` = users.`email`; -- cleaning UPDATE `temp_AXXXS_USXR_LIST` AS temp JOIN `USERS` AS users ON temp.`Email` = users.`email` SET temp.`user_id` = users.`id`; -- 對應得到會員帳號 e-mail 則填上 user_id -- check: 對應不到帳號的資料筆數 SELECT temp.`user_id` FROM `temp_AXXXS_USXR_LIST` AS temp WHERE temp.`user_id` = 0 ``` ### Step 5 ```sql SELECT DISTINCT temp.`user_id` FROM `temp_AXXXS_USXR_LIST` AS temp WHERE temp.`user_id` > 0 ORDER BY temp.`user_id` LIMIT 0, 30000; ``` 分批撈取名單。(不要挑戰小鈴鐺極限 & 請撈不重複資料) ```shell= #!/bin/bash uid=( 123 6666 9487 # ... decoded uid ) for id in ${uid[*]}; do echo "$id, \c" done ``` (分批輸出成需求格式字串,設定小鈴鐺通知) ### Step 6 ``` $ sh uid_concat.sh >> 20231117-notification-uid-part-6.txt ``` ``` $ cat 20231117-notification-uid-part-N.txt ......, XXXXXX673, XXXXXX674, XXXXXX694, XXXXXX700, ``` (留意逗號後面是否留一個半形空格 & 結尾的逗號記得要再去掉)