# CSV 大檔案資料提取 & 匯入 Ep.1
> 2023/11/17 Fri.
###### tags: `shell` `懶人包` `SQL` `資料清潔` `公開筆記`
## 案例: 篩選 E-mail 並查詢對應的 decoded User ID
> 需求: 透過小鈴鐺發送系統通知訊息給活躍會員
### Step 1
```
$ wc -l some_large_illed_file.csv
158372 some_large_illed_file.csv
```
使用指令 `wc` 確認行數。看到 15 萬行!就請不要想嘗試用 Sublime Text 打開檔案了!!!
### Step 2
```
$ head some_large_illed_file.csv
pid,Last Action Time,user_id,Email
ENe5EK,11/17/2023 05:22:22,inXXXqj,wyXYZXX3@gmaXX.com
GRENXQ,11/16/2023 17:58:30,ijXXXgm,evXYZXX85@gmXXl.com
Bo754,11/16/2023 17:58:30,cikXXXb,luoXYZXX113@gXXil.com
4wBxo5,11/16/2023 17:58:29,92XXX27,alXYZXXas200XXyahoo.com.tw
EfCqgN,11/16/2023 17:58:29,79XXX3a,abXYZXX000@gXXil.com
Fmpskt,11/16/2023 17:58:28,eaXXX68,ivXYZXX874@gXXil.com
9DvmvB,11/16/2023 17:58:27,jcXXXhj,waXYZXXc35@gXXil.com
JBu975,11/16/2023 17:58:26,mrXXXhg,emXYZXXo@hotXXil.com
3uKcxu,11/16/2023 17:58:26,5fXXXee,g9XYZXX61@gmXXl.com
```
使用 `head` 指令列出頭幾行,預覽欄位及資料格式長相。確認真正需要的資料是哪幾個欄位。
### Step 3
```
$ awk -F "\"*,\"*" '{print $4}' some_large_illed_file.csv
wyXYZXX3@gmaXX.com
evXYZXX85@gmXXl.com
luoXYZXX113@gXXil.com
alXYZXXas200XXyahoo.com.tw
abXYZXX000@gXXil.com
ivXYZXX874@gXXil.com
waXYZXXc35@gXXil.com
emXYZXXo@hotXXil.com
g9XYZXX61@gmXXl.com
```
善用 `awk` 指令,取出需要的欄位就好。
```
$ awk -F "\"*,\"*" '{print $4}' some_large_illed_file.csv >> import.csv
```
### Step 4
```sql
-- temp
CREATE TABLE `temp_AXXXS_USXR_LIST` (
`Email` varchar(255) DEFAULT '',
`user_id` int(11) NOT NULL DEFAULT 0,
KEY `Email` (`Email`),
KEY `user_id` (`user_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;
-- 以 'import.csv' 匯入處理清單
-- preview
SELECT
temp.*,
users.`email`,
users.`id`
FROM `temp_AXXXS_USXR_LIST` AS temp
JOIN `UXXRS` AS users ON temp.`Email` = users.`email`;
-- cleaning
UPDATE `temp_AXXXS_USXR_LIST` AS temp
JOIN `USERS` AS users ON temp.`Email` = users.`email`
SET temp.`user_id` = users.`id`;
-- 對應得到會員帳號 e-mail 則填上 user_id
-- check: 對應不到帳號的資料筆數
SELECT
temp.`user_id`
FROM `temp_AXXXS_USXR_LIST` AS temp
WHERE temp.`user_id` = 0
```
### Step 5
```sql
SELECT DISTINCT
temp.`user_id`
FROM `temp_AXXXS_USXR_LIST` AS temp
WHERE temp.`user_id` > 0
ORDER BY temp.`user_id`
LIMIT 0, 30000;
```
分批撈取名單。(不要挑戰小鈴鐺極限 & 請撈不重複資料)
```shell=
#!/bin/bash
uid=(
123
6666
9487
# ... decoded uid
)
for id in ${uid[*]};
do
echo "$id, \c"
done
```
(分批輸出成需求格式字串,設定小鈴鐺通知)
### Step 6
```
$ sh uid_concat.sh >> 20231117-notification-uid-part-6.txt
```
```
$ cat 20231117-notification-uid-part-N.txt
......, XXXXXX673, XXXXXX674, XXXXXX694, XXXXXX700,
```
(留意逗號後面是否留一個半形空格 & 結尾的逗號記得要再去掉)