* * * <h1 id="test">練習題匯總</h1> <h2 id="one">第一週練習</h2> 載入資料 ``` data = LOAD '/dataset/pig04/twmask.csv' USING PigStorage(',') AS( code : chararray, name : chararray, address : chararray, tel : chararray, adult_mask : int, child_mask : int, update_time : chararray ); ``` - 請找出桃園市的藥局 - bash > cat twmask.csv | grep -e '桃園市' ``` .. 5932124265,弘宗藥局,桃園市觀音區中山路15號,(03)4735659,2600,66,2020/10/31 00:47:59 5932124345,甘泉藥局,桃園市觀音區中正路242號,(03)4738659,27,0,2020/10/31 00:47:59 5932124372,福星藥局,桃園市觀音區大觀路二段213號1樓,(03)4389138,2970,58,2020/10/31 00:47:59 5932124381,日昇藥局,桃園市觀音區成功路一段781號,(03)4834212,2538,460,2020/10/31 00:47:59 5932130049,德怡藥局,桃園市復興區澤仁里忠孝路34號,(03)3821686,624,638,2020/10/31 00:47:59 ``` - pig >> gp_counties = GROUP data BY SUBSTRING($2,0,3); >> filter_data = FILTER data BY SUBSTRING($2,0,3) == '桃園市'; >> dump filter_data ``` .. (5932124265,弘宗藥局,桃園市觀音區中山路15號,(03)4735659,2600,66,2020/10/31 00:47:59) (5932124345,甘泉藥局,桃園市觀音區中正路242號,(03)4738659,27,0,2020/10/31 00:47:59) (5932124372,福星藥局,桃園市觀音區大觀路二段213號1樓,(03)4389138,2970,58,2020/10/31 00:47:59) (5932124381,日昇藥局,桃園市觀音區成功路一段781號,(03)4834212,2538,460,2020/10/31 00:47:59) (5932130049,德怡藥局,桃園市復興區澤仁里忠孝路34號,(03)3821686,624,638,2020/10/31 00:47:59) ``` - 請問臺北市與新北市的藥局哪個多, 分別有幾間? - bash > cat twmask.csv | grep -e '臺北市' | wc -l > cat twmask.csv | grep -e '新北市' | wc -l - pig >> taipei = FOREACH data GENERATE address; >> taipei1 = FILTER taipei BY $0 matches '臺北市.*' or $0 matches '新北市.*'; >> taipei2 = FOREACH taipei1 GENERATE $0, SUBSTRING($0,0,3); >> taipei3 = GROUP taipei2 BY $1; >> taipei4 = FOREACH taipei3 GENERATE COUNT($1),$0; >> dump taipei4 ``` (970,新北市) (614,臺北市) ``` - 請問臺北市大安區的藥局總共有幾間 ? - bash > cat twmask.csv | grep -e '臺北市大安區' | wc -l ``` 74 ``` - pig >> taipei = FOREACH data GENERATE address; >> taipei1 = FILTER taipei BY $0 matches '臺北市大安區.*'; >> taipei2 = FOREACH taipei1 GENERATE $0, SUBSTRING($0,0,3); >> taipei3 = GROUP taipei2 BY $1; >> taipei4 = FOREACH taipei3 GENERATE COUNT($1),$0; >> dump taipei4 ``` (74,臺北市) ``` - 請找出藥局的名稱,縣市,地區與聯絡電話 - bash > cat twmask.csv | cut -d , -f 2,3,4,8 ``` 大森藥局,金門縣金城鎮民生路28、30號1、2樓,(82)325100 百泰藥局,金門縣金城鎮西海路1段1號,(082)312832 大金藥局,金門縣金沙鎮汶沙里五福街1號,(082)355382 仁愛復興藥局,金門縣金湖鎮新市里復興路40號,(082)332368 大山藥局,金門縣金湖鎮新市里中正路2號、2-2號,(082)333290 ``` - pig >> dump data ``` (5990010631,大森藥局,金門縣金城鎮民生路28、30號1、2樓,(82)325100,3132,800,2020/10/31 00:47:59) (5990010668,百泰藥局,金門縣金城鎮西海路1段1號,(082)312832,2706,860,2020/10/31 00:47:59) (5990020020,大金藥局,金門縣金沙鎮汶沙里五福街1號,(082)355382,1618,1100,2020/10/31 00:47:59) (5990030044,仁愛復興藥局,金門縣金湖鎮新市里復興路40號,(082)332368,1234,248,2020/10/31 00:47:59) (5990030062,大山藥局,金門縣金湖鎮新市里中正路2號、2-2號,(082)333290,3083,279,2020/10/31 00:47:59) ``` - 接續上題, 將結果儲存至 ~/wk/data 目錄中的 mydata.csv中 - bash > mkdir wk > cd wk/ > mkdir data > cd .. > mv twmask.csv wk/data/mydata.csv > cd wk/data > ll ``` total 648 drwxrwxr-x 2 user29 user29 4096 十一 4 16:43 ./ drwxrwxr-x 3 user29 user29 4096 十一 4 16:43 ../ -rw-r--r-- 1 user29 user29 655255 十一 2 02:25 mydata.csv ``` * * * <h2 id="two">第二週練習</h2> 載入資料 ``` data = LOAD '/dataset/pig04/twmask.csv' USING PigStorage(',') AS( code : chararray, name : chararray, address : chararray, tel : chararray, adult_mask : int, child_mask : int, update_time : chararray ); ``` - 找出臺北市兒童剩餘口罩剩餘最多藥局的前五個藥局 - bash > cat twmask.csv | sort -t',' -k6 | cut -d',' -f2 | head -n 5 ``` 新竹縣新豐鄉衛生所 宜蘭縣頭城鎮衛生所 苗栗縣公館鄉衛生所 松田藥局 阿里藥局 ``` - 續上題,請問兒童口套數量第50名的藥局名稱? - bash > cat twmask.csv | sort -t',' -k6 | cut -d',' -f2 | head -n 50 | cat -n ``` .. 46 大賀藥局 47 幸一藥局 48 京元藥局 49 楊梅丁丁藥局 50 大吉藥局 ``` - 找出臺北市藥局的成人口罩數量並存成新的檔案 - bash > cat twmask.csv | sort -r -t',' -k5 | cut -d',' -f2,3,5 > twmasknew.csv > ll ``` .. -rw-r--r-- 1 user29 user29 655255 十一 2 02:25 twmask.csv -rw-rw-r-- 1 user29 user29 380811 十一 2 23:06 twmasknew.csv ``` - 請問哪一間藥局擁有最多的兒童口罩 ? 藥局代號與名稱 ? 擁有多少兒童口罩 ? 地址與聯絡方式 ? - pig >> child = FOREACH data GENERATE $0,$1,$2,$3,$5; >> child_desc = ORDER child BY $4 DESC; >> child_desc_1 = LIMIT child_desc 1; >> Dump child_desc_1; ``` (5935011998,康安新美藥局,苗栗縣苗栗市中正路739號1樓,(037)373089,3820) ``` - 請將桃園縣也替換為桃園市, 並且再次統計全台灣各縣市的醫事機構 - pig >> data =load '/dataset/pig04/twmask.csv' USING PigStorage (',') AS (code: chararray,name:chararray,address:chararray,tel:chararray,smask:int,rmask:int,time:chararray); >> counties = FOREACH data GENERATE $1,SUBSTRING($2,0,3),$4; >> counties_replace = FOREACH counties GENERATE $0,REPLACE($1, '桃園縣','桃園市') ,$2; >> counties_gp = GROUP counties_replace BY $1; >> counties_count = FOREACH counties_gp GENERATE $0,COUNT($1); >> DUMP counties_count; ``` (南投縣,104) (嘉義市,83) (嘉義縣,132) (基隆市,99) (宜蘭縣,122) (屏東縣,219) (彰化縣,318) (新北市,970) (新竹市,87) (新竹縣,95) (桃園市,472) (澎湖縣,24) (臺中市,705) (臺中縣,2) (臺北市,614) (臺南市,516) (臺東縣,53) (花蓮縣,92) (苗栗縣,128) (連江縣,5) (金門縣,14) (雲林縣,214) (高雄市,598) ``` * * * <h2 id="three">第三週練習</h2> 載入資料 ``` data = LOAD '/dataset/pig04/twmask.csv' USING PigStorage(',') AS( code : chararray, name : chararray, address : chararray, tel : chararray, adult_mask : int, child_mask : int, update_time : chararray ); ``` - 找出臺北市的兒童口罩數量前三名 - 使用 address (地址) 找出縣市名稱並以此來分組 >> gp_counties = GROUP data BY SUBSTRING($2,0,3); - 使用 SUM 函數來統計每個群組(縣市)中的兒童口罩數量 >> sum_child = FOREACH gp_counties GENERATE $0,SUM($1.$5); - 使用 ORDER 來排序口罩數量 (降冪,倒排序,由大到小) >> order_child = ORDER sum_child BY $1 DESC; - 使用 LIMIT 列出前三名 >> top3_child = LIMIT order_child 3; >> dump top3_child; ``` (新北市,566872) (臺中市,424988) (高雄市,351554) ``` - 找出臺北市的健康服務中心的成人口罩數量前三名 - 使用 FILTER 來找出台北市與健康服務中心 >> filter_data = FILTER data BY SUBSTRING($2,0,3) == '臺北市' AND $1 MATCHES '.*健康服務中心'; - 使用 ORDER 來排序成人口罩數量 >> order_adult = ORDER filter_data BY $4 DESC; - 使用 LIMIT 列出前三名 >> top3_adult = LIMIT order_adult 3; >> dump top3_child; ``` (新北市,566872) (臺中市,424988) (高雄市,351554) ``` - 找出臺北市各區域的兒童口罩數量前三名 - 使用 FILTER 找出台北市的藥局資料 >> filter_data = FILTER data BY SUBSTRING($2,0,3) == '臺北市'; - 使用 address 找出區域並以此來分組 >> gp_area = GROUP filter_data BY SUBSTRING($2,3,6); - 使用 SUM 函數來統計每個群組(區域)中的兒童口罩數量 >> sum_child = FOREACH gp_area GENERATE $0,SUM($1.$5); - 使用 ORDER 來排序兒童口罩數量 >> order_child = ORDER sum_child BY $1 DESC; - 使用 LIMIT 列出前三名 >> top3_child = LIMIT order_child 3; >> dump top3_child; ``` (大安區,42303) (中山區,41101) (士林區,37932) ``` * * * <h2 id="four">第四週練習</h2> 載入資料 ``` data = LOAD '/dataset/pig04/twmask.csv' USING PigStorage(',') AS( code : chararray, name : chararray, address : chararray, tel : chararray, adult_mask : int, child_mask : int, update_time : chararray ); ``` - 臺北市地區的成人與兒童口罩的總和最多的前三個, 各幾個 ? - pig >> data2 = FOREACH data GENERATE $1, SUBSTRING($2,0,6), $4, $5; >> total_mask = FOREACH data2 GENERATE $0, $1, $2+$3; >> filter_mask = Filter total_mask BY $1 matches '臺北市.*'; >> group_mask = GROUP filter_mask BY $1; >> sum_mask = FOREACH group_mask GENERATE $0, SUM($1.$2); >> order_total = ORDER sum_mask BY $1 DESC; >> top3_child = LIMIT order_total 3; >> dump top3_child; ``` (臺北市大安區,216921) (臺北市中山區,215776) (臺北市士林區,198239) ``` - 找出臺北市的兒童口罩平均最多的前三個區域, 各幾個 ? - pig >> data2 = FOREACH data GENERATE $1, SUBSTRING($2,0,6), $5; >> filter_mask = Filter total_mask BY $1 matches '臺北市.*'; >> group_mask = GROUP filter_mask BY $1; >> avg_mask = FOREACH group_mask GENERATE $0, AVG($1.$2); >> order_total = ORDER avg_mask BY $1 DESC; >> top3_child = LIMIT order_total 3; >> dump top3_child; ``` (臺北市中正區,3478.25) (臺北市內湖區,3460.04) (臺北市萬華區,3278.6666666666665) ``` - 請找出全台灣人口最多的縣市 - pig >> pop_data = LOAD '/dataset/pig04/population.csv' USING PigStorage(',') AS (Area:chararray, People:int, Land_area:double, Density:double); >> pop_data2 = FOREACH pop_data GENERATE SUBSTRING($0,0,3),$1; >> gp_pop_data2 = GROUP pop_data2 BY$0; >> counties_people = FOREACH gp_pop_data2 GENERATE $0, SUM($1.$1); >> desc_counties = ORDER counties_people BY $1 DESC; >> limit_counties = LIMIT desc_counties 1; >> dump limit_counties; ``` (新北市,4018696) ``` - 請找出臺北市各區域的口罩與人口的比例 - pig >> pop_data = LOAD '/dataset/pig04/population.csv' USING PigStorage(',') AS (Area:chararray, People:int, Land_area:double, Density:double); >> pop_data2 = FOREACH pop_data GENERATE SUBSTRING($0,0,6),$1; >> filter_pop_data2 = FILTER pop_data2 BY $0 matches '臺北市.*'; >> gp_pop_data2 = GROUP filter_pop_data2 BY $0; >> counties_people = FOREACH gp_pop_data2 GENERATE $0, SUM($1.$1); >> data =load '/dataset/pig04/twmask.csv' USING PigStorage (',') AS (code: chararray,name:chararray,address:chararray,tel:chararray,adult_mask:int,child_mask:int,update_time:chararray); >> adult_child = FOREACH data GENERATE SUBSTRING(address,0,6) AS area, adult_mask + child_mask AS adult_child_mask; >> filter_adult_child = FILTER adult_child BY $0 matches '臺北市.*'; >> gp_adult_child = GROUP filter_adult_child BY area; >> sum_adult_child = FOREACH gp_adult_child GENERATE $0,SUM($1.$1); >> join1 = JOIN counties_people BY $0, sum_adult_child BY $0; >> pop_mask = FOREACH join1 GENERATE $0 AS counties ,$1 AS people ,$3 AS mask; >> mask_per_people = FOREACH pop_mask GENERATE counties, (double)mask / (double)people; >> mask_per_people_round = FOREACH mask_per_people GENERATE $0,ROUND_TO($1,2); >> dump mask_per_people_round; ``` (臺北市中山區,0.95) (臺北市中正區,1.06) (臺北市信義區,0.78) (臺北市內湖區,0.61) (臺北市北投區,0.56) (臺北市南港區,0.7) (臺北市士林區,0.7) (臺北市大同區,0.64) (臺北市大安區,0.71) (臺北市文山區,0.56) (臺北市松山區,0.86) (臺北市萬華區,0.74) ``` - 請問口罩分配最公平的前三個縣市是 ? - pig >> pop_mask = FOREACH join1 GENERATE $0 AS counties ,$1 AS people ,$3 AS mask; >> data =load '/dataset/pig04/twmask.csv' USING PigStorage (',') AS (code: chararray,name:chararray,address:chararray,tel:chararray,adult_mask:int,child_mask:int,update_time:chararray); >> pop_data2 = FOREACH pop_data GENERATE SUBSTRING($0,0,6),$1; >> filter_pop_data2 = FILTER pop_data2 BY $0 matches '臺北市.*'; >> gp_pop_data2 = GROUP filter_pop_data2 BY $0; >> counties_people = FOREACH gp_pop_data2 GENERATE $0, SUM($1.$1); >> adult_child = FOREACH data GENERATE SUBSTRING(address,0,6) AS area, adult_mask + child_mask AS adult_child_mask; >> filter_adult_child = FILTER adult_child BY $0 matches '臺北市.*'; >> gp_adult_child = GROUP filter_adult_child BY area; >> sum_adult_child = FOREACH gp_adult_child GENERATE $0,SUM($1.$1); >> join1 = JOIN counties_people BY $0, sum_adult_child BY $0; >> pop_mask = FOREACH join1 GENERATE $0 AS counties ,$1 AS people ,$3 AS mask;mask_per_people = FOREACH pop_mask GENERATE counties, (double)mask / (double)people; >> mask_per_people_round = FOREACH mask_per_people GENERATE $0,ROUND_TO($1,2); >> dump mask_per_people_round; ``` (臺北市中山區,0.95) (臺北市中正區,1.06) (臺北市信義區,0.78) (臺北市內湖區,0.61) (臺北市北投區,0.56) (臺北市南港區,0.7) (臺北市士林區,0.7) (臺北市大同區,0.64) (臺北市大安區,0.71) (臺北市文山區,0.56) (臺北市松山區,0.86) (臺北市萬華區,0.74) ``` * * *