# pig
# 20445 -- 替代役役男訓練人數統計表
- HDFS 完整路徑 `/dataset/20445/20445.csv`
```
欄位名稱
年度 year:int
梯次 phase:chararray
替代役役男訓練人數 trainees:int
原始資料集內容
year phase trainees
100,91,1079
100,92,969
100,92_2,122
```
- 題目 1-1. 哪一年的總替代役人數最多?
- 輸出格式: 年分, 人數
- 答案範例: 107, 87
>> data = load '/dataset/20445/20445.csv' USING PigStorage (',') AS (year:int,phase:chararray,trainees:int);
>> a1 = GROUP data BY year;
>> a2 = FOREACH a1 GENERATE group, SUM(data.trainees);
>> a3 = ORDER a2 BY $1 DESC;
>> a4 = LIMIT a3 1;
>> dump a4;
```
(103,37210)
```
- 題目 1-2. 哪一年的梯次最多?
- 輸出格式: 年分, 數量
- 答案範例: 107, 87
>> data = load '/dataset/20445/20445.csv' USING PigStorage (',') AS (year:int,phase:chararray,trainees:int);
>> a1 = GROUP data BY year;
>> a2 = FOREACH a1 GENERATE group, COUNT(data);
>> a3 = ORDER a2 BY $1 DESC;
>> a4 = LIMIT a3 1;
>> dump a4;
```
(104,23)
```
---
# 5958 -- 各縣(市)警察(分)局暨所屬分駐(派出)所地址資料
- HDFS 完整路徑 `/dataset/5958/5958.csv`
```
欄位名稱
名稱 name:chararray
郵遞區號 zip_code:chararray
地址 address:chararray
原始資料集內容
name zip_code address
臺北市政府警察局,100,臺北市中正區延平南路96號
中山分局,104,臺北市中山區中山北路2段1號
中山一派出所,104,臺北市中山區中山北路1段110號
```
- 題目 2-1. 最多派出所的郵遞區號是幾號?
- 輸出格式: 郵遞區號, 數量
- 答案範例: 100, 87
>> data = load '/dataset/5958/5958.csv' USING PigStorage (',') AS (name:chararray,code:chararray,address:chararray);
>> a1 = FILTER data BY name MATCHES '.*派出所';
>> a2 = GROUP a1 BY zip_code;
>> a3 = FOREACH a2 GENERATE group, COUNT($1);
>> a4 = ORDER a3 BY $1 DESC;
>> a5 = LIMIT a4 1;
>> dump a5;
```
(546,21)
```
- 題目 2-2. 哪個縣市有最多派出所?
>> data = load '/dataset/5958/5958.csv' USING PigStorage (',') AS (name:chararray,code:chararray,address:chararray);
>> a0 = FILTER data BY name MATCHES '.*派出所';
>> a1 = GROUP a0 BY SUBSTRING($2,0,3);
>> a2 = FOREACH a1 GENERATE group, COUNT($1);
>> a3 = ORDER a2 BY $1 DESC;
>> a4 = LIMIT a3 1;
>> dump a4;
```
(新北市,145)
```
---
# 6086 -- 107 年幼兒園名錄
- HDFS 完整路徑 `/dataset/6086/6086.csv`
```
欄位名稱
代碼 code:chararray
學校名稱 school:chararray
縣市名稱 city:chararray
地址 address:chararray
電話 phone:chararray
原始資料集內容
code school city address phone
011K02,新北市私立溫特爾幼兒園,新北市,新北市三峽區龍埔里5鄰三樹路336號1、2、3樓,(02)26718181
011K03,新北市私立新榮富新莊幼兒園,新北市,新北市新莊區中信里13鄰中和街204巷1、3號,218巷2、6號,中信里12鄰中和街1、3、5、7、11號,(02)29922174
011K04,新北市私立崇儒幼兒園,新北市,新北市中和區安樂里16鄰宜安路56之1號1樓,(02)29435789
```
- 題目 3-1. 幼兒園數量最少的是哪個縣市?
- 輸出格式: 縣市, 數量
- 答案範例: 臺北市, 87
> data = load '/dataset/6086/6086.csv' USING PigStorage (',') AS (code:chararray,school:chararray,city:chararray,address:chararray,phone:chararray);
> a1 = GROUP data BY city;
> a2 = FOREACH a1 GENERATE group, COUNT(data);
> a3 = ORDER a2 BY $1 ASC;
> a4 = LIMIT a3 1;
> dump a4;
```
(連江縣,5)
```
- 題目 3-2. 臺北市與新北市的幼兒園各有多少間? 由數量多到少排序。
- 輸出格式: 縣市, 數量
- 答案範例: 臺北市, 87
> data = load '/dataset/6086/6086.csv' USING PigStorage (',') AS (code:chararray,school:chararray,city:chararray,address:chararray,phone:chararray);
> a0 = FOREACH data GENERATE city;
> a1 = FILTER a0 BY $0 MATCHES '臺北市.*' OR $0 MATCHES '新北市.*';
> a2 = FOREACH a1 GENERATE $0, SUBSTRING($0,0,3);
> a3 = GROUP a2 BY $1;
> a4 = FOREACH a3 GENERATE COUNT($1),$0;
> dump a4
```
(新北市,1134)
(臺北市,689)
```
---
# 6087 -- 107年國民小學名錄
- HDFS 完整路徑 `/dataset/6087/6087.csv`
```
欄位名稱
代碼 code:chararray
學校名稱 school:chararray
縣市名稱 city:chararray
地址 address:chararray
電話 phone:chararray
網址 url:chararray
原始資料集內容
code school city address phone url
011601,私立育才國小,新北市,新北市永和區福和路125巷20號,(02)29214630,http://www.ytes.ntpc.edu.tw
011602,私立聖心國小,新北市,新北市八里區龍米路一段261號,(02)26182330,http://lshes.com/school/
011603,私立及人國小,新北市,新北市永和區文化路172號,(02)29212145,http://www.cjps.ntpc.edu.tw
```
- 題目 4-1. 新北市有多少間私立國小?
- 輸出格式: 數量
- 答案範例: 87
> data = LOAD '/dataset/6087/6087.csv' USING PigStorage(',') AS(code:chararray, school:chararray, city:chararray, address:chararray, phone:chararray, url:chararray
);
> a0 = FILTER data BY $1 MATCHES '私立.*' AND $2 MATCHES '新北市.*';
> a1 = FOREACH a0 GENERATE city;
> a2 = FOREACH a1 GENERATE $0, SUBSTRING($0,0,3);
> a3 = GROUP a2 BY $1;
> a4 = FOREACH a3 GENERATE $0, COUNT($1);
> dump a4
```
(新北市,5)
```
---
# 999003 -- 107年6月行政區分齡兒童及少年性別人口統計_縣市
- HDFS 完整路徑 `/dataset/999003/999003.csv`
```
欄位名稱
縣市代碼 country_id:int
縣市名稱 country:chararray
0-5歲兒童人口數 a0a5_cnt:int
6-11歲兒童人口數 a6a11_cnt:int
12-17歲少年人口數 a12a17_cnt:int
原始資料集內容
country_id country a0a5_cnt a6a11_cnt a12a17_cnt
10002,宜蘭縣,20907,21718,27474
10015,花蓮縣,15380,15338,19695
09020,金門縣,6373,4437,5451
```
- 題目 5-1. 新北市的0~17歲兒童總數有多少個?
- 輸出格式: 數量
- 答案範例: 87
> data = LOAD '/dataset/999003/999003.csv' USING PigStorage(',') AS(country_id:int, country:chararray, a0a5_cnt:int, a6a11_cnt:int, a12a17_cnt:int);
> a1 = FILTER data BY country == '新北市';
> a2 = FOREACH a1 GENERATE $1,$2+$3+$4;
> dump a2
```
(新北市,620357)
```
---
# 12197_A1 -- A1 類交通事故資料
- HDFS 完整路徑 `/dataset/12197_A1/12197_A1.csv`
```
欄位名稱
發生時間 time:chararray
死亡人數 dead:chararray
受傷人數 injured:chararray
車種 vehicle:chararray
原始資料集內容
ctime dead injured vehicle
106年01月01日 02時35分00秒,死亡1,受傷0,普通重型-機車
106年01月01日 02時43分00秒,死亡1,受傷0,普通重型-機車
106年01月01日 03時35分00秒,死亡2,受傷0,普通重型-機車
```
- 題目 6-1. 造成事故最多的車種是哪種車? (造成死亡或受傷皆算事故)
- 輸出格式: 車種, 次數
- 答案範例: 普通重型-機車, 87
> data = LOAD '/dataset/12197_A1/12197_A1.csv' USING PigStorage(',') AS(time:chararray, dead:chararray, injured:chararray, vehicle:chararray);
a1 = FOREACH data GENERATE vehicle, (INT) SUBSTRING($1,2,3), (INT) SUBSTRING($2,2,3);
> a2 = FOREACH a1 GENERATE $0,$1+$2;
> a3 = GROUP a2 BY $0;
> a4 = FOREACH a3 GENERATE $0,COUNT($1.$1);
> a5 = ORDER a4 BY $1 DESC;
> a6 = LIMIT a5 1;
> dump a6
```
(普通重型-機車,513)
```
- 題目 6-2. 發生事故最多的是哪年哪月哪日? (造成死亡或受傷皆算事故)
- 輸出格式: 日期, 次數
- 答案範例: 106年01月01日, 87
> data = LOAD '/dataset/12197_A1/12197_A1.csv' USING PigStorage(',') AS(time:chararray, dead:chararray, injured:chararray, vehicle:chararray);
a1 = FOREACH data GENERATE vehicle, (INT) SUBSTRING($1,2,3), (INT) SUBSTRING($2,2,3);
> a1 = FOREACH data GENERATE $3, SUBSTRING($0,0,10);
> b1 = GROUP a1 BY $1;
> c1 = FOREACH b1 GENERATE $0,COUNT($1.$1);
> 1 = ORDER c1 BY $1 DESC;
> dump d1
```
(106年08月13日,10)
```
- 題目 6-3. 造成死亡總數最高的是哪個車種?
- 輸出格式: 車種, 次數
- 答案範例: 普通重型-機車, 87
> data = LOAD '/dataset/12197_A1/12197_A1.csv' USING PigStorage(',') AS(time:chararray, dead:chararray, injured:chararray, vehicle:chararray);
a1 = FOREACH data GENERATE vehicle, (INT) SUBSTRING($1,2,3), (INT) SUBSTRING($2,2,3);
> a1 = FOREACH data GENERATE $3,(int)SUBSTRING($1,2,3);
> b1 = GROUP a1 BY $0;
> c1 = FOREACH b1 GENERATE $0,SUM($1.$1);
> d1 = ORDER c1 BY $1 DESC;
> dump d1
```
(普通重型-機車,517)
```
---
# 12197_A2 – A2 類交通事故資料
- HDFS 完整路徑 `/dataset/12197_A2/12197_A2.csv`
```
欄位名稱
發生時間 time:chararray
死亡人數 dead:chararray
受傷人數 injured:chararray
車種 vehicle:chararray
原始資料集內容
ctime dead injured vehicle
106年01月01日 00時00分00秒,死亡0,受傷1,自用-小客車
106年01月01日 00時00分00秒,死亡0,受傷2,自用-小客車
106年01月01日 00時03分00秒,死亡0,受傷1,自用-小客車
```
- 題目 7-1. 平均造成受傷數最高的是哪個車種? (造成受傷人數總數/事故總數) 平均一樣時依車種字母大到小排序。
- 輸出格式: 車種, 次數
- 答案範例: 普通重型-機車, 87
> data = load '/dataset/12197_A2/12197_A2.csv' USING PigStorage (',') AS (time:chararray,dead:chararray,injured:chararray,vehicle:chararray);
> a1 = FOREACH data GENERATE $3,(int)SUBSTRING($2,2,3);
> a11 = GROUP a1 BY $0;
> b1 = FOREACH a11 GENERATE $0, COUNT($1.$1),SUM($1.$1);
> b2 = FOREACH b1 GENERATE $0,(float)$2/$1;
> c1 = ORDER b2 BY $1 DESC;
> dump c1
>
```
(普通重型-特種車,2.0)
(大客車-軍車,2.0)
```
---
# 24333 -- 「ATM位置」查詢一覽表
- HDFS 完整路徑 ``/dataset/24333/24333.csv`
```
欄位名稱
裝設金融機構代號 code:chararray
裝設金融機構名稱 name:chararray
裝設地點 location:chararray
裝設縣市 city:chararray
裝設地址 address:chararray
原始資料集內容
code name location city address
004,臺灣銀行,士林分行,台北市,台北市士林區中山北路六段197號
004,臺灣銀行,大同公司,台北市,台北市中山區中山北路三段22號
004,臺灣銀行,大安分行,台北市,台北市大安區敦化南路二段69號1樓
```
- 題目 8-1. 哪間銀行的ATM在台中市有最多台?
- 輸出格式: 銀行名稱, 數量
- 答案範例: 元大商業銀行, 87
> data = load '/dataset/24333/24333.csv' USING PigStorage (',');
> a1 = FILTER data BY $3 == '台中市';
> b2 = GROUP a1 BY $1;
> c1 = FOREACH b2 GENERATE $0,COUNT($1.$1);
> d1 = ORDER c1 BY $1 DESC;
> dump d1
```
A: (中國信託商業銀行,642)
```
- 題目 8-2. 哪個縣市擁有最多"臺灣銀行"的ATM? [注] "台"跟"臺"
- 輸出格式: 縣市, 數量
- 答案範例: 台北市, 87
> data = load '/dataset/24333/24333.csv' USING PigStorage (',');
> a2 = FILTER data BY $1 == '臺灣銀行';
> a3 = FOREACH a2 GENERATE $1,SUBSTRING($4,0,3);
> a1 = GROUP a3 BY $1;
> b2 = FOREACH a1 GENERATE $0,COUNT($1.$1);
> c1 = ORDER b2 BY $1 DESC;
> dump c1
```
A: (台北市,95)
```
---
# 8066 -- 農產品交易行情
- HDFS 完整路徑 ``/dataset/8066/8066.csv`
```
欄位名稱
交易日期 date:chararray
作物代號 crop_code:chararray
作物名稱 crop_name:chararray
市場代號 market_code:chararray
市場名稱 market_name:chararray
上價 upper_price:float
中價 middle_price:float
下價 lower_price:float
平均價 average_price:float
交易量 trading_volume:int
原始資料集內容
cdata crop_code crop_name market_code market_name upper_price middle_price lower_price average_price trading_volume
107.09.11,11,椰子,104,台北二,31,20,15.7,21.3,849
107.09.11,129,椰子-進口剝殼,104,台北二,88.3,72.8,60,73.3,60
107.09.11,31,釋迦,104,台北二,114,73.5,46.8,76.3,3581
```
- 題目 9-1. 全台農產品交易量最高的是哪個市場?
- 輸出格式: 市場名稱, 交易量
- 答案範例: 台北二, 87
data = load '/dataset/8066/8066.csv' USING PigStorage (',')AS
(date:chararray,crop_code:chararray,crop_name:chararray,market_code:chararray,market_name:chararray,upper_price:float,middle_price:float,lower_price:float,average_price:float,trading_volume:int);
a1 = GROUP data BY $4;
b1 = FOREACH a1 GENERATE $0,SUM($1.$9);
c1 = ORDER b1 BY $1 DESC;
dump c1
```
A: (台北一,1676906)
```
- 題目 9-2. 全台交易總量最高的是哪種農產品? </br> ("百香果-其他"與"百香果-改良種"皆算是"百香果",以此類推其他農產品名稱)
- 輸出格式: 農產品名, 交易量
- 答案範例: 椰子, 87
data = load '/dataset/8066/8066.csv' USING PigStorage (',')AS
(date:chararray,crop_code:chararray,crop_name:chararray,market_code:chararray,market_name:chararray,upper_price:float,middle_price:float,lower_price:float,average_price:float,trading_volume:int);
a1 = GROUP data BY $2;
b1 = FOREACH a1 GENERATE $0,SUM($1.$9);
c1 = ORDER b1 BY $1 DESC;
d1 = LIMIT c1 1;
dump d1
```
A: (文旦柚,589354)
```
- 題目 9-3. 平均上價與平均下價平均價差最高的是哪項農產品? </br> ("百香果-其他"與"百香果-改良種"算兩項不同的農產品,以此類推其他農產品名稱)
- 輸出格式: 農產品名, 交易量
- 答案範例: 椰子, 87
data = load '/dataset/8066/8066.csv' USING PigStorage (',')AS
(date:chararray,crop_code:chararray,crop_name:chararray,market_code:chararray,market_name:chararray,upper_price:float,middle_price:float,lower_price:float,average_price:float,trading_volume:int);
a1 = GROUP data BY $2;
b1 = FOREACH a1 GENERATE $0,AVG($1.$5),AVG($1.$7);
c1 = FOREACH b1 GENERATE $0,$1-$2;
c2 = FOREACH c1 GENERATE $0,ROUND_TO()
d1 = ORDER c1 BY $1 DESC;
e1 = LIMIT d1 1;
dump e1
```
A: (葡萄-進口,334.6)
```
---
# 63029 -- 十六縣持卡人前十大國外消費金額及筆數(依簽帳筆數排名)
- HDFS 完整路徑 ``/dataset/63029/63029.csv`
```
欄位名稱
年月 year_month:chararray
國別 country:chararray
城市 tw_city:chararray
金額 amt:int
筆數 count:int
原始資料集內容
year_month country tw_city amt count
2014-01,英國,基隆市,20933333,30063
2014-01,盧森堡,基隆市,2575891,5276
2014-01,日本,基隆市,20238233,3895
```
- 題目 10-1. 2016到2018年國外消費總金額最高的是哪個縣市?
- 輸出格式: 縣市, 金額
- 答案範例: 基隆市, 87
> data = load '/dataset/63029/63029.csv' USING PigStorage (',')AS
(year_month:chararray,country:chararray,tw_city:chararray,amt:int,count:int);
> a0 = FILTER data BY $0 matches '2018.*' or $0 matches '2017.*' or $0 matches '2016.*';
> a1 = GROUP a0 BY $2;
> b1 = FOREACH a1 GENERATE $0,SUM($1.$3);
> c1 = ORDER b1 BY $1 DESC;
> d1 = LIMIT c1 1;
> dump d1;
```
A: (新竹市,10005781918)
```
- 題目 10-2. 哪一個月的國外消費平均金額最高?
- 輸出格式: 月份, 金額
- 答案範例: 一月, 87
> data = load '/dataset/63029/63029.csv' USING PigStorage (',')AS
(year_month:chararray,country:chararray,tw_city:chararray,amt:int,count:int);
> a0 = FOREACH data GENERATE SUBSTRING($0,5,7),$3;
> a1 = GROUP a0 BY $0;
> b1 = FOREACH a1 GENERATE $0,SUM($1.$1)/COUNT($1.$1);
> c1 = ORDER b1 BY $1 DESC;
> d1 = LIMIT c1 1;
> dump d1
```
A: (六月,11158768)
```
[注] 結果為06,我自己將答案改成'六月'
- 題目 10-3. 新竹市2018年在哪個國家的總消費金額最高?
- 輸出格式: 國家, 金額
- 答案範例: 英國, 87
> data = load '/dataset/63029/63029.csv' USING PigStorage (',')AS
(year_month:chararray,country:chararray,tw_city:chararray,amt:int,count:int);
> a1 = FILTER data BY $0 matches '2018.*' and $2 == '新竹市';
> b1 = GROUP a1 BY $1;
> c1 = FOREACH b1 GENERATE $0,SUM($1.$3);
> d1 = ORDER c1 BY $1 DESC;
> e1 = LIMIT d1 1;
> dump e1
```
A: (日本,548333518)
```
---
# 38315 -- 十六縣居民跨縣市消費樣態
- HDFS 完整路徑 ``/dataset/38315/38315.csv`
```
欄位名稱
年月 year:chararray
地區 area:chararray
類別 type:chararray
卡數 cards:int
總交易筆數 total_trans:int
總交易金額[新台幣] total_amt:biginteger
跨縣市交易筆數 inter_city_trans:int
跨縣市交易金額[新台幣] inter_city_amt:biginteger
原始資料集內容
year area type cards total_trans total_amt inter_city_trans inter_city_amt
2014-01,基隆市,ALL,241213,942118,2265105458,714100,1803194168
2014-02,基隆市,ALL,229540,813048,1842401862,641756,1533509566
2014-03,基隆市,ALL,239793,918036,1992896656,722421,1658553554
```
- 題目 11-1. 2018年哪一個縣市的跨縣市交易金額平均最高?
- 輸出格式: 縣市, 金額
- 答案範例: 基隆市, 87
> data = load '/dataset/38315/38315.csv' USING PigStorage (',')AS
(year:chararray,area:chararray,type:chararray,cards:int,total_trans:int,total_amt:biginteger,inter_city_trans:int,inter_city_amt:biginteger);
> a1 = FILTER data BY $0 matches '2018.*';
> b1 = GROUP a1 BY $1;
> c1 = FOREACH b1 GENERATE $0,SUM($1.$7)/SUM($1.$6);
> d1 = ORDER c1 BY $1 DESC;
> e1 = LIMIT d1 5;
> dump e1
```
A: (新竹市,4164)
```
- 題目 11-2. 總交易筆數中平均單筆消費金額最高的是哪個縣市?
- 輸出格式: 縣市, 金額
- 答案範例: 基隆市, 87
> data = load '/dataset/38315/38315.csv' USING PigStorage (',')AS
(year:chararray,area:chararray,type:chararray,cards:int,total_trans:int,total_amt:biginteger,inter_city_trans:int,inter_city_amt:biginteger);
> a1 = GROUP data BY $1;
> b1 = FOREACH a1 GENERATE $0,SUM($1.$5)/SUM($1.$4);
> c1 = ORDER b1 BY $1 DESC;
> d1 = LIMIT c1 5;
> dump d1
```
A: (新竹市,2651)
```
---
# 999001 -- 連線紀錄
- HDFS完整路徑 ``/dataset/999001/999001.tsv`
```
欄位名稱
連線編號 connection:int
連線形式 type:chararray
通訊協定 protocol:chararray
服務程式 service:chararray
時間戳記 timestamp:int
本機位址 local_host:chararray
本機通訊埠 local_port:int
遠端位址 remote_host:chararray
遠端通訊埠 remote_port:int
原始資料集內容
connection type protocol service ctimestamp local _host local _port remote_ host remote_ port
3 accept tcp mssqld 1502196257 192.168.131.6 1433 222.174.114.42 56803
4 accept tcp mssqld 1502196259 192.168.131.6 1433 222.174.114.42 56889
5 accept tcp smbd 1502196284 192.168.131.4 445 218.81.136.101 2671
```
- 題目 12-1. 哪一個本機位址提供的服務程式種類最多?
- 輸出格式: IP, 種類數
- 答案範例: 192.168.131.6, 87
> data = load '/dataset/999001/999001.tsv' USING PigStorage ('\t')AS
(connection:int,type:chararray,protocol:chararray,service:chararray,timestamp:int,local_host:chararray,local_port:int,remote_host:chararray,remote_port:int);
> a1 = FOREACH data GENERATE $5,$3;
> a11 = DISTINCT a1;
> a2 = GROUP a11 BY $0;
> a3 = FOREACH a2 GENERATE $0,COUNT($1.$1);
> a4 = ORDER a3 BY $1 DESC;
> b1 = LIMIT a4 5;
> dump b1
```
A: (192.168.130.1,19)
```
- 題目 12-2. 哪一個服務程式的IP數最多?
- 輸出格式: 服務程式, IP數
- 答案範例: mssqld, 87
> data = load '/dataset/999001/999001.tsv' USING PigStorage ('\t')AS
(connection:int,type:chararray,protocol:chararray,service:chararray,timestamp:int,local_host:chararray,local_port:int,remote_host:chararray,remote_port:int);
> a1 = FOREACH data GENERATE $3,$5;
> b1 = DISTINCT a1;
> c1 = GROUP b1 BY $0;
> d1 = FOREACH c1 GENERATE $0,COUNT($1.$1);
> e1 = ORDER d1 BY $1 DESC;
> dump e1
```
A:
(mysqld,10)
(mssqld,10)
(RtpUdpStream,10)
(SipSession,10)
(Blackhole,10)
(smbd,10)
(Memcache,10)
(httpd,10)
(upnpd,10)
(SipCall,10)
```
- 題目 12-3. 哪一個遠端位址連線次數最多?
- 輸出格式: IP, 次數
- 答案範例: 222.174.114.42, 87
> data = load '/dataset/999001/999001.tsv' USING PigStorage ('\t')AS
(connection:int,type:chararray,protocol:chararray,service:chararray,timestamp:int,local_host:chararray,local_port:int,remote_host:chararray,remote_port:int);
> a1 = FOREACH data GENERATE $7,$4;
> b1 = GROUP a1 BY $0;
> c1 = FOREACH b1 GENERATE $0,COUNT($1.$1);
> d1 = ORDER c1 BY $1 DESC;
> e1 = LIMIT d1 5;
> dump e1
```
A: (35.185.109.141,31564)
```
---
# 999001 -- 連線紀錄
- 題目 13-1. 哪一個網址的 IP:port 被下載檔案最多次?
- 輸出格式: IP:port, 次數
- 答案範例: 118.89.159.61:12347, 87
> data = load '/dataset/999002/999002.tsv' USING PigStorage ('\t')AS
(download:int,connection:int,download_url:chararray,download_md5_hash:chararray);
> a1 = FOREACH data GENERATE $2;
> b1 = FOREACH a1 GENERATE STRSPLIT ($0,':',3);
> c1 = FOREACH b1 GENERATE STRSPLIT ($0.$1,'//',2),STRSPLIT ($0.$2,'/',2);
> d1 = GROUP c1 BY $0;
> e1 = FOREACH d1 GENERATE $0,COUNT($1.$1);
> f1 = ORDER e1 BY $1 DESC;
> g1 = LIMIT f1 5;
> dump g1
```
A: (203.189.234.149:8080,282)
```
---
# JOIN 以下題目資料集 999001 跟 999002 均會使用到
- 題目 14-1. 哪一個遠端位址下載檔案次數最多? </br> (資料集: 999001 & 999002)
- 輸出格式: IP, 次數
- 答案範例: 222.174.114.42, 87
> data1 = load '/dataset/999001/999001.tsv' USING PigStorage ('\t')AS
(connection:int,type:chararray,protocol:chararray,service:chararray,timestamp:int,local_host:chararray,local_port:int,remote_host:chararray,remote_port:int);
data2 = load '/dataset/999002/999002.tsv' USING PigStorage ('\t')AS
(download:int,connection:int,download_url:chararray,download_md5_hash:chararray);
> data2 = JOIN data1 BY $0,data2 BY $1;
> a1 = GROUP data2 BY $7;
> b1 = FOREACH a1 GENERATE $0,COUNT($1.$1);
> c1 = ORDER b1 BY $1 DESC;
> d1 = LIMIT c1 5;
> dump d1;
```
A: (222.186.21.168,50)
```
- 題目 14-2. 哪一個本機位置被下載檔案次數第三多? </br> (資料集: 999001、 999002)
- 輸出格式: IP, 次數
- 答案範例: 222.174.114.42, 87
> data1 = load '/dataset/999001/999001.tsv' USING PigStorage ('\t')AS
(connection:int,type:chararray,protocol:chararray,service:chararray,timestamp:int,local_host:chararray,local_port:int,remote_host:chararray,remote_port:int);
> data2 = load '/dataset/999002/999002.tsv' USING PigStorage ('\t')AS
(download:int,connection:int,download_url:chararray,download_md5_hash:chararray);
> data3 = JOIN data1 BY $0,data2 BY $1;
> a1 = GROUP data3 BY $5;
> b1 = FOREACH a1 GENERATE $0,COUNT($1.$1);
> c1 = ORDER b1 BY $1 DESC;
> d1 = LIMIT c1 5;
> dump d1;
```
A: (192.168.130.3, 87)
```
---
# JOIN 以下題目資料集 5989 跟 24333 均會使用到
- 題目15-1. 哪個縣市的警察局與ATM數量差距最大? </br> [注] "台"跟"臺" </br> (資料集: 5958 & 24333)
- 輸出格式: 縣市, 數量
- 答案範例: 台北市, 87
> data1 = load '/dataset/24333/24333.csv' USING PigStorage (',')AS
(code:chararray,name:chararray,location:chararray,city:chararray,address:chararray);
> data11 = GROUP data1 BY $3;
> data12 = FOREACH data11 GENERATE $0,COUNT($1.$1);
> data13 = FOREACH data12 GENERATE REPLACE($0,'台北市','臺北市'),$1;
> data2 = load '/dataset/5958/5958.csv' USING PigStorage (',')AS
(name:chararray,zip_code:chararray,address:chararray);
> data21 = FOREACH data2 GENERATE SUBSTRING($2,0,3),$0;
> data22 = GROUP data21 BY $0;
> data23 = FOREACH data22 GENERATE $0,COUNT($1.$1);
> data3 = JOIN data13 BY $0,data23 BY $0;
> b1 = FOREACH data3 GENERATE $0,$1/$3;
> c1 = ORDER b1 BY $1 DESC;
> dump c1
```
A: (台北市,37)
```
- 題目 15-2.依內政部行事曆,2018年有最多節日的月份,其節日有哪些? (節日以名稱大到小排序) </br> 十六縣居民該月總消費金額平均是12個月中最高的嗎? (平均為每月平均) </br> 資料集: 26557 & 38315
- 輸出格式: 月份, 節日(不一定一個), 是/否, 金額
- 答案範例: 一月, 是, 87
> data1 = load '/dataset/38315/38315.csv' USING PigStorage (',')AS
(year:chararray,area:chararray,type:chararray,cards:int,total_trans:int,total_amt:biginteger,inter_city_trans:int,inter_city_amt:biginteger);
> e1 = FILTER data1 BY $0 MATCHES '2018-.*';
> f1 = FOREACH e1 GENERATE SUBSTRING($0,5,7),$5;
> g1 = GROUP f1 BY $0;
> h1 = FOREACH g1 GENERATE $0,AVG($1.$1);
> data2 = load '/dataset/26557/26557.csv' USING PigStorage (',')AS
(year:chararray,a:chararray,b:int,c:chararray,d:chararray);
> a1 = FILTER data2 BY $0 MATCHES '2018-.*';
> b1 = FOREACH a1 GENERATE SUBSTRING($0,5,7),$1;
> b2 = DISTINCT b1;
> b3 = ORDER b2 BY $1;
> c1 = GROUP b3 BY $0;
> d1 = FOREACH c1 GENERATE $0,COUNT($1.$1),$1.$1;
> dataJ = JOIN d1 BY $0,h1 BY $0;
> i1 = ORDER dataJ BY $1 DESC;
> dump i1
```
A: (02,3,和平紀念日,農曆除夕,春節,02,1932543150.25)
```
- [注] </br> 原始答案為 (02,3,{(02,和平紀念日),(02,農曆除夕),(02,春節)},02,1932543150.25) 手動刪除不必要的符號與內容