# 實驗二:NYC計程車大數據分析
### 主講人:黃夙賢
---
## NYC計程車大數據
- 請根據NYC計程車數據,進行大數據資料分析

---
## 資料來源
- [https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
)

---
## Apache PARQUET
- 單欄式檔案格式,可提供優化來加速查詢
- 比CSV更有效率的檔案格式
- 可透過python pandas套件進行格式轉換成csv

---
## 請回答下列問題
- Q1: 共有幾筆資料?
- Q2: 最熱門的載客地點為何?
- Q3: 最熱門下車地點為何?
- Q4: 幾點最容易搭到計程車?
- Q5: 距離和小費有否有關係?
- Q6: 平均在計程車上花的時間?
---
## NYC計程車資料檔
- wget https://github.com/shhuangmust/bigdata/raw/main/taxi.sh
- sh taxi.sh
- hadoop fs –put yellow202206.csv /user/mapr
- hadoop fs –ls
---



---
## 新增表格
```
CREATE TABLE taxi (
line STRING,
VendorID STRING,
tpep_pickup_datetime TIMESTAMP,
tpep_dropoff_datetime TIMESTAMP,
passenger_count SMALLINT,
trip_distance DOUBLE,
RatecodeID SMALLINT,
store_and_fwd_flag STRING,
PULocationID STRING,
DOLocationID STRING,
payment_type STRING,
fare_amount DOUBLE,
extra DOUBLE,
mta_tax DOUBLE,
tip_amount DOUBLE,
tolls_amount DOUBLE,
improvement_surcharge DOUBLE,
total_amount DOUBLE,
congestion_surcharge DOUBLE,
airport_fee DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
tblproperties("skip.header.line.count"="2");
```
---

---
## 讀入資料
- LOAD DATA INPATH '/user/mapr/yellow202206.csv' OVERWRITE INTO TABLE taxi;

---
## Q1:共有幾筆資料
- select count(*) from taxi;
- Ans: 3558123筆資料

---
## Q2:最熱門的載客地點為何?
- select pulocationid, count(pulocationid) as pucount from taxi
group by pulocationid
order by pucount desc;
- Ans: 最熱門的載客地點為237(Manhattan區Upper East Side South)
- (上下車地點代號對應位於 https://s3.amazonaws.com/nyc-tlc/misc/taxi+\_zone\_lookup.csv)
---

---
## Q3: 最熱門下車地點為何?
- select dolocationid, count(dolocationid) as docount from taxi
group by dolocationid
order by docount desc;
- Ans: 最熱門下車地點為236 (Manhattan區Upper East Side North)
---

---
## Q4: 幾點最容易搭到計程車?
- select puhour, count(*) avg0 from (select hour(taxi.tpep\_pickup\_datetime) puhour, * from taxi ) x
group by puhour
order by avg0;
- Ans:早上4點搭車人數最少,較有可能搭到計程車
---

---
## Q5: 距離和小費有否有關係?
- select trip\_distance, avg(tip\_amount) avg0 from taxi
group by trip_distance
order by avg0 desc;
- Ans: 最高小費1400.16,距離為29.92英里
- 距離與小費似乎無關
---

---
## Q6: 平均在計程車上花的時間?
- select avg(unix\_timestamp(taxi.tpep\_dropoff\_datetime)-unix\_timestamp(taxi.tpep\_pickup\_datetime))/3600 from taxi;
- Ans: 7.02分鐘
---

{"metaMigratedAt":"2023-06-17T10:19:44.159Z","metaMigratedFrom":"YAML","title":"實驗二:NYC計程車大數據分析","breaks":true,"contributors":"[{\"id\":\"ef0225b9-6c2a-4012-82c9-fa1031d2c4db\",\"add\":5265,\"del\":2023}]","description":"請根據NYC計程車數據,進行大數據資料分析"}