# 實驗二:NYC計程車大數據分析 ### 主講人:黃夙賢 --- ## NYC計程車大數據 - 請根據NYC計程車數據,進行大數據資料分析 ![](https://i.imgur.com/XD6Ljo5.png =800x) --- ## 資料來源 - [https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page ](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page ) ![](https://i.imgur.com/O3ix5nZ.png =360x) --- ## Apache PARQUET - 單欄式檔案格式,可提供優化來加速查詢 - 比CSV更有效率的檔案格式 - 可透過python pandas套件進行格式轉換成csv ![](https://i.imgur.com/duYDD0w.png) --- ## 請回答下列問題 - Q1: 共有幾筆資料? - Q2: 最熱門的載客地點為何? - Q3: 最熱門下車地點為何? - Q4: 幾點最容易搭到計程車? - Q5: 距離和小費有否有關係? - Q6: 平均在計程車上花的時間? --- ## NYC計程車資料檔 - wget https://github.com/shhuangmust/bigdata/raw/main/taxi.sh - sh taxi.sh - hadoop fs –put yellow202206.csv /user/mapr - hadoop fs –ls --- ![](https://i.imgur.com/FsiohmG.png) ![](https://i.imgur.com/oLYJLp4.png) ![](https://i.imgur.com/TuhFAOh.png) --- ## 新增表格 ``` CREATE TABLE taxi ( line STRING, VendorID STRING, tpep_pickup_datetime TIMESTAMP, tpep_dropoff_datetime TIMESTAMP, passenger_count SMALLINT, trip_distance DOUBLE, RatecodeID SMALLINT, store_and_fwd_flag STRING, PULocationID STRING, DOLocationID STRING, payment_type STRING, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, congestion_surcharge DOUBLE, airport_fee DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' tblproperties("skip.header.line.count"="2"); ``` --- ![](https://i.imgur.com/NGUB8JN.png) --- ## 讀入資料 - LOAD DATA INPATH '/user/mapr/yellow202206.csv' OVERWRITE INTO TABLE taxi; ![](https://i.imgur.com/c4Auuoi.png) --- ## Q1:共有幾筆資料 - select count(*) from taxi; - Ans: 3558123筆資料 ![](https://i.imgur.com/nyXPxM4.png) --- ## Q2:最熱門的載客地點為何? - select pulocationid, count(pulocationid) as pucount from taxi group by pulocationid order by pucount desc; - Ans: 最熱門的載客地點為237(Manhattan區Upper East Side South) - (上下車地點代號對應位於 https://s3.amazonaws.com/nyc-tlc/misc/taxi+\_zone\_lookup.csv) --- ![](https://i.imgur.com/1RWBty2.png) --- ## Q3: 最熱門下車地點為何? - select dolocationid, count(dolocationid) as docount from taxi group by dolocationid order by docount desc; - Ans: 最熱門下車地點為236 (Manhattan區Upper East Side North) --- ![](https://i.imgur.com/ZbQONUo.png) --- ## Q4: 幾點最容易搭到計程車? - select puhour, count(*) avg0 from (select hour(taxi.tpep\_pickup\_datetime) puhour, * from taxi ) x group by puhour order by avg0; - Ans:早上4點搭車人數最少,較有可能搭到計程車 --- ![](https://i.imgur.com/07MBFbL.png) --- ## Q5: 距離和小費有否有關係? - select trip\_distance, avg(tip\_amount) avg0 from taxi group by trip_distance order by avg0 desc; - Ans: 最高小費1400.16,距離為29.92英里 - 距離與小費似乎無關 --- ![](https://i.imgur.com/vb7qfOD.png) --- ## Q6: 平均在計程車上花的時間? - select avg(unix\_timestamp(taxi.tpep\_dropoff\_datetime)-unix\_timestamp(taxi.tpep\_pickup\_datetime))/3600 from taxi; - Ans: 7.02分鐘 --- ![](https://i.imgur.com/jb1KtVL.png)
{"metaMigratedAt":"2023-06-17T10:19:44.159Z","metaMigratedFrom":"YAML","title":"實驗二:NYC計程車大數據分析","breaks":true,"contributors":"[{\"id\":\"ef0225b9-6c2a-4012-82c9-fa1031d2c4db\",\"add\":5265,\"del\":2023}]","description":"請根據NYC計程車數據,進行大數據資料分析"}
    315 views