# 使用BigQuery機器學習進行計程車費預測 **內容說明:** 本篇文章是筆者的Google Cloud Self-Pace Lab學習筆記, 所使用的數據與模型由Google Cloud Platform 提供。 ## 分析目標 - 目前,大多數叫車的應用程式都能夠透過機器學習精確估算計程車費用。這是由於其背後採用機器學習模型進行訓練,使得在乘客叫車時能夠預測計程車費用。本篇文章將介紹如何使用BigQuery機器學習模型進行計程車費用的預測。 ## 分析步驟 ### 一、資料探索 1. 進到BigQuery,使用公開資料中的The New York City Yellow Cab資料集。 ![1](https://hackmd.io/_uploads/By2o0HVD6.png) 此資料記錄了2015年紐約計程車的資料,首先,從資料集了解計程車司機在2015年每個月出車多少次? ```sql SELECT TIMESTAMP_TRUNC(pickup_datetime, MONTH) month, COUNT(*) trips FROM `bigquery-public-data.new_york.tlc_yellow_trips_2015` GROUP BY 1 ORDER BY 1 ``` 從以下結果能得知,2015年每個月紐約的計程車出車都超過 1000 萬次,是非常多。 ![2](https://hackmd.io/_uploads/BkDckINwp.png) 2. 接著,看一下計程車的平均速度是多少? ```sql SELECT EXTRACT(HOUR FROM pickup_datetime) hour, ROUND(AVG(trip_distance / TIMESTAMP_DIFF(dropoff_datetime, pickup_datetime, SECOND))*3600, 1) speed FROM `bigquery-public-data.new_york.tlc_yellow_trips_2015` WHERE trip_distance > 0 AND fare_amount/trip_distance BETWEEN 2 AND 10 AND dropoff_datetime > pickup_datetime GROUP BY 1 ORDER BY 1 ``` 以下結果顯示在白天,計程車的平均速度約 11-12 英里/小時, 但到了凌晨 5點,平均速度幾乎翻倍,達到 21 英里/小時。 這可能是因為凌晨5點道路上的交通量減少. ![3](https://hackmd.io/_uploads/Hy_lgLEwa.png) ### 二、進行預測計程車費用模型訓練 1. 從資料中抽取特徵值: 選擇以下六個變數作為模型預測的特徵: Tolls Amount(高速公路費)、Fare Amount(里程金額)、Hour of Day (時段)、Pick up address(上車地址)、Drop off address(下車地址)、Number of passengers(乘客人數)。 選擇N/1000作為訓練數據集,在這次練習中,因為資源有限,只能使用1/1000筆資料來訓練模型。 ```sql WITH params AS ( SELECT 1 AS TRAIN, 2 AS EVAL ), daynames AS (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek), taxitrips AS ( SELECT (tolls_amount + fare_amount) AS total_fare, daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek, EXTRACT(HOUR FROM pickup_datetime) AS hourofday, pickup_longitude AS pickuplon, pickup_latitude AS pickuplat, dropoff_longitude AS dropofflon, dropoff_latitude AS dropofflat, passenger_count AS passengers FROM `nyc-tlc.yellow.trips`, daynames, params WHERE trip_distance > 0 AND fare_amount > 0 AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.TRAIN ) SELECT * FROM taxitrips ``` 結果顯示如下,Total_fare為高速公路費和里程金額加總的計程車費用,也是目標值。 ![4](https://hackmd.io/_uploads/ry4zOINDp.png) 2. 有了訓練資料之後,在BigQuery上創造資料集來存取資料 ![5](https://hackmd.io/_uploads/Skb0O8VDp.png) 3. 開始進行訓練模型,但首先需要選擇適合的模型進行訓練,在模型選擇上,會考慮以下兩點: - 訓練模型的目的是預測計程車費用 - 變數為連續型變數 因此採用線性迴歸進行模型訓練,以下會使用BigQuery 的linear_reg模型,訓練好的模型命名為 taxifare_model ```sql CREATE or REPLACE MODEL taxi.taxifare_model OPTIONS (model_type='linear_reg', labels=['total_fare']) AS WITH params AS ( SELECT 1 AS TRAIN, 2 AS EVAL ), daynames AS (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek), taxitrips AS ( SELECT (tolls_amount + fare_amount) AS total_fare, daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek, EXTRACT(HOUR FROM pickup_datetime) AS hourofday, pickup_longitude AS pickuplon, pickup_latitude AS pickuplat, dropoff_longitude AS dropofflon, dropoff_latitude AS dropofflat, passenger_count AS passengers FROM `nyc-tlc.yellow.trips`, daynames, params WHERE trip_distance > 0 AND fare_amount > 0 AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.TRAIN ) SELECT * FROM taxitrips ``` ![6](https://hackmd.io/_uploads/HykU_wNva.png) ### 三、評估模型與進行預測 1. 評估模型 評估線性迴歸模型可以使用Root Mean Square Error (RMSE),RMSE是用來評估誤差(即預測值與真實值)的標準差,因此RMSE 越低越好。 ```sql SELECT SQRT(mean_squared_error) AS rmse FROM ML.EVALUATE(MODEL taxi.taxifare_model, ( WITH params AS ( SELECT 1 AS TRAIN, 2 AS EVAL ), daynames AS (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek), taxitrips AS ( SELECT (tolls_amount + fare_amount) AS total_fare, daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek, EXTRACT(HOUR FROM pickup_datetime) AS hourofday, pickup_longitude AS pickuplon, pickup_latitude AS pickuplat, dropoff_longitude AS dropofflon, dropoff_latitude AS dropofflat, passenger_count AS passengers FROM `nyc-tlc.yellow.trips`, daynames, params WHERE trip_distance > 0 AND fare_amount > 0 AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.EVAL ) SELECT * FROM taxitrips )) ``` 以下的結果顯示,RMSE(均方根誤差)為+-9.47美元,表示預測值與真實值之間的差異大致在9.47美元的範圍內。為了評估RMSE是否在可接受的範圍,需要與真實值的基準進行比較。 ![15](https://hackmd.io/_uploads/rJzIUDVD6.png) 2. 接下來用訓練好的模型進行預測 ```sql SELECT * FROM ml.PREDICT(MODEL `taxi.taxifare_model`, ( WITH params AS ( SELECT 1 AS TRAIN, 2 AS EVAL ), daynames AS (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek), taxitrips AS ( SELECT (tolls_amount + fare_amount) AS total_fare, daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek, EXTRACT(HOUR FROM pickup_datetime) AS hourofday, pickup_longitude AS pickuplon, pickup_latitude AS pickuplat, dropoff_longitude AS dropofflon, dropoff_latitude AS dropofflat, passenger_count AS passengers FROM `nyc-tlc.yellow.trips`, daynames, params WHERE trip_distance > 0 AND fare_amount > 0 AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.EVAL ) SELECT * FROM taxitrips )); ``` 從以下結果發現,預測值(predicted_total_fare)與真實值(total_fare)的相差較大,例如第一筆,真實值為5美元,但預測值高達11.64美元。因此,需要進行模型調教。 ![8](https://hackmd.io/_uploads/r1ew2IVDp.png) ### 四、模型調教 1.檢查數據集看是否有奇怪的值? ```sql SELECT COUNT(fare_amount) AS num_fares, MIN(fare_amount) AS low_fare, MAX(fare_amount) AS high_fare, AVG(fare_amount) AS avg_fare, STDDEV(fare_amount) AS stddev FROM `nyc-tlc.yellow.trips` ``` 從結果發現竟然有負的計程車費,最高的計程車費竟然到$50,3325,這顯然不合理,可能是在輸入資料時有錯漏,因此把它們視為離群值。 ![9](https://hackmd.io/_uploads/HJ3txPVP6.png) 接下來,為了讓訓練的資料更精準,把車費金額設定在6$-200$之間,並限定車程地區在紐約市範圍,重新抽取訓練資料。 ```sql SELECT COUNT(fare_amount) AS num_fares, MIN(fare_amount) AS low_fare, MAX(fare_amount) AS high_fare, AVG(fare_amount) AS avg_fare, STDDEV(fare_amount) AS stddev FROM `nyc-tlc.yellow.trips` WHERE trip_distance > 0 AND fare_amount BETWEEN 6 and 200 AND pickup_longitude > -75 AND pickup_longitude < -73 AND dropoff_longitude > -75 AND dropoff_longitude < -73 AND pickup_latitude > 40 AND pickup_latitude < 42 AND dropoff_latitude > 40 AND dropoff_latitude < 42 ``` ![10](https://hackmd.io/_uploads/SJMjUPEPT.png) 2.從新訓練模型,並命名新模型為taxifare_model_2 ```sql CREATE OR REPLACE MODEL taxi.taxifare_model_2 OPTIONS (model_type='linear_reg', labels=['total_fare']) AS WITH params AS ( SELECT 1 AS TRAIN, 2 AS EVAL ), daynames AS (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek), taxitrips AS ( SELECT (tolls_amount + fare_amount) AS total_fare, daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek, EXTRACT(HOUR FROM pickup_datetime) AS hourofday, SQRT(POW((pickup_longitude - dropoff_longitude),2) + POW(( pickup_latitude - dropoff_latitude), 2)) as dist, #Euclidean distance between pickup and drop off SQRT(POW((pickup_longitude - dropoff_longitude),2)) as longitude, #Euclidean distance between pickup and drop off in longitude SQRT(POW((pickup_latitude - dropoff_latitude), 2)) as latitude, #Euclidean distance between pickup and drop off in latitude passenger_count AS passengers FROM `nyc-tlc.yellow.trips`, daynames, params WHERE trip_distance > 0 AND fare_amount BETWEEN 6 and 200 AND pickup_longitude > -75 #limiting of the distance the taxis travel out AND pickup_longitude < -73 AND dropoff_longitude > -75 AND dropoff_longitude < -73 AND pickup_latitude > 40 AND pickup_latitude < 42 AND dropoff_latitude > 40 AND dropoff_latitude < 42 AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.TRAIN ) SELECT * FROM taxitrips ``` 3. 用RMSE評估模型 ```sql SELECT SQRT(mean_squared_error) AS rmse FROM ML.EVALUATE(MODEL taxi.taxifare_model_2, ( WITH params AS ( SELECT 1 AS TRAIN, 2 AS EVAL ), daynames AS (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek), taxitrips AS ( SELECT (tolls_amount + fare_amount) AS total_fare, daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek, EXTRACT(HOUR FROM pickup_datetime) AS hourofday, SQRT(POW((pickup_longitude - dropoff_longitude),2) + POW(( pickup_latitude - dropoff_latitude), 2)) as dist, #Euclidean distance between pickup and drop off SQRT(POW((pickup_longitude - dropoff_longitude),2)) as longitude, #Euclidean distance between pickup and drop off in longitude SQRT(POW((pickup_latitude - dropoff_latitude), 2)) as latitude, #Euclidean distance between pickup and drop off in latitude passenger_count AS passengers FROM `nyc-tlc.yellow.trips`, daynames, params WHERE trip_distance > 0 AND fare_amount BETWEEN 6 and 200 AND pickup_longitude > -75 #limiting of the distance the taxis travel out AND pickup_longitude < -73 AND dropoff_longitude > -75 AND dropoff_longitude < -73 AND pickup_latitude > 40 AND pickup_latitude < 42 AND dropoff_latitude > 40 AND dropoff_latitude < 42 AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.EVAL ) SELECT * FROM taxitrips )) ``` 以下結果顯示,從新訓練的模型的RMSE為+-5.12,比之前的模型低很多。 ![14](https://hackmd.io/_uploads/B17pUD4va.png) ## 預測結果 以下使用新模型taxifare_model_2進行預測。 ```sql SELECT * FROM ml.PREDICT(MODEL `taxi.taxifare_model_2`, ( WITH params AS ( SELECT 1 AS TRAIN, 2 AS EVAL ), daynames AS (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek), taxitrips AS ( SELECT (tolls_amount + fare_amount) AS total_fare, daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek, EXTRACT(HOUR FROM pickup_datetime) AS hourofday, SQRT(POW((pickup_longitude - dropoff_longitude),2) + POW(( pickup_latitude - dropoff_latitude), 2)) as dist, #Euclidean distance between pickup and drop off SQRT(POW((pickup_longitude - dropoff_longitude),2)) as longitude, #Euclidean distance between pickup and drop off in longitude SQRT(POW((pickup_latitude - dropoff_latitude), 2)) as latitude, #Euclidean distance between pickup and drop off in latitude passenger_count AS passengers FROM `nyc-tlc.yellow.trips`, daynames, params WHERE trip_distance > 0 AND fare_amount BETWEEN 6 and 200 AND pickup_longitude > -75 #limiting of the distance the taxis travel out AND pickup_longitude < -73 AND dropoff_longitude > -75 AND dropoff_longitude < -73 AND pickup_latitude > 40 AND pickup_latitude < 42 AND dropoff_latitude > 40 AND dropoff_latitude < 42 AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),1000) = params.TRAIN ) SELECT * FROM taxitrips )); ``` 以下為預測結果,顯示比之前的模型準確很多,並且與真實值的誤差也相對小,更接近真實值。 ![13](https://hackmd.io/_uploads/Sktn7PNPa.png) 以上是使用BigQuery實現機器學習的分享,BigQuery平台上還提供多種不同的機器學習模型和資料可供學習。