Automate crawling job with Google Cloud Function

--- disqus: hackmd --- Automate crawling job with Google Cloud Function === 以 Python 編寫一個會定時爬取YouBike2.0臺北市公共自行車即時資訊網站的爬蟲，將其部署在 Google Cloud Platform 上，會使用到的 GCP 服務有 Cloud Function，Cloud Scheduler 和 Cloud Storage。 ## 撰寫爬蟲程式撰寫一個爬蟲程式，使用 requests 與 pandas 套件擷取並解析 [Ubike 2.0 自行車即時資訊](https://data.gov.tw/dataset/137993) Ubike在台北市每個站的使用狀況的資料，再利用 pandas 套件將解析後的資料轉換成 pandas 可以讀取的 json 檔案，用時間的格式存取至Cloud storage。 ```python= import pandas as pd import requests import json import datetime from google.cloud import storage def hello_world(request): url = 'https://tcgbusfs.blob.core.windows.net/dotapp/youbike/v2/youbike_immediate.json' #目標網址 def my_requests(url):#網站訪問模組 response = requests.get(url).text return response def to_json(source):#把資料轉換成Pandas的json格式 data = json.loads(source) df = pd.DataFrame(data) json_data = df.to_json(orient = 'columns') return json_data def time(): #時間 dateNow = ((datetime.datetime.now())+datetime.timedelta(hours=8)) #Taiwan is UTC+8 dateTimeNowStr = dateNow.strftime("%Y%m%d_%H:%M") return dateTimeNowStr def upload_file_to_bucket():#把爬去的資料上傳至Cloud Storage client=storage.Client(); bucket=client.get_bucket('ubike-2'); blob=bucket.blob('ubike test/' + time() +'_ubike2.json') blob.upload_from_string(to_json(my_requests(url))) upload_file_to_bucket() #啟動upload_file_to_bucket函數 ``` ## 建立 GCP Cloud Function GCP (Google Cloud Function) 谷歌雲端運算服務是由谷歌公司所提供的雲端運算服務，並為 Google 搜尋和 YouTube 的伺服器上提供基礎設施服務、平台服務及無伺服器計算環境。Google Cloud Function 是 GCP 提供的無伺服器（Serverless）運算服務，Cloud Functions，可以把簡單且單一的功能附加在雲端基礎設施或 events 發送上。當 event 觸發的時候，Cloud Functions 會被啟動，您的程式碼在 Cloud Functions 將會被執行在一個完全代管的環境下，您不必額外設定任何基礎設施或是管理伺服器了。 1. 前往 cloud.google.com 註冊一組帳號並登入 2. 選擇 Cloud Function ![](https://i.imgur.com/cvYWltC.png) 3. 新增一個 Function ![](https://i.imgur.com/94z70hp.png) 4. 設定 Cloud Function ![](https://i.imgur.com/xO9XlR7.png) 5. 把以上撰寫好的程式，匯入至 Function 腳本裡。 ![](https://i.imgur.com/7EFg7Pa.png) 6. 設定 requirement.txt, 匯入所需要的套件 ![](https://i.imgur.com/q2JjfZg.png) 7. 測試 Cloud Function 是否有成功運行 ![](https://i.imgur.com/rpCxrsR.png) ![](https://i.imgur.com/1cLH8Ci.png) 8. 檢查資料是否有被存入至 Cloud storage ![](https://i.imgur.com/CviBZ07.png) --- ## 建立 GCP Service Accounts Service account 是與應用程式做綁定的一個帳號，而不是一位真實的使用者。外部服務存取 GCP 服務或是 GCP 服務內部互相溝通，是透過 Service account 來給定權限。所以我們要使用Service account 建立 Cloud Scheduler 與 Cloud Function 之間的溝通權限。 1. 選擇 Service Accounts ![](https://i.imgur.com/gZ0WlI5.png) 2. 設定 Service Accounts ![](https://i.imgur.com/ZG1U6jD.png) 3. 設定 Service Accounts 的 role ![](https://i.imgur.com/xu6Oh5n.png) 4. 複製 Service Accounts 的 Email 帳號 ![](https://i.imgur.com/nAgUgMg.png) * 在 Cloud Function 和 Cloud Scheduler 會需要此帳號。 --- ## 建立 GCP Cloud Scheduler Cloud Scheduler 為託管 Cron 即服務，而不是計算服務。它允許您使用 cron 語法安排任務。但它僅替代了cron的調度組件，並且只能發送HTTP請求或發送 pub / sub 消息。我們透過 Google Scheduler 自訂時間來排程執行爬蟲。 1. 選擇 Cloud Scheduler，並且新增 Job ![](https://i.imgur.com/OC5Wg8L.png) 2. 設定 Cloud Scheduler 的啟動的時程 ![](https://i.imgur.com/H0AM4ox.png) * 設定 CRON 可以參考 [crontab guru](https://crontab.guru/)。 3. 設定啟動的目標 * 首先要找到 Cloud Function 的網址。 ![](https://i.imgur.com/XqOo4di.png) ![](https://i.imgur.com/vdaB7qY.png) * 把 Cloud Function 的網址和 Service Accounts 帳號貼上。 ![](https://i.imgur.com/EqHCJSM.png) * 測試 Cloud Scheduler 運行 ![](https://i.imgur.com/Irx8cBA.png) * 檢查資料是否有被存入至 Cloud storage --- ## 小結在這個小節中我們簡介如何以 Python 的 requests、pandas 套件擷取網頁資料，並把它轉成pandas 能夠讀取的 json 資料格式，儲存至 GCP 的 Cloud Storage 之中；最後建立一個 GCP 無伺服器 Cloud Function 透過 Cloud Scheduler 定時執行這個爬蟲程式。