網路爬蟲實作 - 以新冠肺炎疫情資訊為例

# 網路爬蟲實作 - 以新冠肺炎疫情資訊為例 ###### tags: `光復高中`, `網路爬蟲`, `Python` ## Step 1: 線上使用環境 Google Colaboratory - https://colab.research.google.com/notebooks/ 教學網站： 1. 透過 Google Colaboratory 學習使用 Python 做機器學習等科學計算 - https://medium.com/@ericsk/透過-google-colaboratory-學習使用-python-做機器學習等科學計算-9f92c7bb1f50 2. Colab 基本操作筆記 - https://mattwang44.github.io/en/articles/colab/ 3. Google Colab Free GPU Tutorial - https://medium.com/deep-learning-turkey/google-colab-free-gpu-tutorial-e113627b9f5d ## Step 2: 建立 Notebook File > New Notebook > 名稱為 COVID_19.ipynb ## Step 3: 檢查 Notebook 環境 ```python= !python -V !pip list ``` ## Step 4: Yahoo 新冠肺炎網址 https://news.campaign.yahoo.com.tw/2019-nCoV/index.php ![](https://i.imgur.com/FRMZmFB.png) ## Step 5: 取得 Yahoo 新冠肺炎網頁疫情所有資訊 ```python= import requests # 下載頁面內容 r = requests.get("https://news.campaign.yahoo.com.tw/2019-nCoV/index.php") # 確認是否下載成功 if r.status_code == requests.codes.ok: text = r.text print(text) ``` ## Step 6: 找出台灣疫情資訊採用固定文字位置方式，擷取文字起始及結束位置，直接取得新冠肺炎台灣確認數據，但此方式可能會出錯 ![](https://i.imgur.com/Br1UpXP.png) ```python= import requests # 下載頁面內容 r = requests.get("https://news.campaign.yahoo.com.tw/2019-nCoV/index.php") # 確認是否下載成功 if r.status_code == requests.codes.ok: text = r.text price = text[6740:6743] print(price) ``` ## Step 7: 動態找出台灣疫情資訊段落找出不重複唯一的關鍵字或 HTML 標籤，切割出台灣確診人數 ![](https://i.imgur.com/lypO7nL.png) ```python= import requests # 下載頁面內容 r = requests.get("https://news.campaign.yahoo.com.tw/2019-nCoV/index.php") # 確認是否下載成功 if r.status_code == requests.codes.ok: text = r.text text = text[text.find("台灣"):text.find("美國")] print(text) ``` ## Step 8: 縮小範圍 ![](https://i.imgur.com/lBlTTeY.png) ```python= import requests # 下載頁面內容 r = requests.get("https://news.campaign.yahoo.com.tw/2019-nCoV/index.php") # 確認是否下載成功 if r.status_code == requests.codes.ok: text = r.text text = text[text.find("台灣"):text.find("美國")] text = text[text.find(''):] print(text) ``` ## Step 9: 精準取得台灣確認數值 ![](https://i.imgur.com/mpRTOkw.png) ```python= import requests # 下載頁面內容 r = requests.get("https://news.campaign.yahoo.com.tw/2019-nCoV/index.php") # 確認是否下載成功 if r.status_code == requests.codes.ok: text = r.text text = text[text.find("台灣"):text.find("美國")] text = text[text.find(''):] text = text[text.find(">") + 1:text.find("")] print(text) ``` ## Step 10: 使用 BeautifulSoup 快速精準取得台灣確認數值 ```python= import requests from bs4 import BeautifulSoup # 下載頁面內容 r = requests.get("https://news.campaign.yahoo.com.tw/2019-nCoV/index.php") # 確認是否下載成功 if r.status_code == requests.codes.ok: # 以 BeautifulSoup 解析 HTML 程式碼 soup = BeautifulSoup(r.text, 'html.parser') # 以 HTML 的 p tag 和 class 抓出各國確認數 currents = soup.find_all('p', class_='current') # 台灣確診數位在陣列第一筆 print(currents[0].text) ``` ## Step 11: 使用 PyQuery 快速精準取得台灣確認數值 ```python= # 安裝 PyQuery !pip install --upgrade pyquery ``` ```python= from pyquery import PyQuery as pq # 下載頁面內容 doc = pq("https://news.campaign.yahoo.com.tw/2019-nCoV/index.php") # 以 HTML 的 p tag 和 class 抓出各國確認數 currents = doc("p.current") # 台灣確診數位在陣列第一筆 print(currents[0].text) ```