python 網路爬蟲

# python 網路爬蟲爬蟲（ Web Crawler ），是一個可以自動化爬取資料的程式。大量且快速地從網路上擷取所需資訊，節省了時間。透過爬蟲，更有效地進行數據收集與分析。 ## 目錄 - Requests - 前置作業 - 爬取第一個網頁 - Requests 中的 HTTP 方法 - Response 物件的屬性與方法 - 避免超時 - 檢查 HTTP 狀態碼 - 避免亂碼 - 補充 - Beautiful Soup - 前置作業 - 解析網頁 - 美化排版 - 標籤物件中的屬性與方法 - 尋找標籤 - 範例 ## Requests Requests 是一個 python 的外部函式庫。讓使用者可以輕鬆的發送 HTTP 請求。簡單易上手，適合各種網路操作，例如：網路爬蟲、與 API 互動 ... ...。 --- ### 前置作業輸入以下指令，下載所需函式庫（依據環境不同，使用 pip、pip3、pipenv ）。 ``` pip install requests ``` --- ### 爬取第一個網頁首先，引入 Requests 函式庫。 ```python=1 import requests ``` 接下來，對一個網頁發送請求，以建北電資 27th 社網為例。 ```python=2 url = "https://27th.ckefgisc.org" response = requests.get(url=url) ``` 此時，程式會回傳一個 `Response` 物件，並將其儲存在 `response` 變數中。試著印出 `Response` 物件裡的內容。 ```python=4 print(response.text) ``` ```python # 輸出：<!DOCTYPE html><html lang="zh-TW"><head><meta charset="utf-8"/>... ``` --- ### Requests 中的 HTTP 方法 | 方法 | 說明 | |:-------:|:-------------------------------------:| | GET | 發送請求（公開參數） | | POST | 發送請求（影藏參數） | | PUT | 提供最新內容 | | DELETE | 刪除指定內容 | | HEAD | 請求回應標頭（不含主體） | | OPTIONS | 請求可用公能選單（包含 Allow 標頭） | >[!Tip] 可以到 https://httpbin.org 上測試各種 HTTP 方法。 --- ### `Response` 物件的屬性與方法 | 屬性 | 說明 | |:-----------:|:------------------------:| | text | 訊息內容（字串） | | content | 訊息內容（二進制） | | raw | 串流訊息內容（二進制） | | encoding | 編碼方式 | | status_code | 狀態碼 | | 方法 | 說明 | |:------:|:--------------------:| | json() | 將訊息經過 json 解碼 | --- ### 避免超時在發送請求時加入等待時間，可以避免等待時間過長。 ```python= response = requests.get(url=url, timeout=3) ``` 程式碼將在超過等待時間後報錯。 ```python # 輸出： # Traceback (most recent call last): # ... # requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='httpbin.org', port=443): Read timed out. (read timeout=3) ``` >[!Note] 等待時間不是整個回應的下載時間限制，而是指如果伺服器沒有在等待時間內回應（通常是以回應的第一個字計算），程式碼將會報錯。 --- ### 檢查 HTTP 狀態碼 ```python= response = requests.get(url=url) print(response.status_code) ``` >[!Note] 1xx：請求已接受，且正在處理。 2xx：請求已接受，且已經完成。 3xx：請求已接受，但需重新導向。 4xx：客戶端出錯。 5xx：伺服端出錯。如果出現了錯誤（非 2xx ），可以使用 `Response.raise_for_status()` ，讓程式碼報錯。 ```python= response = requests.get(url=url) response.raise_for_status() ``` --- ### 避免亂碼如果爬取的內容都是亂碼，時常編碼方式不一致造成，只需將編碼方式改為和網頁相同。網頁大部分使用 utf-8，少數老舊的中文網站會使用 big5。 ```python= response = requests.get(url=url) response.encoding = "big5" ``` >[!Note] 對網頁右鍵「檢查」或「檢閱元件」，點選上方「元素」欄位，可以在裡面找到網頁的編碼方式，寫在網頁的 head 裡面。 --- ### 補充 * HTTP：HyperText Transfer Protocol，超文本傳輸協定。由客戶端發送請求到伺服端，再由伺服端返回回應。通常包含： 1. HTTP 版本（請求） 2. 一個 url（請求） 3. HTTP 狀態碼（回應） 4. HTTP 方法 5. HTTP 標頭 6. HTTP 主體 * HTTP 方法：有時稱為 HTTP 動詞，表示請求期待伺服端執行的動作。 * HTTP 標頭：紀錄瀏覽器資訊。 * HTTP 主體：訊息的內容。 --- ## Beautiful Soup Beautiful Soup 是一個 python 的外部函式庫。可以提取 HTML / XML 檔案中的內容，並快速地查找以及修改。 --- ### 前置作業輸入以下指令，下載所需函式庫（依據環境不同，使用 pip、pip3、pipenv ）。 ``` pip install beautifulsoup4 ``` --- ### 解析網頁繼續沿用上一章的程式碼，並引入 Beautiful Soup 函式庫。 ```python=1 import requests from bs4 import BeautifulSoup url = "https://27th.ckefgisc.org" response = requests.get(url=url) ``` 將爬到的 HTML 檔案轉換為標籤樹。 ```python=7 soup = BeautifulSoup(response.text, "html.parser") ``` 試著印出一點東西。 ```python=8 print(soup.title) ``` ```python # 輸出：<title>建北電資 | CKEFGISC</title> ``` --- ### 美化排版使用 `soup.prettify()` 讓輸出的排版更好看。 ```python= print(soup.prettify()) ``` ```python # 輸出： # <!DOCTYPE html> # <html lang="zh-TW"> # <head> # <meta charset="utf-8"/> # <meta content="width=device-width, initial-scale=1" name="viewport"/> # <meta content="建中電研,北一資研,電研社,資訊社,建北電資,社團" name="keywords"> # ... # </html> ``` --- ### 標籤物件中的屬性與方法 ```html <h1 id="hello"> Hello World </h1> 名稱屬性（ id ）內容 ``` ```python= from bs4 import BeautifulSoup html = '<h1 id="hello"> Hello World </h1>' soup = BeautifulSoup(html, "html.parser") tag = soup.h1 print("名稱：", tag.name) print("屬性：", tag.attrs) print("ID：", tag["id"]) print("內容：", tag.get_text()) ``` ```python # 輸出：名稱： h1 # 屬性： {'id': 'hello'} # ID： hello # 內容： Hello World ``` --- ### 尋找標籤 `find()`：尋找第一個指定的標籤 `find_all()`：尋找所有指定的標籤 `select_one()`：以 CSS 選擇器尋找第一個指定的標籤 `select()`：以 CSS 選擇器尋找所有指定的標籤 ```python= from bs4 import BeautifulSoup html = """ <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> """ soup = BeautifulSoup(html, "html.parser") print(soup.find("a")) print(soup.find_all("a")) print(soup.find("a", id="link3")) ``` ```python # 輸出： # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> ``` --- ## 範例 ### 1-1 Requests 中的 Http 方法 ```pyth= import requests ### GET 寫法一 # 參數會包含在 url 中 url = "https://httpbin.org/anything?key1=value1&key2=value2" response = requests.put(url=url) # -------------------------------------------------- ### GET 寫法二 url = "https://httpbin.org/anything" payload = { "key1": "value1", "key2": "value2" } response = requests.get(url=url, params=payload) # -------------------------------------------------- ### POST # 參數不會包含在 url 中 url = "https://httpbin.org/anything" payload = { "key1": "value1", "key2": "value2" } response = requests.post(url=url, data=payload) # -------------------------------------------------- ### PUT url = "https://httpbin.org/anything" payload = { "key1": "value1", "key2": "value2" } response = requests.put(url=url, data=payload) # -------------------------------------------------- ### HEAD url = "https://httpbin.org/anything" response = requests.head(url=url) # print(response.test) 不會輸出任何內容 # -------------------------------------------------- ### OPTIONS url = "https://httpbin.org/anything" response = requests.options(url=url) # print(response.headers) 會輸出包含 Allow 的標頭 ``` ### 1-2 Response 物件的屬性與方法 ```python= import shutil import requests ### content url = "https://httpbin.org/image/jpeg" response = requests.get(url=url) with open("講義程式碼/1-3 image.jpeg", "wb") as file: file.write(response.content) # -------------------------------------------------- ### raw url = "https://httpbin.org/image/jpeg" response = requests.get(url=url, stream=True) with open("講義程式碼/1-2 image2.jpeg", "wb") as file: shutil.copyfileobj(response.raw, file) # -------------------------------------------------- ### raw 下載大型檔案（每次下載一部分，如下圖。） url = "https://httpbin.org/image/jpeg" response = requests.get(url=url, stream=True) with open("講義程式碼/1-2 image2.jpeg", "wb") as file: for chunk in response: # 以 128 byte 為一個 chunk file.write(chunk) ``` ![](https://i.imgur.com/AgEq5hg.png) ### 1-3 避免超時 ```python= import time import requests url = "https://httpbin.org/delay/10" response = requests.get(url=url, timeout=3) # 輸出： # Traceback (most recent call last): # ... # requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='httpbin.org', port=443): Read timed out. (read timeout=3) # -------------------------------------------------- # 下載大型檔案 url = "https://live.staticflickr.com/65535/52259221868_3d2963c1fe_o_d.png" start = time.time() # 開始時計時 response = requests.get(url=url, timeout=3, stream=True) with open("講義程式碼/1-3 large_image.png", "wb") as file: for chunk in response: file.write(chunk) end = time.time() # 結束時計時 print(end - start) # 計算時間差 # 輸出：10.603063106536865 ``` ### 1-4 檢查 HTTP 狀態碼 ```python= import requests url = "https://httpbin.org/status/200" response = requests.get(url=url) print(response.status_code) # 輸出：200 # -------------------------------------------------- url = "https://httpbin.org/status/404" response = requests.get(url=url) print(response.status_code) # 輸出：404 response.raise_for_status() # 輸出： # Traceback (most recent call last): # ... # requests.exceptions.HTTPError: 404 Client Error: NOT FOUND for url: https://httpbin.org/status/404 ``` ### 1-5 避免亂碼 ```python= import requests url = "http://fengshan.itgo.com/8-14.htm" response = requests.get(url=url) print(response.text) # 輸出：... <title>¨«Åª¥xÆW-°ª¶¯¿¤¬F©²</title> ... # -------------------------------------------------- url = "http://fengshan.itgo.com/8-14.htm" response = requests.get(url=url) response.encoding = "big5" print(response.text) # 輸出：... <title>走讀台灣-高雄縣政府</title> ... ``` ### 2-1 尋找標籤 ```python= from bs4 import BeautifulSoup html = """ <html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link2"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html> """ soup = BeautifulSoup(html, "html.parser") print(soup.find(id="link2")) # 輸出：<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> print(soup.find(class_="sister")) # 輸出：<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> print(soup.find_all("b")) # 輸出：[<b>The Dormouse's story</b>] print(soup.find_all(["b", "a"])) # 輸出：[ # <b>The Dormouse's story</b>, # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link2">Tillie</a> # ] print(soup.find_all("a", class_="sister")) # 輸出：[ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link2">Tillie</a> # ] print(soup.find_all("a", class_="sister", limit=2)) # 輸出：[ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # ] print(soup.select_one("title")) # 輸出：<title>The Dormouse's story</title> print(soup.select("body a")) # 輸出：[ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link2">Tillie</a> # ] print(soup.select("a#link1")) # 輸出：[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] print(soup.select_one("a.sister")) # 輸出：<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.