owned this note
owned this note
Published
Linked with GitHub
# python 網路爬蟲
爬蟲( Web Crawler ),是一個可以自動化爬取資料的程式。
大量且快速地從網路上擷取所需資訊,節省了時間。
透過爬蟲,更有效地進行數據收集與分析。
## 目錄
- Requests
- 前置作業
- 爬取第一個網頁
- Requests 中的 HTTP 方法
- Response 物件的屬性與方法
- 避免超時
- 檢查 HTTP 狀態碼
- 避免亂碼
- 補充
- Beautiful Soup
- 前置作業
- 解析網頁
- 美化排版
- 標籤物件中的屬性與方法
- 尋找標籤
- 範例
## Requests
Requests 是一個 python 的外部函式庫。
讓使用者可以輕鬆的發送 HTTP 請求。
簡單易上手,適合各種網路操作,例如:網路爬蟲、與 API 互動 ... ...。
---
### 前置作業
輸入以下指令,下載所需函式庫( 依據環境不同,使用 pip、pip3、pipenv )。
```
pip install requests
```
---
### 爬取第一個網頁
首先,引入 Requests 函式庫。
```python=1
import requests
```
接下來,對一個網頁發送請求,以建北電資 27th 社網為例。
```python=2
url = "https://27th.ckefgisc.org"
response = requests.get(url=url)
```
此時,程式會回傳一個 `Response` 物件,並將其儲存在 `response` 變數中。
試著印出 `Response` 物件裡的內容。
```python=4
print(response.text)
```
```python
# 輸出:<!DOCTYPE html><html lang="zh-TW"><head><meta charset="utf-8"/>...
```
---
### Requests 中的 HTTP 方法
| 方法 | 說明 |
|:-------:|:-------------------------------------:|
| GET | 發送請求( 公開參數 ) |
| POST | 發送請求( 影藏參數 ) |
| PUT | 提供最新內容 |
| DELETE | 刪除指定內容 |
| HEAD | 請求回應標頭( 不含主體 ) |
| OPTIONS | 請求可用公能選單( 包含 Allow 標頭 ) |
>[!Tip]
可以到 https://httpbin.org 上測試各種 HTTP 方法。
---
### `Response` 物件的屬性與方法
| 屬性 | 說明 |
|:-----------:|:------------------------:|
| text | 訊息內容( 字串 ) |
| content | 訊息內容( 二進制 ) |
| raw | 串流訊息內容( 二進制 ) |
| encoding | 編碼方式 |
| status_code | 狀態碼 |
| 方法 | 說明 |
|:------:|:--------------------:|
| json() | 將訊息經過 json 解碼 |
---
### 避免超時
在發送請求時加入等待時間,可以避免等待時間過長。
```python=
response = requests.get(url=url, timeout=3)
```
程式碼將在超過等待時間後報錯。
```python
# 輸出:
# Traceback (most recent call last):
# ...
# requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='httpbin.org', port=443): Read timed out. (read timeout=3)
```
>[!Note]
等待時間不是整個回應的下載時間限制,而是指如果伺服器沒有在等待時間內回應( 通常是以回應的第一個字計算 ),程式碼將會報錯。
---
### 檢查 HTTP 狀態碼
```python=
response = requests.get(url=url)
print(response.status_code)
```
>[!Note]
1xx:請求已接受,且正在處理。
2xx:請求已接受,且已經完成。
3xx:請求已接受,但需重新導向。
4xx:客戶端出錯。
5xx:伺服端出錯。
如果出現了錯誤( 非 2xx ),可以使用 `Response.raise_for_status()` ,讓程式碼報錯。
```python=
response = requests.get(url=url)
response.raise_for_status()
```
---
### 避免亂碼
如果爬取的內容都是亂碼,時常編碼方式不一致造成,只需將編碼方式改為和網頁相同。
網頁大部分使用 utf-8,少數老舊的中文網站會使用 big5。
```python=
response = requests.get(url=url)
response.encoding = "big5"
```
>[!Note]
對網頁右鍵「 檢查 」或「 檢閱元件 」,
點選上方「 元素 」欄位,
可以在裡面找到網頁的編碼方式,
寫在網頁的 head 裡面。
---
### 補充
* HTTP:HyperText Transfer Protocol,超文本傳輸協定。
由客戶端發送請求到伺服端,再由伺服端返回回應。
通常包含:
1. HTTP 版本( 請求 )
2. 一個 url( 請求 )
3. HTTP 狀態碼( 回應 )
4. HTTP 方法
5. HTTP 標頭
6. HTTP 主體
* HTTP 方法:有時稱為 HTTP 動詞,表示請求期待伺服端執行的動作。
* HTTP 標頭:紀錄瀏覽器資訊。
* HTTP 主體:訊息的內容。
---
## Beautiful Soup
Beautiful Soup 是一個 python 的外部函式庫。
可以提取 HTML / XML 檔案中的內容,並快速地查找以及修改。
---
### 前置作業
輸入以下指令,下載所需函式庫( 依據環境不同,使用 pip、pip3、pipenv )。
```
pip install beautifulsoup4
```
---
### 解析網頁
繼續沿用上一章的程式碼,並引入 Beautiful Soup 函式庫。
```python=1
import requests
from bs4 import BeautifulSoup
url = "https://27th.ckefgisc.org"
response = requests.get(url=url)
```
將爬到的 HTML 檔案轉換為標籤樹。
```python=7
soup = BeautifulSoup(response.text, "html.parser")
```
試著印出一點東西。
```python=8
print(soup.title)
```
```python
# 輸出:<title>建北電資 | CKEFGISC</title>
```
---
### 美化排版
使用 `soup.prettify()` 讓輸出的排版更好看。
```python=
print(soup.prettify())
```
```python
# 輸出:
# <!DOCTYPE html>
# <html lang="zh-TW">
# <head>
# <meta charset="utf-8"/>
# <meta content="width=device-width, initial-scale=1" name="viewport"/>
# <meta content="建中電研,北一資研,電研社,資訊社,建北電資,社團" name="keywords">
# ...
# </html>
```
---
### 標籤物件中的屬性與方法
```html
<h1 id="hello"> Hello World </h1>
名稱 屬性( id ) 內容
```
```python=
from bs4 import BeautifulSoup
html = '<h1 id="hello"> Hello World </h1>'
soup = BeautifulSoup(html, "html.parser")
tag = soup.h1
print("名稱:", tag.name)
print("屬性:", tag.attrs)
print("ID:", tag["id"])
print("內容:", tag.get_text())
```
```python
# 輸出:名稱: h1
# 屬性: {'id': 'hello'}
# ID: hello
# 內容: Hello World
```
---
### 尋找標籤
`find()`:尋找第一個指定的標籤
`find_all()`:尋找所有指定的標籤
`select_one()`:以 CSS 選擇器尋找第一個指定的標籤
`select()`:以 CSS 選擇器尋找所有指定的標籤
```python=
from bs4 import BeautifulSoup
html = """
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
"""
soup = BeautifulSoup(html, "html.parser")
print(soup.find("a"))
print(soup.find_all("a"))
print(soup.find("a", id="link3"))
```
```python
# 輸出:
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
```
---
## 範例
### 1-1 Requests 中的 Http 方法
```pyth=
import requests
### GET 寫法一
# 參數會包含在 url 中
url = "https://httpbin.org/anything?key1=value1&key2=value2"
response = requests.put(url=url)
# --------------------------------------------------
### GET 寫法二
url = "https://httpbin.org/anything"
payload = {
"key1": "value1",
"key2": "value2"
}
response = requests.get(url=url, params=payload)
# --------------------------------------------------
### POST
# 參數不會包含在 url 中
url = "https://httpbin.org/anything"
payload = {
"key1": "value1",
"key2": "value2"
}
response = requests.post(url=url, data=payload)
# --------------------------------------------------
### PUT
url = "https://httpbin.org/anything"
payload = {
"key1": "value1",
"key2": "value2"
}
response = requests.put(url=url, data=payload)
# --------------------------------------------------
### HEAD
url = "https://httpbin.org/anything"
response = requests.head(url=url)
# print(response.test) 不會輸出任何內容
# --------------------------------------------------
### OPTIONS
url = "https://httpbin.org/anything"
response = requests.options(url=url)
# print(response.headers) 會輸出包含 Allow 的標頭
```
### 1-2 Response 物件的屬性與方法
```python=
import shutil
import requests
### content
url = "https://httpbin.org/image/jpeg"
response = requests.get(url=url)
with open("講義程式碼/1-3 image.jpeg", "wb") as file:
file.write(response.content)
# --------------------------------------------------
### raw
url = "https://httpbin.org/image/jpeg"
response = requests.get(url=url, stream=True)
with open("講義程式碼/1-2 image2.jpeg", "wb") as file:
shutil.copyfileobj(response.raw, file)
# --------------------------------------------------
### raw 下載大型檔案( 每次下載一部分,如下圖。)
url = "https://httpbin.org/image/jpeg"
response = requests.get(url=url, stream=True)
with open("講義程式碼/1-2 image2.jpeg", "wb") as file:
for chunk in response: # 以 128 byte 為一個 chunk
file.write(chunk)
```

### 1-3 避免超時
```python=
import time
import requests
url = "https://httpbin.org/delay/10"
response = requests.get(url=url, timeout=3)
# 輸出:
# Traceback (most recent call last):
# ...
# requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='httpbin.org', port=443): Read timed out. (read timeout=3)
# --------------------------------------------------
# 下載大型檔案
url = "https://live.staticflickr.com/65535/52259221868_3d2963c1fe_o_d.png"
start = time.time() # 開始時計時
response = requests.get(url=url, timeout=3, stream=True)
with open("講義程式碼/1-3 large_image.png", "wb") as file:
for chunk in response:
file.write(chunk)
end = time.time() # 結束時計時
print(end - start) # 計算時間差
# 輸出:10.603063106536865
```
### 1-4 檢查 HTTP 狀態碼
```python=
import requests
url = "https://httpbin.org/status/200"
response = requests.get(url=url)
print(response.status_code)
# 輸出:200
# --------------------------------------------------
url = "https://httpbin.org/status/404"
response = requests.get(url=url)
print(response.status_code)
# 輸出:404
response.raise_for_status()
# 輸出:
# Traceback (most recent call last):
# ...
# requests.exceptions.HTTPError: 404 Client Error: NOT FOUND for url: https://httpbin.org/status/404
```
### 1-5 避免亂碼
```python=
import requests
url = "http://fengshan.itgo.com/8-14.htm"
response = requests.get(url=url)
print(response.text)
# 輸出:... <title>¨«Åª¥xÆW-°ª¶¯¿¤¬F©²</title> ...
# --------------------------------------------------
url = "http://fengshan.itgo.com/8-14.htm"
response = requests.get(url=url)
response.encoding = "big5"
print(response.text)
# 輸出:... <title>走讀台灣-高雄縣政府</title> ...
```
### 2-1 尋找標籤
```python=
from bs4 import BeautifulSoup
html = """
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link2">
Tillie
</a>
; and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
"""
soup = BeautifulSoup(html, "html.parser")
print(soup.find(id="link2"))
# 輸出:<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
print(soup.find(class_="sister"))
# 輸出:<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
print(soup.find_all("b"))
# 輸出:[<b>The Dormouse's story</b>]
print(soup.find_all(["b", "a"]))
# 輸出:[
# <b>The Dormouse's story</b>,
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>
# ]
print(soup.find_all("a", class_="sister"))
# 輸出:[
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>
# ]
print(soup.find_all("a", class_="sister", limit=2))
# 輸出:[
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# ]
print(soup.select_one("title"))
# 輸出:<title>The Dormouse's story</title>
print(soup.select("body a"))
# 輸出:[
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>
# ]
print(soup.select("a#link1"))
# 輸出:[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
print(soup.select_one("a.sister"))
# 輸出:<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
```