## 抓取資料
```python=
import urllib.request as req
url="網址";
request=req.Request(url, headers={
# ptt 裡的 cookie
"cookie":"over18=1",
"User-Agent":"網站 Network 裡指定 html 的 Headers > Request Headers"
})
with req.urlopen(url) as res
data = res.read().decode("utf-8")
print(data)
```
## 解析資料
### HTML
使用 BeautifulSoup (4) 套件
```python
pip install beautifulsoup4
```
```python=
import bs4
root = bs4.BeautifulSoup(data, "html.parser")
# print(root.title.string)
titles = root.find_all("div", class_="title")
print(titles.a.string)
for title in titles:
if title.a != None
print(title.a.string)
```
## 自己做做看:爬蟲實作搭配 GPT-4 (Web Browsing)
以 pttweb 網站為例 `url="https://www.pttweb.cc/bbs/Gossiping"`
1. `urllib.request` 無法抓到動態(dynamic) cookies,因此改用更多自動化的插件 `requests`
使用方式小有不同,如:
```python
headers= { ... }
session = req.Session()
response = session.get(url, headers=headers)
```
2. 注意`Headers` 分頁中 `Request Headers` 裡有 ":" 前綴的為 `HTTP2`特性之偽元素 ,插件不一定支援 (~SEP 2021),也不一定跟爬蟲權限有關
```javascript
:Authority:www.ptt.cc
:Method:GET
:Path:/bbs/Gossiping/index.html
:Scheme:https
```
3. 比起使用
```python
with req.urlopen(url) as res
data = res.read().decode("utf-8")
```
`requests` 使用 `session` 動態抓取
```python
res = session.get(url, headers=headers)
data = res.content.decode('utf-8')
```
4. 觀察資料格式,為 span 中含有某 class (class_) 的物件並以 strip. 去掉空格
```python
titles = root.find_all("span", class_="e7-show-if-device-is-not-xs")
for title in titles:
print(title.text.strip())
```
5. 後記:若要抓取 `<div data-xs>` 這種標籤,則要使用`attrs`
```python
titles = root.find_all("div", attrs={"data_xs": True})
```
6. 後記2:
- `.text`:會回傳標籤內所有子孫標籤的文字內容。(也可作`.get_text()`)
- `.string`:只有在標籤只有一個 `NavigableString` 類型子節點時才會回傳內容,否則會回傳 `None`(包含`<b>`這類標籤)。
7. 後記3:一度誤認為資料格式非 text/html 而為 br,因此使用解壓縮 (decompressed) 插件 brotli 或其他加密方法使用 zlib 但解壓縮失敗,因為並無壓縮加密
```=python
if response.headers.get('Content-Encoding') == 'br':
decompressed_data = brotli.decompress(response.content)
text = decompressed_data.decode('utf-8')
print(text)
else:
print(response.text)
```
或 zlib
```=python
import zlib
decompressed_data = zlib.decompress(response.content)
```
8. 完整程式碼:
```python=
import requests as req
import bs4
url="https://www.pttweb.cc/bbs/Gossiping"
headers = {
# "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
# "Accept-Encoding":"gzip, deflate, br",
# "Accept-Language":"zh-TW,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6,und;q=0.5,zh-CN;q=0.4,ja;q=0.3",
# "Cache-Control":"no-cache",
# "Cookie":"_ga=GA1.1.12609530.1687623116; PTTweb_v2_guestId_persistent=561355565; PTTweb_v2_authKey_persistent=y81n71zmu3d9dfnofiofckqbfy; PTTweb_v2_guestId=561355565; PTTweb_v2_authKey=y81n71zmu3d9dfnofiofckqbfy; _ga_F0HJ7JBSPD=GS1.1.1687627721.2.0.1687627721.0.0.0",
# "Pragma":"no-cache",
# "Sec-Fetch-Dest":"document",
# "Sec-Fetch-Mode":"navigate",
# "Sec-Fetch-Site":"none",
# "Sec-Fetch-User":"?1",
# "Upgrade-Insecure-Requests":"1",
# "User-Agent":"Mozilla/5.0 (iPad; CPU OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/87.0.4280.77 Mobile/15E148 Safari/604.1 Edg/114.0.0.0"
}
session = req.Session()
response = session.get(url, headers=headers)
data = response.content.decode('utf-8')
#beautifulsoup4 text/html 轉成文字
root = bs4.BeautifulSoup(data, "html.parser")
titles = root.find_all("span", class_="e7-show-if-device-is-not-xs")
for title in titles:
print(title.text.strip())
```