Python 爬蟲-彭彭老師與練習

抓取資料










import urllib.request as req
url="網址";
request=req.Request(url, headers={
    # ptt 裡的 cookie
    "cookie":"over18=1",
    "User-Agent":"網站 Network 裡指定 html 的 Headers > Request Headers"
})
with req.urlopen(url) as res
    data = res.read().decode("utf-8")
print(data)

解析資料

HTML

使用 BeautifulSoup (4) 套件

pip install beautifulsoup4








import bs4
root = bs4.BeautifulSoup(data, "html.parser")
# print(root.title.string)
titles = root.find_all("div", class_="title")
print(titles.a.string)
for title in titles:
    if title.a != None
        print(title.a.string)

自己做做看：爬蟲實作搭配 GPT-4 (Web Browsing)

以 pttweb 網站為例 url="https://www.pttweb.cc/bbs/Gossiping"

urllib.request 無法抓到動態(dynamic) cookies，因此改用更多自動化的插件 requests
使用方式小有不同，如：

headers= { ... }
session = req.Session()
response = session.get(url, headers=headers)

注意Headers 分頁中 Request Headers 裡有 ":" 前綴的為 HTTP2特性之偽元素，插件不一定支援 (~SEP 2021)，也不一定跟爬蟲權限有關

:Authority:www.ptt.cc
:Method:GET
:Path:/bbs/Gossiping/index.html
:Scheme:https

比起使用

with req.urlopen(url) as res
    data = res.read().decode("utf-8")

requests 使用 session 動態抓取

res = session.get(url, headers=headers)
data = res.content.decode('utf-8')

觀察資料格式，為 span 中含有某 class (class_) 的物件並以 strip. 去掉空格

titles = root.find_all("span", class_="e7-show-if-device-is-not-xs")

for title in titles:
    print(title.text.strip())

後記：若要抓取 <div data-xs> 這種標籤，則要使用attrs

titles = root.find_all("div", attrs={"data_xs": True})

後記2：

.text：會回傳標籤內所有子孫標籤的文字內容。（也可作.get_text()）
.string：只有在標籤只有一個 NavigableString 類型子節點時才會回傳內容，否則會回傳 None（包含<b>這類標籤）。

後記3：一度誤認為資料格式非 text/html 而為 br，因此使用解壓縮 (decompressed) 插件 brotli 或其他加密方法使用 zlib 但解壓縮失敗，因為並無壓縮加密

if response.headers.get('Content-Encoding') == 'br':
    decompressed_data = brotli.decompress(response.content)
    text = decompressed_data.decode('utf-8')
    print(text)
else:
    print(response.text)

或 zlib

import zlib

decompressed_data = zlib.decompress(response.content)

完整程式碼：


































import requests as req
import bs4

url="https://www.pttweb.cc/bbs/Gossiping"

headers = {
    # "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    # "Accept-Encoding":"gzip, deflate, br",
    # "Accept-Language":"zh-TW,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6,und;q=0.5,zh-CN;q=0.4,ja;q=0.3",
    # "Cache-Control":"no-cache",
    # "Cookie":"_ga=GA1.1.12609530.1687623116; PTTweb_v2_guestId_persistent=561355565; PTTweb_v2_authKey_persistent=y81n71zmu3d9dfnofiofckqbfy; PTTweb_v2_guestId=561355565; PTTweb_v2_authKey=y81n71zmu3d9dfnofiofckqbfy; _ga_F0HJ7JBSPD=GS1.1.1687627721.2.0.1687627721.0.0.0",
    # "Pragma":"no-cache",
    # "Sec-Fetch-Dest":"document",
    # "Sec-Fetch-Mode":"navigate",
    # "Sec-Fetch-Site":"none",
    # "Sec-Fetch-User":"?1",
    # "Upgrade-Insecure-Requests":"1",
    # "User-Agent":"Mozilla/5.0 (iPad; CPU OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/87.0.4280.77 Mobile/15E148 Safari/604.1 Edg/114.0.0.0"
}

session = req.Session()

response = session.get(url, headers=headers)

data = response.content.decode('utf-8')

#beautifulsoup4 text/html 轉成文字

root = bs4.BeautifulSoup(data, "html.parser")
titles = root.find_all("span", class_="e7-show-if-device-is-not-xs")

for title in titles:
    print(title.text.strip())

抓取資料

解析資料

HTML

自己做做看：爬蟲實作搭配 GPT-4 (Web Browsing)

Read more

Hooks

合作需求

Feedback

Resume-ZH-TW (中文履歷副本)