基本爬蟲 - HackMD

# 基本爬蟲阿巴阿巴，挺有趣的學一下 [TOC] ## 安裝必要套件（如果尚未安裝）： 1. 打開 VS Code。 2. 開啟終端機： * Windows/Linux：Ctrl + (也就是 Ctrl + 反引號)，或使用選單 [檢視] → [終端機]。 * macOS：Cmd + 。 3. 在終端機中輸入以下指令（可複製貼上）： ``` pip install requests beautifulsoup4 ``` ---- ### 如果出現：'pip' 不是內部或外部命令、可執行的程式或批次檔。那就是在安裝時沒有 **勾選 "Add Python to PATH"！** ![image](https://hackmd.io/_uploads/SJSJS22xxx.png) 刪掉python重新安裝。。。。 ## 載入方法 ```python= import requests from bs4 import BeautifulSoup ``` ## 基本爬蟲程式範例 ```python= import requests from bs4 import BeautifulSoup # 1. 發送請求（GET） url = 'https://example.com' response = requests.get(url) # 2. 檢查請求成功 if response.status_code == 200: html = response.text # 取得 HTML 原始碼 # 3. 解析 HTML soup = BeautifulSoup(html, 'html.parser') # 4. 找出特定標籤 title = soup.find('h1') # 找第一個 h1 標籤 print("標題：", title.text) # 找所有超連結 links = soup.find_all('a') for link in links: print("連結文字：", link.text) print("連結網址：", link['href']) else: print("請求失敗，狀態碼：", response.status_code) ``` ## 各類語法 | 名稱 | 用法說明 | | ----------------- | --------------------- | | `requests.get()` | 發送 GET 請求 | | `response.text` | 回傳網頁原始碼（文字） | | `BeautifulSoup()` | 用來解析 HTML 的工具 | | `soup.find()` | 找第一個符合條件的標籤 | | `soup.find_all()` | 找所有符合條件的標籤，會回傳 list | | `tag.text` | 標籤中的純文字 | | `tag['href']` | 取得 `<a>` 標籤中的連結屬性（網址） | ## 注意事項有些網站會反爬蟲，可能要加上 headers 模仿正常使用者： ```python= headers = {'User-Agent': 'Mozilla/5.0'} response = requests.get(url, headers=headers) ``` 有些資料不是 HTML 裡，而是 JavaScript 動態載入，要用 Selenium 或 API 擷取。 ## 常用解析方法（BeautifulSoup） ### 1. 使用 .find() 和 .find_all() ```python= soup.find('div') # 找第一個 <div> soup.find_all('div') # 找所有 <div> soup.find_all('a', class_='link') # 找所有 class 為 link 的 <a> ``` ### 2. 使用 .select() 搭配 CSS 選擇器（功能強大） ```python= soup.select('div.content > h2') # 選擇所有位於 .content 裡的 h2 soup.select('.title') # class 為 title soup.select('#main') # id 為 main ``` | 方法 / 屬性 | 說明 | | ---------------------- | ----------- | | `.text` | 取得純文字 | | `.get('href')` | 取得某屬性值，如超連結 | | `.attrs` | 取得所有屬性為字典 | | `.parent` | 取得父節點 | | `.children` | 回傳子節點生成器 | | `.find_next_sibling()` | 下一個兄弟節點 | ## 加入 headers、參數、cookie ### 模擬正常瀏覽器 ```python= headers = { 'User-Agent': 'Mozilla/5.0', 'Referer': 'https://example.com' } response = requests.get(url, headers=headers) ``` ### 傳送參數 ```python= params = {'q': 'python'} requests.get('https://www.google.com/search', params=params) ``` ### 模擬登入、維持 cookie ```python= session = requests.Session() response = session.get(url) # 後續請求可自動攜帶 cookie ``` ## 加上等待與延遲（避免被封） ```python= import time import random time.sleep(random.uniform(1, 3)) # 隨機延遲 1~3 秒 ``` ## 常見錯誤處理 ```python= try: response = requests.get(url, timeout=5) response.raise_for_status() # 自動丟出 HTTPError except requests.exceptions.RequestException as e: print("錯誤發生：", e) ``` ## 補充工具 | 工具名 | 功能 | | ------------ | ---------------------- | | `lxml` | 更快速的 HTML/XML 解析器 | | `re` | 正規表達式，搭配 `.text` 做文字分析 | | `pandas` | 做表格化儲存與分析爬下來的資料 | | `csv`、`json` | 儲存資料到檔案 | ## 存成 CSV 檔範例 ```python= import csv with open('data.csv', 'w', newline='', encoding='utf-8') as f: writer = csv.writer(f) writer.writerow(['標題', '連結']) # 標題列 writer.writerow(['Python教學', 'https://example.com']) ```