爬蟲限制 - HackMD

爬蟲限制 === ###### tags: `爬蟲` ###### tags: `爬蟲`, `限制`, `wget`, `--user-agent` <br> [TOC] <br> ## 限制 ### 403 Forbidden - ### 2025/03/01 - [「雙失青年」用GPT寫出3款APPs　對編程一竅不通　卻年賺850萬美元](https://wealth.hket.com/article/3908135) - **403 Forbidden** ``` $ wget https://wealth.hket.com/article/3908135 --2025-11-24 10:16:36-- https://wealth.hket.com/article/3908135 Resolving wealth.hket.com (wealth.hket.com)... 3.169.121.20, 3.169.121.89, 3.169.121.102, ... Connecting to wealth.hket.com (wealth.hket.com)|3.169.121.20|:443... connected. HTTP request sent, awaiting response... 403 Forbidden 2025-11-24 10:16:36 ERROR 403: Forbidden. ``` - **添加 user agent 資訊** ``` $ wget \ --user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36" \ "https://wealth.hket.com/article/3908135" --2025-11-24 10:22:04-- https://wealth.hket.com/article/3908135 Resolving wealth.hket.com (wealth.hket.com)... 3.169.121.20, 3.169.121.43, 3.169.121.89, ... Connecting to wealth.hket.com (wealth.hket.com)|3.169.121.20|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: ‘3908135.1’ 3908135.1 [ <=> ] 130.95K --.-KB/s in 0.1s 2025-11-24 10:22:05 (1.21 MB/s) - ‘3908135.1’ saved [134093] ``` - **為何我用瀏覽器卻可以正常查看內文？** > by gpt-5.1 (2025/11/24) 因為「你用瀏覽器」跟「你用 `wget`」在網站眼中是**兩種完全不同的東西**，服務端會用很多條件來判斷你是不是「正常使用者」，`wget` 很容易被當成機器人，直接被擋掉給 403。這個網址我這邊用工具抓也一樣拿到 403，代表它真的有在做這種防機器人控管，而不是你打錯指令。大概會有幾種差異： --- ### 為什麼瀏覽器可以看，wget 卻 403？ #### 1. User-Agent 被當成機器人 `wget` 預設會送出像： ```http User-Agent: Wget/1.21.4 (linux-gnu) ``` 很多新聞／財經網站會直接用規則擋掉這種 UA，因為它們通常代表： * 自動爬蟲 * 批次下載 * 可能影響伺服器負載、或違反內容授權瀏覽器送的是： ```http User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36 ``` 看起來就像一般真人使用的瀏覽器，自然會被放行。 --- #### 2. 瀏覽器會帶很多 header＋cookie，wget 沒有瀏覽器平常會多送這些： * `Accept`, `Accept-Language`, `Accept-Encoding` * `Referer`（例如從 Google 點進來） * `Cookie`（你之前上站時留下來的登入資訊 / 追蹤碼）有些網站規則會像： * 沒有某些 cookie → 視為可疑來源 * 沒有正常的 Accept / Accept-Language → 視為機器人 * Referer 不在白名單 → 直接擋 `wget` 預設都沒有，所以很容易被 403。 --- #### 3. WAF／CDN 的機器人偵測（CloudFront / Cloudflare 類型）從 DNS/IP 看起來，這類新聞財經網站通常會掛在 CDN 或 WAF 後面，像： * 檢查 IP 是否是資料中心 IP（雲端機房、VPN 節點） * 檢查請求速度、頻率、UA 是否像爬蟲 * 若判斷「不像人」，就直接回 403 你在瀏覽器用家用網路看，行為像一般人 → 通過你在 shell 用 `wget`（尤其是如果在伺服器機房）→ 很有機會被判定為 bot → 403 --- ### 可以怎麼驗證？（技術角度）如果只是想實驗原因，而不是長期爬資料，可以試： ```bash wget \ --user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36" \ "https://wealth.hket.com/article/3908135" ``` 進階一點再補上 Accept 類 header： ```bash wget \ --user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36" \ --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" \ --header="Accept-Language: zh-TW,zh;q=0.9,en;q=0.8" \ "https://wealth.hket.com/article/3908135" ``` 如果這樣突然就能下載，大致可以確定是 **User-Agent／header 限制** 的問題。 > ⚠️ 不過嚴格來說，這樣繞過檢查可能會踩到網站的使用條款或 robots.txt， > 只是單純測試 OK，但不要拿來大量爬文或商業用途會比較安全。 --- ### 簡單一句話總結 > 你用瀏覽器時，網站看到的是「正常人類＋正常瀏覽器」； > 你用 `wget` 時，網站看到的是「一個疑似爬蟲的程式」， > 所以對瀏覽器開門，對 `wget` 回 403 Forbidden。如果你願意，我也可以幫你寫一個比較「乖」的抓文小工具（順便考慮重試、節流、尊重 robots.txt 那種）。