Python 基礎爬蟲

# Python 基礎爬蟲 [![](https://img.shields.io/badge/dynamic/json?color=orange&label=總觀看人數&query=%24.viewcount&url=https://hackmd.io/@AndyChiang/StaticCrawler%2Finfo)]() > [name=AndyChiang][time=Fri, Feb 5, 2021 9:59 AM][color=#00CDAF] ###### tags: `Python` `爬蟲` ## 安裝Beautiful Soup與requests 這兩個套件都是爬蟲好用(必備)的套件，使用起來簡單，適合用於靜態網站爬蟲(最常被爬的：PTT、Dcard、Yahoo等等)。 ### Beautiful Soup Beautiful Soup通常用來分析抓下來的HTML，使用pip安裝： ``` pip install beautifulsoup4 ``` ### Requests Requests用於抓取HTML原始碼下來，一樣是使用pip安裝： ``` pip install requests ``` ## 引用大部分爬蟲引用這兩個模組就夠用了，少部分需要其他模組(os、json等等) ``` import requests from bs4 import BeautifulSoup ``` ## 抓取並分析網站抓取網站使用到 requests.get() 函數，參數為我們想要爬蟲的網址(URL)，範例為PTT的C_Chat版。 ``` response = requests.get("https://www.ptt.cc/bbs/C_Chat/index.html") ``` 抓下來的並不是我們想要的HTML，因此要經過 BeautifulSoup 轉化。轉化後的HTML排版並不好看，`soup.prettify()` 可以美化排版。 ``` response = requests.get("https://www.ptt.cc/bbs/C_Chat/index.html") soup = BeautifulSoup(response.text, "html.parser") print(soup.prettify()) ``` ## 搜尋節點搜尋為爬蟲相當重要的一環，抓下來的資訊零零散散，搜索就是幫助我們找到想要的資訊。 ### find() 搜尋第一個符合條件的HTML節點，參數為要搜尋的標籤名稱，回傳第一個符合條件的節點。 ``` result = soup.find("h2") ``` ### find_all() 搜尋所有符合條件的HTML節點，回傳符合條件的節點列表。 ``` result = soup.find_all("h3") ``` 參數可加入關鍵字指定屬性值，另外limit參數可以限制搜尋的數量。 ``` result = soup.find_all("h3", itemprop="headline", limit=3) ``` 回傳為一個符合條件的節點列表(list)，當然也有索引值。 ``` print(result[0]) # 印出第一個符合條件的節點 ``` 如果要同時搜尋多個HTML標籤，可以將標籤名稱打包成串列(list)後當成參數。 ``` result = soup.find_all(["h3", "p"], limit=2) ``` 如果要搜尋指定的CSS屬性值，則可以加入參數指定。因為 `class` 是 Python 的保留字，所以要用 `class_`。 ``` titles = soup.find_all("p", class_="summary", limit=3) ``` 遇到這種情況，也可以用 `{ key : value }` 表示法。 ``` titles = soup.find_all("p", {"class" : "summary"}, limit=3) ``` 使用 string 參數可搜尋指定文字內容的節點。 ``` soup.find_all("p", string="Hello World") ``` ### select_one() select()使用CSS選擇器的方式搜尋節點，類似jQuery的語法。搜尋某節點底下的子節點，回傳第一個符合條件的子節點。 ``` result.select_one("a") ``` ### select() 搜尋某節點底下的子節點，回傳多個符合條件的子節點的列表(list)，另外limit參數可以限制搜尋的數量。 ``` result.select("a", limit=3) ``` 所以通常是先find()到想爬的標籤，再select()往下找子節點。如果要搜尋指定的CSS屬性值，則是用類似 Emmet 的語法。比方說想搜尋 class=summary 的 h2 底下的 a 子節點，就要這樣搜尋： ``` links = soup.select("h2.summary a") ``` ### find_parent() 或 find_parents() 從目前節點向上搜尋符合條件的父節點。find_parent()用於搜尋單個，find_parents()則用於搜尋多個。比方說想搜尋 itemprop="url" 的 a 上層的 h2 父節點，就要這樣搜尋： ``` result = soup.find("a", itemprop="url") parents = result.find_parents("h2") ``` ### find_previous_sibling() 在同級的節點中，搜尋上一個節點。比方說想搜尋 itemprop="headline" 的 h2 上一個的 a 節點，就要這樣搜尋： ``` result = soup.find("h2", itemprop="headline") previous_node = result.find_previous_siblings("a") ``` ### find_next_sibling() 在同級的節點中，搜尋下一個節點。比方說想搜尋 itemprop="headline" 的 h2 下一個的 p 節點，就要這樣搜尋： ``` result = soup.find("h2", itemprop="headline") next_node = result.find_next_siblings("p") ``` ## 抓取資料搜尋到想要的節點後，下一個重要步驟當然就是抓取資料下來嘛! ### get() 抓取節點屬性值，參數為屬性。比方說想搜尋一個 a 的 href(URL連結)，就要這樣搜尋： ``` title.select_one("a").get("href") # 等同於... title.select_one("a")["href"] ``` ### getText() 抓取節點內部文字。比方說想搜尋一個 a 的內部文字，就要這樣搜尋： ``` title.select_one("a").getText() # 等同於... title.select_one("a").text # 等同於... title.select_one("a").string ``` ## 下載圖片還在一張一張下載圖片嗎? 太落伍了! 試試看用爬蟲自動下載圖片吧~ 1. 先搜尋到該圖片的節點，然後抓取圖片的來源網址(src)，並且使用 `get()` 下載圖檔下來(注意，此時你的電腦中還不會出現圖片!)。 ``` link = result.get("src") img = requests.get(link) ``` 2. 接著開一個新檔案，用到 open() 函數，參數為檔名和輸入方式(圖片為二進位輸入，所以是 wb)，此範例因為只有一張圖，不然多張圖的情況檔名記得要換。 ``` file = open("img1.jpg","wb") ``` 3. 把剛下載的圖檔內容寫進新檔案中，然後關檔(務必要關!!)，就完成啦! ``` file.write(img.content) file.close() ``` 4. 推薦要寫：建議是把圖片統一存在一個資料夾，要不然下載的圖片會跟你的檔案混在一起喔! 開檔前先檢查有沒有資料夾，如果沒有，就建立新資料夾。 ``` if not os.path.exists("images"): os.mkdir("images") # 建立資料夾 ``` **完整程式碼：** ``` link = result.get("src") # 抓取圖片src img = requests.get(link) # 下載圖片 if not os.path.exists("images"): os.mkdir("images") # 建立資料夾 file = open("images\\img1.jpg","wb") # 開啟新檔 file.write(img.content) # 寫入內容 file.close() # 關檔 ``` ## 210209補充 - Requests套件(進階) 在前面，我們可能只會這樣寫： ``` # 引入 requests 模組 import requests # 使用 GET 方式下載普通網頁 response = requests.get("https://www.ptt.cc/bbs/C_Chat/index.html") soup = BeautifulSoup(response.text, "html.parser") ``` 但現在跟你說，Request還有更多的功能。 ### 狀態碼(status code) 參數 `status_code` 可以得到該網站的狀態碼。 ``` import requests response = requests.get("https://www.ptt.cc/bbs/C_Chat/index.html") print(response.status_code) # 200 ``` 順便提一下狀態碼(Status Code)是什麼，幾個常見的有： * 200：一切順利，結果已經回傳。 * 301：伺服器將使用者重新定向（re-direct）到另一個位址，當網站更換網域名稱或更改 Routes 時可能會發生。 * 400：錯誤的語法請求。 * 401：未通過伺服器的身份驗證，當請求沒有一併發送正確憑證時會發生。 * 403：伺服器已經理解請求，但是拒絕執行它，意即與請求一併發送的憑證無效。 * 404：找不到目標。更多狀態碼：[MDN - HTTP 狀態碼](https://developer.mozilla.org/zh-TW/docs/Web/HTTP/Status) ### GET 請求 #### 增加 URL 查詢參數有些網站URL中會夾帶的關鍵字，使用 params 參數生成。 ``` import requests my_params = { "key1": "value1", "key2": "value2" } response = requests.get( "https://www.ptt.cc/bbs/C_Chat/index.html", params=my_params) print(response.url) # https://www.ptt.cc/bbs/C_Chat/index.html?key1=value1&key2=value2 ``` #### 加入 headers 使用 headers 參數，經常加入 user-agent 或 cookie 等關鍵字。 ``` # 自訂表頭 my_headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36' } # 將自訂表頭加入 GET 請求中 response = requests.get( "https://www.ptt.cc/bbs/C_Chat/index.html", headers = my_headers) ``` #### cookie 如果是要找網頁存了哪些 cookie，可使用 cookie 參數加上 cookie 名稱。 ``` # 含有 cookie 的內容 response = requests.get("http://my.server.com/has/cookies") # 取出 cookie print(response.cookies['my_cookie_name']) ``` 將自己設定的 cookie 放入 GET 的 headers 中。 ``` # 設定 cookie my_cookies = { my_cookie_name: "content" } # 將 cookie 加入 GET 請求 r = requests.get("http://httpbin.org/cookies", cookies = my_cookies) ``` ### POST 請求前面都是用 GET，但 POST 也算滿常用到的函數，可以用於回傳資料或上傳檔案。 #### 回傳資料 ``` # 資料 my_data = {'key1': 'value1', 'key2': 'value2'} # 將資料加入 POST 請求中 r = requests.post('http://httpbin.org/post', data = my_data) ``` #### 上傳檔案 ``` # 要上傳的檔案 my_files = {'my_filename': open('my_file.docx', 'rb')} # 將檔案加入 POST 請求中 r = requests.post('http://httpbin.org/post', files = my_files) ``` * 參考網址：[Python 使用 requests 模組產生 HTTP 請求，下載網頁資料教學](https://blog.gtwang.org/programming/python-requests-module-tutorial/) ## 相關文章 * [Python 動態網頁爬蟲](/Cp1938RtSZ6yu7DHfJvUDQ) ## 參考網址 * [開發Python網頁爬蟲前需要知道的五個基本觀念](https://www.learncodewithmike.com/2020/10/python-web-scraping.html) * [7個Python使用BeautifulSoup開發網頁爬蟲的實用技巧(我主要是看這個)](https://www.learncodewithmike.com/2020/02/python-beautifulsoup-web-scraper.html) * [有效利用Python網頁爬蟲幫你自動化下載圖片](https://www.learncodewithmike.com/2020/09/download-images-using-python.html) * [Python 使用 Beautiful Soup 抓取與解析網頁資料，開發網路爬蟲教學](https://blog.gtwang.org/programming/python-beautiful-soup-module-scrape-web-pages-tutorial/) * [IT鐵人賽-Python爬蟲小人生](https://ithelp.ithome.com.tw/articles/10202121)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.