Python Crawler - PTT

# Python Crawler - PTT :::info - 課堂筆記：240815 維晏老師 - 目標：爬取 PTT 看板文章的內容 - 部分筆記內容來自ChatGPT ::: ## 1. 準備套件 ```python= !pip install Beautifulsoup4 !pip install requests !pip install pandas ``` - 因 python 更新很快，套件也更新很快，記得去看原始文件確認 - [Beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - [requests](https://requests.readthedocs.io/en/latest/) - [pandas](https://pandas.pydata.org/docs/) ## 2. 參數設定 ### 2-1. 目標網站 ```markdown 目標網站: https://www.ptt.cc/bbs/marvel/index.html 上一頁: https://www.ptt.cc/bbs/marvel/index2726.html 上上一頁: https://www.ptt.cc/bbs/marvel/index2725.html ``` ```python= # 目標網址 target_url = 'https://www.ptt.cc/bbs/' # 目標看板 target_board = 'marvel/' # 目標頁面 target_page = 'index' # 目標頁數 page_num = "" # 頁面附屬檔名 page_ext = ".html" target = target_url + target_board + target_page + page_num + page_ext ``` ### 2-2. 準備 requests 用的 header ```python= headers = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36", "cookie": "_gid=GA1.2.1894547549.1723702817; _gat=1; _ga_DZ6Y3BY9GW=GS1.1.1723702816.2.1.1723703948.0.0.0; _ga=GA1.1.634190837.1721617370", } ``` - 查詢 user-agent、cookie 的方式 - user-agent 資訊包含：瀏覽器版本、作業系統、使用工具等 ![image](https://hackmd.io/_uploads/By05r8rpC.png =600x) - requests header 越詳細，伺服器阻擋爬蟲的機率就越小 - 伺服器通常會先阻擋 requests header 為空的請求 - 建議都填上 user-agent、cookie 的值 ## 3. 用 requests 下載目標網頁的程式碼 ```python= import requests data = requests.get(target, headers=headers) # 放入 requests 物件裡面 ``` - requests.get() `requests.get()` 是 Python 中 `requests` 模組的一個方法，用於發送 HTTP GET 請求到指定的 URL 並取得回應。這個方法主要用於從網頁伺服器獲取資料，通常用來進行爬蟲或訪問 API 資源。 - :::spoiler 常用參數 1. **`url` (必填)** - 目標的 URL，即你要爬取的網站或 API 資源。 ```python= requests.get('https://example.com') ``` 2. **`headers` (選填)** - 可以用來指定 HTTP 標頭，模擬瀏覽器請求，或通過 API key 進行驗證。 - 常用的標頭之一是 `User-Agent`，用於告訴伺服器這是來自什麼樣的瀏覽器或客戶端的請求。 ```python= headers = {'User-Agent': 'Mozilla/5.0'} requests.get(url, headers=headers) ``` 3. **`params` (選填)** - 用來傳遞 URL 查詢參數，也就是 URL 的問號 `?` 之後的部分。 ```python= params = {'search': 'python', 'page': 2} requests.get(url, params=params) ``` 4. **`timeout` (選填)** - 設定請求的超時時間（秒），如果伺服器回應時間過長，會引發超時錯誤。 ```python= requests.get(url, timeout=5) ``` - :::spoiler 回應屬性 `requests.get()` 的回應是 `Response` 物件，常見的屬性包括： 1. **`status_code`**: HTTP 回應狀態碼（例如：200 表示成功，404 表示找不到頁面）。 ```python= response = requests.get(target) print(response.status_code) # 200 ``` 2. **`content`**: 回應的二進位內容，適合用於處理圖片或文件下載。 ```python= print(response.content) ``` 3. **`text`**: 回應的文字內容，適合用於處理 HTML 或 JSON 字串。 ```python= print(response.text) ``` 4. **`json()`**: 解析回應為 JSON 格式，適合用於處理 API 回應。 ```python= json_data = response.json() ``` ## 4. 用 BeautifulSoup 分析跟取得 html 標籤資料 ### 4-1. 用 BeautifulSoup 分析 ```python= from bs4 import BeautifulSoup html_code = BeautifulSoup(data.content) # initial，之後就可以用html的方式去分析。 # 將 `data.content` 的 HTML 內容解析成一個可操作的 BeautifulSoup 物件。 ``` - BeautifulSoup `BeautifulSoup` 是 Python 中的 `bs4`（BeautifulSoup 4）庫，用於從 HTML 或 XML 文件中解析並提取資料。當你從網頁請求取得 HTML 回應後，`BeautifulSoup` 可以幫助你將這些 HTML 內容轉換成易於處理的物件結構，並允許你進行搜尋和提取所需的元素。 - 主要功能 1. 解析 HTML/XML 文件 `BeautifulSoup()` 可將 HTML 字串轉換為一個 ==BeautifulSoup 物件==，這個物件代表了整個 HTML 文件的結構，並且可以使用類似 DOM 的方式訪問每個元素。 2. 提取標籤、文字和屬性 - `BeautifulSoup` 允許使用 `.find()` 和 `.find_all()` 方法來尋找特定的 HTML 標籤或元素，根據標籤名稱、class、id 或其他屬性來定位內容。 ```python= title = html_code.find('title') # 尋找 <title> 標籤 links = html_code.find_all('a') for link in links: print(link['href']) # 打印每個 <a> 標籤的 href 屬性 ``` 3. 修改 HTML 結構（例如插入或刪除元素） - html_code.prettify() - 把程式碼按照html的縮排直接呈現出來給你看 - 預覽而已，還是會省略一些內容，是為了確認是否運行 - 使用beautifulsoup物件，查詢標籤的方法 1. 找到第一個 `<div>` 標籤 ```python= html_code.find('找尋的標籤名稱') ``` 2. 找到該id的 `<div>` 標籤 ```python= html_code.find('找尋的標籤名稱', id='id名稱') ``` 3. 找到當前頁該class名稱的第一個 `class` ```python= html_code.find('找尋的標籤名稱', class_='class名稱') ``` 4. 找到當前頁面該class名稱所有的 `class` ```python= html_code.find_all('找尋的標籤名稱', class_='class名稱') ``` ### 4-2. 取得 html 標籤資料 ```python= html_code.find('div',class_='title') div_list = html_code.find_all('div',class_='title') a_tag = div_list[0].find('a') a_url = a_tag.attrs['href'] a_title = a_tag.contents[0] ``` - 程式碼解釋 1. 查詢 class 名為 title 的 `<div>` 標籤 2. 查詢所有 class 名為 title 的 `<div>` 標籤，將結果存到 `div_list` 變數 3. 從`div_list` 變數裡的第一筆資料查詢 `<a>` 標籤，將結果存到 `a_tag` 變數 4. 從 `a_tag` 變數查詢 `href` 屬性的內容，將結果存到 `a_url` 變數 5. 從 `a_tag` 變數查詢 `<a>` 標籤的內容 - 查詢標籤內容的方法 - 先用 `.find()` 找到標籤 - 再用 `.attrs[]` 找到屬性 - 最後用 `.contents[]` 找該屬性內所有的內容 - 建議這整段要用 `try-except` 包起來 ```python= urls=[] for div_ in div_list: try: a_tag = div_.find('a') url_ = { 'url': a_tag.attrs['href'], 'title': a_tag.contents[0] } except: url_ = { 'url': None, 'title': '文章已刪除'} urls.append(url_) } ``` - 使用迴圈，從PTT看板取得每篇文章的 url ，再存入 urls 裡 - 如果沒有找到文章，則執行 except 區塊 ## 5. For Loop 針對每篇文章連結進行讀取 ### 5-1. 先測試單筆url資料（可略） ```python= article_url = 'https://www.ptt.cc' + urls[14]['url'] # request 下載資料 page_data = requests.get(article_url, headers=headers) # 解析 Request 結果 page_html_code = BeautifulSoup(page_data.content) # 找到文章 meta info article_data = page_html_code.find_all('span', class_='article-meta-value') article_author = article_data[0].contents[0] article_title = article_data[2].contents[0] article_time = article_data[3].contents[0] # 文章內文 article_body = page_html_code.find('div', id='main-content').contents[4] article_row={ 'author':article_author, 'title':article_title, 'time':article_time, 'body':article_body } print(article_row) ``` ### 5-2. For Loop ```python= import time ptt_data=[] for url_ in urls: try: article_url = 'https://www.ptt.cc' + url_['url'] page_data = requests.get(article_url, headers=headers) page_html_code = BeautifulSoup(page_data.content) article_data = page_html_code.find_all('span', class_='article-meta-value') article_author = article_data[0].contents[0] article_title = article_data[2].contents[0] article_time = article_data[3].contents[0] article_body = page_html_code.find('div', id='main-content').contents[4] article_row={ 'url':article_url, 'title':article_title, 'author':article_author, 'time':article_time, 'content':article_body } print(article_row) ptt_data.append(article_row) time.sleep(1) except: print('error') time.sleep(1) ``` - 將每篇文章抓取的內容 (article_row)，存到 ptt_data - `time.sleep(1)` 是讓程式在每次從網站抓取完一篇文章後，等 1 秒再開始抓取下一篇。 ## 6. 讀取所有文章並保存 ### 6-1. 轉成 DataFrame 物件，再轉成 CSV 格式 ```python= import pandas ptt_df = pandas.DataFrame(ptt_data) ptt_df.to_csv() # 直接儲存csv ``` - 將資料轉換成 DataFrame 是因為 Pandas 的 DataFrame 結構能夠有效管理和處理結構化數據，像是表格資料。當你想將資料儲存為 CSV 格式時，DataFrame 提供了直接的函數，例如 `to_csv()`，使得資料儲存變得簡單且高效。 - ptt_df.to_csv() 會儲存在Colab的空間（/content），中斷連線檔案即消失，可以使用下面方法存到雲端硬碟裡。 ### 6-2. 儲存到 Google Drive ```python= from google.colab import drive drive.mount('/content/gdrive') # 掛載google drive到colab上 ptt_df.to_csv('/content/gdrive/My Drive/AI/爬蟲/ppt.csv')　# 修改自己儲存的路徑 ``` - /content/gdrive/My Drive 為 Google Drive 的固定路徑