PTT爬蟲新手筆記

小女子基於個人研究計畫需要收集網路資料，自行學習利用Juptyer Notebook的Python 3撰寫爬蟲程式碼，目前算是有一點小成果，感謝一路上指點我的人、網路大神和ChatGPT，下面提供不專業的筆記給需要的人參考，完整程式碼在筆記最後，也歡迎高手指點！ ❗這篇筆記主要是針對**已爬取下來、整理好**的「網址」資料逐一進行解析。 --- # 匯入模組在進行網路爬蟲之前，我們需要匯入一些模組幫助我們抓取想要的資料。以下簡單介紹各個模組的功能，在Python匯入模組之前需要先確認自己的Python環境是否已經安裝了這些模組喔！ 1. Pandas：可以建立資料集儲存爬取下來的資料。 2. re / time：re可以幫助我們從擁有大量字符的文本中，取得我們所需的資訊。/ time提供了有關時間的函數，例如計時、暫停、日期和時間的處理等。 3. Selenium：會透過WebDriver模擬使用者在瀏覽器上的操作，控制瀏覽器並獲取網頁內容，進行自動化作業。 4. Beautiful Soup：可以分析網頁的HTML與XML文件。 5. by：幫助定位標籤。程式碼👇 ``` import pandas as pd import re, time, requests from selenium import webdriver from bs4 import BeautifulSoup from selenium.webdriver.common.by import By ``` --- # 設置Cookies 我爬取的是PTT的八卦版，所以有限制瀏覽網頁的使用者需要滿18歲，需要設置滿18歲的Cookies，其他看版就可跳過此步驟。程式碼👇 ``` # 開啟瀏覽器 driver = webdriver.Chrome() # 進入PTT首頁 driver.get("https://www.ptt.cc/bbs/Gossiping/index.html") # 找到進入滿18歲的按鈕，並點擊 enter_button = driver.find_element(By.XPATH, "/html/body/div[2]/form/div[1]/button") enter_button.click() time.sleep(2) # 設置滿18歲的Cookies，這取決於PTT網站具體的Cookies名稱和值 over_18_cookie = {'over18': '_gat', '1': '1'} # over_18_cookie = {'Name': 'your_cookies_name', 'Value': 'your_value'} driver.add_cookie(over_18_cookie) # 重新刷新頁面，確保Cookies生效 driver.refresh() ``` 這個部分也不一定要設置cookies，但後續拜訪連結時，程式就會需要每次執行點擊滿18歲的按鈕，就會有些耗時~ # 拜訪連結並抓取文章內容以下列點敘述程式碼的執行步驟： 1. 打開目標檔案並逐一拜訪其中的連結 2. 迴圈遍歷每個連結並利用BeatifulSoup解析網頁內容 3. 抓取看板基本資訊、文章內容以及IP位置 4. 將抓取下來的資料加到串列並利用try: except:pass跳過無法讀取的網址程式碼👇 ``` # 建立一個串列來儲存資料 data_list = [] # 逐行讀取連結 with open('url.txt', 'r') as file: #url.txt替換為目標檔案 for url in file: try: url = url.strip() driver.get(url) soup = BeautifulSoup(driver.page_source, 'html.parser') main_tag = soup.find('div', id='main-content') # 1 抓取作者、版、標題與日期 meta_value_tags = main_tag.find_all('span', re.compile('article-meta-value')) # 2 命名每一個抓取下來的基本資料 author = meta_value_tags[0].text board = meta_value_tags[1].text title = meta_value_tags[2].text date = meta_value_tags[3].text # 3 抓取純文字內容 content_text = main_tag.get_text(separator='\n').split('--')[0].strip() # 4 使用正則表達式找到IP位置 ip_location_match = re.search(r'來自: (.+)', str(soup)) if ip_location_match: ip_location = ip_location_match.group(1) # 將資料附加到串列中 data_list.append({ '作者': author, '版面': board, '標題': title, '日期': date, '內容': content_text, 'IP 位置': ip_location if ip_location else '' }) except Exception as e: print(f"無法讀取網址 {url}，錯誤訊息：{str(e)}") pass #跳過未轉換成功的網址 ``` 📝**使用正則表達式找到IP位置**：當初在抓IP位置的時候花了不少時間，因為無法透過抓取整篇文章時一起抓到QQ，所以稍微解釋一下這裡。 ![ip](https://hackmd.io/_uploads/HJ0RsChVp.png) 我們最主要需要的IP是(來自：)後面的那串數字，因此我們需要先找到每篇文章的「來自：」接著對後面進行字串分割抓取我們需要的資訊。 (我抓IP是為了轉成經緯度資料) # 儲存資料把抓取下來的文章資訊和內容分別寫進Excel儲存，就大功告成了！👏 ``` # 從串列建立一個 DataFrame df = pd.DataFrame(data_list) # 將 DataFrame 寫入 Excel 檔案 df.to_excel('output.xlsx', index=False) ``` # 📌完整程式碼 ``` import pandas as pd import re, time, requests from selenium import webdriver from bs4 import BeautifulSoup from selenium.webdriver.common.by import By # 開啟瀏覽器 driver = webdriver.Chrome() # 進入PTT首頁 driver.get("https://www.ptt.cc/bbs/Gossiping/index.html") # 找到進入滿18歲的按鈕，並點擊 enter_button = driver.find_element(By.XPATH, "/html/body/div[2]/form/div[1]/button") enter_button.click() time.sleep(2) # 設置滿18歲的Cookies，這取決於PTT網站具體的Cookies名稱和值 over_18_cookie = {'name': 'your_cookie_name', 'value': 'your_cookie_value'} driver.add_cookie(over_18_cookie) # 重新刷新頁面，確保Cookies生效 driver.refresh() # 建立一個串列來儲存資料 data_list = [] # 逐行讀取連結 with open('url.txt', 'r') as file: #url.txt替換為目標檔案 for url in file: try: url = url.strip() driver.get(url) soup = BeautifulSoup(driver.page_source, 'html.parser') main_tag = soup.find('div', id='main-content') # 1 抓取作者、版、標題與日期 meta_value_tags = main_tag.find_all('span', re.compile('article-meta-value')) # 2 命名每一個抓取下來的基本資料 author = meta_value_tags[0].text board = meta_value_tags[1].text title = meta_value_tags[2].text date = meta_value_tags[3].text # 3 抓取純文字內容 content_text = main_tag.get_text(separator='\n').split('--')[0].strip() # 4 使用正則表達式找到IP位置 ip_location_match = re.search(r'來自: (.+)', str(soup)) if ip_location_match: ip_location = ip_location_match.group(1) # 將資料附加到串列中 data_list.append({ '作者': author, '版面': board, '標題': title, '日期': date, '內容': content_text, 'IP 位置': ip_location if ip_location else '' }) except Exception as e: print(f"無法讀取網址 {url}，錯誤訊息：{str(e)}") pass #跳過未轉換成功的網址 # 從串列建立一個 DataFrame df = pd.DataFrame(data_list) # 將 DataFrame 寫入 Excel 檔案 df.to_excel('output.xlsx', index=False) # 關閉瀏覽器 driver.quit() ```