【Python 網路爬蟲筆記】Selenium Library、爬取 Hackmd 文章專題 - part 4

【Python 網路爬蟲筆記】Selenium Library、爬取 Hackmd 文章專題 - part 4 === 目錄（Table of Contents）： [TOC] --- 感謝你點進本篇文章！！我是 LukeTseng，一個熱愛資訊的無名創作者，由於近期大學開設大數據分析程式設計這門課程，裡面談到了爬蟲概念，讓我激起一些興趣，因而製作本系列筆記。聲明：本篇筆記僅供個人學習用途，斟酌參考。本篇筆記使用 Jupyter Notebook，搭載 Anaconda 虛擬環境，如需下載者可至該網址：https://www.anaconda.com/download ## 安裝 Selenium 模組透過以下指令： ``` pip install selenium ``` 在 Jupyter Notebook（安裝完後記得 Restart Kernel 才會啟用）： ``` !pip install selenium ``` ## 什麼是 Selenium？ Selenium 是一種開源的網頁瀏覽器自動化工具，可以透過程式碼來模擬 user 在瀏覽器上的各種操作（像人一樣），從而完成自動化測試或網頁爬蟲任務。 Selenium 就是動態爬蟲中應用到最重要的技術。 Selenium 用 WebDriver 來驅動與控制瀏覽器，每種瀏覽器都有專屬的 WebDriver，Chrome 就是用 ChromeDriver、Firefox 用 GeckoDriver、Safari 用 SafariDriver 等等。總之 WebDriver 是在 Selenium 裡面中最常使用到的技術，因為它可模擬很多 user 的操作像點擊元素、填寫表單、捲動頁面等等。本篇筆記使用 Chrome 瀏覽器進行操作與編寫程式，要下載 ChromeDriver 可至該網站：https://sites.google.com/chromium.org/driver/downloads 記得要下載目前正使用對應的 Chrome 版本，具體如何看自己的 Chrome 版本？如下圖所示。 ![image](https://hackmd.io/_uploads/Byqc2ZRTlg.png) ![image](https://hackmd.io/_uploads/H1t-TZCplx.png) 不過也可以不用下載，因為有另一個模組就的指令是可以在程式執行時自動下載，會在下一節介紹。 ## 自動安裝 WebDriver 的方法（懶人法）首先先安裝 webdriver-manager： ``` pip install webdriver-manager ``` 之後在程式中即可使用： ```python= from selenium import webdriver from selenium.webdriver.chrome.service import Service from webdriver_manager.chrome import ChromeDriverManager # 自動下載並安裝對應版本的 ChromeDriver driver = webdriver.Chrome(service=Service(ChromeDriverManager().install())) ``` ## 第一支 selenium 程式以下是一個小範例，會透過 selenium 自動化程式開啟瀏覽器，並前往指定網站。 ```python= from selenium import webdriver from selenium.webdriver.chrome.service import Service from webdriver_manager.chrome import ChromeDriverManager # 建立 WebDriver 物件（自動下載對應版本的 ChromeDriver） driver = webdriver.Chrome(service=Service(ChromeDriverManager().install())) # 開啟 Google 首頁 driver.get('https://www.google.com') # 暫停 5 秒 import time time.sleep(5) # 關閉瀏覽器 driver.quit() ``` 在當建立 WebDriver 物件的時候，瀏覽器就會開起來了。當程式執行後，會出現 Chrome 目前受到自動測試軟體控制的字樣，這是正常的，表示程式執行成功。由於有設定暫停 5 秒的緣故，所以 5 秒後瀏覽器就會關閉了。 ## 元素定位操作可以使用 `.find_element()` 方法定位網頁元素，內部的第一個參數主要語法是 `By.XXX`，而 XXX 則是要定位元素的方法，具體如下範例所示： ```python= from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.service import Service from webdriver_manager.chrome import ChromeDriverManager driver = webdriver.Chrome(service=Service(ChromeDriverManager().install())) driver.get('https://www.google.com') # 1. 透過 ID 定位（最快最準） search_box = driver.find_element(By.ID, "APjFqb") # 2. 透過 NAME 屬性定位 search_box = driver.find_element(By.NAME, "q") # 3. 透過 CLASS_NAME 定位 element = driver.find_element(By.CLASS_NAME, "gLFyf") # 4. 透過 CSS Selector 定位（靈活度高） search_box = driver.find_element(By.CSS_SELECTOR, "textarea[name='q']") # 5. 透過 XPath 定位（功能最強） search_box = driver.find_element(By.XPATH, "//textarea[@name='q']") ``` XPATH 取得的方式很簡單，只要對某個元素按下右鍵 -> Copy -> Copy XPATH 即可取得。 ![image](https://hackmd.io/_uploads/BJuMZf0Tgx.png) 日後要定位元素也建議使用 XPATH 的方式。 ## 常用操作方法在定位網頁元素後，接下來要做的就是類似像點擊、捲動畫面等操作了，如： - `.send_keys()` 輸入文字 - `.submit()` 提交表單（模擬按下 Enter） - `.click()` 按下滑鼠左鍵詳細方法可至 Selenium 官方網站：https://www.selenium.dev/documentation/webdriver/elements/interactions/ 然後可到該網站進行一些 Selenium 的測試與學習：https://example.oxxostudio.tw/python/selenium/demo.html 以下是一個小範例： ```python= from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.service import Service from webdriver_manager.chrome import ChromeDriverManager driver = webdriver.Chrome(service=Service(ChromeDriverManager().install())) driver.get('https://example.oxxostudio.tw/python/selenium/demo.html') time.sleep(5) # 等待 javascript 跑完 text_box = driver.find_element(By.XPATH, '//*[@id="show"]') text_box.send_keys("Hello World!") B = driver.find_element(By.XPATH, '/html/body/button[2]') B.click() add_number = driver.find_element(By.XPATH, '//*[@id="add"]') add_number.click() time.sleep(10) driver.quit() ``` ## 取得網頁元素內容 | 方法 | 說明 | 範例 | | :-- | :-- | :-- | | `.text` | 取得元素的純文字內容 | `element.text` | | `.get_attribute('屬性名')` | 取得元素的指定 HTML 屬性值 | `element.get_attribute('href')` | | `.id` | 取得元素的 id | `element.id` | | `.tag_name` | 取得元素的標籤名稱 | `element.tag_name` | | `.size` | 取得元素的尺寸（長寬） | `element.size` | | `.is_displayed()` | 判斷元素是否顯示在頁面上 | `element.is_displayed()` | | `.is_enabled()` | 判斷元素是否可用 | `element.is_enabled()` | | `.is_selected()` | 判斷元素是否被選取（如 checkbox） | `element.is_selected()` | | `.parent` | 取得元素的父元素 | `element.parent` | | `.screenshot('檔名.png')` | 將元素截圖並儲存為圖片 | `element.screenshot('test.png')` | From [Selenium 函式庫 - Python 網路爬蟲教學 | STEAM 教育學習網](https://steam.oxxostudio.tw/category/python/spider/selenium.html) 以下是對於測試網站所做的範例（改自 [Selenium 函式庫 - Python 網路爬蟲教學 | STEAM 教育學習網](https://steam.oxxostudio.tw/category/python/spider/selenium.html)）： ```python= from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.service import Service from webdriver_manager.chrome import ChromeDriverManager # 啟動瀏覽器 driver = webdriver.Chrome(service=Service(ChromeDriverManager().install())) driver.get('https://example.oxxostudio.tw/python/selenium/demo.html') # 取得 id 為 a 的元素 a = driver.find_element(By.ID, 'a') # 取得 class 為 btn 的元素 b = driver.find_element(By.CLASS_NAME, 'btn') # 取得 class 為 test 的元素 c = driver.find_element(By.CSS_SELECTOR, '.test') # 取得 name 為 dog 的元素 d = driver.find_element(By.NAME, 'dog') # 取得 tag 為 h1 的元素 h1 = driver.find_element(By.TAG_NAME, 'h1') # 取得指定超連結文字的元素 link1 = driver.find_element(By.LINK_TEXT, '我是超連結，點擊會開啟 Google 網站') # 取得超連結文字包含 Google 的元素 link2 = driver.find_element(By.PARTIAL_LINK_TEXT, 'Google') # 取得元素內容 print("a.id:", a.id) # 元素 id print("b.text:", b.text) # 元素文字 print("c.tag_name:", c.tag_name) # 元素標籤名稱 print("d.size:", d.size) # 元素尺寸 print("link1.href:", link1.get_attribute('href')) # 元素屬性值 print("link2.target:", link2.get_attribute('target')) # 元素屬性值 print("h1.is_displayed:", h1.is_displayed()) # 是否顯示 print("d.is_enabled:", d.is_enabled()) # 是否可用 print("a.is_selected:", a.is_selected()) # 是否被選取 # 將 body 元素截圖 body = driver.find_element(By.TAG_NAME, 'body') body.screenshot('./test.png') driver.quit() ``` 輸出結果如下： ![image](https://hackmd.io/_uploads/rkseDfCalx.png) 在外面額外產生了 test.png 的截圖。 ![image](https://hackmd.io/_uploads/ryJGDzCalx.png) ## 專題：爬取 Hackmd 作者頁面的所有文章資訊以下透過我個人 Hackmd 頁面進行爬取，首先爬取第一頁的每篇文章標題。 ```python= from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.service import Service from webdriver_manager.chrome import ChromeDriverManager import time # 開啟瀏覽器 driver = webdriver.Chrome(service=Service(ChromeDriverManager().install())) driver.get('https://hackmd.io/@LukeTseng') # 等待 javascript 跑完 time.sleep(3) # find_elements() 加上 s 可回傳一個列表，一次將所有相同的元素傳入列表中 titles = driver.find_elements(By.CSS_SELECTOR, 'span.line-clamp-1.flex-1.text-lg.font-semibold') for title in titles: print(title.text) driver.quit() ``` 輸出結果： ![image](https://hackmd.io/_uploads/SJX-IRJRgx.png) 如果想要爬取每一頁的內容資訊怎麼辦？可以觀察到 HackMD 沒有下一頁的功能，因此需要自己實作。 ![image](https://hackmd.io/_uploads/rklUwAy0lg.png) 從下圖中打開【檢查】，可以分析一下每個頁面的結構長怎樣，這邊實測發現每個頁面按鈕都是一樣的元素所組成，另外我也特別觀察了一下每個元素的 XPATH，他最後面都有特定的規律，如： - `//*[@id="hackmd-app"]/section/main/div/div[2]/div/div/div[2]/div/ul/li[1]/a` - `//*[@id="hackmd-app"]/section/main/div/div[2]/div/div/div[2]/div/ul/li[2]/a` - `//*[@id="hackmd-app"]/section/main/div/div[2]/div/div/div[2]/div/ul/li[3]/a` 上面這些都是頁面 1、2、3 的元素，其他部分都沒有更動，唯一更動的地方只有最後面的 `/li[1]`，他裡面的中括號的數字會動，所以可以考慮使用 string format 去變動。 ![image](https://hackmd.io/_uploads/rJMPD0k0lx.png) 以下是程式碼： ```python= from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.chrome.service import Service from webdriver_manager.chrome import ChromeDriverManager import time # 開啟瀏覽器 driver = webdriver.Chrome(service=Service(ChromeDriverManager().install())) driver.get('https://hackmd.io/@LukeTseng') page_num = 1 # 爬取第 1 頁到第 14 頁 for page_num in range(1, 15): print(f"正在爬取第 {page_num} 頁...") # 等待標題元素載入 WebDriverWait(driver, 10).until( EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'span.line-clamp-1.flex-1.text-lg.font-semibold')) ) # 爬取當前頁面的所有標題 titles = driver.find_elements(By.CSS_SELECTOR, 'span.line-clamp-1.flex-1.text-lg.font-semibold') for title in titles: print(f" - {title.text}") # 如果不是最後一頁，點擊下一頁 if page_num < 14: try: # 找到下一頁的按鈕（頁碼 = 當前頁+1） next_page_button = WebDriverWait(driver, 10).until( EC.element_to_be_clickable((By.XPATH, f'//*[@id="hackmd-app"]/section/main/div/div[2]/div/div/div[2]/div/ul/li[{page_num + 1}]/a')) ) next_page_button.click() time.sleep(2) # 等待頁面載入 except Exception as e: print(f"無法點擊第 {page_num + 1} 頁: {e}") break print("所有頁面都爬完了！") driver.quit() ``` 輸出結果： ![image](https://hackmd.io/_uploads/Bkqo_AJRgl.png) 其中以下程式碼是 Selenium 的顯式等待（Explicit Wait）機制，能自動等待頁面 loading 完成，像是以下就是等待每個標題元素加載完成才繼續執行程式碼。而其中 `WebDriverWait` 的第二個參數 10 就是最多等待 10 秒的意思。 `until` 為持續檢查某個條件，直到條件成立或超時。 `EC.presence_of_all_elements_located(...)` 是等待的條件，意思是等待「所有符合條件的元素都出現在網頁的 DOM 結構中」。 ```python= WebDriverWait(driver, 10).until( EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'span.line-clamp-1.flex-1.text-lg.font-semibold')) ) ``` 爬取完標題後，可以來爬取每篇文章的日期、瀏覽量，甚至可以算出所有文章的總瀏覽量： ```python= from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.chrome.service import Service from webdriver_manager.chrome import ChromeDriverManager import time # 開啟瀏覽器 driver = webdriver.Chrome(service=Service(ChromeDriverManager().install())) driver.get('https://hackmd.io/@LukeTseng') page_num = 1 all_views = 0 # 爬取第 1 頁到第 14 頁 for page_num in range(1, 15): print(f"正在爬取第 {page_num} 頁...") # 等待標題元素載入 WebDriverWait(driver, 10).until( EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'span.line-clamp-1.flex-1.text-lg.font-semibold')) ) # 等待日期元素載入 WebDriverWait(driver, 10).until( EC.presence_of_all_elements_located( (By.CSS_SELECTOR, 'div.text-text-subtle > span:first-child') ) ) # 等待瀏覽量元素載入 WebDriverWait(driver, 10).until( EC.presence_of_all_elements_located( (By.XPATH, "//a[i[contains(@class, 'ph-eye')]]") # //a 尋找所有 <a> # contains() 函數檢查 class 屬性是否包含指定的文字 # @class 取得 class 屬性值 # 'ph-eye' 要搜尋的子字串 # i[contains(@class, 'ph-eye')] # 當中的 i 表示 <a> 標籤必須包含一個 <i> 子元素，且 <i> 的 class 屬性要有 'ph-eye' 這個字串 ) ) # 爬取當前頁面的所有標題 titles = driver.find_elements(By.CSS_SELECTOR, 'span.line-clamp-1.flex-1.text-lg.font-semibold') # 爬取當前頁面的所有日期 dates = driver.find_elements(By.CSS_SELECTOR, 'div.text-text-subtle > span:first-child') # 爬取當前頁面的所有瀏覽量 views = driver.find_elements(By.XPATH, "//a[i[contains(@class, 'ph-eye')]]") for title, date, view in zip(titles, dates, views): print(f" - {title.text}\n 日期：{date.text}\n 瀏覽量：{view.text}\n") all_views += int(view.text.replace(',', '')) # 將 , 替換成空字串以來轉成整數 # 如果不是最後一頁，點擊下一頁 if page_num < 14: try: # 找到下一頁的按鈕（頁碼 = 當前頁+1） next_page_button = WebDriverWait(driver, 10).until( EC.element_to_be_clickable((By.XPATH, f'//*[@id="hackmd-app"]/section/main/div/div[2]/div/div/div[2]/div/ul/li[{page_num + 1}]/a')) ) next_page_button.click() time.sleep(2) # 等待頁面載入 except Exception as e: print(f"無法點擊第 {page_num + 1} 頁: {e}") break print(f"所有文章總瀏覽量：{all_views}") print("所有頁面都爬完了！") driver.quit() ``` 輸出結果： ![image](https://hackmd.io/_uploads/Hk_kkJl0lx.png) ## 參考資料 [Selenium 函式庫 - Python 網路爬蟲教學 | STEAM 教育學習網](https://steam.oxxostudio.tw/category/python/spider/selenium.html) [瀏覽器自動化工具Selenium介紹 – CH.Tseng](https://chtseng.wordpress.com/2023/07/14/%E7%80%8F%E8%A6%BD%E5%99%A8%E8%87%AA%E5%8B%95%E5%8C%96%E5%B7%A5%E5%85%B7selenium%E4%BB%8B%E7%B4%B9/) [[DAY7]Selenium簡介 | iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天](https://ithelp.ithome.com.tw/m/articles/10319363) [認識 Selenium · GitBook](https://alincode.github.io/learngeb/intro/selenium.html) [Day 15: Selenium －基本概念和操作 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天](https://ithelp.ithome.com.tw/articles/10320979) [python一招完美搞定Chromedriver的自动更新 - NewJune - 博客园](https://www.cnblogs.com/new-june/p/16698204.html) [Python Selenium 教學筆記 - HackMD](https://hackmd.io/@FortesHuang/S1V6jrvet) [動態網頁爬蟲第一道鎖 - Selenium教學：如何使用Webdriver、send_keys(附Python 程式碼) - 臺灣行銷研究](https://tmrmds.co/uncategorized/16513/)