Web crawler - HackMD

# Web crawler ###### tags: `Introduction to Python Applications` ## 簡介 Selenium 為一個能自動化執行瀏覽器操作的套件，包括執行捲動動作或動態載入 JavaScript 程式碼，可搭配 BeautifulSoup 套件完成網路資料爬蟲功能。本次練習選擇學校網站，讓同學能夠用爬蟲取得最新的校務消息。 ### 注意：所有爬蟲請考量被爬的網站的著作權等相關權利；本次使用[學校之測試站台](https://chtcdn.test.nycu.edu.tw/)，已經獲得相關人員同意。 ## 安裝 ### 安裝 Selenium ``` pip3 install selenium ``` 並且依瀏覽器，下載對應的 [WebDriver](https://selenium-python.readthedocs.io/api.html?highlight=selenium.webdriver)： * [Chrome](https://chromedriver.chromium.org/downloads) * [Firefox](https://github.com/mozilla/geckodriver) * [Edge](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/) * [Safari](https://webkit.org/blog/6900/webdriver-support-in-safari-10/) 確認版本並下載對應的 Driver ![](https://i.imgur.com/UZG0fuq.png) ### 安裝 Bealtiful Soup ``` pip install beautifulsoup4 ``` ## Hello World 首先，我們先來測試一下使用 Selenium 打開網站。 ```python= from selenium import webdriver # './chromedriver' 為 WebDriver 位置。 driver = webdriver.Chrome('./chromedriver') driver.get("https://chtcdn.test.nycu.edu.tw/") ``` 執行完以上程式碼後，即可自動使用瀏覽器打開[測試首頁](https://chtcdn.test.nycu.edu.tw)： ![](https://i.imgur.com/1F9Wpky.jpg) macOS 使用者可能會遇到安全性問題，需要至`設定-> 安全性與隱私權`，強制允許 chromedriver（或其他[webdriver]((https://selenium-python.readthedocs.io/api.html?highlight=selenium.webdriver))）。 ![](https://i.imgur.com/kiEXSGT.png) ## 基本操作這邊舉一些簡單的範例，更多內容請參考 [API Document](https://selenium-python.readthedocs.io/api.html) ### 螢幕截圖 ```python= from selenium import webdriver # './chromedriver' 為 WebDriver 位置。 driver = webdriver.Chrome('./chromedriver') driver.get("https://chtcdn.test.nycu.edu.tw/") driver.get_screenshot_as_file('test.png') ``` 使用 `get_screenshot_as_file()` 螢幕截圖。 ### 填寫資料並點擊按鈕 ```python= from selenium import webdriver # './chromedriver' 為 WebDriver 位置。 driver = webdriver.Chrome('./chromedriver') driver.get("https://chtcdn.test.nycu.edu.tw/") element = driver.find_element_by_id("homepageSearch") element.send_keys("表揚") button = driver.find_element_by_class_name("elementor-search-form__submit") button.click() ``` 可透過[測試首頁](https://chtcdn.test.nycu.edu.tw)下方搜尋功能輸入`表揚`，並按下搜尋按鈕。 ![](https://i.imgur.com/i9zhw4T.jpg) 執行結果： ![](https://i.imgur.com/RpgV3ux.jpg) #### 輸入搜尋字串與點擊搜尋按鈕 ```python= element = driver.find_element_by_id("homepageSearch") element.send_keys("表揚") button = driver.find_element_by_class_name("elementor-search-form__submit") button.click() ``` 首先我們可以透過 Chrome 的 F12 (開發者工具)，尋找網頁對應元件的 HTML。 1. 點選這個工具。 ![](https://i.imgur.com/SprWLMH.png) 2. 將游標移至想抓取的區塊，會在下面出現對應的 HTML。 ![](https://i.imgur.com/TIDJw62.jpg) 3. 在此範例中，可看到對應的搜尋輸入文字欄 id 名稱為 `homepageSearch`，與送出查詢的按鈕 class 名稱為 `elementor-search-form__submit`。 ![](https://i.imgur.com/YQzmqrM.png) 4. 使用 find_element_by_id 與 find_element_by_class_name 找尋元件，並做對應動作。使用 find_element_by_id 找尋 id 為 `homepageSearch` 的搜尋輸入文字欄，並使用 `send_keys()` 輸入文字。 ```python= element = driver.find_element_by_id("homepageSearch") element.send_keys("表揚") ``` 使用 find_element_by_class_name 找尋 class 名稱為 `elementor-search-form__submit` 的按鈕，並使用 `clic()` 按下。 ```python= button = driver.find_element_by_class_name("elementor-search-form__submit") button.click() ``` 其他搜尋元件方式請參考 [Document](https://selenium-python.readthedocs.io/locating-elements.html)。 ## 抓取所有公告 > 本段落環境為 Windown 10 搭配 Firefox 與 Edge 瀏覽器基於類似的物件應該都會有差不多的特徵的這個屬性，我們對某一個公告來觀察；對你選擇的公告按下右鍵: ![](https://i.imgur.com/XIDOzNo.png) 可以觀察到，瀏覽器幫我們找到了這個項目的元素 ![](https://i.imgur.com/unXt9Kd.png) 可以發現，所有的公告其實是由`li`組成的列表，並且在列表上的`id`有一個很明顯的特徵：它們都是由`su-post-xxxx`的格式組成的。每一個`li`標籤都包含一個`a`標籤，所謂的`a`標籤是用來提供連結的標籤，它他們擁有一個叫做`href`的屬性保存著連結的網址。 ```python= from selenium import webdriver from bs4 import BeautifulSoup import re driver = webdriver.Edge('msedgedriver.exe') driver.get("https://chtcdn.test.nycu.edu.tw/news-network/") # 來剖析HTML吧 soup = BeautifulSoup(driver.page_source, 'html.parser') # 尋找所有 li 標籤，擁有一個以su-post-開頭的id posts = soup.find_all('li',id=re.compile('^su-post-')) for post in posts: # 我們把文章名稱和連結給印出來吧 # post.a的意思是我要存取li標籤下面的a標籤裡面的href屬性 print(f"{post.text}-{post.a['href']}") ``` 運作正常的話～你應該會看到類似的畫面 ``` 歡慶交大日傑出校友獲表揚-https://chtcdn.test.nycu.edu.tw/news/1665/ 國立陽明交通大學109年度傑出校友-https://chtcdn.test.nycu.edu.tw/news/1655/ . .讓我們省略一些 . 歡慶交大日傑出校友獲表揚-https://chtcdn.test.nycu.edu.tw/news/1665/ 高中生如何面試上陽明交大？最新Podcast揭秘-https://chtcdn.test.nycu.edu.tw/news/1651/ ``` 接續前面的內容～我們試著把第一篇的內文印出來吧～ ```python= # 讓我們試著把第一篇的內文印出來 driver.get(posts[0].a['href']) post_soup = BeautifulSoup(driver.page_source, 'html.parser') context = post_soup.find('div', 'entry-content') for child in context: # 因為文章內有文字和圖片，我們先只處理文字，你也可以把圖片送到Telegram的bot玩玩看 if child.name == 'p': print(child.text) ```