【Python 網路爬蟲筆記】BeautifulSoup Library - part 3

【Python 網路爬蟲筆記】BeautifulSoup Library - part 3 === 目錄（Table of Contents）： [TOC] --- 感謝你點進本篇文章！！我是 LukeTseng，一個熱愛資訊的無名創作者，由於近期大學開設大數據分析程式設計這門課程，裡面談到了爬蟲概念，讓我激起一些興趣，因而製作本系列筆記。聲明：本篇筆記僅供個人學習用途，斟酌參考。另外本篇筆記使用 VSCode 環境進行編寫，部分模組（函式庫）需自行下載。 ## 安裝 BeautifulSoup 模組若使用 google colab 或 anaconda 環境者無須安裝。指令： ``` pip install beautifulsoup4 ``` ## 引入 BeautifulSoup 模組 ``` from bs4 import BeautifulSoup ``` ## 為什麼我們要用 BeautifulSoup？ BeautifulSoup 的主要用途是解析 HTML 和 XML，將網頁內容轉換成結構化的樹狀格式供程式操作。網頁資料解析與擷取是 BeautifulSoup 最主要的用途。在網路爬蟲的世界，無可或缺的除了 request 模組外，就是 BeautifulSoup，有了這個模組就可以進一步擷取、分析我們想要的資訊。例如可以擷取個人部落格所有文章的總瀏覽量，可以做到的方式就是透過 sitemap 一個一個進文章，去抓取每個文章的瀏覽量資訊，最後加總起來。 ## 第一支 BeautifulSoup 程式以我的部落格網站為例：https://luketsengtw.github.io/ ```python= import requests from bs4 import BeautifulSoup url = 'https://luketsengtw.github.io/' html = requests.get(url) soup = BeautifulSoup(html.text, 'html.parser') print(soup.title) ``` Output： ``` <title>Yaoの程式小窩 - 只想好好學程式</title> ``` 如果想要去掉 `<title></title>` 標籤的話，可以加上 `.string` 方法。 ```python= import requests from bs4 import BeautifulSoup url = 'https://luketsengtw.github.io/' html = requests.get(url) soup = BeautifulSoup(html.text, 'html.parser') print(soup.title.string) # 加上 .string ``` Output： ``` Yaoの程式小窩 - 只想好好學程式 ``` ## 解析器（Parser）解析器是 BeautifulSoup 第二個參數，用於將 html 原始碼轉換成標籤樹好讓程式去做一些操作。 Python 內建的網頁解析器是 `html.parser`，如果要使用其它解析器需要額外安裝。常見的解析器就有 `lxml` 跟 `html5lib`。要安裝它們的話可以輸入指令：`pip install lxml html5lib` 以下表格可以幫各位快速閱覽這些解析器的能力： | 解析器 | 速度 | 準確性 | 容錯能力 | | -------- | -------- | -------- | -------- | | html.parser | 中 | 最差 | 最差 | | lxml | 最快 | 高 | 高 | | html5lib | 最慢 | 最高 | 最高 | 通常會使用 lxml 作為解析器，若在學習階段，不想裝這些有的沒的話，用後續用 html.parser 就可以了。 ## BeautifulSoup 常用方法 ### 搜尋方法 `find()`、`find_all()` 是在 BeautifulSoup 中使用頻率最高的方法，因此先特別介紹這個。基本上他的功能就是尋找這樣，`find()` 若有多個標籤存在的話，只會找第一個。以下是一個範例： ```python= from bs4 import BeautifulSoup html = """ <html> <body> <div class="container"> <h2>標題</h2> <p class="content">第一段內容</p> <p class="content">第二段內容</p> <a href="https://example.com">連結一</a> <a href="https://google.com">連結二</a> </div> </body> </html> """ soup = BeautifulSoup(html, 'lxml') # 找於 class_ = 'content' 中第一個段落 <p> 元素 first_p = soup.find('p', class_='content') print(first_p.text) # 找於 class_ = 'content' 中所有段落 <p> 的元素 all_p = soup.find_all('p', class_='content') for p in all_p: print(p.text) # 在 html 所有內容中找所有連結 <a> links = soup.find_all('a', href=True) for link in links: print(f"連結文字: {link.text}, 網址: {link['href']}") ``` Output： ``` 第一段內容第一段內容第二段內容連結文字: 連結一, 網址: https://example.com 連結文字: 連結二, 網址: https://google.com ``` ### CSS Selector 使用 `select()`。以下是測試資料： ```python= from bs4 import BeautifulSoup html_content = """ <!DOCTYPE html> <html> <head> <title>CSS 選擇器練習範例</title> </head> <body> <header id="main-header" class="site-header"> <h1 class="title">網站標題</h1> <nav class="navigation"> <a href="/home" class="nav-link active">首頁</a> <a href="/about" class="nav-link">關於我們</a> <a href="/contact" class="nav-link">聯絡我們</a> </nav> </header> <main class="container"> <article id="article-1" class="post featured"> <h2 class="post-title">第一篇文章</h2> <p class="post-content">這是第一篇文章的內容。</p> <div class="meta"> <span class="author" data-name="A">作者：A</span> <span class="date" data-published="2025-01-01">日期：2025-01-01</span> </div> </article> <article id="article-2" class="post"> <h2 class="post-title">第二篇文章</h2> <p class="post-content highlight">這是第二篇文章的重點內容。</p> <div class="meta"> <span class="author" data-name="B">作者：B</span> <span class="date" data-published="2025-01-02">日期：2025-01-02</span> </div> </article> <aside class="sidebar"> <div class="widget recent-posts"> <h3>最新文章</h3> <ul> <li><a href="/post1">文章一</a></li> <li><a href="/post2">文章二</a></li> <li><a href="/post3">文章三</a></li> </ul> </div> </aside> </main> <footer id="main-footer" class="site-footer"> <p>© 2025 練習網站</p> </footer> </body> </html> """ soup = BeautifulSoup(html_content, 'lxml') ``` **標籤選擇器**：以下範例可選取 HTML 中所有的 h2 標籤及 p 段落標籤。註：使用選擇器回傳的物件會是一個 list，所以以下的 `for h2 in h2_tags:` 會迭代 list 物件內的元素。 ```python= # 選取所有 h2 標籤 h2_tags = soup.select('h2') print("所有 h2 標籤:") for h2 in h2_tags: print(f" {h2.text}") # 選取所有段落 paragraphs = soup.select('p') print(f"\n找到 {len(paragraphs)} 個段落") ``` Output： ``` 所有 h2 標籤: 第一篇文章第二篇文章找到 3 個段落 ``` **ID 選擇器**：`<header id="main-header" class="site-header">` 做舉例，可以選擇 `<header>` 標籤裡面的 id。要選取特定 ID，則使用一個 `#` 井字號作為前綴。 ```python= # 選取特定 ID 的元素 header = soup.select('#main-header') print(f"\n標頭內容: {header[0].find('h1').text}") # 選取特定文章 ID article1 = soup.select('#article-1') print(f"第一篇文章標題: {article1[0].find('h2').text}") ``` Output： ``` 標頭內容: 網站標題第一篇文章標題: 第一篇文章 ``` **Class 選擇器**：顧名思義，選擇 class 的值。那 class 選擇器與 ID 選擇器不一樣，class 選擇器使用 `.` 一個半形點作為前綴。註：以下程式碼的 `.get('href')` 是 BeautifulSoup 的方法，用於取得某個標籤的屬性值。要取得屬性值也可直接寫 `link['href']`，與使用方法的差別在於這種方式比較不安全（會直接報錯），而 `.get()` 找不到的話會直接回傳 None。 ```python= # 選取所有有 post 類別的元素 posts = soup.select('.post') print(f"\n找到 {len(posts)} 篇文章:") for post in posts: title = post.find('h2').text print(f" {title}") # 選取導航連結 nav_links = soup.select('.nav-link') print(f"\n找到 {len(nav_links)} 個導航連結:") for link in nav_links: print(f" {link.text} -> {link.get('href')}") # get 方法取得屬性值 ``` Output： ``` 找到 2 篇文章: 第一篇文章第二篇文章找到 3 個導航連結: 首頁 -> /home 關於我們 -> /about 聯絡我們 -> /contact ``` **多重選擇器**：可以同時選擇多個選擇器。像是可以同時有兩個類別，請見範例： ```python= # 標籤 + 類別選擇器 post_titles = soup.select('h2.post-title') print("\n文章標題:") for title in post_titles: print(f" {title.text}") # 多個類別選擇器（同時擁有兩個類別） featured_posts = soup.select('.post.featured') print(f"\n精選文章數量: {len(featured_posts)}") ``` Output： ``` 文章標題: 第一篇文章第二篇文章精選文章數量: 1 ``` **後代選擇器**：可以選擇一個標籤底下的其中一個標籤，假設要找 `<article>` 裡面的 `<p>`，那透過這個選擇器，就會找出 `<article>` 裡面的所有 `<p>` 標籤。若要做到這個選擇器的功能，則在兩個標籤之間空一格。以下是個範例： ```python= # 選取 main 內的所有 span 標籤 meta_spans = soup.select('main span') print("\n文章元資訊:") for span in meta_spans: print(f" {span.text}") # 選取 article 內的 p 標籤 article_paragraphs = soup.select('article p') print(f"\n文章段落數: {len(article_paragraphs)}") ``` Output： ``` 文章元資訊: 作者：A 日期：2025-01-01 作者：B 日期：2025-01-02 文章段落數: 2 ``` ## 透過 BeautifulSoup 提取純文字透過 `.get_text()` 可獲取標籤內的內容，而非標籤本身（如`<p>123</p>`）。 ```python= from bs4 import BeautifulSoup html = """ <div class="article"> <h1>文章標題</h1> <p>這是第一段落。</p> <p>這是<strong>重要</strong>的第二段落。</p> </div> """ soup = BeautifulSoup(html, 'html.parser') # 提取純文字（包含所有子元素的文字） article = soup.find('div', class_='article') print(article.get_text()) ``` Output： ``` 文章標題這是第一段落。這是重要的第二段落。 ``` ## BeautifulSoup 結合 Requests 小應用爬取網站：https://www.nptu.edu.tw/p/412-1000-2972.php?Lang=zh-tw 爬取國立屏東大學中的所有學術單位，含學院、以及旗下學系。註：此爬蟲程式僅供學習用途，絕無任何其餘用途。建議使用 colab 或 jupyter notebook 進行實作，若使用真實環境有可能會遇到 SSL Certificate 的問題。另外可於該網站中任意處右鍵、按下檢查，開啟開發者工具介面，在左上方箭頭處可選取任意元素，選取完後會跳至該行的 HTML 原始碼。 ![image](https://hackmd.io/_uploads/rJU4rmOngl.png) ![image](https://hackmd.io/_uploads/BJySH7d2lg.png) ![image](https://hackmd.io/_uploads/r1_HBXunlx.png) 範例程式碼： ```python= import requests from bs4 import BeautifulSoup url = "https://www.nptu.edu.tw/p/412-1000-2972.php?Lang=zh-tw" # 目標網址 html = requests.get(url) # 對目標網址發送 GET 請求 html.encoding = "utf-8" # 設定正確編碼 soup = BeautifulSoup(html.text, "lxml") # 建立 BeautifulSoup 物件, 使用 lxml 解析器 main_content = soup.find("div", class_="main") # 主內容, if main_content: # 獲取所有文字內容並按行分割 full_text = main_content.get_text() lines = [line.strip() for line in full_text.split('\n') if line.strip()] # 找到學術單位列表的開始位置 start_idx = -1 for i, line in enumerate(lines): if "學術單位列表" in line: start_idx = i break if start_idx != -1: relevant_lines = lines[start_idx+1:] # 跳過"學術單位列表"這行 current_college = None departments = [] for line in relevant_lines: # 如果遇到包含"學院"的行，這是新的學院 if "學院" in line and not line.endswith("系"): # 如果之前已經有學院資料，先印出來 if current_college and departments: print(f"{current_college}:") for dept in departments: print(f" - {dept}") print() # 空行分隔 # 開始新的學院 current_college = line departments = [] # 如果包含"學系"、"研究所"、"中心"或"學程"，這是學系 elif any(keyword in line for keyword in ["學系", "研究所", "中心", "學程"]): if current_college: # 確保目前有學院 departments.append(line) # 如果遇到校區資訊，結束處理 elif "校區" in line: break # 處理最後一個學院 if current_college and departments: print(f"{current_college}:") for dept in departments: print(f" - {dept}") ``` Output： ``` 管理學院: - 商業大數據學系(含碩士班) - 行銷與流通管理學系(含碩士班) - 休閒事業經營學系(含碩士班) - 不動產經營學系(含碩士班) - 企業管理學系(含碩士班) - 國際經營與貿易學系(含碩士班) - 財務金融學系(含碩士班) - 會計學系 - 大數據商務應用學士學位學程(113學年停招) 資訊學院: - 電腦與通訊學系 - 資訊工程學系(含碩士班) - 電腦科學與人工智慧學系(含碩士班) - 資訊管理學系(含碩士班) - 智慧機器人學系 - 國際資訊科技與應用碩士學位學程教育學院: - 教育行政研究所(含博碩士班) - 教育心理與輔導學系(含碩士班) - 教育學系(含碩士班) - 特殊教育學系(含碩士班) - 幼兒教育學系(含碩士班) - STEM教育國際碩士學位學程 - 特殊教育中心 - 社區諮商中心 - 文教事業經營碩士在職學位學程(110學年停招) 人文社會學院: - 視覺藝術學系(含碩士班) - 音樂學系(含碩士班) - 文化創意產業學系(含碩士班) - 社會發展學系(含碩士班) - 中國語文學系(含碩士班) - 應用日語學系 - 應用英語學系 - 英語學系(含碩士班) - 文化發展學士學位學程原住民專班 - 文化事業發展碩士學位學程原住民專班 - 客家文化產業碩士學位學程 - 客家研究中心 - 原住民族教育研究中心 - 藝文中心理學院: - 科學傳播學系(含科學傳播暨教育碩士班) - 應用化學系(含碩士班) - 應用數學系(含碩士班) - 體育學系(含碩士班) 國際學院: - 東南亞發展中心 - 華語教學中心大武山學院: - 共同教育中心 - 博雅教育中心 - 跨領域學程中心 - EMI發展中心 - 大武山社會實踐暨永續發展中心 - 新媒體創意應用碩士學位學程 - 大武山跨領域學士學位學程 - 師資培育中心 - 師資培育中心 - 教育學程組 ``` ## 總結 BeautifulSoup 是 Python 中用於解析 HTML 和 XML 的函式庫，將網頁內容轉換成樹狀結構供程式操作。主要應用於網路爬蟲和網頁資料擷取。 ### 第一支 BeautifulSoup 範例 ```python= import requests from bs4 import BeautifulSoup url = 'https://luketsengtw.github.io/' html = requests.get(url) soup = BeautifulSoup(html.text, 'html.parser') print(soup.title) ``` `from bs4 import BeautifulSoup` 以此來引入 BeautifulSoup 做網頁分析與資料擷取。 `soup = BeautifulSoup(html.text, 'html.parser')` 是建立 BeautifulSoup 的方法。 ### 解析器介紹 | 解析器 | 速度 | 準確性 | 容錯能力 | | -------- | -------- | -------- | -------- | | html.parser | 中 | 最差 | 最差 | | lxml | 最快 | 高 | 高 | | html5lib | 最慢 | 最高 | 最高 | ### BeautifulSoup 的常用方法 * find() - 找第一個符合的元素 * find_all() - 找所有符合的元素以下程式碼分別找出第一個 `<p>` 和找出所有的 `<p>` 標籤。 ```python= first_p = soup.find('p', class_='content') all_p = soup.find_all('p', class_='content') ``` #### CSS 選擇器使用 `select()` 方法： * 一般標籤：`soup.select('h2')` * ID 選擇器（用 `#`）：`soup.select('#main-header')` * Class 選擇器（用 `.`）：`soup.select('.nav-link')` * 多重選擇器：`soup.select('h2.post-title')` * 後代選擇器（空格）：`soup.select('main span')` #### 文字提取 * `soup.title.string` - 取得標籤內文字 * `element.get_text()` - 取得純文字內容（去除標籤） * `link['href'] 或 link.get('href')` - 取得屬性值 ### 基本爬蟲模板 ```python= import requests from bs4 import BeautifulSoup url = "目標網址" html = requests.get(url) html.encoding = "utf-8" # 設定編碼 soup = BeautifulSoup(html.text, "lxml") # 找到主要內容區域 main_content = soup.find("div", class_="main") text = main_content.get_text() ``` ## 參考資料 [Beautiful Soup 函式庫 - Python 網路爬蟲教學 | STEAM 教育學習網](https://steam.oxxostudio.tw/category/python/spider/beautiful-soup.html) [Beautiful Soup Documentation — Beautiful Soup 4.13.0 documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) [Implementing Web Scraping in Python with BeautifulSoup - GeeksforGeeks](https://www.geeksforgeeks.org/python/implementing-web-scraping-python-beautiful-soup/)