Python｜爬蟲初學筆記

### Python是什麼？ Python 是種廣泛使用的程式設計語言，用於 Web 應用程式、軟體開發、資料科學與機器學習。由於 Python 效率高又容易學習，且可以在許多不同的平台上執行，因此許多開發人員選擇使用 Python 。 ### Python主要特點 * **縮排**：縮排不只為了排版，也是定義程式碼塊的一種方式，讓 Python 程式碼易於閱讀和維護。 * **直譯語言（Interpreted language）**：Python 在主機上運行時無須編譯，開發過程時可以快速地編寫和測試。 * **弱型別（weak typing）**：宣告變數時無須指定型別，而是根據值來推斷。雖然可以加快編寫速度，但也可能增加出錯風險。 * **豐富的標準函式庫（standard library）和第三方函式庫（third-party library）**：可以幫助工程師快速開發。標準函式庫包括操作作業系統、網路程式設計、多執行緒、數據處理等，而第三方函式庫則包括數據庫、科學計算、機器學習等。 ### 與其他程式語言比較 | | Python | JavaScript | C# | SQL | | ---- | ------------------------------------------ | -------------- | ----------------------------------------- | -------------- | | 排版 | 縮排、冒號 | 大括弧、分號 | 大括弧、分號 | 無 | | 型別 | 弱型別 | 弱型別 | 強型別 | 弱型別 | | 類型 | 指令式程式設計 | 指令式程式設計 | 指令式程式設計 | 宣告式程式設計 | | 應用 | 網頁、數據分析、自動化、遊戲、應用程式開發 | 網頁前端、後端 | 各種App，如網站、遊戲、手機 App、電腦 App | 資料庫 | > 命令式程式設計（Imperative Programming）與宣告式程式設計（Declarative Programming）分別是什麼？ > > 這是兩種不同的程式設計典範（Programming Paradigm），也就是不同風格的程式設計方式。 > > ◆　命令式程式設計：描述電腦所需作出的行為，例如：Python、JavaScript、C#。 > ◆　宣告式程式設計：表達計算的邏輯而不用描述控制流程，例如：SQL、HTML、Regular expression（正規表示式）。 > > 簡單來說，多數情況下程式語言需要用一大段程式才可實踐一個功能，而在 SQL 上只需要一個語句就可以表達。這邊也再提供一個很生動的敘述（[參考文章](https://ithelp.ithome.com.tw/articles/10288311?sc=iThelpR)）： ### 爬蟲是什麼？需要用到什麼概念？爬蟲可以模擬人類瀏覽網頁的過程，並將資料存取下來使用。可以應用在取得天氣資訊、股票價格、匯率、建立網站索引等。在寫爬蟲程式時，具備以下先備知識可加快學習： * 網頁基礎知識：了解HTML、CSS和JavaScript的基礎語法和結構，可以幫助開發者更好理解和分析網頁。 * 網路協定：了解HTTP、HTTPS等網路協定，可以更好理解網路資源的請求過程。 * 程式語言：熟悉至少一種語言，例如：Python、Java、JavaScrip t等，可以幫助開發者更好撰寫爬蟲程式。 ### 程式碼範例 *目標網站與預覽圖：待補。* 爬蟲程式碼可以分為以下五個部分： 1. 導入所需模組：引入需要用到的模組，幫助我們進行 HTTP 請求、建立資料與處理資料。 * urllib.request 用於從網路上獲取資源。 * bs4 用於解析 HTML 資料。 * csv 用於操作 CSV 檔案。 * os 用於檔案路徑相關操作。 * urllib.parse.quote() 用於將 URL 中的特殊字符進行轉換。 **這幾個導入的步驟要寫耶** ```python= # 安裝 Python 2 的 Beautiful Soup 4 模組 pip install beautifulsoup4 # 安裝 Python 3 的 Beautiful Soup 4 模組 pip3 install beautifulsoup4 ``` ```python= import urllib.request as req import bs4 import os import csv from urllib.parse import quote ``` 2. 獲取目標網站 HTML 資料使用 req 模組發送 HTTP 請求，需要指定需要獲取的頁面 URL，並用 urlopen() 打開指定的 URL。read() 用於讀取網頁內容；decode() 將編碼為 utf-8 的位元組流轉換為字串，否則中文會亂碼。取得網站內容後，使用 BeautifulSoup 解析 HTML：用 BeautifulSoup() 創建一個 BeautifulSoup 對象，並指定解析器 html.parser。接著要取出指定的 DOM 物件，故先來觀察網站的HTML結構： *用開發者工具看網頁結構，圖片待補* 可發現每一個最新消息都放在名為「accordion-item」的div裡面。因此用 find_all('div’, ‘accordion-item’) 來找指定的 HTML 標籤。 ```python= url = "https://.../" with req.urlopen(url) as response: data = response.read().decode("utf-8") soup = bs4.BeautifulSoup(data, 'html.parser') news_items = soup.find_all('div', 'accordion-item') ``` 3. 建立存放檔案的資料夾如果目錄不存在就建立一個新的目錄；os.path.exists() 用於檢查目錄是否存在，os.makedirs() 用於創建目錄。 ```python= save_dir = './爬蟲資料夾/' if not os.path.exists(save_dir): os.makedirs(save_dir) ``` 4. 資料寫入 CSV 檔案開啟一個 CSV 檔案，並寫入表頭；open() 是 Python 中用來開啟檔案的內建函式。用 for 迴圈取出每個項目的資料，需要取出「標題」、「內文」、「內文HTML」、「附件名稱」、「附件HTML」。為了取出這些資料，先觀察網站架構： *網頁結構示意圖，圖片待補* 找到指定的 DOM 物件分別為「標題（button標籤，class為accordion-button）」、「內文（div標籤，class為container）」、「附件名稱（ul標籤，class為list-group）」。其中，附件有可能無，也可能多於一項，因此 news_file_list 做判斷式，如果有內容用迴圈做出附件名稱清單。 text 是用來獲取HTML標籤內的文字內容，strip() 則是將文字內容中的空白（包括換行符號）刪除，以防止文字內容前後有不必要的空格。 ```python= with open(save_dir + '內容.csv', 'w', encoding='utf-8-sig', newline='') as csv_file: csv_writer = csv.writer(csv_file) csv_writer.writerow(['標題', '內文', '內文HTML', '附件名稱', '附件HTML']) for news_item in news_items: news_title = news_item.find('button', 'accordion-button').text.strip() news_content = news_item.find('div', 'container').text.strip() news_content_html = news_item.find('div', 'container') news_file_list = '' news_files = news_item.find_all('ul', 'list-group') if news_item.find('ul', 'list-group') != None: for news_file in news_files: file_name = news_file.find('h6').text.strip() news_file_list += file_name + ',' news_files_html = news_item.find_all('ul', 'list-group') csv_writer.writerow( [news_title, news_content, news_content_html, news_file_list, news_files_html]) ``` 5. 下載附件取出每個附件的名稱與連結，依所屬最新消息的名稱建立資料夾，並存入檔案。其中，因網址含有中文，需要轉成 ASCII（美國標準資訊交換碼）才可以用，使用 quote() 這個方法轉換。另外，因為轉換時會將部分網址保留字元轉換掉，因此命一個 safe_chars 的變數帶入 quote() ，代表這些字元不需要進行編碼。 ```python= for news_file in news_files: file_name = news_file.find('h6').text.strip() news_file_list += file_name save_file_dir = './爬蟲資料夾/' + news_title if not os.path.exists(save_file_dir): os.makedirs(save_file_dir) safe_chars = "!#$&'()*+,/:;=?@[]" file_href = 'https://...' + \ quote(news_file.find('a')['href'], safe=safe_chars) with req.urlopen(file_href) as response: content = response.read() with open(save_file_dir + '/' + file_name, "wb") as file: file.write(content) ``` #### 成果 *成果圖片待補* 完整程式碼與註解 ```python= # ==== 導入所需模組 ==== # 用於從網路上獲取資源 import urllib.request as req # 用於解析 HTML 資料 import bs4 # 用於操作 CSV 檔案 import csv # 用於檔案路徑相關操作 import os # 用於將 URL 中的特殊字符進行轉換 from urllib.parse import quote # ==== 從目標網站中獲取 HTML 資料 ==== # 設定網站URL url = "https://.../" # 讀取網頁內容（須解碼中文字）；urlopen() 用於打開指定的 URL 資源 with req.urlopen(url) as response: # read() 用於讀取網頁內容；decode() 將編碼為 utf-8 的位元組流轉換為字串 data = response.read().decode("utf-8") # 使用 BeautifulSoup 解析 HTML；BeautifulSoup() 方法用於創建一個 BeautifulSoup 對象，並指定解析器 html.parser soup = bs4.BeautifulSoup(data, 'html.parser') # 找出所有最新消息項目；find_all() 方法用於查找指定的 HTML 標籤 news_items = soup.find_all('div', 'accordion-item') # ==== 建立存放檔案的資料夾 ==== save_dir = './爬蟲資料夾/' # 如果目錄不存在就建立一個新的目錄；os.path.exists() 方法用於檢查目錄是否存在。 if not os.path.exists(save_dir): # os.makedirs() 方法用於創建多層目錄。 os.makedirs(save_dir) # ==== 使用 csv 模組將獲取的最新消息資料寫入 CSV 檔案 ==== # 開啟一個 CSV 檔案，並寫入表頭；open() 是 Python 中用來開啟檔案的內建函式 with open(save_dir + '內容.csv', 'w', encoding='utf-8-sig', newline='') as csv_file: # 建立 CSV 寫入器，寫入表頭 csv_writer = csv.writer(csv_file) csv_writer.writerow(['標題', '內文', '內文HTML', '附件', '附件HTML']) # 遍歷所有最新消息項目，並取出相關資訊 for news_item in news_items: # 取得標題；text 是用來獲取HTML標籤內的文字內容 # strip() 則是將文字內容中的空白（包括換行符號）刪除，以防止文字內容前後有不必要的空格。 news_title = news_item.find('button', 'accordion-button').text.strip() # 取得內文與內文HTML news_content = news_item.find('div', 'container').text.strip() news_content_html = news_item.find('div', 'container') # 取得附件與附件HTML news_file_list = '' news_files = news_item.find_all('ul', 'list-group') if news_item.find('ul', 'list-group') != None: for news_file in news_files: file_name = news_file.find('h6').text.strip() news_file_list += file_name + ',' news_files_html = news_item.find_all('ul', 'list-group') # 將標題、內文、內文HTML、附件和附件HTML寫入CSV檔案 csv_writer.writerow( [news_title, news_content, news_content_html, news_file_list, news_files_html]) # ==== 下載附件檔案 ==== for news_file in news_files: file_name = news_file.find('h6').text.strip() news_file_list += file_name save_file_dir = './內容/' + news_title # 如果目錄不存在就建立一個新的目錄；os.path.exists() 方法用於檢查目錄是否存在。 if not os.path.exists(save_file_dir): # os.makedirs() 方法用於創建多層目錄。 os.makedirs(save_file_dir) # 排除特殊字符 safe_chars = "!#$&'()*+,/:;=?@[]" file_href = 'https://www.sfea.org.tw' + \ quote(news_file.find('a')['href'], safe=safe_chars) # 下載附件 with req.urlopen(file_href) as response: content = response.read() # "w" 表示文字寫入模式，"b" 表示二進位寫入模式。 "wb" 表示以二進位形式寫入檔案。 with open(save_file_dir + '/' + file_name, "wb") as file: file.write(content) # 印出下載完成訊息 print("＝＝＝＝＝＝＝＝＝＝下載完成＝＝＝＝＝＝＝＝＝＝") ```