爬取銀行匯率資訊

# 爬取銀行匯率資訊 ###### tags: `python` `crawler` ## 說明 ### 爬蟲目標爬取台灣銀行匯率網頁（https://rate.bot.com.tw/xrt?Lang=zh-TW），將所有幣別的匯率存成CSV檔案。 ### 規格 #### 實現步驟 1. 確認網頁可以連線。 2. 檢視網頁原始碼確認有要取得的資訊。 3. 使用BeautifulSoup取得全部網頁。 4. 研究規則並透過BeautifulSoup的各種方法（如：`find()`、`find_all()`等等）來取得目標資料。 #### 會用到的模組 | 模組名稱 | 說明 | | -------- | -------------------------------------- | | bs4 | BeautifulSoup網頁解析模組 | | lxml | 網頁解析器 | | requests | HTTP模組，用來直接透過網路讀取網頁資料 | | csv | 操作CSV檔案格式模組 | | time | 取得時間 | ## 程式碼解說 #### 1. ``` # text屬性為html文件原始碼 html_doc = response.text # 指定lxml作為解析器來建立Beautiful物件 soup = BeautifulSoup(html_doc, "lxml") ``` ##### 解說 `response`物件中的`text`屬性就是我們需要的網頁原始碼，另外存到新的變數`html_doc`中後再交給`BeautifulSoup`來解析。如果不另外存到新的變數，也可以直接形成一行如下： ``` # 指定lxml作為解析器來建立Beautiful物件 soup = BeautifulSoup(response.text, "lxml") ``` #### 2. ``` rate_table = soup.find('table').find('tbody') ``` ##### 解說取出匯率資訊網頁內的table內的tbody標籤，因為匯率網頁的第一個table就是匯率資訊，所以直接使用`find()`方法找出第一個table標籤即可。 #### 3. ``` rate_table_rows = rate_table.find_all('tr') ``` ##### 解說因為每一筆匯率都被一個tr標籤包圍，因此透過`find_all()`方法取得每一個tr標籤。 #### 4. ``` if c.attrs['data-table'] == '幣別': ``` ##### 解說因為每一個欄位(column)內都有data-table屬性來標示該欄位的資料標題（例如：幣別），因此透過該屬性可以針對不同欄位來做特殊處理。 > `attrs`為每個soup的屬性，透過指定的標籤的屬性名稱，可以用來對應的屬性值。 #### 5. ``` last_div = None divs = c.find_all('div') # 取得最後一個div標籤 for last_div in divs:pass ``` ##### 解說這裡用了一個小技巧，因為幣別的tr標籤內有多個div標籤，而大部分都是我們不要的資訊，剛好最後一個div標籤內有我們需要的幣別（例如：美金USD），所以透過`for-in`迴圈的列舉特性，更新事先宣告好的變數，當一個什麼事都不做的`for-in`迴圈結束後，該變數內就會是最後被列舉(iterator)出來的div標籤。 #### 6. ``` # 取得幣別 data.append(last_div.string.strip()) ... # 存入匯率資訊 data.append(c.getText().strip()) ``` ##### 解說透過`string`屬性取得的文字因為前後都多了一些「換行」與「空白」字元，所以使用Python`string`物件本身提供的方法`strip()`來去掉。 #### 7. ``` # 以目前時間建立檔名 now = time.localtime() file_name = time.strftime('%Y%m%d_%H%M%S.csv', now) ``` ##### 解說 `time`模組內提供了`strftime()`方法可以用來自訂日期顯示的格式，參數說明如下： | 參數 | 說明 | | ---- | ------------------------------- | | %Y | %Y 四位数的年份表示（000-9999） | | %m | 月份（01-12） | | %d | 月内中的一天（1-31） | | %H | 24小時制的時（0-23） | | %M | 分鐘數（00-59） | | %S | 秒（00-59） | #### 8. ``` # 開啟輸出的 CSV 檔案 with open(file_name, 'w', newline='') as csvfile: ``` ##### 解說使用 `with` 開啟檔案時，會將開啟的檔案放在 `csvfile` 變數中，但是這個 `csvfile` 只有在這個 `with` 的範圍內才可以使用，離開這個範圍後， `csvfile` 就會自動被關閉，並回收相關的資源，好處就是，避免忘記呼叫`close()`方法而造成檔案使用完畢卻沒有釋放資源，造成記憶體浪費或是檔案無法再次被開啟(open)的問題。 #### 9. ``` writer.writerow(['幣別', '現金買入', '現金賣出', '即期買入', '即期賣出']) writer.writerows(result) ``` ##### 解說透過Python提供的csv模組的`writer`物件，呼叫`writerow()`方法可以寫入一列，參數為一個`stirng`的`list`，而`writerows()`(注意方法翠後面多了一個s)，則需要提供一個`list`的`list`物件，這個`list`內的每一筆資料都是一個`list`，代表一列資料。 ## 完整程式碼 ``` from bs4 import BeautifulSoup import requests import csv import time # 存下全部匯率資料 result = [] # 台灣銀行匯率網址 url = 'https://rate.bot.com.tw/xrt?Lang=zh-TW' # 使用requests物件的get方法把網頁抓下來 response = requests.get(url) # text屬性為html文件原始碼 html_doc = response.text # 指定lxml作為解析器來建立Beautiful物件 soup = BeautifulSoup(html_doc, "lxml") # 找到匯率內容表格 rate_table = soup.find('table').find('tbody') rate_table_rows = rate_table.find_all('tr') for row in rate_table_rows: # 解析每一列的資料 columns = row.find_all('td') # 存放解析後的每一筆資料 data = [] for c in columns: if c.attrs['data-table'] == '幣別': last_div = None divs = c.find_all('div') # 取得最後一個div標籤 for last_div in divs:pass # 取得幣別 data.append(last_div.string.strip()) elif c.getText().find('查詢') != 0 and str(c).find('print_width') > 0 : # 存入匯率資訊 data.append(c.getText().strip()) # 存入已解析完的一個幣別的全部匯率 result.append(tuple(data)) print(result) # 以目前時間建立檔名 now = time.localtime() file_name = time.strftime('%Y%m%d_%H%M%S.csv', now) print('輸出的檔案:', file_name) # 開啟輸出的 CSV 檔案 with open(file_name, 'w', encoding='utf-8', newline='') as csvfile: # 建立 CSV 檔寫入器 writer = csv.writer(csvfile) writer.writerow(['幣別', '現金買入', '現金賣出', '即期買入', '即期賣出']) writer.writerows(result) ``` ## 使用CSS Selector的方式 ```python= from bs4 import BeautifulSoup import requests # 下載網頁用的模組 import csv import time url = 'https://rate.bot.com.tw/xrt?Lang=zh-TW' # 匯率網址 # 下載台灣銀行匯率網頁原始碼 response = requests.get(url) html_doc = response.text # 寫入本地端確認有下載完整 # with open('rate.html', 'w', encoding='utf-8') as rate: # rate.write(html_doc) soup = BeautifulSoup(html_doc, 'html.parser') # print(soup.prettify()) # 格式化下載後的網頁 tbody = soup.find('tbody') all_rates_rows = tbody.find_all('tr') all_rate_data = [] for row in all_rates_rows: rate_data = [] # 解析每一列資料 rate_data.append(row.select_one('.hidden-phone.print_show').string.strip()) rate_data.append(row.select_one('td[data-table="本行現金買入"].print_width').string.strip()) rate_data.append(row.select_one('td[data-table="本行現金賣出"].print_width').string.strip()) rate_data.append(row.select_one('td[data-table="本行即期買入"].print_width').string.strip()) rate_data.append(row.select_one('td[data-table="本行即期賣出"].print_width').string.strip()) all_rate_data.append(rate_data) # 全部資料 print(all_rate_data) # 以目前時間產生檔名 now = time.localtime() file_name = time.strftime('%Y%m%d_%H%M%S.csv', now) print('輸出的檔案名稱:', file_name) with open(file_name, 'w', encoding='utf-8', newline='') as csvfile: csvfile.write('\ufeff') # UTF-8 BOM writer = csv.writer(csvfile) # 建立CSV寫入器 writer.writerow(['幣別', '現金買入', '現金賣出', '即期買入', '即期賣出']) writer.writerows(all_rate_data) ```