水庫資料網頁爬蟲

水庫資料網頁爬蟲 === ###### tags: 計畫網頁爬蟲 **2019年03月**計畫內容：用網頁爬蟲技術將水庫資料從網頁中調出，並整理成圖表顯示過去幾年來水庫水位變化，更甚者可以搭配天氣資料，做成水庫水位的關聯，以及其它相關資訊的總整理。環境：jupyter notebook 語言：python3.7 套件：**requests**、**BeautifulSoup4** (CMD 用 pipinstall 方法安裝) 其它軟體搭配：[Postman](https://www.getpostman.com/downloads/) 資料網站：[水庫統計表] http://fhy.wra.gov.tw/ReservoirPage_2011/StorageCapacity.aspx :::info 這個是我沒辦法爬出資料的相似網站： **http://fhy.wra.gov.tw/ReservoirPage_2011/Statistics.aspx** (雖然我也不知道 Statistics 和 StorageCapacity 到底有什麼差...) ::: ## 爬蟲筆記 Crawling Note 參考網址：[大數學堂](https://www.largitdata.com/course_list/1) - 檢視網頁原始碼： ``` - Chrome 更多工具 → 開發人員工具 - 右鍵 → 檢視網頁原始碼 ``` - jupyter notebook 其它用法： ```python=3.7 %pylab inline #在jupyternotebook繪圖 %ls | grep test #(linux 指令)當下檔案中尋找有test的檔案名稱 %run test.py #執行test.py檔 from IPython.display import display, Math, Latex display(Math(r'c = \sqrt{a^2+b^2}')) #出現有c=a^2+b^2的數學效果 ``` <hr/> ### 利用 GET 取得網頁內容網頁右鍵 → 檢查 → Network → 重新載入 → 最上面檔案(Header)找url ```python= import requests res = requests.get("url") #利用requests得到網頁內容 print(res.text) #輸出內容 ``` ### 用 POST 取得網頁內容有的網頁需要一些輸入才能得到查詢頁的結果(比如水庫資料要輸入日期時間)，這時候就需要用post來進行網頁爬蟲。 ```python= import requests #在search result當中的值定義 pyload = { 'title_1' : "value_1", 'title_2' : "value_2", 'title_3' : "value_3", } #利用requests.post把想查詢的條件資料送出給網頁 res = requests.post("url", data = pyload) print(res.text) #輸出內容 ``` 但這個方法我目前發現不可行，因為政府的水庫網站在給設定參數(圖中黑粗體字`__VIEWSTATE`)時，會遇到一個很棘手的亂數問題，但這個亂數似乎又有關於驗證碼的問題，試了很多方法都沒有用...所以我就找另外一種方法試試了QAQ ![](https://i.imgur.com/bdDxBSn.png) ==2019/04/05 更新：我發現應該不是 POST 方法的問題，應該是網站的問題== <hr/> ### 透過 __VIEWSTATE 驗證爬取資料感謝 [這篇文章](https://www.ptt.cc/bbs/R_Language/M.1496796579.A.2FB.html) 的指點以及終於換個網址 parse 資料，才能有今天成功的日子啊~~ - __VIEWSTATE 是什麼? `__VIEWSTATE` 大概算是一種網頁的驗證碼，在送出 POST 資料時需要這項參數，如果錯誤的話，將會無法得到網頁正確的資料，而跑出 `0|500|Error` 的訊息。[[ 更多說明 ]](https://blog.51cto.com/slliang/1783837) - 怎麼找到 __VIEWSTATE ? 每次載入網頁的時候，網頁都會生成一組 `__VIEWSTATE` 值，於是我們只要先用 GET 方式把當中的 `__VIEWSTATE` 值找出來並當作 POST 的參數即可。 ```python= # 此函式會找特定的value，如「__VIEWSTATE」等 def find_value(name, web): reg = 'name="' + name + '".+value="(.*)" />' pattern = re.compile(reg) result = pattern.findall(web.text) try: return result[0] except: return "" ``` [[ 程式碼參考來源 ]](https://www.finlab.tw/%E7%94%A8python%E7%8D%B2%E5%8F%96%E6%8C%81%E8%82%A1%E6%90%8D%E7%9B%8A%E8%A1%A8/) - 除了 __VIEWSTATE 以外，還需要什麼? 在查詢方式上，可以看到有四個能夠選擇的值(水庫性質、年、月、日)，而使用者可以透過在 HTML 原始碼輸入元素的 id 去賦值，比如年份的 id 是 `ctl00$cphMain$ucDate$cboYear` ，則可以指定要哪一年的資料。送進去的參數如下： ```python= date_list = [year, month, date] load_list = [find_value("__EVENTTARGET", content), find_value("__VIEWSTATE")] payload = { #"ctl00$ctl02": find_value("ctl00$ctl02", d), #"ctl00_ctl02_HiddenField": find_value("ctl00_ctl02_HiddenFiel", d), "__EVENTTARGET": load_list[0], #"__EVENTARGUMENT": "", "__VIEWSTATE": load_list[1], #"__VIEWSTATEGENERATOR": find_value('__VIEWSTATEGENERATOR', d), #"__EVENTVALIDATION": find_value('__EVENTVALIDATION', d), 'ctl00$cphMain$cboSearch': "所有水庫", 'ctl00$cphMain$ucDate$cboYear': date_list[0], 'ctl00$cphMain$ucDate$cboMonth': date_list[1], 'ctl00$cphMain$ucDate$cboDay': date_list[2], #'ctl00$cphMain$ucDate$cboHour': para[3], #'ctl00$cphMain$ucDate$cboMinute': para[4], #'__ASYNCPOST': 'true' } requests.post("url", data = payload) #請求資料 soup = BeautifulSoup(res.text, "lxml").find_all("tr")[2:] ``` 透過 BeautifulSoup 解析網頁以後，加上資料的提取(字串處理，取**蓄水量的百分比**)就能蒐集到各水庫的資料了。每次讀取資料都要 parse 網站資料太費時，我把讀出來的資料寫進 csv 檔，等之後的資料視覺化時再取 csv 檔讀資料即可。[[ 資料集下載處 ]](https://github.com/hsiaoping-zhang/Reservoir_DataVisiualization/tree/master/data) <hr> ### (番外) Selenium 介紹 Selenium 是個能提供 python 進行自動化測試的套件，使用者可以利用程式碼進行頁面上的操作，進行大量資料搜索的時候就不用手動操作。但為了因應網頁瀏覽器幫你做事情，還需要一個 web driver 下載。[[ chrome driver 載點 ]](https://github.com/mozilla/geckodriver/releases/download/v0.20.1/geckodriver-v0.20.1-win64.zip) (記得要把 exe 檔放在和 python 同個資料夾底下才能正常運行) - 首先要去 command line 把套件下載 ``` pip install Selenium ```