新人資訊
技術-後台類-azure function #2-插一段爬蟲進去
上次已經完成了 azure function 的 hello world,當然要乘勝追擊,加上自己的程式碼,讓它聽我的指令辦事,我想就從爬蟲開始吧。爬蟲也是令我非常興奮的技術,想到人工查資料時要點來點去,在密密麻麻的網路資料海中尋找需要的資訊,耗時費力又傷神,寫成爬蟲程式後只要一鍵瞬間完成,那真是很爽的事情。此外,當我們會解析雜亂的 html 從中獲取資料,那從全世界無數規格良好的 api 吐出 json 或 csv 等,就更不是問題了。
插入第一行程式碼 import pandas,這是處理資料所需要非常基礎的套件,必須先 import 成功。如下發現 pandas 下方出現毛毛蟲,這是 IDE 告訴我們 somthing wrong,硬是執行後果然在詳細的錯誤訊息中指明找不到此 module。
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
解決方法就是要先安裝,安裝方法就是開啟終端機打入下列指令「pip install pandas」,pip 是套件管理軟體,install 是 pip 接受的指令,pandas 是 install 指令所接受的參數,關於 pip 更多用法可以打入 pip –help 查看。這裡還須注意,命令列開頭必須是 (.venv),代表當前所處的 python 虛擬環境,因為 vs code 為此專案建立了獨立的環境,讓此環境安裝的各套件版本不會影響到預設的環境,進而干擾到別的專案。虛擬環境的使用在開發工作上非常重要,尤其是在目前這種軟體套件改版疊代速度非常快的時代,每一個專案需要的套件各自版本都不一樣,一定要切開,不要使用預設環境去開發任何專案。這兒 vs code 已經幫我們做得很好。安裝成功之後會發現 pandas 下面的毛毛蟲已經不見了。
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
依樣葫蘆再安裝並匯入下列兩套件,爬蟲需要它。
先在 console 執行 pip install requests 和 pip install bs4,然後在程式碼插入:
import requests
from bs4 import BeautifulSoup as bs
然後來物色想抓的資料,下列網址有我想抓的資料:
https://www.taifex.com.tw/cht/3/largeTraderFutQry
十大交易人在某一天台指期的留倉部位中,買進幾口和賣出幾口,這兩個數字。
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Chrome browser F12 是開發者的好朋友,在 Network 頁面的 Headers 頁籤可以找到向後台發送指令的位址,在 Payload 頁籤則可以找到夾帶的參數和內容,如 queryDate 參數其內容為 2022/03/02,這格式很重要,爬蟲程式也必須遵循此格式才可成功獲得正確回應。
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
加了以下三行,原初的 name 參數在此當成傳入日期使用,先把 post 回應印出來確認第一步驟正確。
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
第一個畫面顯示 function 成功執行,第二個畫面測試傳入參數,程式流程會走到爬蟲那段,結果槓龜!
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
查看錯誤訊息發現系統連 Response 都不認識,表示 import module 並沒有成功。
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
仔細查看偵錯過程,系統執行了這一行!原來在 run time 系統又即時生成一個虛擬環境阿!需要的套件要列在 requirements.txt 裡。稍早安裝套件讓毛毛蟲消失只是確認 design time 正確,此設定在 deploy to azure 時也是非常關鍵的。
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
終於得到初步的成功了,原始的 html 洋洋灑灑的資料已經抓出來了。
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
接下來就是解析的工作了,第一步就是借重 bs4 的功能,將文字解析成數狀結構的 xml 物件,一樣先傳回看看,的確已經變成比較結構化了。
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
然後需要摸索和嘗試錯誤的方式,找出資料在文件中的位置,Chrome F12 依然是我們的好朋友,既查明資料藏在 table 元素裡面,bs4 的文件搜尋功能可以輕易地把它抓出來,離目標越來越近了!
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
經過反覆試誤後終於逼近目標,達陣!43207, 46678 核對正確。
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
到這兒還不能太高興,因為還有最後一部也是關鍵的一步 deploy to azure!如果不能上線,一切開發都是枉然。我很悲慘的搞了好久才終於跳出成功畫面。爬文亂試後在 requirements.txt 加入兩個套件。也不知為何在 local 環境沒加也沒關係,累了,懶得深究了。
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
以上已經成功將自己的爬蟲程式包進 azure function了,回顧許多測試過程,不斷開啟 browser 察看結果這還是有點麻煩,在開發複雜功能時應該先用 jupyter notebook 這種互動性更強的工具寫到差不多後,再移植過來,這是今天的體驗心得。
下一步應該會想寫進資料庫吧,待續。
By Newman Chen 2022/3/2