爬蟲 - HackMD

--- tags: Code --- # 爬蟲 ## 標頭檔 ```python= import requests as req from bs4 import BeautifulSoup ``` ## 憑證 ( 先忽略) ```python= ssl._create_default_https_context =ssl._create_unverified_context ``` ## 讀取網頁  ## 讀取網頁 ```python= data = req.get(url) ``` # 搜尋 ## 編碼首先需要用`BeautifulSoup`來重新編碼 ```python= root = BeautifulSoup(data, 編碼格式) ``` 編碼格式有很多種，則一即可 ![](https://i.imgur.com/GWwpFkp.png) (https://www.jianshu.com/p/424e037c5dd8) ## 搜尋搜尋基本上有兩種 1. $find()$ 2. $find\_all()$ ```python= first = root.find('div', class_ = "") for search in root.find_all('div', class_ = "") ... ``` ## 上課(3/8) ```python= import requests as req from bs4 import BeautifulSoup data = req.get("https://dandanjudge.fdhs.tyc.edu.tw/ShowProblem?problemid=a001") #print(data.text) root = BeautifulSoup( data.text, 'html.parser') #print(root) #<div class="container"> root = root.find('div', class_="container"); #<div class="panel panel-default"> ProblemData = root.find_all( 'div', class_ = "panel panel-default") #[ ..., ..., ...., ...] print(ProblemData, len(ProblemData)) # string.replace(舊, 新) # " ","\n"(換行), "\t"(tab鍵) for PerdataId in range(len(ProblemData)) : ProblemData[ PerdataId ] = ProblemData[ PerdataId ].text #轉成字串 for Perdata in ProblemData : Perdata = Perdata.replace(" ", "") Perdata = Perdata.replace("\n", "") Perdata = Perdata.replace("\t", "") Perdata = Perdata.replace("\r", "") print(Perdata) ``` ## 整體(如果熟練使用) ```python= import requests as req from bs4 import BeautifulSoup import json import os def WriteData( DataForWrite:str, file ) : #這個函式可以把字串寫到檔案裏面 with open( file, 'w', encoding="utf-8") as f: f.write( DataForWrite ) def ReplaceStrings(Strings:str, Replace:list[str], ReplaceTo:str) -> str : # Strings: str, 其中 :str是指定他的型態，譬如此處是限定只能傳入字串 # def () -> str 則是限定他return的型態 # 可以在弱型別的python中確認型態，減少出錯的機會 for i in Replace: Strings = Strings.replace( i, ReplaceTo ) #此處和上面挺像，只是把操作都壓縮到for迴圈 return Strings def ReplaceString(String:str, Replace:str, ReplaceTo:str) -> str: #這裡是單個replace，可以當作小小的參考?本質上沒什麼用處，可以更好的了解函式 return String.replace( Replace, ReplaceTo ) def GetProblemData(ProblemID:str, Tofile: str) -> None: """ 3/8 3~10行 """ data = req.get( f"https://dandanjudge.fdhs.tyc.edu.tw/ShowProblem?problemid={ProblemID}") #f 是可以讓你在字串裡面有變數 {} webData = BeautifulSoup( data.text, "html.parser" ) #panel panel-default 可以發現這是他們都共有的特點 webData = webData.find( "body" ) .find( "div", class_ = "container") """ 3/8 12~20行 """ webData = [ i.text for i in webData.find_all( "div", class_ = "panel panel-default" ) ] #這一行的作用等價於上課時所提到的 12~20行，不會的話也沒有關係，只是想展現一下如果熟練使用之後可以縮短程式碼 """ 3/8 21~26行 """ #print(ReplaceStrings("h\t\te\n\n" , ["\t", "\n"], "" )) for i in range( len( webData ) ): webData[ i ] = ReplaceStrings( webData[ i ], [ "\n", "\r", "\t", " " ], "") #print(webData[ i ]) """ 3/8 未教""" if len(webData) == 0: print("No Data") return # if To file not exist, create it(如果檔案不存在，則新建一個) if not os.path.exists( Tofile ): #判斷資料夾是否存在 os.mkdir( Tofile ) #創立一個新資料夾 WriteData( json.dumps( webData, indent = 4, ensure_ascii = False ), file=f"{Tofile}/{ProblemID}.json" ) #json.dumps 是可以把list, dict等轉換成字串的形式 #而WriteData 則是出現在上面的函式，可以把字串寫到檔案裏面 #這邊也有用到f"{}"，是來生成檔案存放名稱的 if __name__ == "__main__": import time for i in range( 1, 10 ): time.sleep( 0.05 ) print( f"Get Problem a{i:03}") GetProblemData( f"a{i:03}", "Data") ``` # 結論其實這樣就基本上把爬蟲學完了，爬蟲其實蠻制式化的? 基本流程大概如下 ```graphviz digraph { st[label="觀察網頁 "] sub1[label="找出共通點"] sub2[label="獲取原始碼"] sub3[label="解構原始碼獲\n取所需要的資訊"] sub4[label="將資訊整理(美觀)"] end[label="存取下來"] st -> sub1 sub1 -> sub2 sub2 -> sub3 sub3 -> sub4 sub4 -> end } ``` 程式碼也東常不會到很長，除非你是要爬很大型的網站，像是劍橋字典之類的，如果想要爬的話也當然需要一點點偽裝，不然很容易就會被踢掉，最後需要注意一點的是要**適度**、**合法** 如果爬太超過的話不僅僅只是對網站的不尊重，嚴重一點還有可能會**犯法**。