Python網路爬蟲入門

--- title: Python網路爬蟲入門 description: 中興大學資訊研究社1101學期程式分享會主題社課 tags: Python --- ###### [Python 教學/](/@NCHUIT/py) # Python 網路爬蟲 > [name=Tatara][time=110,12,23] ## 爬蟲是什麼🤔 當你使用瀏覽器打開一個網頁時，其實是向其伺服器發送 **請求(`request`)**，並且伺服器 **回傳(`response`)** 資料再交給瀏覽器解析與渲染，才出現日常熟悉的網站。而網路爬蟲(web crawler)便是擷取伺服器回傳中我們要的特定資料,並且將過程自動化。 ## 請求與回應 Request &Response ![](https://i.imgur.com/CsBcWU9.png) ![](https://i.imgur.com/rbP8KNM.png) ## HTTP & HTTPS HTTP的全名是超文本傳輸協定(HyperText Transfer Protocol),規範客戶端的請求與伺服器回應的標準，實際上是藉由 TCP 作為資料的傳輸方式。 HTTPS中的S則是(security)。 ## 關於html <table><tr><td><b>H</b>yper<b>T</b>ext <b>M</b>arkup <b>L</b>anguage (超文本標記語言)，縮寫：HTML，是一種用於建立網頁的標準<b>標記語言</b>。<br>瀏覽器可以讀取HTML檔案，並將其彩現成<b>視覺化</b>網頁。<br><h6>HTML描述了一個網站的結構語意隨著線索的呈現，使之成為一種標記語言而<em>非程式語言</em>。</h6></td><td><img src='https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/HTML.svg/800px-HTML.svg.png'></td></tr></table> 抓取到的資料會像右圖的html檔，而我們的目的便是找出我們需要的資料在哪個標籤內。 #### 搞不清楚HTTP和HTML的差別？ https://www.geeksforgeeks.org/difference-between-html-and-http/ ## 請求Request `HTTP Method` HTTP的Request方法有[九種](https://developer.mozilla.org/zh-TW/docs/Web/HTTP/Methods) 最常用的是 **`GET`** 和 **`POST`** ### `GET` Method 向指定的資源要求資料，類似於查詢操作以google搜尋為例先開啟google搜尋頁面 (其實這裡已經做一次請求了) https://www.google.com/search 按下`F12`可以看到我們送出的`GET`請求(要重新整理) `GET`的參數會放在URL後面] `https://網址?參數=參數值`-- [headers](https://zh.wikipedia.org/wiki/HTTP%E5%A4%B4%E5%AD%97%E6%AE%B5), [cookies](https://zh.wikipedia.org/wiki/Cookie), [params (英)](https://en.wikipedia.org/wiki/Query_string) 帶 `params` 的 `get` https://www.google.com/search?q=klaire_kriel ###### 至於為什麼要加 `search`，其實是因為 Google 伺服器有個專門為搜索提供 `get` 的頁面被命名為 `search`，但當 `get` 請求沒有 `params` 的時候，它會自動跳向另一個被命名為 `webhp` 的頁面。 ### `POST` Method 將要處理的資料提交上去，類似於更新操作。而當需要更新的資料是較敏感的，就會用`POST`方法把params包起來。 `POST` params-- headers, [cookies](https://zh.wikipedia.org/wiki/Cookie), [data (英)](https://en.wikipedia.org/wiki/POST_(HTTP)#Use_for_submitting_web_forms) ## Python 函式庫 [`requests`](https://requests.readthedocs.io/zh_CN/latest/api.html) 使用python requests 函式庫向伺服器發送請求 ### 安裝 ``` pip install requests ``` ### 語法 [更多方法及語法\ (英)](https://www.w3schools.com/python/module_requests.asp) #### GET ```python= requests.get(url[,headers,cookies,params,...]) ``` #### POST ```python= requests.post(url[,headers,cookies,data,...]) ``` `[ ]`:選用省力點也可以這樣 ```python= from requests import request request("get",url[,headers,cookies,params,...]) ``` ## 簡單抓取網站資料 #### 一般`GET` 打開colab，抓取[ptt熱門看版網頁](https://www.ptt.cc/bbs/) ```python= import requests response = requests.request("GET", url='https://www.ptt.cc/bbs/')#可直接用request函式並在參數內選擇方法(get,post) print(response.text) #印出資料，也就是文字版的網頁 print(type(response)) #看它的資料型態 print(vars(response)) #看它的屬性 ``` #### 帶 `params` 的 `get` ```python= import requests url = 'https://www.google.com/search' payload = { 'q':'klaire_kriel' } #dict response = requests.request("GET",url=url, params=payload) #關鍵字引數 print(response.text) ``` #### 確認從伺服器傳回的狀態碼 [HTTP狀態碼](https://zh.wikipedia.org/zh-tw/HTTP%E7%8A%B6%E6%80%81%E7%A0%81) ```python= print(response.status_code) #200 ok ``` #### 判斷伺服器狀態後再抓取 Python requests 函式庫定義給 Response 的 status 被命名為 status_code ```python= if response.status_code == requests.codes.ok: print(response.text) ``` print出來的text很多且很醜? ## 使用 Beautiful Soup 抓取與解析網頁資料 Beautiful Soup 是一個 Python 的函式庫模組，可以快速解析網頁 HTML 碼，從中翠取出我們有興趣的資料、去蕪存菁。一樣先安裝，但在colab裡都幫你載好了 ``` pip install bs4 ``` #### Beautiful Soup 基本用法我們先以簡單的html檔來看bs4的功能 ```python= # 引入 Beautiful Soup 模組 from bs4 import BeautifulSoup as bs # 原始 HTML 程式碼假設我們已經有檔案了 html_doc = """ <html><head><title>前進吧！高捷少女</title></head> <body><h2>K.R.T. GIRLS</h2> <p>小穹</p> <p>艾米莉亞</p> <p>婕兒</p> <p>耐耐</p> <a id="link1" href="https://zh.wikipedia.org/wiki/%E9%AB%98%E6%8D%B7%E5%B0%91%E5%A5%B3#%E5%B0%8F%E7%A9%B9">Link 1</a> <a id="link2" href="https://zh.wikipedia.org/wiki/%E9%AB%98%E6%8D%B7%E5%B0%91%E5%A5%B3#%E8%89%BE%E7%B1%B3%E8%8E%89%E4%BA%9E">Link 2</a> <a id="link3" href="https://zh.wikipedia.org/wiki/%E9%AB%98%E6%8D%B7%E5%B0%91%E5%A5%B3#%E5%A9%95%E5%85%92 3</a> <a id="link4" href="https://zh.wikipedia.org/wiki/%E9%AB%98%E6%8D%B7%E5%B0%91%E5%A5%B3#%E8%80%90%E8%80%90">Link 4</a> </body></html> """ # 以 Beautiful Soup 解析 HTML 程式碼 soup = bs(html_doc, 'html.parser') #前項"html_doc"為html的文字資料，後項"'html.parser'"為指定以何種解析器來分析html文字 print(soup) ``` #### 找尋網頁中的元素 `Ctrl + Shift + I` ![](https://i.imgur.com/haX51a0.png) ## 取得節點文字內容 #### 獲取標籤內容 `name`參數方法，可查找所有名為`name`的tag。 ```python= web_title = soup.title #取得網頁標題 print(web_title) print(web_title.string) #使用字串存取 ``` #### 找查元素選取全部符合條件(標籤節點)的元素 [find_all()](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#find-all) ```python= my_girls = soup.find_all('p') #<p></p>標籤 print(my_girls) ``` 選取第一個符合條件(標籤節點)的元素 [find](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#find) ```python= my_girls = soup.find('p') ``` #### keyword 參數返回具有keyword的元素 ```python= my_link = soup.find(id='link1') ``` :::spoiler 練習1 在上面範例的html_doc中加入一行印出所有具有連結的元素 ```python= soup = soup.find_all(id=True) ``` ::: ```python= from bs4 import BeautifulSoup as bs html_doc = """ <html><head><title>前進吧！高捷少女</title></head> <body><h2>K.R.T. GIRLS</h2> <p>小穹</p> <p>艾米莉亞</p> <p>婕兒</p> <p>耐耐</p> <a id="link1" href="https://zh.wikipedia.org/wiki/%E9%AB%98%E6%8D%B7%E5%B0%91%E5%A5%B3#%E5%B0%8F%E7%A9%B9">Link 1</a> <a id="link2" href="https://zh.wikipedia.org/wiki/%E9%AB%98%E6%8D%B7%E5%B0%91%E5%A5%B3#%E8%89%BE%E7%B1%B3%E8%8E%89%E4%BA%9E">Link 2</a> <a id="link3" href="https://zh.wikipedia.org/wiki/%E9%AB%98%E6%8D%B7%E5%B0%91%E5%A5%B3#%E5%A9%95%E5%85%92">Link 3</a> <a id="link4" href="https://zh.wikipedia.org/wiki/%E9%AB%98%E6%8D%B7%E5%B0%91%E5%A5%B3#%E8%80%90%E8%80%90">Link 4</a> </body></html> """ soup = bs(html_doc, 'html.parser') #加在這 print(soup) ``` [ BeautifulSoup 官方文檔](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/) :::spoiler 練習2 查找 Google 主畫面的超連結文字並印出來。`Hint:文檔get()方法`https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/index.html?highlight=get ```python= from requests import get from bs4 import BeautifulSoup as bs response = get('https://www.google.com/') soup = bs(response.text) #其中.text是獲得respone中的text屬性，也就是我們的Hyper'Text' links = soup.find_all('a') #在使用find_all之後，會以list的形式回傳所有符合條件的標籤內容 for link in links: print(link.get('href')) ``` ::: ## CSS 選擇器 ###### *[請參閱 BS Doc - CSS 選擇器 (簡)](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id42)*<br>*[請參閱 w3schools - CSS 選擇器 (英)](https://www.w3schools.com/css/css_selectors.asp)* #### 漂亮一點? ```python= print(soup.prettify) ``` ## 爬取ptt的表特版中的圖片 https://www.ptt.cc/bbs/Beauty/index.html ![](https://i.imgur.com/GzgXFTE.png) 想想看，要怎麼在爬蟲中OVER 18? 打開F12看看發生什麼事如果直接get? ```python= import requests from bs4 import BeautifulSoup as bs u = 'https://www.ptt.cc/bbs/Beauty/index.html' r = requests.get(u) soup = bs(r.text,'html.parser') print(soup) ``` 會發現我們沒滿18歲 :::spoiler 練習三夾帶參數'cookies' 取得網頁 ```python= import requests from bs4 import BeautifulSoup as bs u = 'https://www.ptt.cc/bbs/Beauty/index.html' d = {"over18" : '1'} r = requests.get(url=u,cookies=d) soup = bs(r.text,'html.parser') print(soup) ``` ::: 好我們現在滿18歲了，現在把表特版中第一篇文章中的圖片的網址抓取下來吧 ::: spoiler 練習四 ```python= import requests from bs4 import BeautifulSoup as bs u = 'https://www.ptt.cc/bbs/Beauty/M.1640085446.A.E38.html' d = {"over18" : '1'} r = requests.post(u,cookies=d) soup = bs(r.text,'html.parser') img = soup.find_all('img') for link in img: print(link.get('src')) ``` ::: ## 補充 ### 讀寫檔(之後會教)，把圖片下載下來 ```python= import requests from bs4 import BeautifulSoup as bs urllist = [] u = 'https://www.ptt.cc/bbs/Beauty/M.1640085446.A.E38.html' d = {"over18" : '1'} r = requests.get(u,cookies=d) soup = bs(r.text,'html.parser') img = soup.find_all('img') for link in img: urllist.append(link.get('src')) for i,url in enumerate(urllist): with open (f'{i}.jpg','wb') as f: f.write(requests.get(url).content) ``` ## 回家作業: 選課網頁分析 https://reurl.cc/35dnvV 下載網址中的壓縮檔，解壓縮後練習解析網頁：通識加選-選擇.html : 建立一個「"選課號碼":["v_click":"XXX","課程名稱":"XXXX","授課教師":"XXX","上課時間":"XXX","可選餘額":"XX"]」的字典通識加選-確認.html : 建立一個「"選課號碼":["v_click":"XXX","課程名稱":"XXXX","授課教師":"XXX","上課時間":"XXX",**"可選餘額":"XX"**]」的字典，其中可選餘額**需自行計算** 通識加選-完成.html : 建立一個「"選課號碼":["v_click":"XXX","課程名稱":"XXXX","授課教師":"XXX","上課時間":"XXX","可選餘額":"XX",**"結果說明":"XXX"**]」的字典，其中可選餘額**需自行計算** 網址都有註解在html開頭