中興大學資訊社
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Write
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Help
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    6
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    --- title: Python網路爬蟲入門 description: 中興大學資訊研究社1101學期程式分享會主題社課 tags: Python --- ###### [Python 教學/](/@NCHUIT/py) # Python 網路爬蟲 > [name=Tatara][time=110,12,23] ## 爬蟲是什麼🤔 當你使用瀏覽器打開一個網頁時,其實是向其伺服器發送 **請求(`request`)**,並且伺服器 **回傳(`response`)** 資料再交給瀏覽器解析與渲染,才出現日常熟悉的網站。而網路爬蟲(web crawler)便是擷取伺服器回傳中我們要的特定資料,並且將過程自動化。 ## 請求與回應 Request &Response ![](https://i.imgur.com/CsBcWU9.png) ![](https://i.imgur.com/rbP8KNM.png) ## HTTP & HTTPS HTTP的全名是超文本傳輸協定(HyperText Transfer Protocol),規範客戶端的請求與伺服器回應的標準,實際上是藉由 TCP 作為資料的傳輸方式。 HTTPS中的S則是(security)。 ## 關於html <table><tr><td><b>H</b>yper<b>T</b>ext <b>M</b>arkup <b>L</b>anguage (超文本標記語言),縮寫:HTML,是一種用於建立網頁的標準<b>標記語言</b>。<br>瀏覽器可以讀取HTML檔案,並將其彩現成<b>視覺化</b>網頁。<br><h6>HTML描述了一個網站的結構語意隨著線索的呈現,使之成為一種標記語言而<em>非程式語言</em>。</h6></td><td><img src='https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/HTML.svg/800px-HTML.svg.png'></td></tr></table> 抓取到的資料會像右圖的html檔,而我們的目的便是找出我們需要的資料在哪個標籤內。 #### 搞不清楚HTTP和HTML的差別? https://www.geeksforgeeks.org/difference-between-html-and-http/ ## 請求Request `HTTP Method` HTTP的Request方法有[九種](https://developer.mozilla.org/zh-TW/docs/Web/HTTP/Methods) 最常用的是 **`GET`** 和 **`POST`** ### `GET` Method 向指定的資源要求資料,類似於查詢操作 以google搜尋為例 先開啟google搜尋頁面 (其實這裡已經做一次請求了) https://www.google.com/search 按下`F12`可以看到我們送出的`GET`請求(要重新整理) `GET`的參數會放在URL後面] `https://網址?參數=參數值`-- [headers](https://zh.wikipedia.org/wiki/HTTP%E5%A4%B4%E5%AD%97%E6%AE%B5), [cookies](https://zh.wikipedia.org/wiki/Cookie), [params (英)](https://en.wikipedia.org/wiki/Query_string) 帶 `params` 的 `get` https://www.google.com/search?q=klaire_kriel ###### 至於為什麼要加 `search`, 其實是因為 Google 伺服器有個專門為搜索提供 `get` 的頁面被命名為 `search`,但當 `get` 請求沒有 `params` 的時候,它會自動跳向另一個被命名為 `webhp` 的頁面。 ### `POST` Method 將要處理的資料提交上去,類似於更新操作。 而當需要更新的資料是較敏感的,就會用`POST`方法把params包起來。 `POST` params-- headers, [cookies](https://zh.wikipedia.org/wiki/Cookie), [data (英)](https://en.wikipedia.org/wiki/POST_(HTTP)#Use_for_submitting_web_forms) ## Python 函式庫 [`requests`](https://requests.readthedocs.io/zh_CN/latest/api.html) 使用python requests 函式庫向伺服器發送請求 ### 安裝 ``` pip install requests ``` ### 語法 [更多方法及語法\ (英)](https://www.w3schools.com/python/module_requests.asp) #### GET ```python= requests.get(url[,headers,cookies,params,...]) ``` #### POST ```python= requests.post(url[,headers,cookies,data,...]) ``` `[ ]`:選用 省力點也可以這樣 ```python= from requests import request request("get",url[,headers,cookies,params,...]) ``` ## 簡單抓取網站資料 #### 一般`GET` 打開colab,抓取[ptt熱門看版網頁](https://www.ptt.cc/bbs/) ```python= import requests response = requests.request("GET", url='https://www.ptt.cc/bbs/')#可直接用request函式並在參數內選擇方法(get,post) print(response.text) #印出資料,也就是文字版的網頁 print(type(response)) #看它的資料型態 print(vars(response)) #看它的屬性 ``` #### 帶 `params` 的 `get` ```python= import requests url = 'https://www.google.com/search' payload = { 'q':'klaire_kriel' } #dict response = requests.request("GET",url=url, params=payload) #關鍵字引數 print(response.text) ``` #### 確認從伺服器傳回的狀態碼 [HTTP狀態碼](https://zh.wikipedia.org/zh-tw/HTTP%E7%8A%B6%E6%80%81%E7%A0%81) ```python= print(response.status_code) #200 ok ``` #### 判斷伺服器狀態後再抓取 Python requests 函式庫 定義給 Response 的 status 被命名為 status_code ```python= if response.status_code == requests.codes.ok: print(response.text) ``` print出來的text很多且很醜? ## 使用 Beautiful Soup 抓取與解析網頁資料 Beautiful Soup 是一個 Python 的函式庫模組,可以快速解析網頁 HTML 碼,從中翠取出我們有興趣的資料、去蕪存菁。 一樣先安裝,但在colab裡都幫你載好了 ``` pip install bs4 ``` #### Beautiful Soup 基本用法 我們先以簡單的html檔來看bs4的功能 ```python= # 引入 Beautiful Soup 模組 from bs4 import BeautifulSoup as bs # 原始 HTML 程式碼 假設我們已經有檔案了 html_doc = """ <html><head><title>前進吧!高捷少女</title></head> <body><h2>K.R.T. GIRLS</h2> <p>小穹</p> <p>艾米莉亞</p> <p>婕兒</p> <p>耐耐</p> <a id="link1" href="https://zh.wikipedia.org/wiki/%E9%AB%98%E6%8D%B7%E5%B0%91%E5%A5%B3#%E5%B0%8F%E7%A9%B9">Link 1</a> <a id="link2" href="https://zh.wikipedia.org/wiki/%E9%AB%98%E6%8D%B7%E5%B0%91%E5%A5%B3#%E8%89%BE%E7%B1%B3%E8%8E%89%E4%BA%9E">Link 2</a> <a id="link3" href="https://zh.wikipedia.org/wiki/%E9%AB%98%E6%8D%B7%E5%B0%91%E5%A5%B3#%E5%A9%95%E5%85%92 3</a> <a id="link4" href="https://zh.wikipedia.org/wiki/%E9%AB%98%E6%8D%B7%E5%B0%91%E5%A5%B3#%E8%80%90%E8%80%90">Link 4</a> </body></html> """ # 以 Beautiful Soup 解析 HTML 程式碼 soup = bs(html_doc, 'html.parser') #前項"html_doc"為html的文字資料,後項"'html.parser'"為指定以何種解析器來分析html文字 print(soup) ``` #### 找尋網頁中的元素 `Ctrl + Shift + I` ![](https://i.imgur.com/haX51a0.png) ## 取得節點文字內容 #### 獲取標籤內容 `name`參數方法,可查找所有名為`name`的tag。 ```python= web_title = soup.title #取得網頁標題 print(web_title) print(web_title.string) #使用字串存取 ``` #### 找查元素 選取全部符合條件(標籤節點)的元素 [find_all()](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#find-all) ```python= my_girls = soup.find_all('p') #<p></p>標籤 print(my_girls) ``` 選取第一個符合條件(標籤節點)的元素 [find](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#find) ```python= my_girls = soup.find('p') ``` #### keyword 參數 返回具有keyword的元素 ```python= my_link = soup.find(id='link1') ``` :::spoiler 練習1 在上面範例的html_doc中加入一行 印出所有具有連結的元素 ```python= soup = soup.find_all(id=True) ``` ::: ```python= from bs4 import BeautifulSoup as bs html_doc = """ <html><head><title>前進吧!高捷少女</title></head> <body><h2>K.R.T. GIRLS</h2> <p>小穹</p> <p>艾米莉亞</p> <p>婕兒</p> <p>耐耐</p> <a id="link1" href="https://zh.wikipedia.org/wiki/%E9%AB%98%E6%8D%B7%E5%B0%91%E5%A5%B3#%E5%B0%8F%E7%A9%B9">Link 1</a> <a id="link2" href="https://zh.wikipedia.org/wiki/%E9%AB%98%E6%8D%B7%E5%B0%91%E5%A5%B3#%E8%89%BE%E7%B1%B3%E8%8E%89%E4%BA%9E">Link 2</a> <a id="link3" href="https://zh.wikipedia.org/wiki/%E9%AB%98%E6%8D%B7%E5%B0%91%E5%A5%B3#%E5%A9%95%E5%85%92">Link 3</a> <a id="link4" href="https://zh.wikipedia.org/wiki/%E9%AB%98%E6%8D%B7%E5%B0%91%E5%A5%B3#%E8%80%90%E8%80%90">Link 4</a> </body></html> """ soup = bs(html_doc, 'html.parser') #加在這 print(soup) ``` [ BeautifulSoup 官方文檔](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/) :::spoiler 練習2 查找 Google 主畫面的超連結文字並印出來。`Hint:文檔get()方法`https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/index.html?highlight=get ```python= from requests import get from bs4 import BeautifulSoup as bs response = get('https://www.google.com/') soup = bs(response.text) #其中.text是獲得respone中的text屬性,也就是我們的Hyper'Text' links = soup.find_all('a') #在使用find_all之後,會以list的形式回傳所有符合條件的標籤內容 for link in links: print(link.get('href')) ``` ::: ## CSS 選擇器 ###### *[請參閱 BS Doc - CSS 選擇器 (簡)](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id42)*<br>*[請參閱 w3schools - CSS 選擇器 (英)](https://www.w3schools.com/css/css_selectors.asp)* #### 漂亮一點? ```python= print(soup.prettify) ``` ## 爬取ptt的表特版中的圖片 https://www.ptt.cc/bbs/Beauty/index.html ![](https://i.imgur.com/GzgXFTE.png) 想想看,要怎麼在爬蟲中OVER 18? 打開F12看看發生什麼事 如果直接get? ```python= import requests from bs4 import BeautifulSoup as bs u = 'https://www.ptt.cc/bbs/Beauty/index.html' r = requests.get(u) soup = bs(r.text,'html.parser') print(soup) ``` 會發現我們沒滿18歲 :::spoiler 練習三 夾帶參數'cookies' 取得網頁 ```python= import requests from bs4 import BeautifulSoup as bs u = 'https://www.ptt.cc/bbs/Beauty/index.html' d = {"over18" : '1'} r = requests.get(url=u,cookies=d) soup = bs(r.text,'html.parser') print(soup) ``` ::: 好我們現在滿18歲了,現在把表特版中第一篇文章中的圖片的網址抓取下來吧 ::: spoiler 練習四 ```python= import requests from bs4 import BeautifulSoup as bs u = 'https://www.ptt.cc/bbs/Beauty/M.1640085446.A.E38.html' d = {"over18" : '1'} r = requests.post(u,cookies=d) soup = bs(r.text,'html.parser') img = soup.find_all('img') for link in img: print(link.get('src')) ``` ::: ## 補充 ### 讀寫檔(之後會教),把圖片下載下來 ```python= import requests from bs4 import BeautifulSoup as bs urllist = [] u = 'https://www.ptt.cc/bbs/Beauty/M.1640085446.A.E38.html' d = {"over18" : '1'} r = requests.get(u,cookies=d) soup = bs(r.text,'html.parser') img = soup.find_all('img') for link in img: urllist.append(link.get('src')) for i,url in enumerate(urllist): with open (f'{i}.jpg','wb') as f: f.write(requests.get(url).content) ``` ## 回家作業: 選課網頁分析 https://reurl.cc/35dnvV 下載網址中的壓縮檔,解壓縮後練習解析網頁: 通識加選-選擇.html : 建立一個「"選課號碼":["v_click":"XXX","課程名稱":"XXXX","授課教師":"XXX","上課時間":"XXX","可選餘額":"XX"]」的字典 通識加選-確認.html : 建立一個「"選課號碼":["v_click":"XXX","課程名稱":"XXXX","授課教師":"XXX","上課時間":"XXX",**"可選餘額":"XX"**]」的字典,其中可選餘額**需自行計算** 通識加選-完成.html : 建立一個「"選課號碼":["v_click":"XXX","課程名稱":"XXXX","授課教師":"XXX","上課時間":"XXX","可選餘額":"XX",**"結果說明":"XXX"**]」的字典,其中可選餘額**需自行計算** 網址都有註解在html開頭

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully