網路爬蟲

![](https://i.imgur.com/APEmwRI.png =30%x) # 網路爬蟲 ## Web Crawling --- ## 爬蟲是什麼？ - 自動重複瀏覽網頁、抓取網站上資料的程式 - 應用：搜尋引擎，股市分析，搶票機器人...etc ![](https://i.imgur.com/66ASF52.png =100x100)![](https://i.imgur.com/BzBifvi.png =100x100)![](https://i.imgur.com/2iQPTe9.png =100x100)![](https://i.imgur.com/W0QdreT.png =100x100)![](https://i.imgur.com/riOcXIh.png =100x100)![](https://i.imgur.com/FwTaymK.png =100x100)![](https://i.imgur.com/WyZmeFD.png =100x100)![](https://i.imgur.com/Y7sT3kY.png =100x100) Note: 不知道大家有沒有聽過爬蟲，「網路爬蟲」是一個透過程式「自動抓取」網站資料的過程，在這個資訊爆炸的時代裡面，資料的收集是非常重要的工作項目，但是呢如果透過人工的方式來收集網站資料的話，除了效率很低以外也會花費掉很多的時間。資料收集、整理的工作可以透過網路爬蟲來協助，我們只要先制定好規則，網路爬蟲就可以自動依照這個規則收集跟擷取資料然後整理出我們需要的格式，像是 Excel 試算表、CSV 檔或是資料庫等等。簡單來說，爬蟲就是可以做一些你不想做的繁瑣的事，像是可以上 ptt 一篇一篇抓文章給你看，爬蟲的應用真的超級多，今天的課程講的只是皮毛中的皮毛，有興趣的同學可以自己去google ---- ## 爬蟲簡介 - 進入[PTT-movie版](https://www.ptt.cc/bbs/movie/index.html)![](https://i.imgur.com/q4dJeQ4.png =200x110) - 擷取網頁原始碼中的資料 - 右鍵>>檢查（網頁的架構、骨幹） Note: 再來這裡就直接讓大家看看 ptt 電影版的原始碼長什麼樣子，首先大家這裡要用 chrome 開，後面的介面才會跟我們長的一樣，再來我們要去擷取網頁原始碼中的資料，點擊右鍵檢查網頁的架構和骨幹 ---- ## 右鍵>>檢查 ![](https://i.imgur.com/8ap9cnb.png =400x500) Note: 大家不要擔心，等等我們會實際 demo 給大家看 ---- ### 滑鼠上的code可以對應到網頁上的區塊 ![](https://i.imgur.com/l6gStkP.png) ---- ### 滑鼠上的code可以對應到網頁上的區塊 ![](https://i.imgur.com/Jiiy4id.png) ---- ## 網頁跳轉 ![](https://i.imgur.com/wBCIZUJ.png) ---- ### 點選左邊和右邊會跳轉到同一個網址 ![](https://i.imgur.com/ocXAGHk.png) ---- ![](https://i.imgur.com/FGBihpS.png) ---- ![](https://i.imgur.com/FQcd2DE.png) note: demo完之後小隊輔跟隊員自己試試看10分鐘 ---- ## HTML是什麼？ - Hypertext Markup Language 超文字標示語言 - 一種標記語言，而非一般熟知的程式設計語言 - 告訴瀏覽器該如何呈現你的網頁![](https://i.imgur.com/jEsvLRu.png =50x50) - 由「標籤」與「內容」組成 ![](https://i.imgur.com/RIVK8r9.png =50x50) Note: HTML，Hypertext Markup Language，是一種用來編寫網頁的基本語言，通常 HTML 會和 CSS 及 JavaScript 一起使用來開發網頁應用程式，但是因為我們今天教學的主題並不是網頁開發，是為了讓大家知道自己爬蟲的內容是什麼，所以認識HTML就夠了，css 跟 javascript 不會提到。HTML 是一種「標示語言」，而不是大家比較熟悉的程式語言，而網頁瀏覽器像是常見的 Chrome, IE, Safari, Firefox 就懂得知道如何讀取 HTML 檔案，並且將 HTML 中所描述的內容呈現在頁面上。 ---- 舉例來說，請看看以下這個句子： ```htmlembedded= I love coding very much ``` ---- 在它前後分別加上段落標籤就變成一個段落元素了： ```htmlembedded= I love coding very much ``` ---- ## HTML元素的組成 ```htmlembedded= ⬇️內容 I love coding very much⬅️結束標籤 ⬆️起始標籤 ``` Note: 起始標籤：<>裡面放入元素名稱，代表這個元素從這裡開始。結束標籤：<\/>裡面放入元素名稱，代表這個元素的尾端。 "I love coding very much" 是想呈現在網站上的內容，被夾在起始標籤和結束標籤之間，就是這個元素的內容。一個元素就由起始標籤、結束標籤跟內容所組成。 ---- ## Get a taste! ---- ![](https://i.imgur.com/bmgOWxj.png =600x500) Note: 大家可以看到這個簡單的網頁中有標題、圖片、清單和一句含有超連結的文字，讓我們馬上來看看原始碼吧~ ---- ## HTML文件的架構 ![](https://i.imgur.com/0Kel2mu.png =450x550) Note: 像你們看到的一樣，html文件是被一層一層包起來的，這就是巢狀結構，在後面會比較詳細的介紹到，這邊大家只要看懂這張圖就好了～html 的结构主要分为两大部分：head 和 body。网页的描述应該放入 head 标签，网页向用户展示的的内榮则应放入 body 标签。 ---- ## HTML文件的架構 ```htmlembedded= <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>My test page</title> </head> <body> <h1>What are you going to eat for breakfast?</h1> <img src="https://ppt.cc/fXP2Vx@.jpg" alt="A strange creature combining cat and bao-zi." /> Nice and comfortable breakfast should consist <ul> <li>cat</li> <li>bao-zi</li> <li>cream</li> </ul> Read the <a href="https:/zh.wikipedia.org/wiki/%E7%A5%9E%E9%AD%94%E4%B9%8B%E5%A1%94%22%3E">Breakfast Manifesto</a> to learn even more about the values and principles that guide the pursuit of our mission. </body> </html> ``` note: 首先，整個 html 檔案被html 標籤包起來，裡面就是整個文件的架構了，「頭」的部分是由 head 標籤界定，大部分用於規範和整體網頁相關的資訊，然而這些資訊通常不會直接呈現在網頁上。以這個例子來說，「頭」的部分還包含了 title，用來說明這個網頁的標題，這個標題 My test page 最後會呈現在瀏覽器視窗最上方的文字部分，當別人將這個網頁加入「我的最愛」的時後，也會呈現在看到的字串裡面。「身體」部分是由 body 和 body 兩個標籤所界定，包含了所有直接呈現在瀏覽器視窗的內容。剩下標題、清單、超連結的用法會在後面介紹到~ ---- ## 常見的標籤名字: | 標籤 | 敘述 | | --------------- | -------------------------- | | \<head>\</head> | 標示網頁的基本資訊 | | \<body>\</body> | 包含整個網頁的內容 | | \<div>\</div> | 包含一整個區塊（division） | | \\ | 顯示一段文字（paragraph） | | \<a>\</a> | 顯示連結（anchor） | | \<img>\</img> | 顯示圖片 | ---- ## 文件標題（heading） ```htmlembedded= <h1>Watermelon</h1> <h2>Pineapple</h2> <h3>Apple</h3> <h4>Cherry</h4> ``` ---- HTML最多可以有六層的 heading，\<h1>–\<h6>，我們通常只使用3至4層，像範例上就有四層標題，分別是watermelon, pineapple, apple 和 cherry。 ---- <h1>Watermelon</h1> <h2>Pineapple</h2> <h3>Apple</h3> <h4>Cherry</h4> ![](https://i.imgur.com/qSQBKMI.png =100x100)![](https://i.imgur.com/KOFarzp.png =100x100)![](https://i.imgur.com/H5De3Ef.png =100x100)![](https://i.imgur.com/xofXGRI.png =100x100) ---- ## HTML屬性（attribute） ```htmlembedded= I love coding very much ⬆️屬性 ⬆️屬性值 ``` Note: 在元素名稱和屬性之間有一個空格屬性名稱後面接著等於符號「=」屬性值被在「""」裡面屬性包在起始標籤裡面，如範例所示是一種可以讓 HTML 元素更容易跟其他程式碼、傳遞訊息的語法，HTML 屬性通常會以「屬性名稱＝ ”屬性值”」的方式呈現，昨天有學到 python 字典、或是如果有自己知道 c加加 map 的話應該就對 key = value 這個術語滿熟悉的，但是對於初學程式的人來說，屬性的呈現方式可能很難以理解。簡單來說，HTML 屬性就好像 HTML 元素上的更小的零件，如果 HTML 元素是車子的輪胎，HTML 屬性就好像是輪胎的紋路跟螺絲，這些小零件有自己的名子，但你還是可以自己決定裡要使用什麼樣式。 ---- ## HTML譬喻 ![](https://i.imgur.com/w6TbYli.png ) ---- ## HTML譬喻 |html|譬喻| |---|---| |標籤名字|箱子種類 (p, div, a...etc)| |屬性名字、屬性值|箱子特徵| |起始標籤|箱子蓋子| |結束標籤|箱子底部| |標籤內容|箱子內容| ---- ## 常見的屬性（箱子特徵）: |屬性|敘述| |---|---| |id | 身分證明 |class|類別| |href |連結的網址| |src |圖片的路徑| Note: HTML 元素都可以有 id 及 class 屬性。id 屬性是讓你為元素命名的，整個頁面中的元素名稱也不應該會重複；至於 class 屬性可以將元素歸為某個特定的類別，通常也會有很多元素屬於同一種類別，意思就是擁有相同的 class 屬性值。 ---- ## 空元素（empty elements） ```htmlembedded= <img src="https://ppt.cc/fXP2Vx@.jpg" alt="A strange creature combining cat and bao-zi." > ``` Note: 一般的 HTML 元素都會有一個起始標籤搭配一個結尾標籤，在起始與結尾標籤之間放一些內容，空元素則是一個特別的存在，沒有結束標籤，也沒有內容，因為需要的東西都包含在起始標籤裡面了 ---- <img src="https://ppt.cc/fXP2Vx@.jpg" alt="A strange creature combining cat and bao-zi." > ---- ## 常用空元素 |標籤 |敘述 | | -------- | -------------------------------------------------------- | | \ | 強制換行 | | \<img> | 圖像 | | \<input> | 表單輸入元素 | | \<link> | 規定外部資源與當前文檔的關係 | | \<meta> | 用來提供有關頁面的無數據訊息 | | \<hr> | 水平分隔線 | ---- ## 圖片無法正確顯示時 ---- <img src="???" alt="A strange creature combining cat and bao-zi." /> ---- ## 屬性 "alt" ---- - 許多視能障礙的網頁瀏覽者，會使用 Screen Readers 這樣的工具，利用說明文字（alt text）來了解網頁要呈現的圖片內容。 - 有些東西出錯了 ```htmlembedded= <img src="???" alt="A strange creature combining cat and bao-zi." /> ``` ---- ![](https://i.imgur.com/bmgOWxj.png =600x500) Note: 介紹完了標題和圖片，我們接著往下看吧~ ---- ## 清單（list） - 無順序性清單（Unordered lists）代表這些項目的順序改變不會影響任何事，例如購物清單。項目會包含在 \<ul> 裡面。 - 有順序性清單（Ordered lists）代表這些項目的順序是有意義的，例如食譜裡的製作步驟。項目會包含在 \<ol> 裡面。 ![](https://i.imgur.com/sdXOhfZ.png) Note: 以上是兩種最常見的清單類，清單至少會包含兩個元素，每個項目則分別放在 \<li> (list item) element裡面。 ---- ## 把以下這段文字變成清單： ```htmlembedded= Nice and comfortable breakfast should consist cat, bao-zi, and cream. ``` 🔽🔽🔽 ```htmlembedded= Nice and comfortable breakfast should consist <ul> <li>cat</li> <li>bao-zi</li> <li>cream</li> </ul> ``` ---- ## 清單（list） Nice and comfortable breakfast should consist <ul> <li>cat</li> <li>bao-zi</li> <li>cream</li> </ul> ![](https://i.imgur.com/JxxsQ9F.png =70x140)![](https://i.imgur.com/O6Haykb.png =70x140)![](https://i.imgur.com/CHrWmFc.png =70x140)![](https://i.imgur.com/2oG5tny.png =70x140) ---- ## 表格（table） - HTML \<table> 元件代表表格數據 - 透過二維資料表來呈現資訊![](https://i.imgur.com/N7zDWdh.png =100x100) - 早期 table 是同時拿來做網站排版與資料呈現，在 html、div 出現之後，慢慢回歸資料呈現的使用。 - **永遠的 table，改變的只有使用它的方式。** ---- ```htmlembedded= <table> <tr> <th>Department</th> <th>Student Name</th> <th>Grade</th> </tr> <tr> <td align='right'>Computer Science</td> <td align='right'>Mao Bao Zi</td> <td align='right'>100</td> </tr> <tr> <td align='right'>Electronic Engineering</td> <td align='right'>Yi Ke Qiu Qiu</td> <td align='right'>59</td> </tr> </table> ``` ---- <table> <tr> <th>Department</th> <th>Student Name</th> <th>Grade</th> </tr> <tr> <td align='right'>Computer Science</td> <td align='right'>Mao Bao Zi</td> <td align='right'>100</td> </tr> <tr> <td align='right'>Electronic Engineering</td> <td align='right'>Yi Ke Qiu Qiu</td> <td align='right'>59</td> </tr> </table> ![](https://i.imgur.com/ox6DwWe.png =100x100)![](https://i.imgur.com/0KRDHhT.png =100x100)![](https://i.imgur.com/L1yFHvZ.png =100x100) ---- # Questions?![](https://i.imgur.com/o95v7Pq.png) ---- ## 連結（link） 1. 選擇一段文字，例如「Breakfast Manifesto」 2. 把他們包在 \<a> 元素裡 ```htmlembedded= <a>Breakfast Manifesto</a> ``` 3. 在 \<a> 中加上 href 這個屬性 ```htmlembedded= <a href="">Breakfast Manifesto</a> ``` 4. 屬性值就是你要連結的網址 ```htmlembedded= <a href="https://zh.wikipedia.org/wiki/%E7%A5%9E%E9%AD%94%E4%B9%8B%E5%A1%94">Breakfast Manifesto</a> ``` Note: 連結對於網頁來說是非常重要的。要加上連結，我們需要用到元素 \<a> ("a" stands for "anchor")，要讓文字變成連結的步驟如下： ---- <a href="https://zh.wikipedia.org/wiki/%E7%A5%9E%E9%AD%94%E4%B9%8B%E5%A1%94">Breakfast Manifesto</a> ---- ![](https://i.imgur.com/bmgOWxj.png =600x500) ---- ## 巢狀元素（nested element） ![](https://i.imgur.com/OXXdfL5.png) Note: 大多數 HTML 元素可以巢狀，意思就是可以包含其他 HTML 元素。HTML 文件是由巢狀的 HTML 元素構成。 ---- ## 巢狀元素（nested element） ```htmlembedded= I love coding very much ``` I love coding very much ---- ```htmlembedded= Become smaller and smaller ``` Become smaller and smaller ![](https://i.imgur.com/KB4dFex.png =180x100)![](https://i.imgur.com/KB4dFex.png =180x100)![](https://i.imgur.com/KB4dFex.png =180x100)![](https://i.imgur.com/KB4dFex.png =180x100) ---- ## 巢狀元素（nested element） ```htmlembedded= I wear a green hat ``` I wear a green hat ![](https://i.imgur.com/ijBLscv.png =100x100)![](https://i.imgur.com/G9Ouiqm.png =100x100)![](https://i.imgur.com/42BAMn1.png =100x100) ---- ## ![](https://i.imgur.com/ZmlJbIG.png =80x80)下列哪一個子元素不會出現？ ---- ## I love you in every universe <table> <tr> <td>A</td> <td>&ltstrong&gt&lt/strong&gt</td> </tr> <tr> <td>B</td> <td>&ltfont size=“1”&gt&lt/font&gt</td> </tr> <tr> <td>C</td> <td>&ltem&gt&lt/em&gt</td> </tr> <tr> <td>D</td> <td>&ltfont color=“#555cc”&gt&lt/font&gt</td> </tr> </table> ---- ## I love you in every universe <table> <tr> <td>A</td> <td>&ltstrong&gt&lt/strong&gt</td> </tr> <tr> <td>B</td> <td>&ltfont size=“1”&gt&lt/font&gt</td> </tr> <tr> <td>C</td> <td>&ltem&gt&lt/em&gt 斜體</td> </tr> <tr> <td>D</td> <td>&ltfont color=“#555cc”&gt&lt/font&gt</td> </tr> </table> ---- # Questions ![](https://i.imgur.com/GKtrsLx.png =200x200) ---- ## BeautifulSoup簡介： - 熱門且實用的爬蟲工具 ![](https://i.imgur.com/uzv3IEp.png =150x150) - 能夠根據標籤、屬性、標籤間的關係來抓取網頁的內容（整個標籤，裡面的文字...etc） - 可以想像成一個工具箱，裡面包含了許多工具（函式）可以讓我們在爬蟲時使用 - [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) ^^全部的方法與函式，有興趣的同學可以參考一下優，裡面也有附中文版的網址>< Note: 分析完網頁的組成之後，我們要怎麼把HTML的元素抓進程式裡，用來記錄或是分析呢?這邊介紹一個熱門的函式庫BeautifulSoup，大家通常用它來抓取標籤、屬性來做使用，那這邊只會介紹一些基本的函式，有興趣的同學可以自己點進網址去查詢一下 ---- ## 使用美麗湯抓取html : ```python= from bs4 import BeautifulSoup soup = BeautifulSoup("SomebadHTML") print(soup.prettify()) ``` ![](https://i.imgur.com/juVbw7m.png =100x100)![](https://i.imgur.com/w3QKb5n.png =100x100)![](https://i.imgur.com/j1Skojd.png =100x100)![](https://i.imgur.com/d9UakaS.png =100x100)![](https://i.imgur.com/XjearhQ.png =100x100) Note: 這邊先給大家演示美麗湯最簡單的功能，美麗湯可以把網頁的HTML整個抓下來 ---- ## 印出的結果 : ```htmlembedded= <html> <body> Some bad HTML </body> </html> ``` Note: 用prettify這個功能程式就會自動縮排，排版也會比較好看，之後也比較容易看到那些內容是我們需要抓取的 ---- ## 抓取特定內容 : |想抓取的東西|語法| |---|---| |標籤|ex : soup.b| |內容|ex : soup.text| |特徵|ex : soup.b['id']| Note: 這邊介紹3個基本的抓取方式，第一種利用抓取標籤可以把HTML的整個元素抓取出來，那如果只需要內容或特定的特徵值的話只需要加".text"或者是"[特徵]"就可以了，那後面會有實際例子讓大家了解 ---- ## 抓取特定標籤 : ```python= from bs4 import BeautifulSoup soup = BeautifulSoup('Extremely bold'+ '<div>Extremely div</div>') print(soup.b) ``` Note: 這邊看到當我們只想要抓標籤為b的箱子的話，只要很簡單的用soup.b就可以抓到 ---- ## 抓取結果 : ```python= from bs4 import BeautifulSoup soup = BeautifulSoup('Extremely bold'+ '<div>Extremely div</div>') print(soup.b) ``` ```htmlembedded= Extremely bold ``` ![](https://i.imgur.com/apdZqQJ.png =100x) ---- ## 如果只想要內容 : ```python= from bs4 import BeautifulSoup soup = BeautifulSoup('Extremely bold'+ '<div>Extremely div</div>') print(soup.b.text) ``` Note: 那如果大家覺得出來的結果有起始標籤和結束標籤很礙眼的話，就可以用再在後面加.text，這樣出來的結果就會只有Extremely bold ---- ## 抓取結果 : ```python= from bs4 import BeautifulSoup soup = BeautifulSoup('Extremely bold'+ '<div>Extremely div</div>') print(soup.b.text) ``` ```htmlembedded= Extremely bold ``` ![](https://i.imgur.com/azWpIwG.png) ---- ## 想要特定特徵 ? ```python= from bs4 import BeautifulSoup soup = BeautifulSoup('<a href="http://example.com/lacie"' + 'class="sister"id="link2">Lacie</a>') ``` Note: 那大家還記得我前面說的，如果想要抓取HTML裡面特定的特徵值的話要怎麼辦嗎 ---- ## 想要特定特徵 ? ```python= from bs4 import BeautifulSoup soup = BeautifulSoup('<a href="http://example.com/lacie"' +'class="sister"id="link2">Lacie</a>') print(soup.a['href']) print(soup.a['id']) ``` Note: 沒錯就是用中括弧裡面包想要的特徵，象是以這個code為例，當我們想知道這個元素的href和id是什麼的話，就在後面直接加中括弧href和id，需要特別注意的是不是用.而是中括弧 ---- ## 抓取結果 : ```python= from bs4 import BeautifulSoup soup = BeautifulSoup('<a href="http://example.com/lacie"' +'class="sister"id="link2">Lacie</a>') print(soup.a['href']) print(soup.a['id']) ``` ```htmlembedded= http://example.com/lacie link2 ``` ![](https://i.imgur.com/Ckpi1yr.png =100x100)![](https://i.imgur.com/MrRWWwi.png =100x100)![](https://i.imgur.com/Ckpi1yr.png =100x100)![](https://i.imgur.com/MrRWWwi.png =100x100)![](https://i.imgur.com/Ckpi1yr.png =100x100) Note: 那出來的結果就會長這樣 ---- ## find() : ```python= from bs4 import BeautifulSoup soup = BeautifulSoup('Extremely bold'+ 'Extremely div') print(soup.b) print(soup.find('b')) ``` Note: 這邊介紹另一種函式是大家比較常用來抓取網頁資訊的就是find()，基本上兩者的功能是一樣的，只是find函是更彈性，想抓取特殊的資訊時更方便 ---- ## 抓取結果一樣 ```python= from bs4 import BeautifulSoup soup = BeautifulSoup('Extremely bold'+ 'Extremely div') print(soup.b) print(soup.find('b')) ``` ```htmlembedded= Extremely bold Extremely bold ``` ![](https://i.imgur.com/AXOoGa3.png =100x180)![](https://i.imgur.com/hpyte37.png =100x180) Note: 這邊可以看到，兩個的結果是一樣的 ---- ## find() : ```python= from bs4 import BeautifulSoup soup = BeautifulSoup('Extremely bold'+ 'Extremely div') print(soup.find('b','story')) print(soup.find('b',id='bold')) ``` Note: 可以看到這個例子，可以更進一步用特徵值的條件去過濾想要的訊息，那需要特別注意的是，find的語法當想要篩選的條件是class時，前面是不用寫class=??的，而其他的特徵則需要 ---- ## 抓取結果 : ```python= from bs4 import BeautifulSoup soup = BeautifulSoup('Extremely bold'+ 'Extremely div') print(soup.find('b','story')) print(soup.find('b',id='bold')) ``` ```htmlembedded= Extremely bold Extremely bold ``` ![](https://i.imgur.com/rW8Jaiq.jpg =300x) Note: 可以看到兩者都是抓取到第一個元素 ---- ## 小練習.ipynb - [COLAB](https://colab.research.google.com/drive/1tyatJ0yADwPiEVRPWXBT8o__jw5AdfJ_?usp=sharing) note: 這邊想請大家做個小練習，大家打開我們給的colab 點選檔案->在雲端硬碟儲存副本 ---- ## ![](https://i.imgur.com/ZmlJbIG.png =80x80)抓特徵值(智多星1) ### (4分鐘) ```htmlembedded= bold ``` Note: 剛剛說到find基本上跟直接用.標籤的功能一樣，那大家還記得前面說過可以抓特徵值，請大家打開colab找到智多星1的block試試看，把id='bold'的bold印出來 ---- ## 解答 : ```python= # 智多星1 from bs4 import BeautifulSoup soup = BeautifulSoup('Extremely bold'+ 'Extremely div') answer = soup.find('b')['id'] print(answer) ``` ```htmlembedded= bold ``` ---- ## find()整理 : |語法|正確?| |---|---| |.find('標籤' , '特徵' ) | V | |.find('標籤' , class = '特徵') | X | |.find('標籤' , id = '特徵') | V | Note: 這邊幫大家統整find()函式比較容易混淆的地方，想用class值filter抓取結果時直接逗號後面加上特徵名字就好了，第二種的用法是錯誤的，請大家特別注意 ---- ## 錯誤用法 ```python= from bs4 import BeautifulSoup soup = BeautifulSoup('Extremely bold'+ 'Extremely div') answer = soup.find('b', class='story') print(answer) ``` ---- ## 結果為錯誤語法 ![](https://i.imgur.com/1YlGpBH.png =800x300) ---- ## 正確用法 ```python= from bs4 import BeautifulSoup soup = BeautifulSoup('Extremely bold'+ 'Extremely div') answer = soup.find('b', 'story') print(answer) ``` ---- ## 抓取結果 : ```python= from bs4 import BeautifulSoup soup = BeautifulSoup('Extremely bold'+ 'Extremely div') answer = soup.find('b', 'story') print(answer) ``` ```htmlembedded= Extremely bold ``` ---- ## find_all() : - 跟find()用法完全一樣 - 會把所有符合項目抓取出來 Note: 不知道大家有沒有發現，就算剛剛示範的例子裡面，兩個元素除了內容的一樣，但用find()抓取時還是只會抓取第一個符合的元素，那如果比方說我們想PTT裡面的標題，標題的屬性、特徵基本上都是一致的，那我們想要全部抓出來就可以用find_all，這個函是跟find幾乎一樣，只差在他會抓取全部符合的，而不是只抓第一個 ---- ## find_all() : ```python= from bs4 import BeautifulSoup soup = BeautifulSoup('Extremely bold'+ 'Extremely div') print(soup.b) print(soup.find_all('b')) ``` Note: 這邊給大家看用find_all抓取前面的範例，就會把兩個都符合的元素抓取出來出來不是只抓第一個 ---- ## 抓取結果 : ```python= from bs4 import BeautifulSoup soup = BeautifulSoup('Extremely bold'+ 'Extremely div') print(soup.b) print(soup.find_all('b')) ``` ```htmlembedded= Extremely bold [Extremely bold, Extremely div] ``` ---- ## ![](https://i.imgur.com/ZmlJbIG.png =80x80)抓TEXT(智多星2) ### (4分鐘) ```htmlembedded= Extremely bold Extremely div ``` note: 又到了智多星的時間，請大家用打開colab看到智多星2，這次希望大家能把find_all裡面抓到的元素，全部印出內容 ---- ## 解答 : ```python= # 智多星2 from bs4 import BeautifulSoup soup = BeautifulSoup('Extremely bold'+ 'Extremely div') for i in soup.find_all('b'): answer = i.text print(answer) ``` ```htmlembedded= Extremely bold Extremely div ``` ---- # Questions ![](https://i.imgur.com/NSwdRtV.png =200x200) ---- ## 爬蟲前置作業: ```python= import requests from bs4 import BeautifulSoup headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64)"+ "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63"+ "Safari/537.36 Edg/102.0.1245.33"} target = 'https://www.ptt.cc/bbs/movie/index.html' req = requests.get(url=target,headers=headers) soup = BeautifulSoup(req.text,'html.parser') ``` note: 大家可以看到Q1區塊內的最上方的code ---- ## headers ? ```python= headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64)"+ "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63"+ "Safari/537.36 Edg/102.0.1245.33"} ``` note: 前面有個header那是為什麼呢 ---- ## 偽裝瀏覽器: ![](https://i.imgur.com/X12RkS8.jpg =960x500) note: 那是因為大部分的伺服器為了防止大家都用程式來執行特定指令而不是真正的使用者透過瀏覽器使用，因為這樣不只會造成伺服器的負擔也會影響到真正使用者的品質，納為了避免這種情況，通常會設有防範機制 ---- ### 偽裝瀏覽器: - 點選檢查後點擊網路在接著重整 ![](https://i.imgur.com/oGZx1Uy.png =600x500) note: 那我們要怎麼偽裝自己是一個使用者在是用瀏覽器呢，可以在瀏覽器終點右鍵 ->檢查->網路 ---- ### 偽裝使用者使用瀏覽器: - 先點選index.html後點擊標題滑到最下面 ![](https://i.imgur.com/ZapnIOr.jpg =960x550) note: 點選index.html滑到最下面就可以知道當我們使用瀏覽器時根伺服器傳送的資訊 ---- ## 抓下網頁原始碼: ```python= req = requests.get(url=target,headers=headers) soup = BeautifulSoup(req.text,'html.parser') ``` ---- ## 爬蟲前置作業: ```python= import requests from bs4 import BeautifulSoup import string import os import shutil headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36 Edg/102.0.1245.33"} target = 'https://www.ptt.cc/bbs/movie/index.html' req = requests.get(url=target,headers=headers) soup = BeautifulSoup(req.text,'html.parser') print(soup) ``` 執行看看這段程式 ---- ## 執行結果: ![](https://i.imgur.com/YfWEDIq.png) ---- ## Q1：找到所有含有貼文網址資訊的箱子 (7分鐘) ![](https://i.imgur.com/yAFfrbt.jpg) ---- ## Q1結果圖，會是清單的型態 ![](https://i.imgur.com/gZztoNT.jpg =1000x500) ---- ## A1：找到所有含有貼文網址資訊的箱子 ```python= title = soup.find_all('div' , 'title') print(title) ``` ---- ## Q2：從第一步獲得的網址資訊中提取出網址 (7分鐘) ![](https://i.imgur.com/wCCY592.jpg =1000x500) ---- ![](https://i.imgur.com/MONc7wh.jpg) ---- ## Q2結果圖 ![](https://i.imgur.com/R0cZ6mN.png) ---- ## A2：從第一步獲得的網址資訊中提取出網址 ```python= for i in range(len(title)): if(title[i].find('a') != None): href = 'https://www.ptt.cc' + title[i].find('a')['href'] print(href) ``` ---- 這個結果會隨著那一頁的貼文及貼文數不同而有改變 ---- ## Q3：進入貼文之後爬取貼文 (7分鐘) ![](https://i.imgur.com/qyIQzPt.png) ---- ## Q3結果圖 ![](https://i.imgur.com/5ZhzG7g.jpg) ---- ## A3：進入貼文之後爬取貼文 ```python= content = soup.find('div' , id="main-container").text print(content) ``` ---- ## 實作技巧： 1. 觀察網頁 2. 比對網頁原始碼，找出資料所在位置 3. 找出符合需求的標籤 4. 以 beautifulsoup 的函式求得資料 --- {%youtube 11GWxA4PiJI %} ---- ## Selenium ![](https://i.imgur.com/XYrUMyg.png) - 可以直接驅動瀏覽器，模擬真正的使用者操作網站 note: 我們介紹另一種函式庫，他可以直接驅動瀏覽器抹你食用者操作網站，就不需要另外用header去片伺服器 ---- ## Xpath - 一種用來尋找文件中某個節點（node）位置的查詢語言 note: 剛剛是用美麗湯，現在介紹另一個搜尋的方法就是xpath，他是把文件存成一個tree的概念，那tree就是一種資料型態等等會有例子讓大家更了解 ---- ## 範例HTML ```htmlembedded= <html> <head> <meta charset='utf-8'> </head> <body> 一般字體 (Plain Text) 斜體 (Italic) 粗體 (Bold) 明顯 (Strong) 強調 (Emphasized) 這是 上標 與 下標 的顯示情況！ </body> </html> ``` ---- ## HTML Tree![](https://i.imgur.com/F0mJxIX.png =100x100) ![](https://i.imgur.com/Qpo4fCc.jpg =700x400) note: 從上面的範例可以將他表示成這個tree，大家可以看到每個圓圓的就是一個節點(node)，而每個節點上面會連結其他節點，下面也會有，但只會有一個節點是沒有上面連結的，那這個結點就是根結點，就像是樹的最底部一樣，所以叫tree ---- ### Xpath語法以我們使用的 HTML 原始內容來說： | 查詢路徑 | 結果 | | -------- | -------- | | /html | 取得 html 標籤（以 / 開頭代表從根節點開始找） | | /html/body/a |取得 body 下所有 a 標籤（無結果，因為 a 在 p 底下 | | //a | 取得所有 a 標籤 | | /html/body//a | 取得 body 下所有 a 標籤 | | //a/@href | 取得所有 a 標籤的 href 屬性 | ---- ### Xpath語法 | 判斷式 | 結果 | |-------|-----| |/html/body//a[1] | 取得 body 下第一個 a 標籤| |/html/body//a[last() - 1] | 取得 body 下最後一個 a 標籤| |//p[@class]| 取得有定義 class 屬性的 p 標籤| |//p[@id='name']| 取得 id 特徵值為 name 的 p 標籤| note: 用 [ ] 包起來，用來對查詢加上一些額外的限制 :，同樣的，xpath也可以用[]去找尋某些特徵值，像最後一個的語法就是 ---- ## ![](https://i.imgur.com/ZmlJbIG.png =80x80)找到PTT的id(智多星3) ### (4分鐘) ``` topbar-container topbar logo main-container action-bar-container search-bar ``` note: 那現在想讓大家利用xpath找尋剛剛PTT的id ---- ## 解答 : ```python= webdrive.get("https://www.ptt.cc/bbs/movie/index.html") html = etree.HTML(webdrive.page_source) links = html.xpath("/html//@id") print(links) ``` ---- ## 練習2： [逐夢營網站](https://cscamp.nctucsunion.me/) ![](https://i.imgur.com/IrXRDVH.png) ---- ## 偽裝使用者使用瀏覽器: ![](https://i.imgur.com/fJFOo6R.png) ---- ## Q4：從獲得的網址資訊中提取出各組的網址 (7分鐘) ![](https://i.imgur.com/5yjRc4T.jpg) ---- ## Q4結果圖 ![](https://i.imgur.com/gqI69Iu.png =600x300) ---- ## A4：從獲得的網址資訊中提取出各組的網址 ```python= webdrive.get("https://cscamp.nctucsunion.me/team/") html = etree.HTML(webdrive.page_source) links = html.xpath("/html//div[@id='team-intro']//a/@href") for link in links : print(link) ``` ---- ## Q5：從得到的網址找出照片連結並且點擊顯示出照片 (7分鐘) ![](https://i.imgur.com/61GtDv0.jpg) ---- ## Q5：印出的照片 ![](https://i.imgur.com/up1ZPcP.png) ---- ## A5：從得到的網址找出照片連結並且點擊顯示出照片 ```python= webdrive.get("https://cscamp.nctucsunion.me/team/") html = etree.HTML(webdrive.page_source) links = html.xpath("/html//div[@id='team-intro']//a/@href") for link in links : # print(link) webdrive.get(link) html = etree.HTML(webdrive.page_source) imgs = html.xpath("/html//img/@src") for i in imgs : i = urllib.parse.quote(i, safe=";/?:@&=+$,", encoding='utf-8') print(i) ``` ---- ## A5 : 錯誤結果圖 ![](https://i.imgur.com/BPRkMPU.png) ---- ## A5 : 結果圖 ![](https://i.imgur.com/9A3qV8k.png) ``` hint : urllib.parse.quote(i, safe=";/?:@&=+$,", encoding='utf-8') ``` ---- ## Q6：利用plt印出圖片 (7分鐘) ![](https://i.imgur.com/ao7KQvd.png) ---- ## Q6：利用plt印出圖片 ```python= for i in imgs : img_src = urllib.parse.quote(i, safe=";/?:@&=+$," , encoding='utf-8') print(img_src) img = io.imread(img_src) imgplot = plt.imshow(img) #1. plt.popup() 2. plt.display() #3. plt.show() 4. plt.showmaker() ``` ---- ## A6：利用plt印出圖片 ```python= for i in imgs : img_src = urllib.parse.quote(i, safe=";/?:@&=+$," , encoding='utf-8') print(img_src) img = io.imread(img_src) imgplot = plt.imshow(img) plt.show() ``` ---- ## 結果圖 ![](https://i.imgur.com/d7QZJGQ.png =400x600) --- ![](https://i.imgur.com/GcqzM3F.gif)