Sprout 0616 Web Scraping

Sprout 0616 Web Scraping === ## 前置作業安裝這些package： ``` pip install beautifulsoup4 pip install lxml pip install requests ``` 或是升級： ``` python -m pip install --upgrade beautifulsoup4 python -m pip install --upgrade lxml python -m pip install --upgrade requests ``` | beautifulsoup4 | lxml | requests | | :--------: | :--------: | :--------: | | 分析網頁內容 | 分析html文件 | 向伺服器索取網頁內容| --- ## 分析html ### 排版 ```python html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') # 以html_doc為參數，lxml為解析方式 # BeautifulSoup其他的解析html方式如html5lib, html.parser # 建造一個BeautifulSoup物件，命名為soup。 print(soup.prettify()) # prettify的用處是將html排版，較容易閱讀。 ``` ```html <html> <head> <title> The Dormouse's story </title> </head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. ... </body> </html> ``` ### 用名字找標籤 1. 找到第一個title標籤。 2. 找到第一個p標籤。 3. 第一個p標籤的內文。 4. soup.title是bs4.element.Tag類別。 5. p標籤的內文是字串。 ```python print(soup.title) # <title>The Dormouse's story</title> print(soup.p) # The Dormouse's story print(soup.p.text) # The Dormouse's story print(type(soup.title)) # <class 'bs4.element.Tag'> print(type(soup.p.text)) # <class 'str'> ``` ### 用屬性找標籤 1. 找到第一個class是title的標籤。 2. 找到第一個id是link1的a標籤。 ```python print(soup.find(class_='title')) # The Dormouse's story print(soup.find('a', id='link1')) # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> ``` ### 取出屬性的內容用法跟dict很像。 1. 找出第一個p標籤的class。 2. 找出第一個a標籤的超連結。 ```python print(soup.p['class']) # ['title'] print(soup.a['href']) # http://example.com/elsie ``` ### Parent and contents 1. 找到第一個id為link1的a標籤。 2. 找出a標籤的parent標籤。 3. 印出parent_tag的所有child。 ```python atag = soup.find('a', id='link1') print(atag) # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> parent_tag = atag.parent print(parent_tag) ''' Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. ''' for item in parent_tag.contents: print(item) # Once upon a time there were three little sisters; and their names were # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # , # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # and # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> # ; # and they lived at the bottom of a well. ``` ### 找出所有的標籤找出所有的a標籤。 ```python print(soup.find_all('a')) ''' [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] ''' ``` ### 小練習1 印出html_doc中所有的超連結網址。 ```c from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') for link in soup.find_all('a'): print(link['href']) ``` --- ## 從網頁取得html ```python from bs4 import BeautifulSoup import requests url = 'http://www.scp-wiki.net/scp-002' html = requests.get(url).text print(html) ``` Output: ``` (上略) <div class="top-bar"> <ul> <li>SCP Series <ul> <li><a href="/scp-series-5">Series V</a></li> <li><a href="/scp-series-4">Series IV</a></li> <li><a href="/scp-series-4-tales-edition">» Series IV Tales</a></li> <li><a href="/scp-series-3">Series III</a></li> <li><a href="/scp-series-3-tales-edition">» Series III Tales</a></li> <li><a href="/scp-series-2">Series II</a></li> <li><a href="/scp-series-2-tales-edition">» Series II Tales</a></li> <li><a href="/scp-series">Series I</a></li> <li><a href="/scp-series-1-tales-edition">» Series I Tales</a></li> </ul> (下略) ``` ## 熬湯 ```python from bs4 import BeautifulSoup import requests url = 'http://www.scp-wiki.net/scp-002' html = requests.get(url).text soup = BeautifulSoup(html, 'lxml') print(soup.prettify()) ``` Output: ``` (上略) <div class="top-bar"> <ul> <li> SCP Series <ul> <li> <a href="/scp-series-5"> Series V </a> </li> <li> <a href="/scp-series-4"> Series IV </a> </li> <li> <a href="/scp-series-4-tales-edition"> » Series IV Tales </a> </li> <li> <a href="/scp-series-3"> Series III </a> </li> <li> <a href="/scp-series-3-tales-edition"> » Series III Tales </a> </li> <li> <a href="/scp-series-2"> Series II </a> </li> <li> <a href="/scp-series-2-tales-edition"> » Series II Tales </a> </li> <li> <a href="/scp-series"> Series I </a> </li> <li> <a href="/scp-series-1-tales-edition"> » Series I Tales </a> </li> </ul> (下略) ``` --- # 目標：製作SCP圖鑑！ ### SCP介紹 SCP基金會是一個跨國組織，負責搜尋並收容各種具有異常屬性的個人、地點或物體（統稱為SCP）。依照收容的危險程度，SCP被分為五個類別：Safe, Euclid, Keter, Thaumiel, Neutralized. 接下來要分析SCP基金會的網站，從介紹SCP的頁面中，找出SCP的類別以及圖片，儲存到對應的資料夾中。 [這是一部你明天還會再看一遍的影片](https://www.youtube.com/watch?v=odc0ajxCq2c) ### 待會會用到但有點陌生的東東 - f-string - os - string.strip(), string.find() ## 尋找SCP的類別 [SCP的類別們](http://www.scp-wiki.net/object-classes) 含有SCP類別的html片段。 ```html Object Class: Euclid ``` 程式碼： ```python def classify(article): for item in article.find_all('strong'): if item.text == 'Object Class:': ptag = item.parent return ptag.contents[-1].strip() return None # Euclid ``` ### 小練習2 印出SCP頁面中Description的內文。 ```c from bs4 import BeautifulSoup import requests url = 'http://www.scp-wiki.net/scp-002' html = requests.get(url).text soup = BeautifulSoup(html, 'lxml') for item in soup.find_all('strong'): if item.text == 'Description:': ptag = item.parent print(ptag.contents[-1]) print() sibling = ptag.find_next_sibling() print(sibling.text) ``` ## 尋找SCP的圖片含有圖片的html片段： ```html <div class="scp-image-block block-right" style="width:300px;"> <img src="http://scp-wiki.wdfiles.com/local--files/scp-002/800px-SCP002.jpg" style="width:300px;" alt="800px-SCP002.jpg" class="image" /> <div class="scp-image-caption" style="width:300px;"> SCP-002 in its containment area </div> </div> ``` 程式碼： ```python def find_image(article): try: div = article.find('div', class_='scp-image-block block-right') # 先找到包含圖片的div標籤 link = div.img['src'] # 找出圖片的連結 image = requests.get(link).content # 取得圖片的內容 except: image = None # 利用try-except避免找不到圖片而產生錯誤的狀況 return image ``` ## 下載圖片 ```python file_path = 'scp-002.jpg' with open(file_path, 'wb') as f: f.write(image) ``` ## 把所有元素組裝在一起 ```python from bs4 import BeautifulSoup import requests def classify(article): for item in article.find_all('strong'): if item.text == 'Object Class:': ptag = item.parent return ptag.contents[-1].strip() return None def find_image(article): try: div = article.find('div', class_='scp-image-block block-right') link = div.img['src'] image = requests.get(link).content except: image = None return image scp_name = 'scp-002' base_url = 'http://www.scp-wiki.net/' url = base_url + scp_name # 取得scp-002的頁面 src = requests.get(url).text soup = BeautifulSoup(src, 'lxml') article = soup.find('div', id='page-content') # 分類scp，否則印出錯誤訊息 scp_type = classify(article) if scp_type == None: print(f'{scp_name} is top secret!') # 尋找並儲存圖片，否則印出錯誤訊息 file_path = f'{scp_name}.jpg' image = find_image(article) if image != None: with open(file_path, 'wb') as f: f.write(image) else: print(f'{scp_name} image not found!') ``` --- ## 創造對應的資料夾 ```python def check_dir(scp_type): # 檢查資料夾是否存在 if not os.path.exists(scp_type): # 如果不存在的話，就新增一個資料夾 os.mkdir(scp_type) ``` ## 把SCP們分類 ```python from bs4 import BeautifulSoup import requests import os def check_dir(scp_type): if not os.path.exists(scp_type): os.mkdir(scp_type) def classify(article): for item in article.find_all('strong'): if item.text == 'Object Class:': ptag = item.parent return ptag.contents[-1].strip() return None def find_image(article): try: div = article.find('div', class_='scp-image-block block-right') link = div.img['src'] image = requests.get(link).content except: image = None return image base_url = 'http://www.scp-wiki.net/' for i in range(1, 51): # scp的編號，如果不滿三位就在前面補0 scp_name = f'scp-{i:03}' url = base_url + scp_name src = requests.get(url).text soup = BeautifulSoup(src, 'lxml') article = soup.find('div', id='page-content') # 分類SCP scp_type = classify(article) # 如果找不到SCP的類別的話就進入下一次迴圈 if scp_type == None: print(f'{item} is top secret!') continue image = find_image(article) # 以SCP的類別作為資料夾名稱，編號作為檔名 file_path = f'{scp_type}/{scp_name}.jpg' # 確定有沒有找到圖片 if image != None: # 確定資料夾是否存在 check_dir(scp_type) with open(file_path, 'wb') as f: f.write(image) else: print(f'{item} image not found!') ``` ### 小練習3 把圖片的檔名變成SCP們的名字。 ```c def find_name(): scp_name = [] url = 'http://www.scp-wiki.net/scp-series' src = requests.get(url).text soup = BeautifulSoup(src, 'lxml') article = soup.find('div', id='page-content') for i in range(1, 51): scp_id = f'SCP-{i:03}' for atag in article.find_all('a'): if atag.text == scp_id: litag = atag.parent scp_name.append(litag.contents[-1].strip(' -')) return scp_name def preprocess(scp_names): for i in range(len(scp_names)): scp_names[i] = re.sub(r'[<>:"/\\\|\?\*]', '', scp_names[i]) ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.