owned this note
owned this note
Published
Linked with GitHub
Sprout 0616 Web Scraping
===
## 前置作業
安裝這些package:
```
pip install beautifulsoup4
pip install lxml
pip install requests
```
或是升級:
```
python -m pip install --upgrade beautifulsoup4
python -m pip install --upgrade lxml
python -m pip install --upgrade requests
```
| beautifulsoup4 | lxml | requests |
| :--------: | :--------: | :--------: |
| 分析網頁內容 | 分析html文件 | 向伺服器索取網頁內容|
---
## 分析html
### 排版
```python
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
# 以html_doc為參數,lxml為解析方式
# BeautifulSoup其他的解析html方式如html5lib, html.parser
# 建造一個BeautifulSoup物件,命名為soup。
print(soup.prettify())
# prettify的用處是將html排版,較容易閱讀。
```
```html
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
```
### 用名字找標籤
1. 找到第一個title標籤。
2. 找到第一個p標籤。
3. 第一個p標籤的內文。
4. soup.title是bs4.element.Tag類別。
5. p標籤的內文是字串。
```python
print(soup.title)
# <title>The Dormouse's story</title>
print(soup.p)
# <p class="title"><b>The Dormouse's story</b></p>
print(soup.p.text)
# The Dormouse's story
print(type(soup.title))
# <class 'bs4.element.Tag'>
print(type(soup.p.text))
# <class 'str'>
```
### 用屬性找標籤
1. 找到第一個class是title的標籤。
2. 找到第一個id是link1的a標籤。
```python
print(soup.find(class_='title'))
# <p class="title"><b>The Dormouse's story</b></p>
print(soup.find('a', id='link1'))
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
```
### 取出屬性的內容
用法跟dict很像。
1. 找出第一個p標籤的class。
2. 找出第一個a標籤的超連結。
```python
print(soup.p['class'])
# ['title']
print(soup.a['href'])
# http://example.com/elsie
```
### Parent and contents
1. 找到第一個id為link1的a標籤。
2. 找出a標籤的parent標籤。
3. 印出parent_tag的所有child。
```python
atag = soup.find('a', id='link1')
print(atag)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
parent_tag = atag.parent
print(parent_tag)
'''
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
'''
for item in parent_tag.contents:
print(item)
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# ;
# and they lived at the bottom of a well.
```
### 找出所有的標籤
找出所有的a標籤。
```python
print(soup.find_all('a'))
'''
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
'''
```
### 小練習1
印出html_doc中所有的超連結網址。
```c
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
for link in soup.find_all('a'):
print(link['href'])
```
---
## 從網頁取得html
```python
from bs4 import BeautifulSoup
import requests
url = 'http://www.scp-wiki.net/scp-002'
html = requests.get(url).text
print(html)
```
Output:
```
(上略)
<div class="top-bar">
<ul>
<li>SCP Series
<ul>
<li><a href="/scp-series-5">Series V</a></li>
<li><a href="/scp-series-4">Series IV</a></li>
<li><a href="/scp-series-4-tales-edition">» Series IV Tales</a></li>
<li><a href="/scp-series-3">Series III</a></li>
<li><a href="/scp-series-3-tales-edition">» Series III Tales</a></li>
<li><a href="/scp-series-2">Series II</a></li>
<li><a href="/scp-series-2-tales-edition">» Series II Tales</a></li>
<li><a href="/scp-series">Series I</a></li>
<li><a href="/scp-series-1-tales-edition">» Series I Tales</a></li>
</ul>
(下略)
```
## 熬湯
```python
from bs4 import BeautifulSoup
import requests
url = 'http://www.scp-wiki.net/scp-002'
html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
```
Output:
```
(上略)
<div class="top-bar">
<ul>
<li>
SCP Series
<ul>
<li>
<a href="/scp-series-5">
Series V
</a>
</li>
<li>
<a href="/scp-series-4">
Series IV
</a>
</li>
<li>
<a href="/scp-series-4-tales-edition">
» Series IV Tales
</a>
</li>
<li>
<a href="/scp-series-3">
Series III
</a>
</li>
<li>
<a href="/scp-series-3-tales-edition">
» Series III Tales
</a>
</li>
<li>
<a href="/scp-series-2">
Series II
</a>
</li>
<li>
<a href="/scp-series-2-tales-edition">
» Series II Tales
</a>
</li>
<li>
<a href="/scp-series">
Series I
</a>
</li>
<li>
<a href="/scp-series-1-tales-edition">
» Series I Tales
</a>
</li>
</ul>
(下略)
```
---
# 目標:製作SCP圖鑑!
### SCP介紹
SCP基金會是一個跨國組織,負責搜尋並收容各種具有異常屬性的個人、地點或物體(統稱為SCP)。依照收容的危險程度,SCP被分為五個類別:Safe, Euclid, Keter, Thaumiel, Neutralized.
接下來要分析SCP基金會的網站,從介紹SCP的頁面中,找出SCP的類別以及圖片,儲存到對應的資料夾中。
[這是一部你明天還會再看一遍的影片](https://www.youtube.com/watch?v=odc0ajxCq2c)
### 待會會用到但有點陌生的東東
- f-string
- os
- string.strip(), string.find()
## 尋找SCP的類別
[SCP的類別們](http://www.scp-wiki.net/object-classes)
含有SCP類別的html片段。
```html
<p>
<strong>
Object Class:
</strong>
Euclid
</p>
```
程式碼:
```python
def classify(article):
for item in article.find_all('strong'):
if item.text == 'Object Class:':
ptag = item.parent
return ptag.contents[-1].strip()
return None
# Euclid
```
### 小練習2
印出SCP頁面中Description的內文。
```c
from bs4 import BeautifulSoup
import requests
url = 'http://www.scp-wiki.net/scp-002'
html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
for item in soup.find_all('strong'):
if item.text == 'Description:':
ptag = item.parent
print(ptag.contents[-1])
print()
sibling = ptag.find_next_sibling()
print(sibling.text)
```
## 尋找SCP的圖片
含有圖片的html片段:
```html
<div class="scp-image-block block-right" style="width:300px;">
<img src="http://scp-wiki.wdfiles.com/local--files/scp-002/800px-SCP002.jpg" style="width:300px;" alt="800px-SCP002.jpg" class="image" />
<div class="scp-image-caption" style="width:300px;">
<p>
SCP-002 in its containment area
</p>
</div>
</div>
```
程式碼:
```python
def find_image(article):
try:
div = article.find('div', class_='scp-image-block block-right')
# 先找到包含圖片的div標籤
link = div.img['src']
# 找出圖片的連結
image = requests.get(link).content
# 取得圖片的內容
except:
image = None
# 利用try-except避免找不到圖片而產生錯誤的狀況
return image
```
## 下載圖片
```python
file_path = 'scp-002.jpg'
with open(file_path, 'wb') as f:
f.write(image)
```
## 把所有元素組裝在一起
```python
from bs4 import BeautifulSoup
import requests
def classify(article):
for item in article.find_all('strong'):
if item.text == 'Object Class:':
ptag = item.parent
return ptag.contents[-1].strip()
return None
def find_image(article):
try:
div = article.find('div', class_='scp-image-block block-right')
link = div.img['src']
image = requests.get(link).content
except:
image = None
return image
scp_name = 'scp-002'
base_url = 'http://www.scp-wiki.net/'
url = base_url + scp_name
# 取得scp-002的頁面
src = requests.get(url).text
soup = BeautifulSoup(src, 'lxml')
article = soup.find('div', id='page-content')
# 分類scp,否則印出錯誤訊息
scp_type = classify(article)
if scp_type == None:
print(f'{scp_name} is top secret!')
# 尋找並儲存圖片,否則印出錯誤訊息
file_path = f'{scp_name}.jpg'
image = find_image(article)
if image != None:
with open(file_path, 'wb') as f:
f.write(image)
else:
print(f'{scp_name} image not found!')
```
---
## 創造對應的資料夾
```python
def check_dir(scp_type):
# 檢查資料夾是否存在
if not os.path.exists(scp_type):
# 如果不存在的話,就新增一個資料夾
os.mkdir(scp_type)
```
## 把SCP們分類
```python
from bs4 import BeautifulSoup
import requests
import os
def check_dir(scp_type):
if not os.path.exists(scp_type):
os.mkdir(scp_type)
def classify(article):
for item in article.find_all('strong'):
if item.text == 'Object Class:':
ptag = item.parent
return ptag.contents[-1].strip()
return None
def find_image(article):
try:
div = article.find('div', class_='scp-image-block block-right')
link = div.img['src']
image = requests.get(link).content
except:
image = None
return image
base_url = 'http://www.scp-wiki.net/'
for i in range(1, 51):
# scp的編號,如果不滿三位就在前面補0
scp_name = f'scp-{i:03}'
url = base_url + scp_name
src = requests.get(url).text
soup = BeautifulSoup(src, 'lxml')
article = soup.find('div', id='page-content')
# 分類SCP
scp_type = classify(article)
# 如果找不到SCP的類別的話就進入下一次迴圈
if scp_type == None:
print(f'{item} is top secret!')
continue
image = find_image(article)
# 以SCP的類別作為資料夾名稱,編號作為檔名
file_path = f'{scp_type}/{scp_name}.jpg'
# 確定有沒有找到圖片
if image != None:
# 確定資料夾是否存在
check_dir(scp_type)
with open(file_path, 'wb') as f:
f.write(image)
else:
print(f'{item} image not found!')
```
### 小練習3
把圖片的檔名變成SCP們的名字。
```c
def find_name():
scp_name = []
url = 'http://www.scp-wiki.net/scp-series'
src = requests.get(url).text
soup = BeautifulSoup(src, 'lxml')
article = soup.find('div', id='page-content')
for i in range(1, 51):
scp_id = f'SCP-{i:03}'
for atag in article.find_all('a'):
if atag.text == scp_id:
litag = atag.parent
scp_name.append(litag.contents[-1].strip(' -'))
return scp_name
def preprocess(scp_names):
for i in range(len(scp_names)):
scp_names[i] = re.sub(r'[<>:"/\\\|\?\*]', '', scp_names[i])
```