---
title: Python Crawler
tags: Python, Crawler, 爬蟲
---
# Python Crawler --- Chia
> 【目次】
> [TOC]
>
> Reference website:
> 1. [認識網路爬蟲:解放複製貼上的時間](https://pala.tw/what-is-web-crawler/)
> 2. [影片:開始使用PYTHON撰寫網路爬蟲 (CRAWLER)](http://waww.you2repeat.com/watch/?v=woJ2ZpQ1Q9I)
> 3. [爬蟲 .content 和 .text 的用法區別](https://www.smwenku.com/a/5b8fc98d2b71776722158feb)
> 4. [简单爬虫的通用步骤](https://zhuanlan.zhihu.com/p/29017712)
> 5. [爬蟲 / PTT - 4](https://eugene87222.github.io/2018/02/15/PTT-crawler-4/)
> 6. [Python enumerate() 函数](http://www.runoob.com/python/python-func-enumerate.html)
> 7. [Python 爬蟲練習紀錄(二)](https://ivanjo39191.pixnet.net/blog/post/51107259-python-%E7%88%AC%E8%9F%B2%E7%B7%B4%E7%BF%92%E7%B4%80%E9%8C%84%28%E4%BA%8C%29)
> 8. [Python 使用 Beautiful Soup 抓取與解析網頁資料,開發網路爬蟲教學](https://blog.gtwang.org/programming/python-beautiful-soup-module-scrape-web-pages-tutorial/2/)
> 9. [Slides: Python Crawler](https://slides.com/bessy/python_crawler)
> 10. [Python 圖片隱寫](https://hackmd.io/@NIghTcAt/ByKr8jxGH)
---
# What is Web-crawler?
* 自動化抓取網頁內容的程式
* Concept dismantle (概念拆解)
1. 先進入 web page
2. 複製所需要的欄位資訊
3. 貼到 word / excel
4. Repeat
* Tool (工具) --- python
* 好處:讀檔、爬蟲、寫檔 一氣呵成!
* (套件簡介)
1. requests:對目標網頁的伺服器發出請求
2. BeautifulSoup:解析 HTML
3. pandas:爬取表格
4. selenium:網頁測試工具,應付較麻煩的 JavaScript
5. re:正則表達式,擷取技術性較高的文字段落
# 安裝必要套件
* Windows環境下,將命令提示字元以管理者身分打開,輸入:
```pip3 install requests```
```pip3 install beautifulsoup4```
# 查看網站回傳的狀態碼
* 目的:確認網頁回傳的狀況
```python=
import requests
url = 'https://www.reddit.com/'
response = requests.get(url)
print(response)
#<Response [200]>
```
> 換個網站試試看:[天下雜誌](https://www.cw.com.tw/today)
* 常見的 [HTTP狀態碼](https://blog.miniasp.com/post/2009/01/16/Web-developer-should-know-about-HTTP-Status-Code.aspx):
* 200 - 用戶端要求成功。
* 403 - 禁止使用。(網頁阻擋爬蟲)
* 404 - 找不到。
* 403 - 解決辦法:偽裝成瀏覽器,再進行網頁存取。
```python=
import requests
import urllib.request
url = 'https://www.cw.com.tw/today'
# 按F12 - Network - get方法
fake_browser = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
}
request = urllib.request.Request(url, headers=fake_browser)
response = urllib.request.urlopen(request)
print(request, response.getcode())
#<urllib.request.Request object at 0x104aa9b20> 200
```
# 文字爬取
## 文字爬取的方法:
* 複製整個網頁
1. 搜尋網頁內的元素 (HTML元素)
2. 搜尋網頁內的元素 (CSS選擇器)
3. 爬取網頁內的文字(正規表達式)
## 複製整個網頁
```python=
import requests
url = 'https://www.gamer.com.tw/'
response = requests.request('get', url)
file_name = 'gamer.html'
with open(file_name, 'w', encoding='utf-8') as f:
f.write(response.text)
# f = open(file_name, 'w', encoding='utf-8')```
# f.write(response.text)```
print('Success!')
```
```python=
import requests
import urllib.request
url = 'https://www.cw.com.tw/today'
fake_browser = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
}
request = urllib.request.Request(url, headers=fake_browser)
response = urllib.request.urlopen(request)
file_name = 'CommonWealth_Magazine.html'
with open(file_name, 'w', encoding='utf-8') as f:
f.write(response.read().decode('utf-8'))
```
## 搜尋網頁內的元素(HTML元素)
```python=
import requests
# BeautifulSoup 套件
from bs4 import BeautifulSoup
url = 'https://www.gamer.com.tw/'
response = requests.request('get', url)
# 將html文字轉成BeautifulSoup物件
soup = BeautifulSoup(response.text, 'html.parser')
# 這樣就能用它搜尋裡面的內容
title = soup.find('title').text
print(title)
```
```python=
import requests
import urllib.request
from bs4 import BeautifulSoup
url = 'https://www.cw.com.tw/today'
fake_browser = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
}
req = urllib.request.Request(url, headers = fake_browser)
response = urllib.request.urlopen(req)
# 將html文字轉成BeautifulSoup物件
soup = BeautifulSoup(response.read().decode('utf-8'), 'html.parser')
# 這樣就能用它搜尋裡面的內容
title = soup.find('title').text
print(title)
#今日最新-天下雜誌
```
### Lab 1
* 商業週刊:https://www.businessweekly.com.tw/newlist.aspx
1. 查看回傳的狀態碼,並印出狀態碼
```python=
#1
import requests
url = 'https://www.businessweekly.com.tw/newlist.aspx'
response = requests.get(url)
print(response)
#<Response [200]>
```
2. 複製整個網頁,並存成news.html
```python=
#2
import requests
url = 'https://www.businessweekly.com.tw/newlist.aspx'
response = requests.get(url)
file_name = 'news.html'
with open(file_name, 'w', encoding='utf-8') as f:
f.write(response.text)
print('Success!')
#Success!
```
3. 搜尋網頁內的元素(HTML元素-抓取```<title>```)
```python=
#3
import requests
from bs4 import BeautifulSoup
url = 'https://www.businessweekly.com.tw/newlist.aspx'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.find('title').text
print(title)
#今日最新文章 - 商業周刊 - 商周.com
```
## 搜尋網頁內的元素(css選擇器)
```python=
import requests
from bs4 import BeautifulSoup
url = 'https://www.gamer.com.tw/'
response = requests.request('get', url)
soup = BeautifulSoup(response.text, 'html.parser')
# 或是可用CSS選擇器
side_titles = soup.select('.BA-left li a')
for title in side_titles:
print(title.text)
```
```python=
import requests
import urllib.request
from bs4 import BeautifulSoup
url = 'https://www.cw.com.tw/today'
fake_browser = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
}
req = urllib.request.Request(url, headers = fake_browser)
response = urllib.request.urlopen(req)
# 將html文字轉成BeautifulSoup物件
soup = BeautifulSoup(response.read().decode('utf-8'), 'html.parser')
# 或是可用CSS選擇器
side_titles = soup.select('#item1 > div:nth-child(1) > section:nth-child(3) > div.caption > p')
for title in side_titles:
print(title.text)
```
## 爬取網頁內的文字(正則表達式)
* re:正則表達式,擷取技術性較高的文字段落
```python=
import requests
import urllib.request
from bs4 import BeautifulSoup
import re
url = 'https://www.cw.com.tw/today'
fake_browser = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
}
request = urllib.request.Request(url, headers=fake_browser)
response = urllib.request.urlopen(request)
# 將html文字轉成BeautifulSoup物件
soup = BeautifulSoup(response.read().decode('utf-8'), 'html.parser')
# 或是可用正則表達式
response_crawling = soup.find_all('a', href=re.compile('article'))
for a in response_crawling:
print(a.text) #不會印出"HTML標籤"
print(a) #連同HTML標籤一起爬取
```
## 將網頁文字存成檔案
* 若檔案已存在則複寫,若不存在則建立新檔案。
```python=
...
file_name = 'Lab2_text.txt'
#方法一
f = open(file_name, 'w', encoding='utf-8')
f.write(response_crawling.text)
f.close()
# 方法二
with open(file_name, 'w', encoding='utf-8') as f:
f.write(response_crawling.text)
f.close()
```
### Lab 2
1. 搜尋網頁內的元素(CSS選擇器) --- 爬取 class = channel_cnt 內的標題文字
```python=
import requests
from bs4 import BeautifulSoup
url = 'https://www.businessweekly.com.tw/newlist.aspx'
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
side_titles = soup.select('#Scroll-panel-all a')
for title in side_titles:
print(title.text)
```
2. 將網頁文字存成檔案 --- 承上題,存檔為 Lab.txt
```python=
import requests
from bs4 import BeautifulSoup
url = 'https://www.businessweekly.com.tw/newlist.aspx'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
side_titles = soup.select('#Scroll-panel-all a')
file_name = 'Lab.txt'
file = open(file_name, 'w', encoding = 'utf8')
for title in side_titles:
file.write(title.text + '\n')
print(title.text)
file.close()
```
## Lab 3 --- 下載小說 (將小說的章節標題+內容,存成 novel.txt)
* 小說 (芸汐傳.天才小毒妃):https://www.ck101.org/293/293983/51812965.html
* Hint:
1. 先確認網頁回傳的狀況
2. 使用CSS選擇器 — 爬取 class為yuedu_zhengwen 內的文字
3. 將文字寫入novel.txt,並存檔
```python=
# 方法一
import requests
from bs4 import BeautifulSoup
url = 'https://www.ck101.org/293/293983/51812965.html'
response = requests.get(url) #使用header避免訪問受到限制
#print(response)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('title').text
items = soup.select('.yuedu_zhengwen')
file_name = './novel.txt'
file = open(file_name, 'w', encoding = 'utf8')
file.write(title + '\n' + '\n')
for i in items:
file.write(i.text + '\n')
print(i.text + '\n')
file.close()
# 方法二
import requests
from bs4 import BeautifulSoup
import os
word_first = 2965 #word_last = 4330 # 4330-2965+1=1366
url = 'https://www.ck101.org/293/293983/5181'+ str(word_first) +'.html'
response = requests.get(url) #使用header避免訪問受到限制
print(response)
soup = BeautifulSoup(response.content, 'html.parser')
items = soup.select('.yuedu_zhengwen')
items_string = str(items).replace('<br/>','').replace('</div>','').replace('[<div','').replace('class="yuedu_zhengwen"','').replace('id="content">','')
items_string_split = items_string.split()
print(items_string_split)
folder_path ='./novel/'
if (os.path.exists(folder_path) == False): #判斷資料夾是否存在
os.makedirs(folder_path) #Create folder
file_name = './novel/Lab.txt'
file = open(file_name, 'w', encoding = 'utf8')
for items_string in items_string_split:
file.write(items_string + '\n')
#print(items_string + '\n')
file.close()
print('Done!')
```
### 延伸-1 (去除某些不要的文字。如:小÷說◎網 】,♂小÷說◎網 】,)
```python=
import requests
from bs4 import BeautifulSoup
url = 'https://www.ck101.org/293/293983/51812965.html'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('title').text
items = soup.select('.yuedu_zhengwen')
file_name = './Lab.txt'
file = open(file_name, 'w', encoding = 'utf8')
file.write(title + '\n' + '\n')
for i in items:
#去除某些不要的文字。如:小÷說◎網 】,♂小÷說◎網 】,
i = str(i).replace('小÷說◎網 】,♂小÷說◎網 】,','').replace('<br/>','').replace('<div class="yuedu_zhengwen" id="content">','').replace('</div>','')
file.write(i + '\n')
print(i + '\n')
file.close()
```
### 延伸-2 (判斷novel資料夾是否存在,並在資料夾下建立 Lab.txt及存檔。)
```python=
import requests
from bs4 import BeautifulSoup
import os
url = 'https://www.ck101.org/293/293983/51812965.html'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('title').text
items = soup.select('.yuedu_zhengwen')
#判斷資料夾是否存在
folder_path ='./novel/'
if (os.path.exists(folder_path) == False):
os.makedirs(folder_path) #Create folder
# 在 novel資料夾下,建立 Lab.txt
file_name = './novel/Lab.txt'
file = open(file_name, 'w', encoding = 'utf8')
file.write(title + '\n' + '\n')
for i in items:
i = str(i).replace('小÷說◎網 】,♂小÷說◎網 】,','').replace('<br/>','').replace('<div class="yuedu_zhengwen" id="content">','').replace('</div>','')
file.write(i + '\n')
print(i + '\n')
file.close()
```
### 延伸-3 (自動爬取多個章節內容,並存成一個個.txt檔。)
```python=
#方法一
import requests
from bs4 import BeautifulSoup
import os
index = 0
#判斷資料夾是否存在
folder_path ='./novel/'
if (os.path.exists(folder_path) == False):
os.makedirs(folder_path) #Create folder
def get_content(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('title').text
items = soup.select('.yuedu_zhengwen')
file_write(items, title)
def file_write(items, title):
global index
file_name = './novel/Lab' + str(index + 1) + '.txt'
f = open(file_name, 'w', encoding='utf-8')
f.write(title + '\n' + '\n')
for i in items:
i = str(i).replace('小÷說◎網 】,♂小÷說◎網 】,','').replace('<br/>','').replace('<div class="yuedu_zhengwen" id="content">','').replace('</div>','')
f.write(i + '\n')
#print(i + '\n')
f.close() #close file
index += 1
print('Done!')
# 自動爬取多個章節內容並儲存
url = ['https://www.ck101.org/293/293983/5181{}.html'.format(str(i)) for i in range(2965,4330)]
for u in url:
get_content(u)
#方法二
import requests
import urllib.request
from bs4 import BeautifulSoup
import os
index = 0
url = ['https://www.ck101.org/293/293983/5181{}.html'.format(str(i)) for i in range(2965,4330)]
def get_content(url):
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.content, 'html.parser')
items = soup.select('.yuedu_zhengwen')
items_string = str(items).replace('<br/>','').replace('</div>]','').replace('[<div','').replace('class="yuedu_zhengwen"','').replace('id="content">','')
items_string_split = items_string.split()
print(items_string_split)
file_write(items_string_split, items)
def file_write(items_string_split, items):
global index
a = ''
for items_string in items_string_split:
a = a + items_string + '\n'
print(a)
novel_name = './novel' + str(index + 1) + '.text'
with open(novel_name,'w', encoding='big5') as f: #以byte的形式將圖片數據寫入
f.write(a)
f.close() #close file
index += 1
print('Done!')
for titles in url:
get_content(titles)
```
# 圖片爬取
## 爬取圖片並儲存
* 進入google欲搜尋圖片的頁面
```python=
import requests
from bs4 import BeautifulSoup
import os
url = 'https://www.google.com/search?rlz=1C2CAFB_enTW617TW617&biw=1600&bih=762&tbm=isch&sa=1&ei=Z3BUXLqNOZmk-QbW_KaYDw&q=%E7%8B%97&oq=%E7%8B%97&gs_l=img.3..0l10.18328.18868..20040...0.0..0.52.143.3......1....1..gws-wiz-img.......0i24.5zgXwVAqY4U'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
items = soup.find_all('img')
```
* 建立欲儲存圖片的photo資料夾
```python=
folder_path ='./photo/'
if (os.path.exists(folder_path) == False): #判斷資料夾是否存在
os.makedirs(folder_path) #Create folder
```
* 設定所欲下載的圖片數(photolimit) + 抓取圖片的src屬性 + 將抓取的圖片存入photo資料夾下
```python=
photolimit = 10
for index , item in enumerate (items):
if (item and index < photolimit):
# use 'get' to get photo link path , requests = send request
html = requests.get(item.get('src'))
img_name = folder_path + str(index + 1) + '.png'
with open(img_name,'wb') as f: #以byte的形式將圖片數據寫入
f.write(html.content)
f.flush()
f.close()
print('第 %d 張' % (index + 1))
print('Done')
```
## 爬取圖片並儲存(輸入關鍵字搜尋圖片)
```python=
import requests
import urllib.request
from bs4 import BeautifulSoup
import os
import time
word = input('Input key word: ')
url = 'https://www.google.com/search?rlz=1C2CAFB_enTW617TW617&biw=1600&bih=762&tbm=isch&sa=1&ei=n3JUXIWIJNatoAT87a-4Cw&q='
+ word + '&oq=' + word + '&gs_l=img.3..35i39l2j0l8.40071.45943..46702...1.0..2.56.625.13......3....1..gws-wiz-img.....0..0i24.9fotvswIauk'
photolimit = 10
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url,headers = headers) #使用header避免訪問受到限制
soup = BeautifulSoup(response.content, 'html.parser')
items = soup.find_all('img')
folder_path ='./photo/'
if (os.path.exists(folder_path) == False): #判斷資料夾是否存在
os.makedirs(folder_path) #Create folder
for index , item in enumerate (items):
if (item and index < photolimit ):
html = requests.get(item.get('src')) # use 'get' to get photo link path , requests = send request
img_name = folder_path + str(index + 1) + '.png'
with open(img_name,'wb') as file: #以byte的形式將圖片數據寫入
file.write(html.content)
file.flush()
file.close() #close file
print('第 %d 張' % (index + 1))
time.sleep(1)
print('Done')
```
## LAB - shutterstock
* https://www.shutterstock.com/
* 爬取並儲存(輸入關鍵字搜尋圖片)
Hint:
1. 能夠藉由輸入關鍵字,來搜尋圖片
2. "沒有限制"所欲下載的圖片數(photolimit)
3. 請將圖片下載並儲存到photo_sheep資料夾下
```python=
import requests
from bs4 import BeautifulSoup
import os
#import time
word = input("關鍵字:")
url = 'https://www.shutterstock.com/search?search_source=base_landing_page&language=zh-Hant&searchterm='+ word +'&image_type=all'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
all_img = soup.find_all('img')
folder_path ='./photo_sheep/'
if (os.path.exists(folder_path) == False): #判斷資料夾是否存在
os.makedirs(folder_path) #Create folder
for index, img in enumerate (all_img):
if (img):
html = requests.get(img.get('src'))
img_name = folder_path + str(index + 1) + '.png'
with open(img_name,'wb') as f:
f.write(html.content)
f.flush()
print(index + 1)
#time.sleep(1)
print('done ~ ')
```
# 附錄:抓表格
* 爬取表格的套件 pandas
```import pandas```
## 以表格的格式爬取(pandas -1)
```python=
''' 方法一 '''
import pandas
pandas.set_option('max_columns', 200)
pandas.set_option('max_rows', 200)
url = 'https://course.ttu.edu.tw/u9/main/listcourse.php'
table_data = pandas.read_html(url)
file_name = "crawl_table_byPandas.txt"
file = open(file_name, 'w',encoding = 'utf8')
file.write(str(table_data))
print("-- File Writing Ending --")
file.close()
# 統計數據與相關係數
for data in table_data:
print(data.describe())
''' 方法二 '''
import pandas
# Read the html of the table
table = pandas.read_html("https://course.ttu.edu.tw/u9/main/listcourse.php")
# Transfer the first line as the title of column
# table.columns = table.iloc[0]
# table.reindex(table.index.drop(1))
file_name = 'crawled_table.txt'
file = open(file_name, 'w',encoding = 'utf8')
for i in range(len(table)):
file.write(str(table))
print("-- File Writing Ending --")
file.close()
```
## 以單項的格式爬取(CSS selector)
```python=
import requests
from bs4 import BeautifulSoup
response = requests.get("https://course.ttu.edu.tw/u9/main/listcourse.php")
soup = BeautifulSoup(response.text, "lxml")
tag = ".mistab"
with open('./crawl_table_byCSS.txt', 'w',encoding = 'utf8') as f:
for course in soup.select(tag):
print(course.get_text())
f.write(course.get_text())
'''
results = course.get_text()
print(results)
f.write(results + '\n') #f objeocct open txt
'''
print("-- File Writing Ending --")
f.close()
```
# 附錄:斷詞斷句&語音轉文字
```python=
#http://www.chiehfuchan.com/%E7%B0%A1%E5%96%AE%E5%88%A9%E7%94%A8-python-%E5%A5%97%E4%BB%B6-speechrecognition-%E9%80%B2%E8%A1%8C%E8%AA%9E%E9%9F%B3%E8%BE%A8%E8%AD%98/
#https://ithelp.ithome.com.tw/articles/10196577
#https://zhuanlan.zhihu.com/p/50677236
import speech_recognition as sr
r = sr.Recognizer()
with sr.AudioFile("C:\\Users\\pcsh1\\Documents\\錄音\\2.wav") as source: #file.wav
r.adjust_for_ambient_noise(source) #解決環境噪聲
audio = r.listen(source) # audio = r.record(source, duration=100)
en_simplechinese = r.recognize_google(audio, language = 'zh-TW ; en-US')
print(en_simplechinese)
#with sr.Microphone() as source:
#audio = r.listen(source)
#============================================================
### 轉繁體中文 ###
from hanziconv import HanziConv
tra_chinese = HanziConv.toTraditional(en_simplechinese)
print(tra_chinese)
#============================================================
### jieba 斷詞斷句 ###
import jieba
import jieba.analyse
f = open('test.txt','w',encoding='utf8') #清空檔案內容 #'a'
f.write(tra_chinese) #寫入檔案
f.close() # 關閉檔案
f = open('test.txt','r',encoding='utf8')
article = f.read()
tags = jieba.analyse.extract_tags(article,10)
print('最重要字詞',tags)
f.close()
#============================================================
'''
### jieba 斷詞斷句 ###
import jieba
import jieba.analyse
f = open('test.txt','r',encoding='utf8')
article = f.read()
tags = jieba.analyse.extract_tags(article,100)
print('最重要字詞',tags)
'''
```
# 附錄:輸入帳密後,再爬取成績
```python=
import requests
from bs4 import BeautifulSoup as b
payload = {'mail_id':'404040523', 'mail_pwd':'gibe258deny700'}
rs = requests.session()
res = rs.post('http://stu.fju.edu.tw/stusql/SingleSignOn/StuScore/SSO_stu_login.asp', data = payload)
res2 = rs.get('http://stu.fju.edu.tw/stusql/SingleSignOn/StuScore/stu_scoreter.asp')
#print(res2.content)
soup = b(res2.content, "html.parser")
all_td1 = soup.find_all('td', {'align': 'left', 'valign': None})
list1 = []
for obj in all_td1:
list1.append(obj.contents[0])
#print(obj)
for obj in list1:
print(obj.string)
print("===============")
all_td2 = soup.find_all('td', {'align': 'center', 'valign': None})
list2 = []
for obj in all_td2:
list2.append(obj.contents[0])
#print(obj)
for obj in list2:
print(obj.string)
print("===============")
all_td3 = soup.find_all('td', {'align': 'right', 'valign': None})
list3 = []
for obj in all_td3:
list3.append(obj.contents[0])
#print(obj)
for obj in list3:
print(obj.string)
```