Python Crawler –- Chia

【目次】

Python Crawler –- Chia

What is Web-crawler?

安裝必要套件

查看網站回傳的狀態碼

文字爬取

文字爬取的方法：

複製整個網頁

搜尋網頁內的元素(HTML元素)

Lab 1

搜尋網頁內的元素(css選擇器)

爬取網頁內的文字(正則表達式)

將網頁文字存成檔案

Lab 2

Lab 3 –- 下載小說 (將小說的章節標題+內容，存成 novel.txt)

延伸-1 (去除某些不要的文字。如：小÷說◎網】，♂小÷說◎網】，)

延伸-2 (判斷novel資料夾是否存在，並在資料夾下建立 Lab.txt及存檔。)

延伸-3 (自動爬取多個章節內容，並存成一個個.txt檔。)

圖片爬取

爬取圖片並儲存

爬取圖片並儲存(輸入關鍵字搜尋圖片)

LAB - shutterstock

附錄：抓表格

以表格的格式爬取(pandas -1)

以單項的格式爬取(CSS selector)

附錄：斷詞斷句&語音轉文字

附錄：輸入帳密後，再爬取成績

Reference website：

認識網路爬蟲：解放複製貼上的時間

影片：開始使用PYTHON撰寫網路爬蟲 (CRAWLER)

爬蟲 .content 和 .text 的用法區別

简单爬虫的通用步骤

爬蟲 / PTT - 4

Python enumerate() 函数

Python 爬蟲練習紀錄(二)

Python 使用 Beautiful Soup 抓取與解析網頁資料，開發網路爬蟲教學

Slides: Python Crawler

Python 圖片隱寫

What is Web-crawler?

自動化抓取網頁內容的程式
Concept dismantle (概念拆解)
1. 先進入 web page
2. 複製所需要的欄位資訊
3. 貼到 word / excel
4. Repeat
Tool (工具) –- python
- 好處：讀檔、爬蟲、寫檔一氣呵成！
(套件簡介)
1. requests：對目標網頁的伺服器發出請求
2. BeautifulSoup：解析 HTML
3. pandas：爬取表格
4. selenium：網頁測試工具，應付較麻煩的 JavaScript
5. re：正則表達式，擷取技術性較高的文字段落

安裝必要套件

Windows環境下，將命令提示字元以管理者身分打開，輸入：
pip3 install requests
pip3 install beautifulsoup4

查看網站回傳的狀態碼

目的：確認網頁回傳的狀況







import requests

url = 'https://www.reddit.com/'
response = requests.get(url)
print(response)

#<Response [200]>

換個網站試試看：天下雜誌

常見的 HTTP狀態碼：
- 200 - 用戶端要求成功。
- 403 - 禁止使用。(網頁阻擋爬蟲)
- 404 - 找不到。
403 - 解決辦法：偽裝成瀏覽器，再進行網頁存取。















import requests
import urllib.request

url = 'https://www.cw.com.tw/today'

# 按F12 - Network - get方法
fake_browser = {
  'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
}

request = urllib.request.Request(url, headers=fake_browser)
response = urllib.request.urlopen(request)

print(request, response.getcode())
#<urllib.request.Request object at 0x104aa9b20> 200

文字爬取

文字爬取的方法：

複製整個網頁

搜尋網頁內的元素 (HTML元素)
搜尋網頁內的元素 (CSS選擇器)
爬取網頁內的文字(正規表達式)

複製整個網頁












import requests
url = 'https://www.gamer.com.tw/'
response = requests.request('get', url)

file_name = 'gamer.html'
with open(file_name, 'w', encoding='utf-8') as f:
    f.write(response.text)

# f = open(file_name, 'w', encoding='utf-8')```
# f.write(response.text)```

print('Success!')















import requests
import urllib.request

url = 'https://www.cw.com.tw/today'

fake_browser = {
  'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
}

request = urllib.request.Request(url, headers=fake_browser)
response = urllib.request.urlopen(request)

file_name = 'CommonWealth_Magazine.html'
with open(file_name, 'w', encoding='utf-8') as f:
    f.write(response.read().decode('utf-8'))

搜尋網頁內的元素(HTML元素)














import requests

# BeautifulSoup 套件
from bs4 import BeautifulSoup

url = 'https://www.gamer.com.tw/'
response = requests.request('get', url)

# 將html文字轉成BeautifulSoup物件
soup = BeautifulSoup(response.text, 'html.parser')

# 這樣就能用它搜尋裡面的內容
title = soup.find('title').text
print(title)





















import requests
import urllib.request
from bs4 import BeautifulSoup


url = 'https://www.cw.com.tw/today'

fake_browser = {
  'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
}

req = urllib.request.Request(url, headers = fake_browser)
response = urllib.request.urlopen(req)

# 將html文字轉成BeautifulSoup物件
soup = BeautifulSoup(response.read().decode('utf-8'), 'html.parser')

# 這樣就能用它搜尋裡面的內容
title = soup.find('title').text
print(title)
#今日最新－天下雜誌

Lab 1

商業週刊：https://www.businessweekly.com.tw/newlist.aspx

查看回傳的狀態碼，並印出狀態碼








#1
import requests

url = 'https://www.businessweekly.com.tw/newlist.aspx'
response = requests.get(url)
print(response)            

#<Response [200]>

複製整個網頁，並存成news.html











#2
import requests

url = 'https://www.businessweekly.com.tw/newlist.aspx'
response = requests.get(url)
file_name = 'news.html'
with open(file_name, 'w', encoding='utf-8') as f:
    f.write(response.text)
print('Success!')

#Success!

搜尋網頁內的元素(HTML元素-抓取<title>)











#3
import requests
from bs4 import BeautifulSoup

url = 'https://www.businessweekly.com.tw/newlist.aspx'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.find('title').text
print(title)            

#今日最新文章 - 商業周刊 - 商周.com

搜尋網頁內的元素(css選擇器)











import requests
from bs4 import BeautifulSoup

url = 'https://www.gamer.com.tw/'
response = requests.request('get', url)
soup = BeautifulSoup(response.text, 'html.parser')

# 或是可用CSS選擇器
side_titles = soup.select('.BA-left li a')
for title in side_titles:
	print(title.text)





















import requests
import urllib.request
from bs4 import BeautifulSoup


url = 'https://www.cw.com.tw/today'

fake_browser = {
  'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
}

req = urllib.request.Request(url, headers = fake_browser)
response = urllib.request.urlopen(req)

# 將html文字轉成BeautifulSoup物件
soup = BeautifulSoup(response.read().decode('utf-8'), 'html.parser')

# 或是可用CSS選擇器
side_titles = soup.select('#item1 > div:nth-child(1) > section:nth-child(3) > div.caption > p')
for title in side_titles:
    print(title.text)

爬取網頁內的文字(正則表達式)

re：正則表達式，擷取技術性較高的文字段落





















import requests
import urllib.request
from bs4 import BeautifulSoup
import re

url = 'https://www.cw.com.tw/today'

fake_browser = {
  'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
}
request = urllib.request.Request(url, headers=fake_browser)
response = urllib.request.urlopen(request)

# 將html文字轉成BeautifulSoup物件
soup = BeautifulSoup(response.read().decode('utf-8'), 'html.parser')

# 或是可用正則表達式
response_crawling = soup.find_all('a', href=re.compile('article'))
for a in response_crawling:
    print(a.text)  #不會印出"HTML標籤"
    print(a)       #連同HTML標籤一起爬取

將網頁文字存成檔案

若檔案已存在則複寫，若不存在則建立新檔案。












...
file_name = 'Lab2_text.txt'

#方法一
f = open(file_name, 'w', encoding='utf-8') 
f.write(response_crawling.text)
f.close()

# 方法二
with open(file_name, 'w', encoding='utf-8') as f:
    f.write(response_crawling.text)
f.close()

Lab 2

搜尋網頁內的元素(CSS選擇器) –- 爬取 class = channel_cnt 內的標題文字












import requests
from bs4 import BeautifulSoup

url = 'https://www.businessweekly.com.tw/newlist.aspx'
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')

side_titles = soup.select('#Scroll-panel-all a')

for title in side_titles:
    print(title.text)

將網頁文字存成檔案 –- 承上題，存檔為 Lab.txt















import requests
from bs4 import BeautifulSoup
url = 'https://www.businessweekly.com.tw/newlist.aspx'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

side_titles = soup.select('#Scroll-panel-all a')

file_name = 'Lab.txt'
file = open(file_name, 'w', encoding = 'utf8') 

for title in side_titles:
    file.write(title.text + '\n')
    print(title.text)
file.close()

Lab 3 –- 下載小說 (將小說的章節標題+內容，存成 novel.txt)

小說 (芸汐傳.天才小毒妃)：https://www.ck101.org/293/293983/51812965.html
Hint:

先確認網頁回傳的狀況
使用CSS選擇器 — 爬取 class為yuedu_zhengwen 內的文字
將文字寫入novel.txt，並存檔


















































# 方法一
import requests
from bs4 import BeautifulSoup

url = 'https://www.ck101.org/293/293983/51812965.html'
response = requests.get(url) #使用header避免訪問受到限制
#print(response)

soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('title').text
items = soup.select('.yuedu_zhengwen')

file_name = './novel.txt'
file = open(file_name, 'w', encoding = 'utf8') 
file.write(title + '\n' + '\n')

for i in items:
    file.write(i.text + '\n')
    print(i.text  + '\n')  
file.close()

# 方法二
import requests
from bs4 import BeautifulSoup
import os

word_first = 2965  #word_last = 4330  # 4330-2965+1=1366
url = 'https://www.ck101.org/293/293983/5181'+ str(word_first) +'.html'
response = requests.get(url) #使用header避免訪問受到限制
print(response)

soup = BeautifulSoup(response.content, 'html.parser')
items = soup.select('.yuedu_zhengwen')
items_string = str(items).replace('<br/>','').replace('</div>','').replace('[<div','').replace('class="yuedu_zhengwen"','').replace('id="content">','')
items_string_split = items_string.split()
print(items_string_split)

folder_path ='./novel/'
if (os.path.exists(folder_path) == False): #判斷資料夾是否存在
    os.makedirs(folder_path) #Create folder
    
file_name = './novel/Lab.txt'
file = open(file_name, 'w', encoding = 'utf8') 

for items_string in items_string_split:
    file.write(items_string + '\n')
    #print(items_string + '\n')
    
file.close()
print('Done！')

延伸-1 (去除某些不要的文字。如：小÷說◎網】，♂小÷說◎網】，)





















import requests
from bs4 import BeautifulSoup

url = 'https://www.ck101.org/293/293983/51812965.html'
response = requests.get(url) 

soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('title').text
items = soup.select('.yuedu_zhengwen')

file_name = './Lab.txt'
file = open(file_name, 'w', encoding = 'utf8') 
file.write(title + '\n' + '\n')

for i in items:
    #去除某些不要的文字。如：小÷說◎網 】，♂小÷說◎網 】，
    i = str(i).replace('小÷說◎網 】，♂小÷說◎網 】，','').replace('<br/>','').replace('<div class="yuedu_zhengwen" id="content">','').replace('</div>','')
    file.write(i + '\n')
    print(i  + '\n')  
file.close()

延伸-2 (判斷novel資料夾是否存在，並在資料夾下建立 Lab.txt及存檔。)


























import requests
from bs4 import BeautifulSoup
import os

url = 'https://www.ck101.org/293/293983/51812965.html'
response = requests.get(url) 

soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('title').text
items = soup.select('.yuedu_zhengwen')

#判斷資料夾是否存在
folder_path ='./novel/'
if (os.path.exists(folder_path) == False): 
    os.makedirs(folder_path) #Create folder

# 在 novel資料夾下，建立 Lab.txt
file_name = './novel/Lab.txt'
file = open(file_name, 'w', encoding = 'utf8') 
file.write(title + '\n' + '\n')

for i in items:
    i = str(i).replace('小÷說◎網 】，♂小÷說◎網 】，','').replace('<br/>','').replace('<div class="yuedu_zhengwen" id="content">','').replace('</div>','')
    file.write(i + '\n')
    print(i  + '\n')  
file.close()

延伸-3 (自動爬取多個章節內容，並存成一個個.txt檔。)













































































#方法一
import requests
from bs4 import BeautifulSoup
import os
index = 0
    
#判斷資料夾是否存在
folder_path ='./novel/'
if (os.path.exists(folder_path) == False): 
    os.makedirs(folder_path) #Create folder
    
def get_content(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    title = soup.find('title').text
    items = soup.select('.yuedu_zhengwen')

    file_write(items, title)

def file_write(items, title):
    global index
    
    file_name = './novel/Lab' + str(index + 1) + '.txt'
    f = open(file_name, 'w', encoding='utf-8')
    f.write(title + '\n' + '\n')
    
    for i in items:
        i = str(i).replace('小÷說◎網 】，♂小÷說◎網 】，','').replace('<br/>','').replace('<div class="yuedu_zhengwen" id="content">','').replace('</div>','')
        f.write(i + '\n')
        #print(i  + '\n')

    f.close() #close file
    index += 1
    print('Done！')

# 自動爬取多個章節內容並儲存
url = ['https://www.ck101.org/293/293983/5181{}.html'.format(str(i)) for i in range(2965,4330)]

for u in url:
    get_content(u)


#方法二
import requests
import urllib.request
from bs4 import BeautifulSoup
import os
index = 0

url = ['https://www.ck101.org/293/293983/5181{}.html'.format(str(i)) for i in range(2965,4330)]

def get_content(url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers = headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    items = soup.select('.yuedu_zhengwen')

    items_string = str(items).replace('<br/>','').replace('</div>]','').replace('[<div','').replace('class="yuedu_zhengwen"','').replace('id="content">','')
    items_string_split = items_string.split()
    print(items_string_split)
    file_write(items_string_split, items)

def file_write(items_string_split, items):
    global index
    a = ''
    for items_string in items_string_split:
        a = a + items_string + '\n'
    print(a)
    novel_name = './novel' + str(index + 1) + '.text'
    with open(novel_name,'w', encoding='big5') as f: #以byte的形式將圖片數據寫入
        f.write(a)
    f.close() #close file
    index += 1
    print('Done！')
    
for titles in url:
    get_content(titles)

圖片爬取

爬取圖片並儲存

進入google欲搜尋圖片的頁面








import requests
from bs4 import BeautifulSoup
import os

url = 'https://www.google.com/search?rlz=1C2CAFB_enTW617TW617&biw=1600&bih=762&tbm=isch&sa=1&ei=Z3BUXLqNOZmk-QbW_KaYDw&q=%E7%8B%97&oq=%E7%8B%97&gs_l=img.3..0l10.18328.18868..20040...0.0..0.52.143.3......1....1..gws-wiz-img.......0i24.5zgXwVAqY4U'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
items = soup.find_all('img')

建立欲儲存圖片的photo資料夾



folder_path ='./photo/'
if (os.path.exists(folder_path) == False): #判斷資料夾是否存在
    os.makedirs(folder_path) #Create folder

設定所欲下載的圖片數(photolimit) + 抓取圖片的src屬性 + 將抓取的圖片存入photo資料夾下















photolimit = 10

for index , item in enumerate (items):
    if (item and index < photolimit):
        # use 'get' to get photo link path , requests = send request
        html = requests.get(item.get('src')) 
        
        img_name = folder_path + str(index + 1) + '.png'

        with open(img_name,'wb') as f: #以byte的形式將圖片數據寫入
            f.write(html.content)
            f.flush()
        f.close()
        print('第 %d 張' % (index + 1))        
print('Done')

爬取圖片並儲存(輸入關鍵字搜尋圖片)




































import requests
import urllib.request
from bs4 import BeautifulSoup
import os
import time

word = input('Input key word: ')

url = 'https://www.google.com/search?rlz=1C2CAFB_enTW617TW617&biw=1600&bih=762&tbm=isch&sa=1&ei=n3JUXIWIJNatoAT87a-4Cw&q=' 
        + word + '&oq=' + word + '&gs_l=img.3..35i39l2j0l8.40071.45943..46702...1.0..2.56.625.13......3....1..gws-wiz-img.....0..0i24.9fotvswIauk'


photolimit = 10

headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url,headers = headers) #使用header避免訪問受到限制
soup = BeautifulSoup(response.content, 'html.parser')
items = soup.find_all('img')
folder_path ='./photo/'

if (os.path.exists(folder_path) == False): #判斷資料夾是否存在
    os.makedirs(folder_path) #Create folder

for index , item in enumerate (items):
    if (item and index < photolimit ):
        html = requests.get(item.get('src')) # use 'get' to get photo link path , requests = send request
        img_name = folder_path + str(index + 1) + '.png'

        with open(img_name,'wb') as file: #以byte的形式將圖片數據寫入
            file.write(html.content)
            file.flush()
        file.close() #close file
        print('第 %d 張' % (index + 1))
        time.sleep(1)
        
print('Done')

LAB - shutterstock

https://www.shutterstock.com/
爬取並儲存(輸入關鍵字搜尋圖片)

Hint:

能夠藉由輸入關鍵字，來搜尋圖片
"沒有限制"所欲下載的圖片數(photolimit)
請將圖片下載並儲存到photo_sheep資料夾下


























import requests
from bs4 import BeautifulSoup
import os
#import time

word = input("關鍵字：")
url = 'https://www.shutterstock.com/search?search_source=base_landing_page&language=zh-Hant&searchterm='+ word +'&image_type=all'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
all_img = soup.find_all('img')

folder_path ='./photo_sheep/'
if (os.path.exists(folder_path) == False): #判斷資料夾是否存在
    os.makedirs(folder_path) #Create folder

for index, img in enumerate (all_img):
    if (img):
        html = requests.get(img.get('src'))
        img_name = folder_path + str(index + 1) + '.png'
        
        with open(img_name,'wb') as f:
            f.write(html.content)
            f.flush()
        print(index + 1)
        #time.sleep(1)
print('done ~ ')

附錄：抓表格

爬取表格的套件 pandas
import pandas

以表格的格式爬取(pandas -1)


































''' 方法一 '''
import pandas

pandas.set_option('max_columns', 200)
pandas.set_option('max_rows', 200)
url = 'https://course.ttu.edu.tw/u9/main/listcourse.php'
table_data = pandas.read_html(url)
file_name = "crawl_table_byPandas.txt"

file = open(file_name, 'w',encoding = 'utf8')
file.write(str(table_data))
print("-- File Writing Ending --")
file.close()
# 統計數據與相關係數
for data in table_data:
	print(data.describe())
    
''' 方法二 '''
import pandas

# Read the html of the table
table = pandas.read_html("https://course.ttu.edu.tw/u9/main/listcourse.php")

# Transfer the first line as the title of column
# table.columns = table.iloc[0]

# table.reindex(table.index.drop(1))

file_name = 'crawled_table.txt'
file = open(file_name, 'w',encoding = 'utf8')
for i in range(len(table)):
	file.write(str(table))
print("-- File Writing Ending --")
file.close()

以單項的格式爬取(CSS selector)



















import requests
from bs4 import BeautifulSoup

response = requests.get("https://course.ttu.edu.tw/u9/main/listcourse.php")
soup = BeautifulSoup(response.text, "lxml")
tag = ".mistab"

with open('./crawl_table_byCSS.txt', 'w',encoding = 'utf8') as f:
    for course in soup.select(tag):
        print(course.get_text())
        f.write(course.get_text())
        
        '''
        results = course.get_text()
        print(results)
        f.write(results + '\n') #f objeocct open txt
        '''
    print("-- File Writing Ending --")
    f.close()

附錄：斷詞斷句&語音轉文字















































#http://www.chiehfuchan.com/%E7%B0%A1%E5%96%AE%E5%88%A9%E7%94%A8-python-%E5%A5%97%E4%BB%B6-speechrecognition-%E9%80%B2%E8%A1%8C%E8%AA%9E%E9%9F%B3%E8%BE%A8%E8%AD%98/
#https://ithelp.ithome.com.tw/articles/10196577
#https://zhuanlan.zhihu.com/p/50677236

import speech_recognition as sr

r = sr.Recognizer()
with sr.AudioFile("C:\\Users\\pcsh1\\Documents\\錄音\\2.wav") as source: #file.wav
    r.adjust_for_ambient_noise(source) #解決環境噪聲 
    audio = r.listen(source) # audio = r.record(source, duration=100) 

en_simplechinese = r.recognize_google(audio, language = 'zh-TW ; en-US') 
print(en_simplechinese)

#with sr.Microphone() as source:
    #audio = r.listen(source)

#============================================================
### 轉繁體中文 ###
from hanziconv import HanziConv
tra_chinese = HanziConv.toTraditional(en_simplechinese) 
print(tra_chinese)

#============================================================
### jieba 斷詞斷句 ###
import jieba
import jieba.analyse
f = open('test.txt','w',encoding='utf8') #清空檔案內容  #'a'
f.write(tra_chinese) #寫入檔案
f.close() # 關閉檔案

f = open('test.txt','r',encoding='utf8')
article = f.read()
tags = jieba.analyse.extract_tags(article,10)
print('最重要字詞',tags)
f.close()

#============================================================
'''
### jieba 斷詞斷句 ###
import jieba
import jieba.analyse
f = open('test.txt','r',encoding='utf8')
article = f.read()
tags = jieba.analyse.extract_tags(article,100)
print('最重要字詞',tags)
'''

附錄：輸入帳密後，再爬取成績






































import requests
from bs4 import BeautifulSoup as b

payload = {'mail_id':'404040523', 'mail_pwd':'gibe258deny700'}
rs = requests.session()
res = rs.post('http://stu.fju.edu.tw/stusql/SingleSignOn/StuScore/SSO_stu_login.asp', data = payload)
res2 = rs.get('http://stu.fju.edu.tw/stusql/SingleSignOn/StuScore/stu_scoreter.asp')
#print(res2.content)
soup = b(res2.content, "html.parser")

all_td1 = soup.find_all('td', {'align': 'left', 'valign': None})
list1 = []
for obj in all_td1:
    list1.append(obj.contents[0])
    #print(obj)

for obj in list1:
    print(obj.string)

print("===============")
all_td2 = soup.find_all('td', {'align': 'center', 'valign': None})
list2 = []
for obj in all_td2:
    list2.append(obj.contents[0])
    #print(obj)

for obj in list2:
    print(obj.string)

print("===============")
all_td3 = soup.find_all('td', {'align': 'right', 'valign': None})
list3 = []
for obj in all_td3:
    list3.append(obj.contents[0])
    #print(obj)

for obj in list3:
    print(obj.string)

Python Crawler –- Chia

What is Web-crawler?

安裝必要套件

查看網站回傳的狀態碼

文字爬取

文字爬取的方法：

複製整個網頁

搜尋網頁內的元素(HTML元素)

Lab 1

搜尋網頁內的元素(css選擇器)

爬取網頁內的文字(正則表達式)

將網頁文字存成檔案

Lab 2

Lab 3 –- 下載小說 (將小說的章節標題+內容，存成 novel.txt)

延伸-1 (去除某些不要的文字。如：小÷說◎網 】，♂小÷說◎網 】，)

延伸-2 (判斷novel資料夾是否存在，並在資料夾下建立 Lab.txt及存檔。)

延伸-3 (自動爬取多個章節內容，並存成一個個.txt檔。)

圖片爬取

爬取圖片並儲存

爬取圖片並儲存(輸入關鍵字搜尋圖片)

LAB - shutterstock

附錄：抓表格

以表格的格式爬取(pandas -1)

以單項的格式爬取(CSS selector)

附錄：斷詞斷句&語音轉文字

附錄：輸入帳密後，再爬取成績

Read more

Django網站系統部署

中文分詞技術

NLP - NLP入門基礎知識

NLP_paper - Distributed representations of words and phrases and their compositionality

延伸-1 (去除某些不要的文字。如：小÷說◎網】，♂小÷說◎網】，)