Python Crawler

--- title: Python Crawler tags: Python, Crawler, 爬蟲 --- # Python Crawler --- Chia > 【目次】 > [TOC] > > Reference website： > 1. [認識網路爬蟲：解放複製貼上的時間](https://pala.tw/what-is-web-crawler/) > 2. [影片：開始使用PYTHON撰寫網路爬蟲 (CRAWLER)](http://waww.you2repeat.com/watch/?v=woJ2ZpQ1Q9I) > 3. [爬蟲 .content 和 .text 的用法區別](https://www.smwenku.com/a/5b8fc98d2b71776722158feb) > 4. [简单爬虫的通用步骤](https://zhuanlan.zhihu.com/p/29017712) > 5. [爬蟲 / PTT - 4](https://eugene87222.github.io/2018/02/15/PTT-crawler-4/) > 6. [Python enumerate() 函数](http://www.runoob.com/python/python-func-enumerate.html) > 7. [Python 爬蟲練習紀錄(二)](https://ivanjo39191.pixnet.net/blog/post/51107259-python-%E7%88%AC%E8%9F%B2%E7%B7%B4%E7%BF%92%E7%B4%80%E9%8C%84%28%E4%BA%8C%29) > 8. [Python 使用 Beautiful Soup 抓取與解析網頁資料，開發網路爬蟲教學](https://blog.gtwang.org/programming/python-beautiful-soup-module-scrape-web-pages-tutorial/2/) > 9. [Slides: Python Crawler](https://slides.com/bessy/python_crawler) > 10. [Python 圖片隱寫](https://hackmd.io/@NIghTcAt/ByKr8jxGH) --- # What is Web-crawler? * 自動化抓取網頁內容的程式 * Concept dismantle (概念拆解) 1. 先進入 web page 2. 複製所需要的欄位資訊 3. 貼到 word / excel 4. Repeat * Tool (工具) --- python * 好處：讀檔、爬蟲、寫檔一氣呵成！ * (套件簡介) 1. requests：對目標網頁的伺服器發出請求 2. BeautifulSoup：解析 HTML 3. pandas：爬取表格 4. selenium：網頁測試工具，應付較麻煩的 JavaScript 5. re：正則表達式，擷取技術性較高的文字段落 # 安裝必要套件 * Windows環境下，將命令提示字元以管理者身分打開，輸入： ```pip3 install requests``` ```pip3 install beautifulsoup4``` # 查看網站回傳的狀態碼 * 目的：確認網頁回傳的狀況 ```python= import requests url = 'https://www.reddit.com/' response = requests.get(url) print(response) #<Response [200]> ``` > 換個網站試試看：[天下雜誌](https://www.cw.com.tw/today) * 常見的 [HTTP狀態碼](https://blog.miniasp.com/post/2009/01/16/Web-developer-should-know-about-HTTP-Status-Code.aspx)： * 200 - 用戶端要求成功。 * 403 - 禁止使用。(網頁阻擋爬蟲) * 404 - 找不到。 * 403 - 解決辦法：偽裝成瀏覽器，再進行網頁存取。 ```python= import requests import urllib.request url = 'https://www.cw.com.tw/today' # 按F12 - Network - get方法 fake_browser = { 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36', } request = urllib.request.Request(url, headers=fake_browser) response = urllib.request.urlopen(request) print(request, response.getcode()) #<urllib.request.Request object at 0x104aa9b20> 200 ``` # 文字爬取 ## 文字爬取的方法： * 複製整個網頁 1. 搜尋網頁內的元素 (HTML元素) 2. 搜尋網頁內的元素 (CSS選擇器) 3. 爬取網頁內的文字(正規表達式) ## 複製整個網頁 ```python= import requests url = 'https://www.gamer.com.tw/' response = requests.request('get', url) file_name = 'gamer.html' with open(file_name, 'w', encoding='utf-8') as f: f.write(response.text) # f = open(file_name, 'w', encoding='utf-8')``` # f.write(response.text)``` print('Success!') ``` ```python= import requests import urllib.request url = 'https://www.cw.com.tw/today' fake_browser = { 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36', } request = urllib.request.Request(url, headers=fake_browser) response = urllib.request.urlopen(request) file_name = 'CommonWealth_Magazine.html' with open(file_name, 'w', encoding='utf-8') as f: f.write(response.read().decode('utf-8')) ``` ## 搜尋網頁內的元素(HTML元素) ```python= import requests # BeautifulSoup 套件 from bs4 import BeautifulSoup url = 'https://www.gamer.com.tw/' response = requests.request('get', url) # 將html文字轉成BeautifulSoup物件 soup = BeautifulSoup(response.text, 'html.parser') # 這樣就能用它搜尋裡面的內容 title = soup.find('title').text print(title) ``` ```python= import requests import urllib.request from bs4 import BeautifulSoup url = 'https://www.cw.com.tw/today' fake_browser = { 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36', } req = urllib.request.Request(url, headers = fake_browser) response = urllib.request.urlopen(req) # 將html文字轉成BeautifulSoup物件 soup = BeautifulSoup(response.read().decode('utf-8'), 'html.parser') # 這樣就能用它搜尋裡面的內容 title = soup.find('title').text print(title) #今日最新－天下雜誌 ``` ### Lab 1 * 商業週刊：https://www.businessweekly.com.tw/newlist.aspx 1. 查看回傳的狀態碼，並印出狀態碼 ```python= #1 import requests url = 'https://www.businessweekly.com.tw/newlist.aspx' response = requests.get(url) print(response) #<Response [200]> ``` 2. 複製整個網頁，並存成news.html ```python= #2 import requests url = 'https://www.businessweekly.com.tw/newlist.aspx' response = requests.get(url) file_name = 'news.html' with open(file_name, 'w', encoding='utf-8') as f: f.write(response.text) print('Success!') #Success! ``` 3. 搜尋網頁內的元素(HTML元素-抓取```<title>```) ```python= #3 import requests from bs4 import BeautifulSoup url = 'https://www.businessweekly.com.tw/newlist.aspx' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') title = soup.find('title').text print(title) #今日最新文章 - 商業周刊 - 商周.com ``` ## 搜尋網頁內的元素(css選擇器) ```python= import requests from bs4 import BeautifulSoup url = 'https://www.gamer.com.tw/' response = requests.request('get', url) soup = BeautifulSoup(response.text, 'html.parser') # 或是可用CSS選擇器 side_titles = soup.select('.BA-left li a') for title in side_titles: print(title.text) ``` ```python= import requests import urllib.request from bs4 import BeautifulSoup url = 'https://www.cw.com.tw/today' fake_browser = { 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36', } req = urllib.request.Request(url, headers = fake_browser) response = urllib.request.urlopen(req) # 將html文字轉成BeautifulSoup物件 soup = BeautifulSoup(response.read().decode('utf-8'), 'html.parser') # 或是可用CSS選擇器 side_titles = soup.select('#item1 > div:nth-child(1) > section:nth-child(3) > div.caption > p') for title in side_titles: print(title.text) ``` ## 爬取網頁內的文字(正則表達式) * re：正則表達式，擷取技術性較高的文字段落 ```python= import requests import urllib.request from bs4 import BeautifulSoup import re url = 'https://www.cw.com.tw/today' fake_browser = { 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36', } request = urllib.request.Request(url, headers=fake_browser) response = urllib.request.urlopen(request) # 將html文字轉成BeautifulSoup物件 soup = BeautifulSoup(response.read().decode('utf-8'), 'html.parser') # 或是可用正則表達式 response_crawling = soup.find_all('a', href=re.compile('article')) for a in response_crawling: print(a.text) #不會印出"HTML標籤" print(a) #連同HTML標籤一起爬取 ``` ## 將網頁文字存成檔案 * 若檔案已存在則複寫，若不存在則建立新檔案。 ```python= ... file_name = 'Lab2_text.txt' #方法一 f = open(file_name, 'w', encoding='utf-8') f.write(response_crawling.text) f.close() # 方法二 with open(file_name, 'w', encoding='utf-8') as f: f.write(response_crawling.text) f.close() ``` ### Lab 2 1. 搜尋網頁內的元素(CSS選擇器) --- 爬取 class = channel_cnt 內的標題文字 ```python= import requests from bs4 import BeautifulSoup url = 'https://www.businessweekly.com.tw/newlist.aspx' response = requests.get(url) print(response) soup = BeautifulSoup(response.text, 'html.parser') side_titles = soup.select('#Scroll-panel-all a') for title in side_titles: print(title.text) ``` 2. 將網頁文字存成檔案 --- 承上題，存檔為 Lab.txt ```python= import requests from bs4 import BeautifulSoup url = 'https://www.businessweekly.com.tw/newlist.aspx' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') side_titles = soup.select('#Scroll-panel-all a') file_name = 'Lab.txt' file = open(file_name, 'w', encoding = 'utf8') for title in side_titles: file.write(title.text + '\n') print(title.text) file.close() ``` ## Lab 3 --- 下載小說 (將小說的章節標題+內容，存成 novel.txt) * 小說 (芸汐傳.天才小毒妃)：https://www.ck101.org/293/293983/51812965.html * Hint: 1. 先確認網頁回傳的狀況 2. 使用CSS選擇器 — 爬取 class為yuedu_zhengwen 內的文字 3. 將文字寫入novel.txt，並存檔 ```python= # 方法一 import requests from bs4 import BeautifulSoup url = 'https://www.ck101.org/293/293983/51812965.html' response = requests.get(url) #使用header避免訪問受到限制 #print(response) soup = BeautifulSoup(response.content, 'html.parser') title = soup.find('title').text items = soup.select('.yuedu_zhengwen') file_name = './novel.txt' file = open(file_name, 'w', encoding = 'utf8') file.write(title + '\n' + '\n') for i in items: file.write(i.text + '\n') print(i.text + '\n') file.close() # 方法二 import requests from bs4 import BeautifulSoup import os word_first = 2965 #word_last = 4330 # 4330-2965+1=1366 url = 'https://www.ck101.org/293/293983/5181'+ str(word_first) +'.html' response = requests.get(url) #使用header避免訪問受到限制 print(response) soup = BeautifulSoup(response.content, 'html.parser') items = soup.select('.yuedu_zhengwen') items_string = str(items).replace('<br/>','').replace('</div>','').replace('[<div','').replace('class="yuedu_zhengwen"','').replace('id="content">','') items_string_split = items_string.split() print(items_string_split) folder_path ='./novel/' if (os.path.exists(folder_path) == False): #判斷資料夾是否存在 os.makedirs(folder_path) #Create folder file_name = './novel/Lab.txt' file = open(file_name, 'w', encoding = 'utf8') for items_string in items_string_split: file.write(items_string + '\n') #print(items_string + '\n') file.close() print('Done！') ``` ### 延伸-1 (去除某些不要的文字。如：小÷說◎網】，♂小÷說◎網】，) ```python= import requests from bs4 import BeautifulSoup url = 'https://www.ck101.org/293/293983/51812965.html' response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') title = soup.find('title').text items = soup.select('.yuedu_zhengwen') file_name = './Lab.txt' file = open(file_name, 'w', encoding = 'utf8') file.write(title + '\n' + '\n') for i in items: #去除某些不要的文字。如：小÷說◎網】，♂小÷說◎網】， i = str(i).replace('小÷說◎網】，♂小÷說◎網】，','').replace('<br/>','').replace('<div class="yuedu_zhengwen" id="content">','').replace('</div>','') file.write(i + '\n') print(i + '\n') file.close() ``` ### 延伸-2 (判斷novel資料夾是否存在，並在資料夾下建立 Lab.txt及存檔。) ```python= import requests from bs4 import BeautifulSoup import os url = 'https://www.ck101.org/293/293983/51812965.html' response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') title = soup.find('title').text items = soup.select('.yuedu_zhengwen') #判斷資料夾是否存在 folder_path ='./novel/' if (os.path.exists(folder_path) == False): os.makedirs(folder_path) #Create folder # 在 novel資料夾下，建立 Lab.txt file_name = './novel/Lab.txt' file = open(file_name, 'w', encoding = 'utf8') file.write(title + '\n' + '\n') for i in items: i = str(i).replace('小÷說◎網】，♂小÷說◎網】，','').replace('<br/>','').replace('<div class="yuedu_zhengwen" id="content">','').replace('</div>','') file.write(i + '\n') print(i + '\n') file.close() ``` ### 延伸-3 (自動爬取多個章節內容，並存成一個個.txt檔。) ```python= #方法一 import requests from bs4 import BeautifulSoup import os index = 0 #判斷資料夾是否存在 folder_path ='./novel/' if (os.path.exists(folder_path) == False): os.makedirs(folder_path) #Create folder def get_content(url): response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') title = soup.find('title').text items = soup.select('.yuedu_zhengwen') file_write(items, title) def file_write(items, title): global index file_name = './novel/Lab' + str(index + 1) + '.txt' f = open(file_name, 'w', encoding='utf-8') f.write(title + '\n' + '\n') for i in items: i = str(i).replace('小÷說◎網】，♂小÷說◎網】，','').replace('<br/>','').replace('<div class="yuedu_zhengwen" id="content">','').replace('</div>','') f.write(i + '\n') #print(i + '\n') f.close() #close file index += 1 print('Done！') # 自動爬取多個章節內容並儲存 url = ['https://www.ck101.org/293/293983/5181{}.html'.format(str(i)) for i in range(2965,4330)] for u in url: get_content(u) #方法二 import requests import urllib.request from bs4 import BeautifulSoup import os index = 0 url = ['https://www.ck101.org/293/293983/5181{}.html'.format(str(i)) for i in range(2965,4330)] def get_content(url): headers = {'User-Agent': 'Mozilla/5.0'} response = requests.get(url, headers = headers) soup = BeautifulSoup(response.content, 'html.parser') items = soup.select('.yuedu_zhengwen') items_string = str(items).replace('<br/>','').replace('</div>]','').replace('[<div','').replace('class="yuedu_zhengwen"','').replace('id="content">','') items_string_split = items_string.split() print(items_string_split) file_write(items_string_split, items) def file_write(items_string_split, items): global index a = '' for items_string in items_string_split: a = a + items_string + '\n' print(a) novel_name = './novel' + str(index + 1) + '.text' with open(novel_name,'w', encoding='big5') as f: #以byte的形式將圖片數據寫入 f.write(a) f.close() #close file index += 1 print('Done！') for titles in url: get_content(titles) ``` # 圖片爬取 ## 爬取圖片並儲存 * 進入google欲搜尋圖片的頁面 ```python= import requests from bs4 import BeautifulSoup import os url = 'https://www.google.com/search?rlz=1C2CAFB_enTW617TW617&biw=1600&bih=762&tbm=isch&sa=1&ei=Z3BUXLqNOZmk-QbW_KaYDw&q=%E7%8B%97&oq=%E7%8B%97&gs_l=img.3..0l10.18328.18868..20040...0.0..0.52.143.3......1....1..gws-wiz-img.......0i24.5zgXwVAqY4U' response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') items = soup.find_all('img') ``` * 建立欲儲存圖片的photo資料夾 ```python= folder_path ='./photo/' if (os.path.exists(folder_path) == False): #判斷資料夾是否存在 os.makedirs(folder_path) #Create folder ``` * 設定所欲下載的圖片數(photolimit) + 抓取圖片的src屬性 + 將抓取的圖片存入photo資料夾下 ```python= photolimit = 10 for index , item in enumerate (items): if (item and index < photolimit): # use 'get' to get photo link path , requests = send request html = requests.get(item.get('src')) img_name = folder_path + str(index + 1) + '.png' with open(img_name,'wb') as f: #以byte的形式將圖片數據寫入 f.write(html.content) f.flush() f.close() print('第 %d 張' % (index + 1)) print('Done') ``` ## 爬取圖片並儲存(輸入關鍵字搜尋圖片) ```python= import requests import urllib.request from bs4 import BeautifulSoup import os import time word = input('Input key word: ') url = 'https://www.google.com/search?rlz=1C2CAFB_enTW617TW617&biw=1600&bih=762&tbm=isch&sa=1&ei=n3JUXIWIJNatoAT87a-4Cw&q=' + word + '&oq=' + word + '&gs_l=img.3..35i39l2j0l8.40071.45943..46702...1.0..2.56.625.13......3....1..gws-wiz-img.....0..0i24.9fotvswIauk' photolimit = 10 headers = {'User-Agent': 'Mozilla/5.0'} response = requests.get(url,headers = headers) #使用header避免訪問受到限制 soup = BeautifulSoup(response.content, 'html.parser') items = soup.find_all('img') folder_path ='./photo/' if (os.path.exists(folder_path) == False): #判斷資料夾是否存在 os.makedirs(folder_path) #Create folder for index , item in enumerate (items): if (item and index < photolimit ): html = requests.get(item.get('src')) # use 'get' to get photo link path , requests = send request img_name = folder_path + str(index + 1) + '.png' with open(img_name,'wb') as file: #以byte的形式將圖片數據寫入 file.write(html.content) file.flush() file.close() #close file print('第 %d 張' % (index + 1)) time.sleep(1) print('Done') ``` ## LAB - shutterstock * https://www.shutterstock.com/ * 爬取並儲存(輸入關鍵字搜尋圖片) Hint: 1. 能夠藉由輸入關鍵字，來搜尋圖片 2. "沒有限制"所欲下載的圖片數(photolimit) 3. 請將圖片下載並儲存到photo_sheep資料夾下 ```python= import requests from bs4 import BeautifulSoup import os #import time word = input("關鍵字：") url = 'https://www.shutterstock.com/search?search_source=base_landing_page&language=zh-Hant&searchterm='+ word +'&image_type=all' res = requests.get(url) soup = BeautifulSoup(res.content, 'html.parser') all_img = soup.find_all('img') folder_path ='./photo_sheep/' if (os.path.exists(folder_path) == False): #判斷資料夾是否存在 os.makedirs(folder_path) #Create folder for index, img in enumerate (all_img): if (img): html = requests.get(img.get('src')) img_name = folder_path + str(index + 1) + '.png' with open(img_name,'wb') as f: f.write(html.content) f.flush() print(index + 1) #time.sleep(1) print('done ~ ') ``` # 附錄：抓表格 * 爬取表格的套件 pandas ```import pandas``` ## 以表格的格式爬取(pandas -1) ```python= ''' 方法一 ''' import pandas pandas.set_option('max_columns', 200) pandas.set_option('max_rows', 200) url = 'https://course.ttu.edu.tw/u9/main/listcourse.php' table_data = pandas.read_html(url) file_name = "crawl_table_byPandas.txt" file = open(file_name, 'w',encoding = 'utf8') file.write(str(table_data)) print("-- File Writing Ending --") file.close() # 統計數據與相關係數 for data in table_data: print(data.describe()) ''' 方法二 ''' import pandas # Read the html of the table table = pandas.read_html("https://course.ttu.edu.tw/u9/main/listcourse.php") # Transfer the first line as the title of column # table.columns = table.iloc[0] # table.reindex(table.index.drop(1)) file_name = 'crawled_table.txt' file = open(file_name, 'w',encoding = 'utf8') for i in range(len(table)): file.write(str(table)) print("-- File Writing Ending --") file.close() ``` ## 以單項的格式爬取(CSS selector) ```python= import requests from bs4 import BeautifulSoup response = requests.get("https://course.ttu.edu.tw/u9/main/listcourse.php") soup = BeautifulSoup(response.text, "lxml") tag = ".mistab" with open('./crawl_table_byCSS.txt', 'w',encoding = 'utf8') as f: for course in soup.select(tag): print(course.get_text()) f.write(course.get_text()) ''' results = course.get_text() print(results) f.write(results + '\n') #f objeocct open txt ''' print("-- File Writing Ending --") f.close() ``` # 附錄：斷詞斷句&語音轉文字 ```python= #http://www.chiehfuchan.com/%E7%B0%A1%E5%96%AE%E5%88%A9%E7%94%A8-python-%E5%A5%97%E4%BB%B6-speechrecognition-%E9%80%B2%E8%A1%8C%E8%AA%9E%E9%9F%B3%E8%BE%A8%E8%AD%98/ #https://ithelp.ithome.com.tw/articles/10196577 #https://zhuanlan.zhihu.com/p/50677236 import speech_recognition as sr r = sr.Recognizer() with sr.AudioFile("C:\\Users\\pcsh1\\Documents\\錄音\\2.wav") as source: #file.wav r.adjust_for_ambient_noise(source) #解決環境噪聲 audio = r.listen(source) # audio = r.record(source, duration=100) en_simplechinese = r.recognize_google(audio, language = 'zh-TW ; en-US') print(en_simplechinese) #with sr.Microphone() as source: #audio = r.listen(source) #============================================================ ### 轉繁體中文 ### from hanziconv import HanziConv tra_chinese = HanziConv.toTraditional(en_simplechinese) print(tra_chinese) #============================================================ ### jieba 斷詞斷句 ### import jieba import jieba.analyse f = open('test.txt','w',encoding='utf8') #清空檔案內容 #'a' f.write(tra_chinese) #寫入檔案 f.close() # 關閉檔案 f = open('test.txt','r',encoding='utf8') article = f.read() tags = jieba.analyse.extract_tags(article,10) print('最重要字詞',tags) f.close() #============================================================ ''' ### jieba 斷詞斷句 ### import jieba import jieba.analyse f = open('test.txt','r',encoding='utf8') article = f.read() tags = jieba.analyse.extract_tags(article,100) print('最重要字詞',tags) ''' ``` # 附錄：輸入帳密後，再爬取成績 ```python= import requests from bs4 import BeautifulSoup as b payload = {'mail_id':'404040523', 'mail_pwd':'gibe258deny700'} rs = requests.session() res = rs.post('http://stu.fju.edu.tw/stusql/SingleSignOn/StuScore/SSO_stu_login.asp', data = payload) res2 = rs.get('http://stu.fju.edu.tw/stusql/SingleSignOn/StuScore/stu_scoreter.asp') #print(res2.content) soup = b(res2.content, "html.parser") all_td1 = soup.find_all('td', {'align': 'left', 'valign': None}) list1 = [] for obj in all_td1: list1.append(obj.contents[0]) #print(obj) for obj in list1: print(obj.string) print("===============") all_td2 = soup.find_all('td', {'align': 'center', 'valign': None}) list2 = [] for obj in all_td2: list2.append(obj.contents[0]) #print(obj) for obj in list2: print(obj.string) print("===============") all_td3 = soup.find_all('td', {'align': 'right', 'valign': None}) list3 = [] for obj in all_td3: list3.append(obj.contents[0]) #print(obj) for obj in list3: print(obj.string) ```

Read more

Django網站系統部署

中文分詞技術

NLP - NLP入門基礎知識

NLP_paper - Distributed representations of words and phrases and their compositionality