爬蟲菜單 - HackMD

--- tags: Python --- # 爬蟲菜單 ### 為何爬蟲每當我們瀏覽網頁、查詢資料時，需要不斷更改請求內容、還得細細挑選從中擷取出關鍵內容，然而不斷重複這些步驟確實有點冗餘，所以為了自動化這些步驟，坐在辦公椅上好好喝咖啡，就得交給網路爬蟲(自動瀏覽網路機器人)來執行，其工作內容包含 1. 對伺服器發送請求(request)，包含請求方式(method)、網址(url)、標頭(headers)、表單(form data) 2. 讀取回應檔案(response)，常用的包含HTML、JSON、XML、圖片 3. 剖析檔案內容，如從HTML中的DOM(document object model)物件取出必要字串 4. 儲存擷取內容，如抓取網頁的表格儲存成excel、抓去圖片置於資料夾中 5. 設定排程，讓程式定時去執行抓取的動作 ### 發送請求首先在chrome中進入目標網頁，點選右鍵、檢視網頁原始碼，最上方有一排標籤，分別紀錄著 1. Elements當前網頁的HTML、CSS 2. Console網頁終端機 3. Sources來源伺服器的資料夾 4. Network資源取得的表現以及伺服器所回傳的資源在網頁爬蟲中最常使用到的功能就是Elements與Network，Elements用來剖析HTML的DOM物件中包了哪些元素，可以幫助我們快速尋找到需要的內容，Network則紀載著詳細的請求內容與回傳的元件，像是請求內容、js檔、圖片等等，若要深入就得先補齊網頁前端的知識囉，這次來實作抓取英文單字的例句 ```python #首先從python原生的urlib模組的request類別匯入Request類別(對，就是類別的類別) from urllib.request import Request #定義要抓取的單字 vocabulary = 'crawler' #使用文字樣板，將單字帶入網址中 url = f'https://sentence.yourdictionary.com/{vocabulary}' #提交的表單，這次使用GET方法進行請求，不會有表單 form_data = None #請求標頭，用來表明自己的身分，例子中為網頁版本Mozilla/5.0、作業系統windows 64位元 #有些網站會防止惡意請求，會檢查標頭來認證身分，通常加入User-Agent都可以通過 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'} #製作請求的物件 request = Request(url,form_data,headers,method='GET') ``` (進階內容-可先跳過)很多電商網站都會有防爬機制，會使用cookie的暫存向伺服器通訊來檢查使用者身分，像是如果要進入蝦皮追蹤商品資訊，就要使用到比較進階的selenium(網頁操作自動化)來取得csrftoken(防止跨站攻擊的機制)，首先下載chrome的webdriver軟體，要確定版本相符跟自己的一樣，點選chrome瀏覽器右上角的說明來查看當前版本，傳送門 https://sites.google.com/a/chromium.org/chromedriver/downloads ```python from selenium import webdriver driver = webdriver.Chrome('C:/Users/使用者名稱/Downloads/chromedriver.exe') driver.get('https://shopee.tw/') #會彈出新的頁面 cookies = driver.get_cookies() #格式為list包dict csrftoken = [item['value'] for item in cookies if item['name'] == 'csrftoken'][0] ``` ### 讀取回應製作好請求的內容後，就可以開始試水溫檢查回應的內容 ```python from urllib.request import urlopen, urlretrieve #使用urlopen打開網頁，轉成urllib的response物件 response = urlopen(request) #檢查回應的狀況，回應為2開頭的才能持續進行 print(response.status) #轉換response物件成字串，read()之後為byte，decode()則將byte轉成字串 res_str = response.read().decode('utf-8') #使用urlretrieve下載回傳的資源，不一定為html，也可以是png、json檔之類的 urlretrieve(request, '下載位置.html') ``` ### 剖析物件若回傳的是HTML檔，就得出動chrome開發人員工具、翻出目標資料的所在位置，右鍵開啟檢視網頁原始碼，在elements的功能下一個個將巢狀選單打開，直到找出包著例句的母元素是哪一個，以下是用chrome檢視原始碼攤開的畫面 ![](https://i.imgur.com/DgfdO1G.jpg) 找到了！例句在ul下的li元素中，li的class為voting_li、id是序列碼、裡面有包著div和b，直接轉成字串一個一個處理好像有點吐血，還好有解析HTML的套件bs4，可以用select函數如前端的document.querySelector般選取DOM元素，還附有get_text()函數，就算摻雜了其他標籤，照樣把文字通通乾淨的抽出來 ```python from bs4 import BeautifulSoup #將上面回應的物件轉成soup的html剖析物件 soup = BeautifulSoup(response,'html.parser') #選取所有tag為li且class為voting_li的元素 #在querySelector中，class用「.」表示、id用「#」 print(soup.select('li.voting_li').prettify()) #使用prettify函數來排版 #使用get_text()把xml標牽內的文字通通抽出來 sentences = [li.get_text() for li in soup.select('li.voting_li')] ``` 當然熟正規表達式是件好事，以下是小抄與簡單範例 | 位置鎖定類 | 字元選擇類 | 匹配次數類 | | -------- | -------- | -------- | |^開頭|.任何字元|*匹配零至多次| |$結尾|[]字元候選組合|+匹配一至多次| |()目標字串組|\d數字，等同於[0-9]，\D非數字|?匹配零或一次| ||\w英文字母，等同於[A-Za-z]，\W非英文字母|{n}匹配n次| ||\s任何空白字元，等同於[\f\n\r\t\v]，\S非空白字元|{n,m}匹配n到m次| ```python import re string = '<div>Some Text Here</div>' #搜尋包在的所有內容(.*)，group(0)回傳整段內容 re.search(r'(.*)', string).group(0) #輸出Text #改成group(1)，回傳小括號內所有符合的字元 re.search(r'(.*)', string).group(1) #輸出Text ``` 如果只是單存要擷取HTML裡面的table表格，用pandas套件最簡單 ```python import pandas as pd dataframe_list = pd.read_html('網址') dataframe_list[0].to_excel('超簡單的.xlsx',index=0) ``` ### 儲存內容將抓取英文單字例句的實作改成函式來跑迴圈，並且輸出成json檔，可做為存檔點供下次使用 ```python from urllib.request import Request, urlopen from bs4 import BeautifulSoup def get_sentences(vocabulary): url = f'https://sentence.yourdictionary.com/{vocabulary}' req = Request(url) res = urlopen(req) soup = BeautifulSoup(res,'html.parser') sentences = [li.get_text() for li in soup.select('li.voting_li')] return sentences import time results = dict() for vocabulary in ['better','faster','stronger']: results[vocabulary] = get_sentences(vocabulary) time.sleep(1) #有些網站會偵測IP發送頻率 import json with open('例句.json','w') as file: json.dump(results, file) #把w改成r、dump改成load(file)就變成讀取了 ``` ### 設定排程 > windows使用者 1. 搜尋並開啟工作排程器(應該每個人都有吧) 2. 右方動作列點選建立工作 3. (一般)設定獨一無二的名稱，將安全性選項改為「不論使用者登入與否均執行」 4. (觸發程序)新增執行程式的模式，要確定左下角的已啟用是勾選的 5. (動作)程式或指令碼處輸入python.exe的位置，新增引數放入.py檔的位置 6. 按下確定後會要求輸入登入密碼 > mac、linux使用者(終端機cron) |分鐘|小時|日|月|星期|使用者|指令| |-|-|-|-|-|-|-|-| |0-59|0-23|1-31|1-12|0日-6六|root|引數| |符號|用意| |-|-| |/5|每隔五單位| |1-5|在單位1到5之間| |*|不指定| ```shell= 進入crontab設定檔 $ nano /etc/crontab 每天早上六點整執行一次 0 6 * * * usr/bin/python /home/root/crawler.py 每隔15分鐘執行一次 /15 * * * * usr/bin/python /home/root/crawler.py 星期一到五每小時執行一次 */1 * * * 1-5 usr/bin/python /home/root/crawler.py 啟用cron服務 $ service cron start ``` ## selenium > webdriver * [Chrome](https://sites.google.com/a/chromium.org/chromedriver/downloads) * [Edge](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/) * [Firefox](https://github.com/mozilla/geckodriver/releases) * [Safari](https://webkit.org/blog/6900/webdriver-support-in-safari-10/) ```python= from selenium import webdriver from selenium.webdriver.common.keys import Keys driver = webdriver.Chrome() driver.get('http://www.google.com') driver.close() el = driver.find_element_by_class_name('gLFyf.gsfi') el.send_keys('hi') el.clear() btn = driver.find_element_by_class_name('gNO89b') btn.click() el.send_keys(Keys.CONTROL, 'c') driver.forward() driver.back() ``` ## puppeteer + cheerio ```javascript= const puppeteer = require('puppeteer'); const fs = require('fs'); async function printPDF() { const browser = await puppeteer.launch({ headless: true }); const page = await browser.newPage(); await page.goto('C:/Users/Bill/Desktop/範例.html', { waitUntil: 'networkidle0' }); const pdf = await page.pdf({ format: 'A4' }); await browser.close(); fs.writeFileSync('範例.pdf', pdf, 'binary'); return pdf; } printPDF().then(pdf=>{}) ```