爬蟲流程 - HackMD

# 爬蟲流程：以爬取知乎話題為主要例子目的：示範爬蟲時努力與反爬蟲抗爭的心路歷程 > 目標網站：https://www.zhihu.com/topic/19722451/hot --- ``` import pandas as pd import re from urllib import parse import requests from bs4 import BeautifulSoup import json from time import sleep from random import randint import random import urllib import numpy as np from selenium import webdriver, common from selenium.webdriver.chrome.options import Options from selenium.webdriver.support.ui import Select import time from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import NoSuchElementException from selenium.common.exceptions import TimeoutException from selenium.webdriver.chrome.service import Service import winsound # 提示音，只支援windows import os import sys import googlemaps ``` --- ## 一、嘗試找request URL > 打開目標網頁，開F12，點Network，點ALL，找到網頁的result，然後複製其request URL貼到網頁去檢查 ![](https://hackmd.io/_uploads/BkuNIZKeT.png) 每個都點進去發現沒有可以用的。果斷放棄，用下一招。 ### 其他例子:人民網習近平重要講話數據庫 > 可以成功的網站：以人民網習近平重要講話數據庫為例 > http://jhsjk.people.cn/result?searchArea=0&keywords=&isFuzzy=0 ![](https://hackmd.io/_uploads/H1VnwWFe6.png) request URL 以json格式呈現，最懶人的爬蟲方式，連整理都不用整理了~ ![](https://hackmd.io/_uploads/Byn4ubtgT.png) ``` Avgle= {"title":[], "pubtime":[], "url":[], "source":[]} #http://jhsjk.people.cn/testnew/result?keywords=&isFuzzy=0&searchArea=0&year=0&form=0&type=0&page=2&origin=%E5%85%A8%E9%83%A8&source=2 session_requests = requests.session() header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36'} for i in range(0,2): url = f"http://jhsjk.people.cn/testnew/result?keywords=%E4%B9%A0%E8%BF%91%E5%B9%B3&isFuzzy=0&searchArea=0&year=0&form=0&type=0&page={i}&origin=%E5%85%A8%E9%83%A8&source=2" sleep(randint(1, 5)) GetP = session_requests.get(url, headers=header) gj = GetP.json() for e in range(0,10): Avgle["title"].append(gj["list"][e]["title"]) Avgle["pubtime"].append(gj["list"][e]["input_date"]) Avgle["url"].append( "http://jhsjk.people.cn/article/"+gj["list"][e]["article_id"]) Avgle["source"].append(gj["list"][e]["origin_name"]) xnxx = pd.DataFrame(Avgle) ``` ![](https://hackmd.io/_uploads/SyluwbYgp.png) --- ## 二、找現成的免費的api > 網友討論：https://zhuanlan.zhihu.com/p/87029765 ![](https://hackmd.io/_uploads/Byu9AbKeT.png) 看來無法不勞而獲OWO ### 其他例子:googlemap > 爬取googlemap某個捷運站附近的餐廳，有提供api key控管流量 ``` gmaps = googlemaps.Client(key='自己去申請你的API key嘻嘻') geocode_result = gmaps.geocode("動物園捷運站") results = [] loc = geocode_result[0]['geometry']['location'] query_result = gmaps.places_nearby(keyword="餐廳",location=loc, radius=500) results.extend(query_result['results']) while query_result.get('next_page_token'): time.sleep(2) query_result = gmaps.places_nearby(page_token=query_result['next_page_token']) results.extend(query_result['results']) print("動物園捷運站為中心半徑500公尺的餐廳店家數量: "+str(len(results)))#(google mapi api上限提供60間) for place in results: print(place['place_id']) print(gmaps.place(place_id=place['place_id'], language='zh-TW')['result']['name']) print(gmaps.place(place_id=place['place_id'], language='zh-TW')['result']['geometry']['location']) ``` ![](https://hackmd.io/_uploads/SyxFXMFx6.png) --- ## 三、獲取網頁html ![](https://hackmd.io/_uploads/S1x7T-teT.png) ``` redtube = "https://www.zhihu.com/topic/19722451/hot?utm_id=0" url_get = requests.get(redtube) url_get ``` ![](https://hackmd.io/_uploads/rJpZ2ZFxT.png) ``` url_meta = url_soup.find_all(attrs={"class": "ContentItem-title"}) for title in url_meta: print(title.getText()) ``` ``` url_decode = url_get.content.decode( "utf-8", "ignore").encode("gb2312", "ignore") url_soup = BeautifulSoup(url_decode, 'html.parser') print(url_soup) ``` ![](https://hackmd.io/_uploads/SyCf2WFxp.png) ``` url_meta = url_soup.find_all(attrs={"class": "ContentItem-title"}) for title in url_meta: print(title.getText()) ``` ![](https://hackmd.io/_uploads/H1D42Ztea.png) 正當很開心可以爬到資料時，發現幹資料好像不太齊全。原來知乎是滾動式頁面，要一直往下滑才能展開更多資料。於是此路不通，換下一招。 ### 其他例子:人民網習近平重要講話數據庫 > 可以成功的網站：以人民網習近平重要講話數據庫為例，爬取上一步驟網頁連結裡的內文 > http://jhsjk.people.cn/article/40088445 ``` df['Content'] = pd.Series() session_requests = requests.session() header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36'} for w in range(0,20): url = df["url"][w] url_get = requests.get(url, headers=header, timeout=100) url_decode = url_get.content.decode( "utf-8", "ignore").encode("gb2312", "ignore") url_soup = BeautifulSoup(url_decode, 'html.parser') url_meta = url_soup.find_all(attrs={"class": "d2txt_con clearfix"}) for title in url_meta: # print(title.getText()) df["Content"][w] = title.getText() print(w,'Finished！', df["title"][w]) ``` ![](https://hackmd.io/_uploads/BJTshbYgT.png) --- ## 四、selenium一般使用，開啟虛擬瀏覽器 ``` url = 'https://www.zhihu.com/topic/19722451/hot' s = Service(r"C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe") driver = webdriver.Chrome(service=s) driver.get(url) ``` 往下滑滑到網頁最底部 > 參考：https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python ``` SCROLL_PAUSE_TIME = 3　#網路不好的時候滑一滑會卡住，要讓他停久一點，等網頁載好才能繼續往下滑，不然程式碼很快就會停掉結果網頁沒滑完。 # Get scroll height last_height = driver.execute_script("return document.body.scrollHeight") while True: # Scroll down to bottom driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page time.sleep(SCROLL_PAUSE_TIME) # Calculate new scroll height and compare with last scroll height new_height = driver.execute_script("return document.body.scrollHeight") if new_height == last_height: break last_height = new_height ``` 往下滑的過程發現網址突然跳掉了，重新嘗試幾遍時好時壞。 ![](https://hackmd.io/_uploads/r1_HtMtxT.png) 有些網站的反爬蟲機制就是會這樣~~ 譬如openai ![](https://hackmd.io/_uploads/BkslqGKxa.png) --- ## 五、selenium接管現存瀏覽器 > 參考: > https://blog.csdn.net/xue_11/article/details/125269232 > mac os: > https://blog.csdn.net/qq_46601555/article/details/131559972?spm=1001.2101.3001.6650.2&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EYuanLiJiHua%7EPosition-2-131559972-blog-130727783.235%5Ev38%5Epc_relevant_sort_base2&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EYuanLiJiHua%7EPosition-2-131559972-blog-130727783.235%5Ev38%5Epc_relevant_sort_base2&utm_relevant_index=5 ``` #先在terminal控制瀏覽器 #CD\ # cd C:\"Program Files (x86)"\Google\Chrome\Application # .\chrome.exe --remote-debugging-port=9222 --user-data-dir="C:\selenum\AutomationProfile" ``` ![](https://hackmd.io/_uploads/SJpMoftxp.png) ``` chrome_options = Options() chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222") #chrome_driver = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe" #driver = webdriver.Chrome(chrome_driver,options=chrome_options) s = Service(r"C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe") driver = webdriver.Chrome(service=s,options=chrome_options) ``` ``` driver.get('https://www.zhihu.com/topic/19722451/hot') ``` ``` SCROLL_PAUSE_TIME = 3 # Get scroll height last_height = driver.execute_script("return document.body.scrollHeight") while True: # Scroll down to bottom driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page time.sleep(SCROLL_PAUSE_TIME) # Calculate new scroll height and compare with last scroll height new_height = driver.execute_script("return document.body.scrollHeight") if new_height == last_height: break last_height = new_height ``` 順利爬到最底了 ![](https://hackmd.io/_uploads/HyLfhfKlT.png) ``` swag = { 'title': [], 'url': [] } for i in range(0, 114): #看一下最底部帖子的CSS_SELECTOR的nth-child是第幾個來設定。 #下面這行是最底部帖子的CSS_SELECTOR #TopicMain > div.ListShortcut > div.Card.TopicFeedList > div > div > div:nth-child(114) > div > div > h2 > div > a try: print(driver.find_element(By.CSS_SELECTOR, f"#TopicMain > div.ListShortcut > div > div > div > div:nth-child({i}) > div > div > h2 > span > a").text) swag ['title'].append(driver.find_element(By.CSS_SELECTOR, f"#TopicMain > div.ListShortcut > div > div > div > div:nth-child({i}) > div > div > h2 > span > a").text) print(driver.find_element(By.CSS_SELECTOR, f"#TopicMain > div.ListShortcut > div > div > div > div:nth-child({i}) > div > div > h2 > span > a").get_attribute('href')) swag ['url'].append(driver.find_element(By.CSS_SELECTOR, f"#TopicMain > div.ListShortcut > div > div > div > div:nth-child({i}) > div > div > h2 > span > a").get_attribute('href')) except NoSuchElementException: #有些連結是影片沒有內文，網頁長的不太一樣，但我們這份程式碼沒有要爬影片，所以用except去排除錯誤，這樣會自動不爬有影片的帖子。 pass xvideos = pd.DataFrame(swag) ``` ![](https://hackmd.io/_uploads/HJQEaGFea.png) ``` jvid = { 'title': [], 'url': [], 'Name': [], 'Pubtime': [], 'Content': [] } for k in range(0,5): driver.get(xvideos['url'][k]) time.sleep(3) #網路不好的時候網頁載入不完全，會找不到元素，所以讓程式碼睡一下等網頁載好 e = driver.find_elements(By.CLASS_NAME,"Post-RichTextContainer") n = driver.find_elements(By.CLASS_NAME,"AuthorInfo-head") t = driver.find_elements(By.CLASS_NAME,"ContentItem-time") #all = driver.find_elements_by_class_name("List-item") for i in n: print(i.text) jvid['title'].append(xvideos['title'][k]) jvid['url'].append(xvideos['url'][k]) jvid['Name'].append(i.text) for k in e: print(k.text) jvid['Content'].append(k.text) for r in t: print(r.text) jvid['Pubtime'].append(r.text) ``` ![](https://hackmd.io/_uploads/SJkqCMtgT.png) ``` #看一下dict長度是否相等，才方便轉為df print(len(jvid['title'])) print(len(jvid['url'])) print(len(jvid['Name'])) print(len(jvid['Content'])) print(len(jvid['Pubtime'])) ``` ![](https://hackmd.io/_uploads/SyKBkXFxT.png) ``` pornhub = pd.DataFrame(jvid) pornhub.to_csv("20231001_外賣知乎文_掃黃.csv",index=False) pornhub.to_excel("20231001_外賣知乎文_掃黃.xlsx",index=False) ``` ![](https://hackmd.io/_uploads/rJu0J7Fep.png) --- ## 六、selenium反覆遇到驗證碼:以學校的人民日報資料庫為例 > 參考:\ > https://www.learncodewithmike.com/2021/08/python-selenium-bypass-captcha.html ``` driver.get( 'https://www-oriprobe-com.proxyone.lib.nccu.edu.tw:8443/peoplesdaily_tc.shtml') ``` ``` #搜尋設為「掃黃」標題搜尋 Bukeyisese = {"title": [], "listinfo": [], "link": []} for j in range(1, 3): driver.get( f"http://data.people.com.cn/rmrb/s?qs=%7B%22cds%22%3A%5B%7B%22cdr%22%3A%22AND%22%2C%22cds%22%3A%5B%7B%22fld%22%3A%22title%22%2C%22cdr%22%3A%22OR%22%2C%22hlt%22%3A%22true%22%2C%22vlr%22%3A%22AND%22%2C%22qtp%22%3A%22DEF%22%2C%22val%22%3A%22%E6%8E%83%E9%BB%83%22%7D%2C%7B%22fld%22%3A%22subTitle%22%2C%22cdr%22%3A%22OR%22%2C%22hlt%22%3A%22false%22%2C%22vlr%22%3A%22AND%22%2C%22qtp%22%3A%22DEF%22%2C%22val%22%3A%22%E6%8E%83%E9%BB%83%22%7D%2C%7B%22fld%22%3A%22introTitle%22%2C%22cdr%22%3A%22OR%22%2C%22hlt%22%3A%22false%22%2C%22vlr%22%3A%22AND%22%2C%22qtp%22%3A%22DEF%22%2C%22val%22%3A%22%E6%8E%83%E9%BB%83%22%7D%5D%7D%5D%2C%22obs%22%3A%5B%7B%22fld%22%3A%22dataTime%22%2C%22drt%22%3A%22DESC%22%7D%5D%7D&tr=A&ss=1&pageNo={j}&pageSize=20") title = driver.find_elements(By.CLASS_NAME, "open_detail_link") for i in title: print(i.text) Bukeyisese['title'].append(i.text) listinfo = driver.find_elements(By.CLASS_NAME, "listinfo") for k in listinfo: print(k.text) Bukeyisese['listinfo'].append(k.text) listinfo = driver.find_elements(By.CLASS_NAME, "listinfo") links = [h.get_attribute('href') for h in title] for j in links: print(j) Bukeyisese['link'].append(j) time.sleep(5) sleep(randint(1, 5)) sleep(randint(0, 2)) ``` ![](https://hackmd.io/_uploads/H1TQPQYxa.png) ``` Keyisese = pd.DataFrame(Bukeyisese) Keyisese['Content'] = pd.Series() ``` ``` for j in range(0,40): try: driver.get(Keyisese['link'][j]) time.sleep(2) sleep(randint(1, 3)) Con = driver.find_element(By.CLASS_NAME, "detail_con") Keyisese['Content'][j] = Con.text print(Keyisese['title'][j]) sleep(randint(5, 10)) sleep(randint(4, 8)) except NoSuchElementException: #處理驗證碼 sleep(randint(3, 8)) img_base64 = driver.execute_script(""" var ele = arguments[0]; var cnv = document.createElement('canvas'); cnv.width = ele.width; cnv.height = ele.height; cnv.getContext('2d').drawImage(ele, 0, 0); return cnv.toDataURL('image/jpeg').substring(22); """, driver.find_element(By.XPATH, '//*[@id="login_form"]/table/tbody/tr[1]/td[2]/img')) with open("captcha_login.png", 'wb') as image: image.write(base64.b64decode(img_base64)) # 下載下來的一般驗證碼(Normal Captcha)圖片 file = {'file': open('captcha_login.png', 'rb')} api_key = '自己去申請嘻嘻' data = { 'key': api_key, 'method': 'post' } response = requests.post('http://2captcha.com/in.php', files=file, data=data) print(f'response:{response.text}') if response.ok and response.text.find('OK') > -1: captcha_id = response.text.split('|')[1] # 擷取驗證碼ID for i in range(10): response = requests.get( f'http://2captcha.com/res.php?key={api_key}&action=get&id={captcha_id}') print(f'response:{response.text}') if response.text.find('CAPCHA_NOT_READY') > -1: # 尚未辨識完成 time.sleep(3) elif response.text.find('OK') > -1: captcha_text = response.text.split('|')[1] # 擷取辨識結果 break else: print('取得驗證碼發生錯誤!') winsound.Beep(1000, 10000) else: print('提交驗證碼發生錯誤!') winsound.Beep(1000, 10000) time.sleep(2) driver.find_element(By.ID, "validateCode").send_keys(captcha_text) driver.find_element(By.CLASS_NAME, "login_sub").click() Con = driver.find_element(By.CLASS_NAME, "detail_con") Keyisese['Content'][j] = Con.text print(Keyisese['title'][j]) sleep(randint(5, 10)) ``` ![](https://hackmd.io/_uploads/HJjmIXKxp.png) ![](https://hackmd.io/_uploads/Hy8ML7tx6.png) 最後結果： ![](https://hackmd.io/_uploads/B1K5d7YeT.png) --- ### 備註連結: 1. selenium語法差異: https://stackoverflow.com/questions/72854116/selenium-attributeerror-webdriver-object-has-no-attribute-find-element-by-cs 2. bs4: https://blog.gtwang.org/programming/python-beautiful-soup-module-scrape-web-pages-tutorial/ 3. requests+bs4 https://yanwei-liu.medium.com/python%E7%88%AC%E8%9F%B2%E5%AD%B8%E7%BF%92%E7%AD%86%E8%A8%98-%E4%B8%80-beautifulsoup-1ee011df8768