Web spider - HackMD

# **Install packages** * 安裝套件 ``` pip install requests pip install beautifulsoup4 ``` * 嘗試爬取 https://www.ptt.cc/bbs/graduate/index.html ``` import requests url="https://www.ptt.cc/bbs/graduate/index.html" #todo 躲掉反爬蟲 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'} response = requests.get(url, headers=headers) #todo 偽裝成 user if response.status_code == 200: #todo check HTTP status_code print("successfully") with open('output.html','w',encoding='utf-8') as f: #todo write into html file f.write(response.text) else: print("failed") ``` * 因為編碼問題可能會報錯 ``` UnicodeEncodeError: 'cp950' codec can't encode character '\u22ef' in position 262: illegal multibyte sequence ``` * 加入這幾行, 更改編碼類型為 utf-8 就能解決了 ``` import sys import codecs sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach()) ``` # **使用 beautifulsoup 擷取資料** ``` import requests from bs4 import BeautifulSoup import sys import codecs sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach()) url="https://www.ptt.cc/bbs/graduate/index.html" headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'} response = requests.get(url, headers=headers) #todo 偽裝成 user soup = BeautifulSoup(response.text, "html.parser") articles = soup.find_all("div", class_="r-ent") for item in articles: title = item.find("div", class_="title") #todo 擷取標題 if title and title.a: title = title.a.text else: title = "no title" popular = item.find("div", class_="nrec") #todo 擷取人氣 if popular and popular.span: popular = popular.span.text else: popular = "N" author = item.find("div", class_="author") #todo 擷取作者 if author: author = author.text else: author = "N" date = item.find("div", class_="date") #todo 擷取日期 if date: date = date.text else: date = "N" print(f"title:{title}\n popular:{popular}\n author:{author}\n date:{date}\n") #todo print ``` * 資料顯示, 大概就是長這樣 ![sp](https://hackmd.io/_uploads/SJSBSM3C6.png) # **儲存格式** * 先將資料儲存成 data_list 格式, 並用字典將資料填入正確的 index ``` data_list = [] #todo 資料串列 for item in articles: data = {} #字典 title = item.find("div", class_="title") #todo 擷取標題 if title and title.a: title = title.a.text else: title = "no title" data["title"] = title popular = item.find("div", class_="nrec") #todo 擷取人氣 if popular and popular.span: popular = popular.span.text else: popular = "N" data["popular"]=popular author = item.find("div", class_="author") #todo 擷取作者 if author: author = author.text else: author = "N" data["author"]=author date = item.find("div", class_="date") #todo 擷取日期 if date: date = date.text else: date = "N" data["date"]=date data_list.append(data) ``` * 因為不需要 print 出資料 ## 儲存成 json 格式 ``` import json #todo 引入 json 套件 #// Crawl data section // with open("data.json", "w", encoding="utf-8")as file: #todo 存入 data.json 檔案 json.dump(data_list, file, ensure_ascii=False, indent=4) ``` * 因為沒使用 print 顯示資料, 不會有 UnicodeEncodeError 的問題, 不需要在程式碼中更改編碼格式 ## 儲存成 excel 格式 * install package ``` pip install pandas pip install openpyxl ``` * 存入 excel 檔案 ``` import pandas #// Crwal data section // df = pd.DataFrame(data_list) #todo 存入 data.excel 檔案 df.to_excel("data.xlsx", index=False, engine="openpyxl") ``` # **防止 Ban ip** * 如果請求太多次, 很可能被伺服器ban-ip, 禁止訪問網站 * 解決辦法：使用 proxy(代理伺服器) * 這邊有一大堆可以用：https://proxyscrape.com/free-proxy-list ``` proxies=[ {'http':'http://129.153.42.81:3128'}, {'http':'http://209.121.164.50:31147'}, {'http':'http://175.139.233.76:80'}, ] url="https://www.sinyi.com.tw/rent/" headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"} response = requests.get(url, headers=headers, proxies=proxies[2]) ``` * proxies[] 可以一直更換新的 ip, 這樣就不怕被 ban 了 * 也可以用超多 threads, 每隻 thread 的 ip 都不一樣 # **disable javascript** * 檢視原始碼, 按 Ctrl+Shift+P * 執行 disable JavaScript ![image](https://hackmd.io/_uploads/rJqG3oTCp.png) * 可以用來確認網站是否有使用 js 來呈現 * js 會根據使用者的行為, 與網站進行互動 * 單純 requests 的方法不再有用, 需要 Selenium 套件來貼近使用者行為, 繞過擋爬蟲的網站 # **Selenium** * 安裝 Selenium ``` pip install selenium ``` * 安裝 web diver, 每個瀏覽器使用的 diver 不一樣 * 確認瀏覽器版本, 盡量更新到最新版本 (設定/關於) * edge : https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/?form=MA13LH * Chrome : https://googlechromelabs.github.io/chrome-for-testing/ * driver path 要使用 "/" 來當作路徑 *** * 使用 edge ``` # import 需要的套件 from selenium import webdriver from selenium.webdriver.edge.service import Service from selenium.webdriver.edge.options import Options import time # 編碼改為 utf-8 import sys import codecs sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach()) # 設置 driver 及 brower path = "C:/Users/USER/Desktop/web_driver/edgedriver_win64/msedgedriver.exe" service = Service(path) options = Options() driver = webdriver.Edge(service=service, options=options) #做你想做的事 driver.get("https://www.google.com") #web address print(driver.title) #print title driver.quit() ``` * 使用 chrome ``` # import 需要的套件 from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options import time # 編碼改為 utf-8 import sys import codecs sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach()) # 設置 driver 及 brower path = "C:/Users/USER/Desktop/web_driver/chromedriver-win64/chromedriver.exe" service = Service(path) options = Options() driver = webdriver.Chrome(service=service, options=options) #做你想做的事 driver.get("https://www.google.com") #web address print(driver.title) #print title driver.quit() ``` * 能跑出網頁就代表成功了 * 如果有 error 就全部貼 chatgpt (雖然不是很靠普, 但多少能解決一些問題) * selenium 官方文件 : https://www.selenium.dev/documentation/webdriver/elements/ # **Selenium 基本操作** * 需要的套件, 改編碼, 設置 driver 記得先貼上去 * 善用 try, exception, 處理找不到物件的狀況, 可以有效避免程式 shutdown * import 新套件 ``` from selenium.webdriver.common.by import By # 查找資料 from selenium.webdriver.common.keys import Keys # 控制鍵盤 ``` ## 到 https://www.ptt.cc/bbs/graduate/index.html, 爬取第一篇文章的標題 ``` #做你想做的事 driver.get("https://www.ptt.cc/bbs/graduate/index.html") driver.implicitly_wait(5) #等待網頁載入完成 element = driver.find_element(By.CLASS_NAME, 'r-ent').find_element(By.CLASS_NAME, 'title') print(element.text) driver.close() ``` ## 到 https://rent.591.com.tw/, 爬取所有租屋標題 ``` driver.get("https://rent.591.com.tw/") driver.implicitly_wait(5) element = driver.find_elements(By.CLASS_NAME, 'vue-list-rent-item') for i in element: print(i.find_element(By.CLASS_NAME, 'rent-item-right').find_element(By.CLASS_NAME, 'item-title').text) ``` ## 到 https://www.google.com.tw/, 搜尋 "cute cat", 並點入"圖片" ``` driver.get("https://www.google.com.tw/") driver.implicitly_wait(3) search = driver.find_element(By.CLASS_NAME, 'gLFyf') search.send_keys("cute cat") search.send_keys(Keys.RETURN) driver.implicitly_wait(2) search = driver.find_element(By.LINK_TEXT, '圖片').click() time.sleep(5) driver.quit() ``` ## 到 https://www.591.com.tw/, 搜尋 "台北市", 並點擊"搜尋" ``` driver.get("https://rent.591.com.tw/") driver.implicitly_wait(2) inputbox = driver.find_element(By.CLASS_NAME, 'form-control') inputbox.send_keys("台北市") driver.implicitly_wait(2) try: search = driver.find_element(By.CLASS_NAME, 'searchBtn') print("Element found!") search.click() except Exception as e: print("Element not found!") time.sleep(10) driver.quit() ``` # **Selenium 偵測關鍵詞 (important)** * 透過偵測關鍵字詞判斷網頁是否載入完成 * 不需要使用 driver.implicitly_wait() 來強迫等待時間 * import 套件 ``` from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC ``` ## (改良版) 到 https://www.591.com.tw/, 搜尋 "台北市", 並點擊"搜尋" ``` driver.get("https://rent.591.com.tw/") inputbox = WebDriverWait(driver, 2).until( EC.presence_of_element_located((By.CLASS_NAME, 'form-control')) ) inputbox.clear() inputbox.send_keys("台北市") search = WebDriverWait(driver, 2).until( EC.presence_of_element_located((By.CLASS_NAME, 'searchBtn')) ) search.click() time.sleep(10) driver.quit() ``` # **載入資料** ## 抓取資料 * 目標網站 ``` https://rent.591.com.tw/?section=1,6,3,8,2&searchtype=1 ``` ![h](https://hackmd.io/_uploads/S1t2H0GyC.png) * 爬取中山區, 大安區, 中正區, 萬華區,大同區的所有出租房屋資訊 (挑需要的列出來就行), 最後儲存到 json file * 確保網路通暢, 但網站回應時間過長, 仍然可能引發 Timeoutexception, 加入 time.sleep() 延遲 * 如果某些動作很可能造成整個程式 shutdown (需要靠經驗), 就包在 try, except 裡面, 即使出錯也不會連累整個程式的運行 * 以 Currentpage 作為頁面識別, 每一次跳轉頁面都會重新偵測是否跳轉成功, 如果頁面為跳轉或是還停留在上一頁, 會延遲 1s 後再重新偵測, 直到 loading_times (偵測的上限次數), 如果超過 loading_times, 程式會停止執行 ``` # import 需要的套件 from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import time import json import pandas as pd # 編碼改為 utf-8 import sys import codecs sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach()) # 設置 driver 及 brower path = "C:/Users/USER/Desktop/web_driver/chromedriver-win64/chromedriver.exe" service = Service(path) options = Options() driver = webdriver.Chrome(service=service, options=options) #做你想做的事 driver.get("https://rent.591.com.tw/?section=1,6,3,8,2&searchtype=1") #! 可依使用者需求配置參數 data_order = ['hid','url','title','price','address','traffic'] #todo 調整資料在 json 呈現的順序 data_limit = 10000 #todo 想要抓取的資料數量 loading_times = 50 #todo 幾秒內沒有 loading 成功會停止 data_list = [] page_pre = -1 def getdata(obj): #todo 傳入一整個 page 一次擷取多筆資料 global data_limit pagedata = 0 for i in obj: if data_limit <= 0: break data={} detail = i.find_element(By.CLASS_NAME, 'rent-item-right') #todo 大部分資訊集中在某個區塊 url = i.find_element(By.TAG_NAME, 'a').get_attribute('href') title = detail.find_element(By.CLASS_NAME, 'item-title').text price = detail.find_element(By.CLASS_NAME, 'item-price-text').find_element(By.TAG_NAME, 'span').text price = int(price.replace(",", "")) traffic = detail.find_element(By.CLASS_NAME, 'item-tip').text address = detail.find_element(By.CLASS_NAME, 'item-area').text hid = i.get_attribute('data-bind') #! //資料篩選 (有些資料跟租屋沒關係)// del_data = detail.find_element(By.CLASS_NAME, 'item-style').find_element(By.CLASS_NAME, 'is-kind').text if del_data == "車位": print(f'give out {hid} data, title：{title}, price：{price}') continue #todo 儲存資料進 data_list data_order_mapping = {'url': url, 'title': title, 'price': price, 'address': address, 'traffic': traffic, 'hid':hid} data = {key: data_order_mapping[key] for key in data_order} data_list.append(data) pagedata += 1 data_limit -= 1 page_number = int(driver.find_element(By.CLASS_NAME, 'pageCurrent').text) print(f"page {page_number} 共取得：{pagedata} 筆資料") while data_limit > 0: #todo 測試 page jump 是否成功 loading_times_checking = loading_times #todo 重新偵測的時間上限 while True: try: #todo 擷取當前 page number page_number = WebDriverWait(driver, 1).until( EC.presence_of_element_located((By.CLASS_NAME, 'pageCurrent')) ) page_number = int(page_number.text) if page_pre == page_number and loading_times_checking <= 0: data_limit = 0 print(f"Crawler try to get {page_number+1} page {loading_times} times, but it failed") elif page_pre == page_number: loading_times_checking -= 1 time.sleep(1) continue else: break except Exception as e: loading_times_checking -= 1 time.sleep(1) continue #todo 往下滑 driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") #todo 擷取 whole page obj = WebDriverWait(driver, 5).until( EC.presence_of_element_located((By.CLASS_NAME, 'switch-list-content')) and EC.presence_of_all_elements_located((By.CLASS_NAME, 'vue-list-rent-item')) ) getdata(obj) page_pre = page_number try: #todo Crawler over last_element = WebDriverWait(driver, 5).until( EC.presence_of_element_located((By.CSS_SELECTOR, '.pageNext.last')) ) print('Crawler over') break except Exception as e: #todo nextpage nextpage = driver.find_element(By.CLASS_NAME, 'page-limit').find_element(By.CLASS_NAME, 'pageNext') nextpage.click() #todo 存入 data.json 檔案 try: with open("D:/gold_house/data.json", "w", encoding="utf-8")as file: json.dump(data_list, file, ensure_ascii=False, indent=4) print('json success') except Exception as e: print('json fails') #todo 存入 data.excel 檔案 try: df = pd.DataFrame(data_list) df.to_excel("D:/gold_house/data.xlsx", index=False, engine="openpyxl") print('excel success') except Exception as e: print('excel fails') print(f"共爬取: {len(data_list)} 筆資料") time.sleep(10) driver.quit() ``` ## 資料篩選 * 資料不一定全是租屋, 有些是租車位的 * 資料篩選區篩出車位關鍵字的標籤 ``` del_data = detail.find_element(By.CLASS_NAME, 'item-style').find_element(By.CLASS_NAME, 'is-kind').text if del_data == "車位": print(f'give out {hid} data, title：{title}, price：{price}') continue ``` # 登入 FB * 安裝擴充套件 Cookie-Editor * 登入 FB, 以 json 格式儲存成 Cookie ![fb](https://hackmd.io/_uploads/SJKTJd_10.png) * 每一筆資料只留 name 和 value, 其他都刪掉 * 使用 for loop, 將 cookie 全部傳入, 最後 refresh page ``` with open('cookie.json') as f: cookies = json.load(f) driver = webdriver.Chrome() driver.get('https://www.facebook.com') for cookie in cookies: driver.add_cookie(cookie) driver.refresh() ``` * FB 爬蟲很容易被 ban 帳號, 盡量找官方提供的 API 比較好