# **Install packages**
* 安裝套件
```
pip install requests
pip install beautifulsoup4
```
* 嘗試爬取 https://www.ptt.cc/bbs/graduate/index.html
```
import requests
url="https://www.ptt.cc/bbs/graduate/index.html"
#todo 躲掉反爬蟲
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
response = requests.get(url, headers=headers) #todo 偽裝成 user
if response.status_code == 200: #todo check HTTP status_code
print("successfully")
with open('output.html','w',encoding='utf-8') as f: #todo write into html file
f.write(response.text)
else:
print("failed")
```
* 因為編碼問題可能會報錯
```
UnicodeEncodeError: 'cp950' codec can't encode character '\u22ef' in position 262: illegal multibyte sequence
```
* 加入這幾行, 更改編碼類型為 utf-8 就能解決了
```
import sys
import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
```
# **使用 beautifulsoup 擷取資料**
```
import requests
from bs4 import BeautifulSoup
import sys
import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
url="https://www.ptt.cc/bbs/graduate/index.html"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
response = requests.get(url, headers=headers) #todo 偽裝成 user
soup = BeautifulSoup(response.text, "html.parser")
articles = soup.find_all("div", class_="r-ent")
for item in articles:
title = item.find("div", class_="title") #todo 擷取標題
if title and title.a:
title = title.a.text
else:
title = "no title"
popular = item.find("div", class_="nrec") #todo 擷取人氣
if popular and popular.span:
popular = popular.span.text
else:
popular = "N"
author = item.find("div", class_="author") #todo 擷取作者
if author:
author = author.text
else:
author = "N"
date = item.find("div", class_="date") #todo 擷取日期
if date:
date = date.text
else:
date = "N"
print(f"title:{title}\n popular:{popular}\n author:{author}\n date:{date}\n") #todo print
```
* 資料顯示, 大概就是長這樣

# **儲存格式**
* 先將資料儲存成 data_list 格式, 並用字典將資料填入正確的 index
```
data_list = [] #todo 資料串列
for item in articles:
data = {} #字典
title = item.find("div", class_="title") #todo 擷取標題
if title and title.a:
title = title.a.text
else:
title = "no title"
data["title"] = title
popular = item.find("div", class_="nrec") #todo 擷取人氣
if popular and popular.span:
popular = popular.span.text
else:
popular = "N"
data["popular"]=popular
author = item.find("div", class_="author") #todo 擷取作者
if author:
author = author.text
else:
author = "N"
data["author"]=author
date = item.find("div", class_="date") #todo 擷取日期
if date:
date = date.text
else:
date = "N"
data["date"]=date
data_list.append(data)
```
* 因為不需要 print 出資料
## 儲存成 json 格式
```
import json #todo 引入 json 套件
#// Crawl data section //
with open("data.json", "w", encoding="utf-8")as file: #todo 存入 data.json 檔案
json.dump(data_list, file, ensure_ascii=False, indent=4)
```
* 因為沒使用 print 顯示資料, 不會有 UnicodeEncodeError 的問題, 不需要在程式碼中更改編碼格式
## 儲存成 excel 格式
* install package
```
pip install pandas
pip install openpyxl
```
* 存入 excel 檔案
```
import pandas
#// Crwal data section //
df = pd.DataFrame(data_list) #todo 存入 data.excel 檔案
df.to_excel("data.xlsx", index=False, engine="openpyxl")
```
# **防止 Ban ip**
* 如果請求太多次, 很可能被伺服器ban-ip, 禁止訪問網站
* 解決辦法:使用 proxy(代理伺服器)
* 這邊有一大堆可以用:https://proxyscrape.com/free-proxy-list
```
proxies=[
{'http':'http://129.153.42.81:3128'},
{'http':'http://209.121.164.50:31147'},
{'http':'http://175.139.233.76:80'},
]
url="https://www.sinyi.com.tw/rent/"
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"}
response = requests.get(url, headers=headers, proxies=proxies[2])
```
* proxies[] 可以一直更換新的 ip, 這樣就不怕被 ban 了
* 也可以用超多 threads, 每隻 thread 的 ip 都不一樣
# **disable javascript**
* 檢視原始碼, 按 Ctrl+Shift+P
* 執行 disable JavaScript

* 可以用來確認網站是否有使用 js 來呈現
* js 會根據使用者的行為, 與網站進行互動
* 單純 requests 的方法不再有用, 需要 Selenium 套件來貼近使用者行為, 繞過擋爬蟲的網站
# **Selenium**
* 安裝 Selenium
```
pip install selenium
```
* 安裝 web diver, 每個瀏覽器使用的 diver 不一樣
* 確認瀏覽器版本, 盡量更新到最新版本 (設定/關於)
* edge : https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/?form=MA13LH
* Chrome : https://googlechromelabs.github.io/chrome-for-testing/
* driver path 要使用 "/" 來當作路徑
***
* 使用 edge
```
# import 需要的套件
from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.edge.options import Options
import time
# 編碼改為 utf-8
import sys
import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
# 設置 driver 及 brower
path = "C:/Users/USER/Desktop/web_driver/edgedriver_win64/msedgedriver.exe"
service = Service(path)
options = Options()
driver = webdriver.Edge(service=service, options=options)
#做你想做的事
driver.get("https://www.google.com") #web address
print(driver.title) #print title
driver.quit()
```
* 使用 chrome
```
# import 需要的套件
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import time
# 編碼改為 utf-8
import sys
import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
# 設置 driver 及 brower
path = "C:/Users/USER/Desktop/web_driver/chromedriver-win64/chromedriver.exe"
service = Service(path)
options = Options()
driver = webdriver.Chrome(service=service, options=options)
#做你想做的事
driver.get("https://www.google.com") #web address
print(driver.title) #print title
driver.quit()
```
* 能跑出網頁就代表成功了
* 如果有 error 就全部貼 chatgpt (雖然不是很靠普, 但多少能解決一些問題)
* selenium 官方文件 : https://www.selenium.dev/documentation/webdriver/elements/
# **Selenium 基本操作**
* 需要的套件, 改編碼, 設置 driver 記得先貼上去
* 善用 try, exception, 處理找不到物件的狀況, 可以有效避免程式 shutdown
* import 新套件
```
from selenium.webdriver.common.by import By # 查找資料
from selenium.webdriver.common.keys import Keys # 控制鍵盤
```
## 到 https://www.ptt.cc/bbs/graduate/index.html, 爬取第一篇文章的標題
```
#做你想做的事
driver.get("https://www.ptt.cc/bbs/graduate/index.html")
driver.implicitly_wait(5) #等待網頁載入完成
element = driver.find_element(By.CLASS_NAME, 'r-ent').find_element(By.CLASS_NAME, 'title')
print(element.text)
driver.close()
```
## 到 https://rent.591.com.tw/, 爬取所有租屋標題
```
driver.get("https://rent.591.com.tw/")
driver.implicitly_wait(5)
element = driver.find_elements(By.CLASS_NAME, 'vue-list-rent-item')
for i in element:
print(i.find_element(By.CLASS_NAME, 'rent-item-right').find_element(By.CLASS_NAME, 'item-title').text)
```
## 到 https://www.google.com.tw/, 搜尋 "cute cat", 並點入"圖片"
```
driver.get("https://www.google.com.tw/")
driver.implicitly_wait(3)
search = driver.find_element(By.CLASS_NAME, 'gLFyf')
search.send_keys("cute cat")
search.send_keys(Keys.RETURN)
driver.implicitly_wait(2)
search = driver.find_element(By.LINK_TEXT, '圖片').click()
time.sleep(5)
driver.quit()
```
## 到 https://www.591.com.tw/, 搜尋 "台北市", 並點擊"搜尋"
```
driver.get("https://rent.591.com.tw/")
driver.implicitly_wait(2)
inputbox = driver.find_element(By.CLASS_NAME, 'form-control')
inputbox.send_keys("台北市")
driver.implicitly_wait(2)
try:
search = driver.find_element(By.CLASS_NAME, 'searchBtn')
print("Element found!")
search.click()
except Exception as e:
print("Element not found!")
time.sleep(10)
driver.quit()
```
# **Selenium 偵測關鍵詞 (important)**
* 透過偵測關鍵字詞判斷網頁是否載入完成
* 不需要使用 driver.implicitly_wait() 來強迫等待時間
* import 套件
```
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
```
## (改良版) 到 https://www.591.com.tw/, 搜尋 "台北市", 並點擊"搜尋"
```
driver.get("https://rent.591.com.tw/")
inputbox = WebDriverWait(driver, 2).until(
EC.presence_of_element_located((By.CLASS_NAME, 'form-control'))
)
inputbox.clear()
inputbox.send_keys("台北市")
search = WebDriverWait(driver, 2).until(
EC.presence_of_element_located((By.CLASS_NAME, 'searchBtn'))
)
search.click()
time.sleep(10)
driver.quit()
```
# **載入資料**
## 抓取資料
* 目標網站
```
https://rent.591.com.tw/?section=1,6,3,8,2&searchtype=1
```

* 爬取中山區, 大安區, 中正區, 萬華區,大同區的所有出租房屋資訊 (挑需要的列出來就行), 最後儲存到 json file
* 確保網路通暢, 但網站回應時間過長, 仍然可能引發 Timeoutexception, 加入 time.sleep() 延遲
* 如果某些動作很可能造成整個程式 shutdown (需要靠經驗), 就包在 try, except 裡面, 即使出錯也不會連累整個程式的運行
* 以 Currentpage 作為頁面識別, 每一次跳轉頁面都會重新偵測是否跳轉成功, 如果頁面為跳轉或是還停留在上一頁, 會延遲 1s 後再重新偵測, 直到 loading_times (偵測的上限次數), 如果超過 loading_times, 程式會停止執行
```
# import 需要的套件
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import json
import pandas as pd
# 編碼改為 utf-8
import sys
import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
# 設置 driver 及 brower
path = "C:/Users/USER/Desktop/web_driver/chromedriver-win64/chromedriver.exe"
service = Service(path)
options = Options()
driver = webdriver.Chrome(service=service, options=options)
#做你想做的事
driver.get("https://rent.591.com.tw/?section=1,6,3,8,2&searchtype=1")
#! 可依使用者需求配置參數
data_order = ['hid','url','title','price','address','traffic'] #todo 調整資料在 json 呈現的順序
data_limit = 10000 #todo 想要抓取的資料數量
loading_times = 50 #todo 幾秒內沒有 loading 成功會停止
data_list = []
page_pre = -1
def getdata(obj): #todo 傳入一整個 page 一次擷取多筆資料
global data_limit
pagedata = 0
for i in obj:
if data_limit <= 0:
break
data={}
detail = i.find_element(By.CLASS_NAME, 'rent-item-right') #todo 大部分資訊集中在某個區塊
url = i.find_element(By.TAG_NAME, 'a').get_attribute('href')
title = detail.find_element(By.CLASS_NAME, 'item-title').text
price = detail.find_element(By.CLASS_NAME, 'item-price-text').find_element(By.TAG_NAME, 'span').text
price = int(price.replace(",", ""))
traffic = detail.find_element(By.CLASS_NAME, 'item-tip').text
address = detail.find_element(By.CLASS_NAME, 'item-area').text
hid = i.get_attribute('data-bind')
#! //資料篩選 (有些資料跟租屋沒關係)//
del_data = detail.find_element(By.CLASS_NAME, 'item-style').find_element(By.CLASS_NAME, 'is-kind').text
if del_data == "車位":
print(f'give out {hid} data, title:{title}, price:{price}')
continue
#todo 儲存資料進 data_list
data_order_mapping = {'url': url, 'title': title, 'price': price, 'address': address, 'traffic': traffic, 'hid':hid}
data = {key: data_order_mapping[key] for key in data_order}
data_list.append(data)
pagedata += 1
data_limit -= 1
page_number = int(driver.find_element(By.CLASS_NAME, 'pageCurrent').text)
print(f"page {page_number} 共取得:{pagedata} 筆資料")
while data_limit > 0:
#todo 測試 page jump 是否成功
loading_times_checking = loading_times #todo 重新偵測的時間上限
while True:
try:
#todo 擷取當前 page number
page_number = WebDriverWait(driver, 1).until(
EC.presence_of_element_located((By.CLASS_NAME, 'pageCurrent'))
)
page_number = int(page_number.text)
if page_pre == page_number and loading_times_checking <= 0:
data_limit = 0
print(f"Crawler try to get {page_number+1} page {loading_times} times, but it failed")
elif page_pre == page_number:
loading_times_checking -= 1
time.sleep(1)
continue
else:
break
except Exception as e:
loading_times_checking -= 1
time.sleep(1)
continue
#todo 往下滑
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
#todo 擷取 whole page
obj = WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CLASS_NAME, 'switch-list-content')) and
EC.presence_of_all_elements_located((By.CLASS_NAME, 'vue-list-rent-item'))
)
getdata(obj)
page_pre = page_number
try: #todo Crawler over
last_element = WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CSS_SELECTOR, '.pageNext.last'))
)
print('Crawler over')
break
except Exception as e: #todo nextpage
nextpage = driver.find_element(By.CLASS_NAME, 'page-limit').find_element(By.CLASS_NAME, 'pageNext')
nextpage.click()
#todo 存入 data.json 檔案
try:
with open("D:/gold_house/data.json", "w", encoding="utf-8")as file:
json.dump(data_list, file, ensure_ascii=False, indent=4)
print('json success')
except Exception as e:
print('json fails')
#todo 存入 data.excel 檔案
try:
df = pd.DataFrame(data_list)
df.to_excel("D:/gold_house/data.xlsx", index=False, engine="openpyxl")
print('excel success')
except Exception as e:
print('excel fails')
print(f"共爬取: {len(data_list)} 筆資料")
time.sleep(10)
driver.quit()
```
## 資料篩選
* 資料不一定全是租屋, 有些是租車位的
* 資料篩選區篩出車位關鍵字的標籤
```
del_data = detail.find_element(By.CLASS_NAME, 'item-style').find_element(By.CLASS_NAME, 'is-kind').text
if del_data == "車位":
print(f'give out {hid} data, title:{title}, price:{price}')
continue
```
# 登入 FB
* 安裝擴充套件 Cookie-Editor
* 登入 FB, 以 json 格式儲存成 Cookie

* 每一筆資料只留 name 和 value, 其他都刪掉
* 使用 for loop, 將 cookie 全部傳入, 最後 refresh page
```
with open('cookie.json') as f:
cookies = json.load(f)
driver = webdriver.Chrome()
driver.get('https://www.facebook.com')
for cookie in cookies:
driver.add_cookie(cookie)
driver.refresh()
```
* FB 爬蟲很容易被 ban 帳號, 盡量找官方提供的 API 比較好