tags: `web crawler`,`Selenium`,`自動登入`,`瀏覽器自動化`

Selenium簡介與應用(PTT資料擷取)

為瀏覽器自動化（Browser Automation）需求所設計的一套工具集合，讓程式可以直接驅動瀏覽器進行各種網站操作。
利用 Selenium 操作網頁表單資料、點選按鈕或連結、取得網頁內容並進行檢驗，可以滿足相當多測試的需求。
Seleniumn WebDriver是用來執行並操作瀏覽器的一個API介面，程式透過呼叫WebDriver來直接對瀏覽器進行操作。
檢查chrome版本
- 安裝Chrome https://sites.google.com/chromium.org/driver/
下載Chrome記得跟當前路徑放在一起才能運作,或是要指定路徑
pip install selenium

XPath(XML Path)

用來查詢XML/HTML文件的所有元素
XML/HTML的每個元素都是一個節點，整份文件是一個樹狀結構
selenium可以使用xpath的形式來定位網頁元素，我們可以通過開發者模式的來獲取xpath路徑

動態網頁

Web網站會因使用者互動、時間或各種參數來決定回應的網頁內容稱為動態網頁
靜態網頁的爬取，通常使用[requests]+[BeautifulSoup]來解決
動態網頁的爬取，則使用[Selenium]套件的[WebDriver]模組來解決

Selenium之功用

啟動真實瀏覽器來進行網頁操作自動化
支援一般標籤定位
支援CSS選擇器定位
支援XPath表達式定位
支援AJAX技術的客戶端動態網頁資料的擷取

Selenuim網頁資料定位函式

定位函式	說明
find_element(s)_by_id	傳回第一個(所有)相符id的元素(串列)
find_element(s)_by_name	傳回第一個(所有)相符name的元素(串列)
find_element(s)_by_class_name	傳回第一個(所有)相符class的元素(串列)
find_element(s)_by_tag_name	傳回第一個(所有)相符標籤名稱的元素(串列)
find_element(s)_by_link_text	傳回第一個(所有)相符超連結文字的元素(串列)
find_element(s)_by_partial_link_text	傳回第一個(所有)相符部分超連結文字的元素(串列)
find_element(s)_by_css_selector	傳回第一個(所有)相符CSS選擇器的元素(串列)
find_element(s)_by_xpath	傳回第一個(所有)相符XPath選擇器的元素(串列)































#抓取PTTGossiping[Selenium]

#1
from selenium import webdriver

#2
www=webdriver.Chrome("chromedriver") #"chromedriver"放在同一路徑可不寫
url="https://www.ptt.cc/ask/over18?from=%2Fbbs%2FGossiping%2Findex.html"

#3
www.implicitly_wait(10) #隱性等待抓資料時間

#4
www.get(url) #用get取得網路資料

www.maximize_window() #視窗最大化
button=www.find_element_by_name("yes")#問你是否滿18歲
button.click()

for line in www.find_elements_by_class_name("r-ent"):
    try:
        print(line.find_element_by_class_name("date").text.strip())#日期
        print(line.find_element_by_class_name("title").text.strip())#標題
        print("https://ptt.cc"+line.find_element_by_tag_name("a").get_attribute("href"))#連結
        print(line.find_element_by_class_name("author").text.strip()) #作者
        print(line.find_element_by_class_name("nrec").text.strip()) #nrec回應人數
        print("==================================================")
    except:
        continue
#5
www.quit()#離開視窗

tags: web crawler,Selenium,自動登入,瀏覽器自動化

Selenium簡介與應用(PTT資料擷取)

Read more

Object 物件導向 & 繼承

Machine Learning 基礎篇train_test_split(訓練和測試資料集)

Machine Learning 基礎篇-Linear Rregression,Multiple Regression

Machine Learning 練習篇-Multiple Regression多元回歸糖尿病案例

tags: `web crawler`,`Selenium`,`自動登入`,`瀏覽器自動化`