Try   HackMD

R 爬取動態網頁資料-RSelenium

LHB阿好伯, 2021/07/15

tags: R

RSelenium

https://github.com/ropensci/RSelenium

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

簡介

Selenium 是自動化控制網路瀏覽器的工具
可以作為動態網頁的爬蟲工具

這時可能會有人想問什麼是動態網頁
我之前分享的以R語言爬取監測站歷史資料並以ggplot2繪製風玫瑰圖(風花圖,Wind Rose)_大寮測站為例
網頁的資料會隨網址變化
可以藉由修改網址取得所需資料

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

然而有時候會遇到像下面這網頁
他的資料是依據網頁上的選單互動所變化
這時候就可以利用Selenium讓程式像人一樣去操作網頁

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

安裝Java

https://java.com/zh-TW/download/ie_manual.jsp?locale=zh_TW

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

安裝 WebDriver

https://chromedriver.chromium.org/
創建一個目錄來放置可執行文件,例如 C:\WebDriver\bin或/opt/WebDriver/bin
在 Windows 上 - 以管理員身份打開命令提示符並運行以下命令以將該目錄永久添加到您計算機上所有用戶的路徑中:

setx /m path "%path%;C:\R\WebDriver"

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

macOS 和 Linux 上的 Bash 用戶 - 在終端中:
export PATH=$PATH:/opt/WebDriver/bin >> ~/.profile

您現在已準備好測試您的更改。關閉所有打開的命令提示符並打開一個新的
鍵入您在上一步中創建的文件夾中的一個二進製文件的名稱,例如:
chromedriver

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

安裝Selenium Server

https://www.selenium.dev/downloads/

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

並不需要執行檔案
而是將檔案放在CMD(命令提示字元)的路徑中即可

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

執行

java -jar selenium-server-standalone-3.141.59.jar

這邊有一個網路上都沒提到的重點
這時候不要關閉CMD才可以開始使用RSelenium
所以每一次使用套件前都要先執行selenium-server-standalone

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

常用函數

開啟、關閉瀏覽器

library(RSelenium) remDr <- remoteDriver(browserName = "chrome") # 打開瀏覽器 remDr$open()

瀏覽器會顯示測試軟體控制
這時候代表你已經成功了XD

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

# 關閉瀏覽器 remDr$quit() # 直接退出 remDr$close() # close用於關閉當前會話,也可以用作關閉瀏覽器

開啟特定網頁

remDr$navigate("https://wq.epa.gov.tw/EWQP/zh/EnvWaterMonitoring/RiverWaterQuality.aspx")

定位

定位到網頁上某個物件需要使用函數

findElements( using = c("xpath", "css selector", "id", "name", "tag name", "class name", "link text", "partial link text"), value )

其中using可以是"xpath"、"css selector"、 "id"、 "name"、"tag name"、"class name"、 "link text"或"partial link text"
value是其相應的數值

例如在這個網頁中我們可以查詢他的XPath是
//*[(@id = "CPH1_ddl_Unit")]

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

網頁定位方法

滑鼠點擊

這時再搭配 clickElement() 就可以模擬滑鼠點擊的動作去選擇清單中的選項

click(buttonId = 0),預設buttonId爲0表示單擊左鍵
1表示單擊滾動條
2表示單擊右鍵

unit_xpath2 <- "//*[@id='CPH1_ddl_Unit']/option[7]" #選項7 btn2 <- remDr$findElement(using = 'xpath', value = unit_xpath2) btn2$clickElement()

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

或是也可以選擇 mouseMoveToLocation(x = NA_integer_, y = NA_integer_, webElement = NULL) 來控制滑鼠移動

輸入框控制

unit_xpath3 <-"//*[@id='txtSearch']" btn3 <- remDr$findElement(using = 'xpath', value = unit_xpath3) # key為文字輸入後執行的鍵盤按鈕 # key使用UTF-8代碼 # https://github.com/SeleniumHQ/selenium/wiki/JsonWireProtocol#sessionsessionidelementidvalue text <- list('地下水', key = 'enter') btn3$sendKeysToElement(text) Rivername <- strsplit( as.character(btn3$getElementText()), "\n")

擷取資料

getElementTagName() #查詢元素的標籤名稱
getElementText() #獲取元素的文字

其他實用函數

#確認當前顯示警報對話框 remDr$acceptAlert() # 取消當前顯示的警報對話框 remDr$dismissalert() #刪除當前頁面可見的所有cookie remDr$deleteallcookies() #視窗最大化 remDr$maxWindowSize() #截圖 remDr$screenshot(display = FALSE, useViewer = TRUE, file = NULL) #獲取當前頁面標題 remDr$getTitle(URL) #取得當前頁面的URL getCurrentUrl() # 上下切換分頁 goBack() goForward() # 重新加載當前頁面 remDr$refresh() # 元素在頁面上的位置。(0, 0) 頁面的左上角。 remDr$getElementLocation()

參考資料
https://cran.r-project.org/web/packages/RSelenium/RSelenium.pdf

https://docs.ropensci.org/RSelenium/

https://mran.microsoft.com/snapshot/2017-12-11/web/packages/RSelenium/vignettes/RSelenium-basics.html#sending-mouse-events-to-elements

🌟全文可以至下方連結觀看或是補充

全文分享至

https://www.facebook.com/LHB0222/

https://www.instagram.com/ahb0222/

有疑問想討論的都歡迎於下方留言

喜歡的幫我分享給所有的朋友 \o/

有所錯誤歡迎指教

全部文章列表