R教學1014 - 爬蟲

# R教學1014 - 爬蟲 [slido](https://app.sli.do/event/ui4hmrw5/live/questions) ## 大綱 - 前導 - 爬蟲流程 - 有空講一下r style guide ## 前導 - 什麼是爬蟲? 網路爬蟲(Web crawler)，又稱網路蜘蛛(Web spider)、自動檢索工具(automatic indexer)，是㇐種「自動化瀏覽網路」的程式，或者說是㇐種網路機器人。網路爬蟲能夠自動採集所有能夠存取的網頁頁面內容，提供使用者做進㇐步的分析。網路爬蟲可以說是駭客行為的㇐種，因為它會大量耗用掉網站的流量，並不被網站擁有者所樂見，因此盡可能偽裝成㇐般使用者，避免不必要的法律困擾。 - 搜尋引擎的運作方式 - robots.txt - 爬蟲的注意事項 - 發出的請求盡量㇐次就做對，不要有太多異常的請求。 - 盡量偽裝成正常使用者(可利用Sys.sleep和rand函數)，不要短時間大量發出請求。 - 尊重欲爬取資料的網站，不要吃光對方流量，干擾到其正常的服務。 - 注意資料版權問題，是否能夠公開及自由運用。若該網站未宣告，則應假定資料的版權為該網站所有。 ## 爬蟲流程 ![](https://i.imgur.com/xT13nM2.png) ### 爬蟲基本工具本次教學採用[`rvest`](https://rvest.tidyverse.org/)套件，基本上就是tidyverse中參考[beautiful soup](https://www.crummy.com/software/BeautifulSoup/)所開發的工具。常用的函數有以下幾種 `read_html()`:讀取網頁內容 `html_nodes()`:取得架構的節點，可從html、css或是xpath指定。 `html_text()`:獲取節點的文字內容。 `html_attr()`:獲取節點屬性而在爬蟲時我們需要輸入準確的路徑才能夠告訴程式我們需要他爬的資料，但是網頁架構可能會相當複雜，路徑會找個老半天。所以我們可以安裝瀏覽器的plug-in幫我們快速找到正確的路徑。 Chrome用戶請安裝這個擴充工具[XPath Helper](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=zh-TW)以及[css selector](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?itemlang=it&hl=zh-tw)。 ```r library(tidyverse) library(rvest) ``` ### 範例一：證交所每日收盤 [網址](https://www.twse.com.tw/exchangeReport/MI_INDEX?response=html&type=ALLBUT0999&date=20191008) ```r # 1014_R_爬蟲 rm(list = ls()); gc() install.packages("rvest") library(tidyverse) library(rvest) # 證交所股價爬蟲 # 連結 url <- "https://www.twse.com.tw/exchangeReport/MI_INDEX?response=html&type=ALLBUT0999&date=20191008" # 股價資料表標題 stockTitle <- read_html(url, encoding = "utf-8") %>% html_nodes(xpath = "/html/body/div/table[9]/thead/tr[3]/td") %>% html_text() stockPriceData <- read_html(url, encoding = "utf-8") %>% html_nodes(xpath = "/html/body/div/table[9]/tbody/tr/td") %>% html_text() %>% matrix(ncol = 16, byrow = T) %>% as.data.frame() colnames(stockPriceData) <- stockTitle # 連續爬多天 dateList <- seq.Date(from=as.Date("2019-09-01"), to=as.Date("2019-09-03"), "days") %>% gsub("\\-","",.) # 儲存表 output <- NULL for(di in 1:length(dateList)){ cat(paste0("目前正在下載 ",dateList[di]," 交易日，進度: ", di," / ",length(dateList),"\n")) # 連結 url <- paste0("http://www.twse.com.tw/exchangeReport/MI_INDEX?response=html&date=", dateList[di],"&type=ALLBUT0999") # 判斷當日是否有股價資料 contnet <- read_html(url, encoding = "utf-8") %>% html_nodes(xpath="/html/body/div[1]") %>% html_text() if(grepl("沒有符合條件的資料", contnet)==F){ # 股價資料表標題 stockTitle <- read_html(url, encoding = "utf-8") %>% html_nodes(xpath="/html/body/div/table[9]/thead/tr[3]/td") %>% html_text() # 股價資料表 stockPriceData <- read_html(url, encoding = "utf-8") %>% html_nodes(xpath="/html/body/div/table[9]/tbody/tr/td") %>% html_text() %>% matrix(ncol=16, byrow=T) %>% as.data.frame() # 替換股價資料表標題 colnames(stockPriceData) <- stockTitle # 儲存資料 output <- bind_rows(output, stockPriceData %>% mutate(date = dateList[di])) }else{ cat(paste0("交易日 ",dateList[di]," 未開盤\n")) } # 暫停延緩 Sys.sleep(5) } ``` 要注意的點: - 標題跟內文分開爬 - 網址當成api用 ### 範例二：鉅亨新聞網 [網址](https://news.cnyes.com/news/cat/tw_stock_news) ```r # 鉅亨網新聞爬蟲 url <- paste0("https://news.cnyes.com/news/cat/tw_stock_news") # 下載新聞標題(使用CSS語法爬取) title <- read_html(url, encoding = "utf-8") %>% html_nodes(css="._1xc2") %>% #html_nodes(xpath = "//div/a[@class='_1Zdp']/div[@class='_1xc2']/h3") html_text() # 取出新聞連結 newsLink <- read_html(url, encoding = "utf-8") %>% html_nodes(css = "._1Zdp") %>% html_attr("href") # 取得新聞內容 data <- read_html("https://news.cnyes.com/news/id/4394884", encoding = "utf-8") %>% html_nodes(xpath = "//p") %>% html_text() %>% paste(., collapse = "") # 迴圈連結取出新聞內容 newsContent <- NULL for(ix in 1:length(newsLink)){ cat(paste0("目前進度:",ix,"/",length(newsLink),"\n")) # 網址 url <- paste0("https://news.cnyes.com",newsLink[ix]) # 取得新聞內容 data <- read_html(url, encoding = "utf-8") %>% html_nodes(xpath = "//p") %>% html_text() %>% paste(., collapse = "") # 儲存資料 newsContent <- c(newsContent, data) # 暫停程式碼 Sys.sleep(1) } # 整理成tibble output <- tibble(title, newsLink, newsContent) ``` 要注意的點: - 用css不用XPath(為了一次獲取所有標題) - 用attr找href(為了一次獲取所有連結) ### 範例三：ptt電影版 [網址](https://www.ptt.cc/bbs/movie/index.html) ```r # ptt電影版爬蟲 url <- "https://www.ptt.cc/bbs/movie/index.html" # 利用上一頁面的連結推斷目前在第幾頁 pageNum <- read_html(url, encoding = "utf-8") %>% html_nodes(xpath="//a[@class='btn wide'][2]") %>% # 指定到上一頁 html_attr("href") %>% # 取得href屬性的值 gsub("\\D","",.) %>% as.numeric() # 讀取頁數 pageRead <- 3 # 建立文章連結表 articleTable <- NULL # 迴圈翻取各頁各文章的連結 for(page in seq((pageNum-pageRead), (pageNum+1), 1)){ cat(paste0("目前正在讀取第 ",page," 個頁面，進度: ",page," / ",pageNum,"\n")) # 連結 url <- paste0("https://www.ptt.cc/bbs/movie/index",page,".html") # 下載網頁原始碼 html <- read_html(url, encoding = "utf-8") # 讀取標題 title <- html %>% html_nodes(xpath = "//div[@class='title']") %>% html_text() %>% gsub("\n", "", .) %>% gsub("\t", "", .) # 讀取文章連結 link <- html %>% html_nodes(xpath = "//div[@class='title']/a") %>% html_attr("href") %>% paste0("https://www.ptt.cc",.) # 讀取文章日期 articleDate <- html %>% html_nodes(xpath = "//div[@class='date']") %>% html_text() # 移除文章已被刪除項目 removeSite <- grep("\\刪除)",title) if(length(removeSite)>0){ articleDate <- articleDate[-removeSite] title <- title[-removeSite] } # 儲存資料 articleTable <- bind_rows(articleTable, tibble(articleDate, title, link)) # 暫停延緩 Sys.sleep(0.1) } ``` 要注意的點: - 翻頁問題 - 爬爬ptt的網站 ## 其他爬蟲 [RSelenium](https://cran.r-project.org/web/packages/RSelenium/RSelenium.pdf) --- # [R style guide](https://google.github.io/styleguide/Rguide.html) # Ref https://github.com/SuYenTing/Quantitative_investment_material_in_R/tree/master/R%E8%AA%9E%E8%A8%80%E7%88%AC%E8%9F%B2%E6%95%99%E5%AD%B8