Try   HackMD
tags: RSelenium

Scrape web tables from QuickFS using RSelenium in Windows7

The aim is to scrape all the tables at


Connect Chrome driver, RSelenium

Data Analysis and Extraction - RSelenium tutorial

  • Install R packages
dir.R.packages <- "C:/Program Files/R/R-4.0.3/library" 

#install.packages("tidyquant", lib = dir.R.packages)
#install.packages("RSelenium", lib = dir.R.packages)

library(RSelenium,lib.loc = dir.R.packages)
  • Download chrome driver. Note that mismatched versions may occur (e.g., version of the downloaded chrome driver does not support old version of chrome browser). The downloaded working version is ChromeDriver 89.0.4389.23

  • Download Selenium Server

  • Laucn cmd.exe, change directory to D:/My Software, where selenium-server-standalone-3.141.59.jar is located

D:
cd D:\My Software
# Execute the following command
java -Dwebdriver.chrome.driver="C:\drivers\chromedriver_win32\chromedriver.exe" -jar selenium-server-standalone-3.141.59.jar

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

con <- RSelenium::remoteDriver(remoteServerAddr="localhost"
                                ,port=4444
                                ,browserName="chrome")

# Open the connection
con$open()

# Send an URL to the new session
con$navigate("https://quickfs.net/company/CKF:AU")

  • The website should be opened in the new session. Reconnect the URL if errors occur

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Scrape all tables of the webpage where dropdown option is "Overview"

#-------------------------------------------------------------
# Dropdown= Overview
# Select "overview" from the dropdown list and then get the content of all tables
#-------------------------------------------------------------
tables <- htmlParse(con$getPageSource()[[1]]) # class(tables)
readHTMLTable(tables)

# Extracting tables
library(rvest, lib.loc = dir.R.packages)
x <- con$getPageSource()[[1]] %>% 
  read_html() %>%
  html_table()

table.1 <- x[[1]] # class(table.1) "data.frame"
table.2 <- x[[2]] # class(table.1) "data.frame"

# Extract sub tables
names(table.1)
str(table.1)

# Reshape the table Valuation Ratios 
library(tidyr, lib.loc = dir.R.package)

table.1.1 <- table.1[c(2:9),c(1:2)] %>%
  dplyr::rename(name=`Key Statistics`
                ,value=`Key Statistics.1`) %>%
  tidyr::pivot_wider(names_from = name, values_from=value)

# Clean column names
colnames(table.1.1) <- sub(x=colnames(table.1.1), pattern = "/", replacement = ".")

# Reshape the table 10-Yr Median Returns
table.1.2 <- table.1[c(2:4),c(3:4)] %>%
  dplyr::rename(name=`Key Statistics`
                ,value=`Key Statistics.1`) %>%
  tidyr::pivot_wider(names_from = name, values_from=value)

# Reshape the table 10-Year CAGR
table.1.3 <- table.1[c(6:9),c(3:4)] %>%
  dplyr::rename(name=`Key Statistics`
                ,value=`Key Statistics.1`) %>%
  tidyr::pivot_wider(names_from = name, values_from=value)

# Reshape the table 10-Yr Median Margins
table.1.4 <- table.1[c(2:5),c(5:6)] %>%
  dplyr::rename(name=`Key Statistics`
                ,value=`Key Statistics.1`) %>%
  tidyr::pivot_wider(names_from = name, values_from=value)

# Reshape the table Capital Structure
table.1.5 <- table.1[c(7:9),c(5:6)] %>%
  dplyr::rename(name=`Key Statistics`
                ,value=`Key Statistics.1`) %>%
  tidyr::pivot_wider(names_from = name, values_from=value)

#----------------------------------------
# Reshape the table with 10 year overview
#----------------------------------------
str(table.2)
table.2$X1 <- table.2$X1 %>%
  gsub(x=., pattern = " ", replacement = ".") %>%
  gsub(x=., pattern = "%", replacement = "percent")

base.table.2 <- data.frame()
iterators <- colnames(table.2)[2:11]
years <- table.2[1,c(2:11)] # dim(years) 1 10
item.names <- table.2[c(2:14),1]

for(i in 1:ncol(years)){
  # Get column name by positio
  year <- years[1,i] # "2011"
  name <- colnames(years)[i]
  # Reshape a single year of data to long format
  .year.long <- data.frame( name=table.2[c(2:14), 1]
                            ,value=table.2[c(2:14),i+1]
                            ,stringsAsFactors = F)
  
  .year.wide <- .year.long %>% 
    tidyr::pivot_wider(names_from = name, values_from=value) %>%
    # Add year
    dplyr::mutate(year=year) %>%
    dplyr::select(year,everything())
  # Vertically add the current year of data to the base data set
  base.table.2 <- dplyr::bind_rows(base.table.2, .year.wide)
}


Scrape tables on the webpage where dropdown option is "Income Statement" (not working)

Inspect the web elments

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

<div _ngcontent-c1="" class="col-xs-offset-3 col-xs-2"><select-fs-dropdown _ngcontent-c1="" _nghost-c4="">
  <div _ngcontent-c4="" class="btn-group open" dropdown="">
  <button _ngcontent-c4="" class="selectDropdown dropdown-toggle" dropdowntoggle="" type="button" aria-haspopup="true" aria-expanded="true">
  <div _ngcontent-c4="" class="dropdownLabel">Overview</div>
  </button>
  <!----><ul _ngcontent-c4="" class="dropdown-menu" id="select-fs-dropdown" role="menu">
  <!----><li _ngcontent-c4="">
  <a _ngcontent-c4="" id="ovr">Overview</a>
  </li><li _ngcontent-c4="">
  <a _ngcontent-c4="" id="is">Income Statement</a>
  </li><li _ngcontent-c4="">
  <a _ngcontent-c4="" id="bs">Balance Sheet</a>
  </li><li _ngcontent-c4="">
  <a _ngcontent-c4="" id="cf">Cash Flow Statement</a>
  </li><li _ngcontent-c4="">
  <a _ngcontent-c4="" id="ratios">Key Ratios</a>
  </li>
  </ul>
  </div>
  </select-fs-dropdown></div>
id.ovr <- con$findElement(using = 'id', value = "ovr")

Selenium message:no such element: Unable to locate element: {"method":"css selector","selector":"#ovr"}
(Session info: chrome=89.0.4389.90)
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '3.141.59', revision: 'e82be7d358', time: '2018-11-14T08:25:53'
System info: host: 'CHANG-PC', ip: '192.168.0.167', os.name: 'Windows 7', os.arch: 'amd64', os.version: '6.1', java.version: '1.8.0_231'
Driver info: driver.version: unknown

Error: 	 Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
class: org.openqa.selenium.NoSuchElementException
Further Details: run errorDetails method

R - Rselenium - navigate drop down menu / list / box using = 'id'