Job Search Automation: Week 2

--- tags: automate-job --- # Job Search Automation: Week 2 This week, we will get started on building web scrapers for javascript-heavy webpages. [Link](https://drive.google.com/open?id=1tb-agxhifxq36LQ9bu4i5ejzfc5Kxcxc_ZF8POGzBeU) to this week's slides. ## Table of Contents [TOC] ## Static Webpages A static webpage never changes unless the developer edits the source code. When we make a request to the website's server, the server will load the site from an actual `.html` file. Every single user who visits the site will get the same information. We can think a static webpage as a vending machine. We get what we see! ## Dynamic Webpages A dynamic webpage, on the other hand, contains information that can change based on the user. When the user requests the webpage, the server will consider the user's information and generate a site based on the user. We can think of a dynamic webpage as a restaurant. We can order anything from the menu and the order gets sent to the kitchen where ingredients are assembled to create the dish we requested. ## Dynamic Website Example When we log in to Facebook, we can instantly see why Facebook is a **dynamic website**. We're greeted with our news feed, which contains news from friends we often interact with or information that aligns with our personal interests. The content of the webpage *changes* to suit the user! ![](https://i.imgur.com/4ZCMVQV.jpg) This makes dynamic webpages *difficult* to scrape with the tools we've already learned: Requests and BeautifulSoup. If the content of a webpage is changing based on various factors, the data we want may not be available when we download the source code through Requests. We would have to *interact* with the website first to access the data we want. If we log out of our account and type facebook.com into the address bar of our browser, we're greeted with a login page instead. Most likely, this is the source code that Requests will download for us. However, suppose we want to scrape the information from our *personal news feed*. We'd have to run some program that can simulate or control a browser for us. Specifically, we want a program that will fill out the Email and Password forms for us and log us in. ![](https://i.imgur.com/2d953vF.png) ## Javascript **Javascript** is a programming language that allows webpages to interact with the user or accomplish complex actions. ![](https://i.imgur.com/WTVsCLA.png) > Quoted from [here](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/First_steps/What_is_JavaScript): > > HTML is the markup language that we use to structure and give meaning to our web content, for example defining paragraphs, headings, and data tables, or embedding images and videos in the page. > > CSS is a language of style rules that we use to apply styling to our HTML content, for example setting background colors and fonts, and laying out our content in multiple columns. > > JavaScript is a scripting language that enables you to create dynamically updating content, control multimedia, animate images, and pretty much everything else. (Okay, not everything, but it is amazing what you can achieve with a few lines of JavaScript code.) Sometimes Javascript is embedded into the HTML source code we download from web pages. Other times, it's in a separate JS file. For **Chrome users**, we can also debug the Javascript of a webpage by clicking **View -> Developer -> Javascript Console**. If you're interested in understanding Javascript syntax, watch this series of [videos](https://www.youtube.com/watch?v=fGdd9qNwQdQ&list=PLoYCgNOIyGACTDHuZtn0qoBdpzV9c327V&index=1). If you're having a busy week and running low on time, watch this [one](https://www.youtube.com/watch?v=e57ReoUn6kM&list=PLoYCgNOIyGACTDHuZtn0qoBdpzV9c327V&index=6) to understand selectors. ## Selenium We will be using a program called Selenium to automate our browser. More specificially, [Selenium Python](https://selenium-python.readthedocs.io/installation.html#introduction), which is an API that lets you access Selenium functions through Python code. ### Install Selenium Run in your terminal: ``` $ pip install selenium ``` ### ChromeDriver Selenium requires a **webdriver**, which is an automation framework that lets you run tests against different browsers. It seems like y'all are Chrome users, so we will be focusing on installing ChromeDriver, a webdriver for Chrome. Later on, we'll configure the driver to work on [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome). :::info Headless Chrome is a headless browser for Google Chrome. Headless browsers are web browsers without GUI and can be controlled through programs or from your terminal. ::: ### Install ChromeDriver Download [ChromeDriver](http://chromedriver.chromium.org/) and place it in the same directory as your code. If you have a Mac, you can also install it using HomeBrew by running the following code: ``` brew install chromedriver ``` ## BearX Demo Now that we know the tools we can use, let's walk through an example! The website we're looking at is [BearX](https://bearx.co/login/). [Demo code](https://drive.google.com/open?id=19QEv992LD4L1grjcronsvskan2-yYIhD). For our new file, we want an import statement. ```Python from selenium import webdriver ``` Now, we'll set up our **ChromeDriver**. There are a few options we want to set up. ```Python #neat way of setting options for our ChromeDriver options = webdriver.ChromeOptions() options.add_argument('headless') options.add_argument('window-size=1200x600') ``` Adding the argument *headless* indicates we will be running our tests on **Headless Chrome**. Adding the argument for *window-size* lets us decide how large the window of our headless browser will be. Set it to a reasonable size! If you set it to a size such as 1 x 1, nothing will be clickable. If you're interested, there are other [options](https://sites.google.com/a/chromium.org/chromedriver/capabilities). Now, to intialize our driver with the options we set up. ```Python driver = webdriver.Chrome(chrome_options=options) ``` Now have the driver visit the website we want to scrape. ```Python driver.get('https://bearx.co/login/') driver.implicitly_wait(10) ``` :::info The `implictly_wait` function here tells Selenium to wait *10 seconds* before running our other functions. This is useful because webpages can take time to load and we want them to finish loading before interacting with them. ::: If we're not logged in, we get a webpage that looks like the one below. We want to use Selenium to login for us and click to the jobs page so we can scrape it. If we want to check that our headless browser is going to the right webpage, we can tell the driver to take a screenshot by running the following: ```Java driver.get_screenshot_as_file('login.png') ``` :::info The input into the `get_screenshot_as_file` function will be the filename of the screenshot. It will appear in the same directory as your code. The screenshot should match the picture below. ::: ![](https://i.imgur.com/p4A2PSR.jpg) To login, we need to find the specific page elements that allow us to input text into the **Email Address** and **Password** forms, so we write code to access these elements. We can refer to the [Selenium documentation](https://selenium-python.readthedocs.io/locating-elements.html) on locating elements to use the following functions. ```Python #selects the email address input email = driver.find_element_by_css_selector('input[type=text]') #selects the password input password = driver.find_element_by_css_selector('input[type=password]') ``` :::info The function `find_element_by_css_selector` allows you to reference the tag that the CSS element falls under and any attribute that it contains. For example, `input[type=text]` tells the driver that it's looking for an element under the `<input>` tag with `type = text`. ::: After selecting these elements, we want to input our email and password. ```Python email.send_keys(secret.username) password.send_keys(secret.password) ``` :::info The function `send_keys` allows you to type anything into any element that requires an input. ::: In this case, I have another file called `secret.py` that holds my username and password, so`secret.username` and `secret.password` is meant to represent my username and password respectively. Personally, I suggest using another file to hold your login information and not pushing it to your Github repository if you plan on posting your code to a public repo. Let's take a screenshot to see if we're on the right track. ```Python driver.get_screenshot_as_file('login-page.png') ``` ![](https://i.imgur.com/SZcFDOb.jpg) All we have to do is select the element that represents the login button and click it. ```Python #selects the login button login = driver.find_element_by_css_selector('input[value="Login"]') #click the login button login.click() ``` :::info The function `click` allows the driver to click on clickable elements. ::: Let's wait for the page to load and take a screenshot. ```Python driver.implicitly_wait(1000) driver.get_screenshot_as_file('post_login.png') ``` ![](https://i.imgur.com/pwhz5nc.png) Now, let's navigate to all the job listings using the drop down menu next to **Dashboard**. ```Python dropdowns = driver.find_element_by_id('navbarDropdown') dropdowns.click(); driver.get_screenshot_as_file('dropdown.png') ``` :::info The function `find_element_by_id` finds the first element in the code that matches the `id` that you pass into as the argument. ::: We should get the following screenshot: ![](https://i.imgur.com/ivZHEQC.png) With the drop-down menu open, we can now click on **Jobs** to navigate to the right webpage. ```Python jobs = driver.find_element_by_link_text('Jobs') jobs.click() driver.get_screenshot_as_file('dropdown-click.png') ``` :::info The function `find_element_by_link_text` selects the first element inside an `<a>` tag with the text you pass in as the parameter between the tags. ::: The screenshot we get: ![](https://i.imgur.com/JxMVzVz.png) Now, we can use Selenium to obtain the elements we want. Suppose we want to scrape the position name and the corresponding links. ```Python positions = driver.find_elements_by_tag_name('h6'); ``` :::info The function `find_elements_by_tag_name` returns an array of elements that have the tag we specified as the parameter. ::: If we want to check the text in these elements, we can run ```Python for elem in positions: print(elem.text) ``` Finally, we want to obtain all the links that correspond to the position. ```Python all_links = driver.find_elements_by_xpath("//a[@href]") ``` :::info The function `find_elements_by_xpath` returns an array of elements that have the path we've specified in the parameter. Refer to the [documentation](https://selenium-python.readthedocs.io/locating-elements.html) for more details on formatting the xpath. ::: If we want to check the links in these elements, we can run the following code: ```Python for elem in all_links: print(elem.get_attribute("href")) ``` :::info `get_attribute` allows us to access the text inside the attribute we specify inside of tag. ::: ### Final Formatting ```Python position_names = [] for elem in position_elements: position_names.append(elem.text) #hardcoding to obtain the relevant links wanted_links = all_links[13:33] application_links = [] add_link = 1 for elem in wanted_links: if add_link == 1: application_links.append(elem.get_attribute("href")) add_link = 0 else: add_link = 1 table_list = [position_names, application_links] ``` **Image** ![](https://i.imgur.com/Bk9sJ0g.png) ## Spotted Bugs/Tips ### Have conflicting programs installed? **Set up a virtual environment!** If you want to avoid dependancy conflicts, set up [virtualenv](https://virtualenv.pypa.io/en/stable/). **If you don't have virtualenv, run** ``` $ sudo pip install virtualenv ``` **Set up a virtual environment for Selenium** 1. Navigate to the directory where you want to run your code. 2. Move ChromeDriver to your current directory. 3. To create your virutal environment and activate it, run: ``` $ virtualenv -p python3 env $ source env/bin/activate ``` Having trouble with virtualenv? Check [here](https://stackoverflow.com/questions/31133050/virtualenv-command-not-found). ### Can't get webdriver.Chrome to work? Try adding an [executable path](https://stackoverflow.com/questions/51199515/selenium-error-chromedriver-executable-needs-to-be-in-path) to the location of your webdriver as an argument. ### Having Trouble Selecting the Right Element? Click the blue button on the top left when you inspect a webpage. Then, click the element that you want to locate in your code. ![](https://i.imgur.com/pOJQotN.png) **Demonstration** ![](https://i.imgur.com/TsuWfZJ.gif) ### Can't get webpage to load completely? Use the function `implicitly_wait` to have the driver wait a few seconds before doing anything. We can specifiy the number of seconds by the input. ### Trying to load a Webpage by Scrolling? Taken from this [StackOverflow answer](https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python): ```Python SCROLL_PAUSE_TIME = 0.5 # Get scroll height last_height = driver.execute_script("return document.body.scrollHeight") while True: # Scroll down to bottom driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page time.sleep(SCROLL_PAUSE_TIME) # Calculate new scroll height and compare with last scroll height new_height = driver.execute_script("return document.body.scrollHeight") if new_height == last_height: break last_height = new_height ``` ## Resources **About Static/Dynamic Webpages** [Static vs Dynamic Webpage](https://noahveltman.com/static-dynamic/) **About Javascript** [What is Javascript?](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/First_steps/What_is_JavaScript) **About Headless Chrome and Selenium** [Driving Headless Chrome with Python](https://duo.com/decipher/driving-headless-chrome-with-python) [Running Selenium with Headless Chrome](https://intoli.com/blog/running-selenium-with-headless-chrome/)