--- tags: automate-job --- # Job Search Automation: Week 1 Check the project README.md for the motivation and design behind this project. [Slides](https://slides.com/tzee/deck#/) this week. ## Table of Contents [Toc] ## A Basic Introduction to Web Scraping Web scraping describes a process in which a program downloads and processes content from the web. The simplest web scraper involves downloading the website's html code and scraping information from it. It's much easier to run a web scraping program that goes through websites and return a cumulative list of content rather than wasting precious time searching for content online. **A scenario:** Suppose you'd like to wake up to an updated album of cute cats every single day. Since you're a busy student, you decide to web scrape cat pictures from your favorite websites, starting with Google Images (*the images in the picture below*). ![](https://i.imgur.com/0n99Zcf.jpg) :::danger **STOP.** Before you go into researching how to web scrape these images on a webpage, you need to check if any of these conditions below are fulfilled. If **any** of them are, don't choose to web scrape! ::: - [ ] Does the website provide an [API](https://medium.com/@TebbaVonMathenstien/what-is-an-api-and-why-should-i-use-one-863c3365726b) and does it make your life [easier](https://www.grepsr.com/web-scraping-vs-api)? - [ ] Is it not ethical to scrape content from this website? (i.e. check its [robots.txt](http://www.robotstxt.org/)) **The answer:** * Google provides an [API](https://developers.google.com/custom-search/docs/overview#available_apis). Is it worth it? Yes. *Refer to the image below. The source code is pretty complicated!* ![](https://i.imgur.com/v480BfO.jpg) * Yup. [Not ethical](https://www.google.com/robots.txt). Don't risk it. Want more practice? Try looking at the sites below. * [LinkedIn](https://www.linkedin.com) * [Unsplash](https://www.unsplash.com) * [Reddit](https://reddit.com) * [Medium](https://medium.com) ## HTML & CSS HTML is the format webpages are written in. Depending on the web browser, there are different [ways](https://blog.kissmetrics.com/how-to-read-source-code/) to view a webpage's source code. For Google Chrome, right click and press "Inspect" to view the source code. ![](https://i.imgur.com/ZhkOpgR.png) If you have no idea what HTML is, I recommend watching this [video](https://www.youtube.com/watch?v=3JluqTojuME). Try to get familiar with using Developer Tools. ## Static Website Demo Walkthrough In the spirit of Codeology, I will go over a basic web scraper for the following [website](https://weworkremotely.com/categories/remote-programming-jobs) and introduce the required modules: [Requests](http://docs.python-requests.org/en/master/) and [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). ## Requests Requests is a module that allows the user to download files and web pages from the Internet. We want to download the HTML code behind the webpage below: ![](https://i.imgur.com/pTPeSVO.png) ### Installation: ``` $ pip install requests ``` ### Basic Overview: **Importing** ```python import requests ``` **Get**: takes in the URL string of the chosen webpage as a parameter and returns a Response object. ```python res = requests.get('https://weworkremotely.com/categories/remote-programming-jobs') ``` **Raise_for_status**: raises an Exception if there was an error downloading the file. If the download succeeded, it will do nothing. ```Python try: res.raise_for_status() except Exception as e: print(e) ``` ### Tie it Together ```Python def get_webpage(url): res = requests.get(url) try: res.raise_for_status() except Exception as e: print(e) return res ``` :::success If all goes well, the contents of the response object should look like the HTML code below. ::: **Code:** ```python print(res.content) ``` **Returns:** Downloaded HTML code of the webpage ![](https://i.imgur.com/tQQ4wYI.png) Feel free to refer to the [documentation](http://docs.python-requests.org/en/master/) for more information! ## BeautifulSoup4 The code above is super complicated looking, so we want to use BeautifulSoup4 to parse through the downloaded HTML code of the webpage and specifically select the content we want. ### Installation ``` $ pip install beautifulSoup4 ``` ### Basic Overview Import BeautifulSoup4. ``` import bs4 ``` :::warning If you're not very familiar with HTML and CSS, please refer back to the HTML and CSS sections before proceeding. ::: Refer to the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for more methods! ## Recommended Websites Easier: https://job-ca.com https://www.careerbuilder.com https://www.glassdoor.com More Difficult: https://uncubed.com ## Resources [Automate the Boring Stuff: Web Scraping](https://automatetheboringstuff.com/chapter11/) [Inspect Element Tutorial](https://zapier.com/blog/inspect-element-tutorial/)