Hands-on data scraping in Python

# Hands-on data scraping in Python ## Example 1: current time and date in Paris Let's build a simple scraper that provides us with the current time and date in Paris. For this, we will be relying on https://www.timeanddate.com/worldclock/france/paris-department ```Python import requests time_html = requests.get("https://www.timeanddate.com/worldclock/france/paris-department") print(time_html.text) ``` You can then inspect the returned HTML in a code editor to better understand it, and locate the data of interest to you. You can also do that using the Developer Console in your browser: ![](https://i.imgur.com/UyxjNe4.jpg) As you can see, the time and date are included in two `span` HTML tags with the IDs `ct` and `ctdat`. Next, we will use BeautifulSoup to select these `span` tags by their IDs and get their text contents: ```Python import requests from bs4 import BeautifulSoup time_html = requests.get("https://www.timeanddate.com/worldclock/france/paris-department") soup = BeautifulSoup(time_html.text, "html.parser") local_time_paris = soup.select_one("#ct").text local_date_paris = soup.select_one("#ctdat").text print(f"Paris local time: {local_time_paris}") print(f"Paris local date: {local_date_paris}") ``` Voila! You have at your disposal a neat script that can let you know the time and date in Paris whenever you need. ## Example 2: list Pokémons of Generation I So your Data Science Engineer's research has led them to the list of Pokémons of the first generation as the new data set that will take your whole business to the next level? No problem. We'll take care of it! We first got to identify a reputable website that has them all listed: [Wikipedia List of generation I Pokémon](https://en.wikipedia.org/wiki/List_of_generation_I_Pok%C3%A9mon) Inspecting the Developer Console will lead you to a `table` HTML tag that contains the following about all 151 Pokémons of generation I: - English name - Japanese name in both formats - Pokémon type(s) - Evolves from/into - Notes ![Pokémon table](https://i.imgur.com/lXRT4bg.png) So as usual, let's make a `GET` request to fetch the HTML of the page, load it into BeautifulSoup and select our table: ```Python import requests from bs4 import BeautifulSoup pokemon_wiki = requests.get("https://en.wikipedia.org/wiki/List_of_generation_I_Pokémon") soup = BeautifulSoup(pokemon_wiki.text, "html.parser") pokemon_table = soup.select_one(".wikitable.sortable.plainrowheaders.jquery-tablesorter") print(pokemon_table) ``` What the code above does is look for an HTML tag that has all 4 CSS classes attached (`wikitable`, `sortable`, `plainrowheaders` and `jquery-tablesorter`), which fits what we see in the screenshot of the browser Developer Console: ```HTML <table class="wikitable sortable plainrowheaders jquery-tablesorter"> <caption>...</caption> <thead>...</thead> <tbody> ... </tbody> </table> ``` However, if you execute the script you will be surprised to notice that `pokemon_table` is `None`! ![](https://i.imgur.com/xvlPu1Q.gif) This is where debugging the returned HTML will reveal a difference between what your code gets and what your browser shows you. > You can do it quickly in your terminal using the command `curl https://en.wikipedia.org/wiki/List_of_generation_I_Pokémon`). Here is how our table shows up in the HTML when we do so: ```html <table class="wikitable sortable plainrowheaders"> <caption>...</caption> <tbody> <tr>...</tr> <tr>...</tr> <tr id="Bulbasaur">...</tr> <tr id="Ivysaur">...</tr> ... </tbody> </table> ``` We notice that it has a missing CSS class, and even more surprising, it does not have a `thead` HTML tag inside of it. This is a good example of roadblocks you might face while building your scraper. This is due to the browser executing JavaScript that modify the HTML tree. In this case, JQuery adds the missing class and creates a `thead` HTML, and makes the table sortable. **TIP**: > Always get the HTML of the page using `curl` and save it in a file in your disk. This will give you a better idea of what your script will get as a response than inspecting the source code in the browser. > It will also help you detect early if the website relies heavily on JavaScript, requires specific HTTP headers or any anti-bot measures. So relying on the HTML snippet, we will need to: 1. Adjust our CSS selector 2. Select the `tr` tags inside of it 3. Remove the first 2 `tr` tags since they only include header information 4. Inside each `tr` tag, select the first and last `th` tags, which include english name of Pokémon and notes about it. ``` Python import requests from bs4 import BeautifulSoup pokemon_wiki = requests.get("https://en.wikipedia.org/wiki/List_of_generation_I_Pokémon") soup = BeautifulSoup(pokemon_wiki.text, "html.parser") pokemon_table = soup.select_one(".wikitable.sortable.plainrowheaders.jquery-tablesorter") pokemon_rows = pokemon_table.select("tr") data = [] # var to store the list of Pokemons for row in pokemon_rows: columns = row.select("th") data.append({ "name": columns[0].text, "notes": columns[-1].text. }) print(f"Scraped data about Pokemon named {name}) print(data) ``` ### Going further - Scrape all Pokémon attributes in table: this will present a challenge since some pékemon do not have a secondary type, and this need to be taken into account. - Save URLs to each Pokémon Wiki entry when possible - Saving into a JSON file - Transform Evolves Into/From attributes to foreign keys, when possible - Descriptive statistics about Pokémons: distribution by types, evolution rank