Web scraping

Web Scrapping is a technique used to collect information from websites in an organised ways. It is a very powerful method to request information "in batch" for different purposes.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler.

from wikipedia

Note that web-scrapping it's not always "permitted", and some websites would rather provide an API for you to interact with their content.

Python scripting

Web scraping can be done easily with python language, to be able to write your own scripts you need to install some software:

  1. Install python.
    You can learn what pyhton is here and then follow these instructions to install it.
  2. Install beautifulSoup, and requests-html packages. You can do it with pip using the following commands:
pip install beautifulsoup4
pip install requests-html

A simple example

Imagine you want to know how much time the bus will take to arrive to an specific point and show that data in a led screen. You know the web page where that information is showed, but you just need the number of minutes, not the rest of the page! and you don't want to open a browser and find your bookmark every time to get the number.

Writing a script to get that data is simple, and you can update the values automatically so your screen shows the real thing. Let's get in to it.

First you need to find out where the data is stored on the source code of the web page, to do this right click on the number you want to get and select inspect:

You can see that the number is inside a div with the id linebus__number__H4_2041 and in the class next-buses__line-next. That's everything we need!

First we need to import the needed libraries to get the data from the internet and process it:

from requests_html import HTMLSession
from bs4 import BeautifulSoup

Then we get the data with the address of the page, and render it.

session = HTMLSession()
response = session.get('https://www.tmb.cat/ca/barcelona/autobusos/-/lineabus/H4/parada/406')
response.html.render()

Then we search for the container of the bus we want (linebus__number__H4_2041) inside the resndered response:

bus = soup.find(id='line--bus__number__H4_2041')

And finally the time for next bus inside the next-buses__line-next tag:

time = bus.find(class_='next-buses__line-next').string

At this point inside the time variable we have the text that indicates how long for the netx bus! so we can just print it:

print(time)

Full code

from requests_html import HTMLSession from bs4 import BeautifulSoup session = HTMLSession() resp = session.get('https://www.tmb.cat/ca/barcelona/autobusos/-/lineabus/H4/parada/406') resp.html.render() soup = BeautifulSoup(resp.html.html, "lxml") bus = soup.find(id='line--bus__number__H4_2041') time = bus.find(class_='next-buses__line-next').string print(time)

Get the news

Now let's try to get the new heradlines from bbc.com website:

from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
resp = session.get('https://www.bbc.com/news')
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")

for heading in soup.select('h3[class*=heading]'):
    print(heading.string)

and the output:

Two Americans dead after kidnapping in Mexico
Why a million Americans a year risk Mexico medical tourism
Ukraine denies reported link to Nord Stream attack
Thousands protest at Georgian 'foreign agent' bill
Ukraine investigates killing of unarmed soldier
EU agency unhappy at Amsterdam 'erotic centre' plan
Six Palestinians killed in Israeli West Bank raid

Batch example

This example produces a list of foods in the Open food facts page.

Product names

from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
resp = session.get('https://world.openfoodfacts.org/1')
resp.html.render()
soap = BeautifulSoup(resp.html.html, 'lxml')
    
for name in soap.findAll(class_='list_product_name v-space-tiny'):
    print(name.string)       

This is a small piece of the output:

Eaux de sources - Cristaline - 1,5 L
Nutella - Ferrero - 400 g
Prince - Lu - 300 g
Coca-cola - 330 mL
Nutella - Ferrero - 1 kg
Coca Cola Zero - 330 ml
Sésame - Gerblé - 230g
nutella biscuits - 304g
Cruesli Mélange De Noix - Quaker - 450 g
Céréales chocapic - Nestlé - 430 g

You can create a list of comma separated values, for example product name, link just adding a couple of lines:

from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
resp = session.get('https://world.openfoodfacts.org/2')
resp.html.render()
soap = BeautifulSoup(resp.html.html, 'lxml')
    
for product in soap.findAll(class_='list_product_a'):
    print(product.find(class_='list_product_name v-space-tiny').string, ', ', product['href'])

It's output (trimed):

Harrys brioche tranchee -30% de sucre sans additif 485g - 485 g ,  https://world.openfoodfacts.org/product/3228857001378/harrys-brioche-tranchee-30-de-sucre-sans-additif-485g
Coca Cola gout original - 1,25 L ,  https://world.openfoodfacts.org/product/5449000267412/coca-cola-gout-original
Biscoff à tartiner - Lotus - 400g ,  https://world.openfoodfacts.org/product/5410126006957/biscoff-a-tartiner-lotus
Alpro Mango (meer fruit) - 400 g ,  https://world.openfoodfacts.org/product/5411188125808/alpro-mango-meer-fruit
Maple joe - 250g ,  https://world.openfoodfacts.org/product/3088542500285/maple-joe
Céréales Extra Pépites Chocolat Noisettes - Kellogg's - 500 g ,  https://world.openfoodfacts.org/product/3159470005071/cereales-extra-pepites-chocolat-noisettes-kellogg-s

After this example you can imagine how to get more data of each product, you can even open the links of the product one by one and get, for example, the ingredients.

Saving it to a CSV file

You can save the scraped data directly to a file instead of printing it to the console:

myFile = open('test.csv', 'w')

for product in soap.findAll(class_='list_product_a'):
    myFile.write(product.find(class_='list_product_name v-space-tiny').string)
    myFile.write(',')
    myFile.write(product['href'])
    myFile.write('\n')

myFile.close()
Select a repo