owned this note
owned this note
Published
Linked with GitHub
---
tags: mdef
---
# Web scraping
[TOC]
Web Scrapping is a technique used to collect information from websites in an organised ways. It is a very powerful method to request information "in batch" for different purposes.
> Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler.
_from [wikipedia](https://en.wikipedia.org/wiki/Web_scraping)_
:::info
Note that web-scrapping it's not always "permitted", and some websites would rather provide an [API](https://en.wikipedia.org/wiki/API) for you to interact with their content.
:::
## Python scripting
Web scraping can be done easily with python language, to be able to write your own scripts you need to install some software:
1. Install python.
You can learn what pyhton is [here](https://www.python.org/doc/essays/blurb/) and then follow [these instructions](https://docs.python-guide.org/starting/installation/#installation-guides) to install it.
3. Install [beautifulSoup](https://www.crummy.com/software/BeautifulSoup/), and [requests-html](https://requests.readthedocs.io/projects/requests-html/en/latest/index.html) packages. You can do it with [pip](https://pypi.org/project/pip/) using the following commands:
~~~
pip install beautifulsoup4
pip install requests-html
~~~
## A simple example
Imagine you want to know how much time the bus will take to arrive to an specific point and show that data in a led screen. You know the web page where that information is showed, but you just need the number of minutes, not the rest of the page! and you don't want to open a browser and find your bookmark every time to get the number.

Writing a script to get that data is simple, and you can update the values automatically so your screen shows the real thing. Let's get in to it.
First you need to find out **where the data is stored on the source code of the web page**, to do this **right click on the number** you want to get and select **_inspect_**:

You can see that the number is inside a _div_ with the id **_line--bus__number__H4_2041_** and in the _class_ **_next-buses__line-next_**. That's everything we need!
First we need to import the needed libraries to **get the data from the internet** and **process it**:
~~~python
from requests_html import HTMLSession
from bs4 import BeautifulSoup
~~~
Then we get the data with the address of the page, and [_render_](https://requests.readthedocs.io/projects/requests-html/en/latest/index.html#javascript-support) it.
~~~python
session = HTMLSession()
response = session.get('https://www.tmb.cat/ca/barcelona/autobusos/-/lineabus/H4/parada/406')
response.html.render()
~~~
Then we search for the container of the bus we want (_line--bus__number__H4_2041_) inside the resndered response:
~~~python
bus = soup.find(id='line--bus__number__H4_2041')
~~~
And finally the time for next bus inside the _next-buses__line-next_ tag:
~~~python
time = bus.find(class_='next-buses__line-next').string
~~~
At this point inside the _time_ variable we have the text that indicates how long for the netx bus! so we can just print it:
~~~python
print(time)
~~~
### Full code
~~~python=
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get('https://www.tmb.cat/ca/barcelona/autobusos/-/lineabus/H4/parada/406')
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")
bus = soup.find(id='line--bus__number__H4_2041')
time = bus.find(class_='next-buses__line-next').string
print(time)
~~~
## Get the news
Now let's try to get the new heradlines from bbc.com website:
~~~python
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get('https://www.bbc.com/news')
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")
for heading in soup.select('h3[class*=heading]'):
print(heading.string)
~~~
and the output:
~~~
Two Americans dead after kidnapping in Mexico
Why a million Americans a year risk Mexico medical tourism
Ukraine denies reported link to Nord Stream attack
Thousands protest at Georgian 'foreign agent' bill
Ukraine investigates killing of unarmed soldier
EU agency unhappy at Amsterdam 'erotic centre' plan
Six Palestinians killed in Israeli West Bank raid
~~~
## Batch example
This example produces a list of foods in the [Open food facts](https://world.openfoodfacts.org/) page.

### Product names
~~~python
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get('https://world.openfoodfacts.org/1')
resp.html.render()
soap = BeautifulSoup(resp.html.html, 'lxml')
for name in soap.findAll(class_='list_product_name v-space-tiny'):
print(name.string)
~~~
This is a small piece of the output:
~~~
Eaux de sources - Cristaline - 1,5 L
Nutella - Ferrero - 400 g
Prince - Lu - 300 g
Coca-cola - 330 mL
Nutella - Ferrero - 1 kg
Coca Cola Zero - 330 ml
Sésame - Gerblé - 230g
nutella biscuits - 304g
Cruesli Mélange De Noix - Quaker - 450 g
Céréales chocapic - Nestlé - 430 g
~~~
### Names and links
You can create a list of comma separated values, for example _product name, link_ just adding a couple of lines:
~~~python
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get('https://world.openfoodfacts.org/2')
resp.html.render()
soap = BeautifulSoup(resp.html.html, 'lxml')
for product in soap.findAll(class_='list_product_a'):
print(product.find(class_='list_product_name v-space-tiny').string, ', ', product['href'])
~~~
It's output (trimed):
~~~
Harrys brioche tranchee -30% de sucre sans additif 485g - 485 g , https://world.openfoodfacts.org/product/3228857001378/harrys-brioche-tranchee-30-de-sucre-sans-additif-485g
Coca Cola gout original - 1,25 L , https://world.openfoodfacts.org/product/5449000267412/coca-cola-gout-original
Biscoff à tartiner - Lotus - 400g , https://world.openfoodfacts.org/product/5410126006957/biscoff-a-tartiner-lotus
Alpro Mango (meer fruit) - 400 g , https://world.openfoodfacts.org/product/5411188125808/alpro-mango-meer-fruit
Maple joe - 250g , https://world.openfoodfacts.org/product/3088542500285/maple-joe
Céréales Extra Pépites Chocolat Noisettes - Kellogg's - 500 g , https://world.openfoodfacts.org/product/3159470005071/cereales-extra-pepites-chocolat-noisettes-kellogg-s
~~~
After this example you can imagine how to get more data of each product, you can even open the links of the product one by one and get, for example, the ingredients.
## Saving it to a CSV file
You can save the scraped data directly to a file instead of printing it to the console:
~~~python
myFile = open('test.csv', 'w')
for product in soap.findAll(class_='list_product_a'):
myFile.write(product.find(class_='list_product_name v-space-tiny').string)
myFile.write(',')
myFile.write(product['href'])
myFile.write('\n')
myFile.close()
~~~