Web scraping - HackMD

--- tags: mdef --- # Web scraping [TOC] Web Scrapping is a technique used to collect information from websites in an organised ways. It is a very powerful method to request information "in batch" for different purposes. > Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. _from [wikipedia](https://en.wikipedia.org/wiki/Web_scraping)_ :::info Note that web-scrapping it's not always "permitted", and some websites would rather provide an [API](https://en.wikipedia.org/wiki/API) for you to interact with their content. ::: ## Python scripting Web scraping can be done easily with python language, to be able to write your own scripts you need to install some software: 1. Install python. You can learn what pyhton is [here](https://www.python.org/doc/essays/blurb/) and then follow [these instructions](https://docs.python-guide.org/starting/installation/#installation-guides) to install it. 3. Install [beautifulSoup](https://www.crummy.com/software/BeautifulSoup/), and [requests-html](https://requests.readthedocs.io/projects/requests-html/en/latest/index.html) packages. You can do it with [pip](https://pypi.org/project/pip/) using the following commands: ~~~ pip install beautifulsoup4 pip install requests-html ~~~ ## A simple example Imagine you want to know how much time the bus will take to arrive to an specific point and show that data in a led screen. You know the web page where that information is showed, but you just need the number of minutes, not the rest of the page! and you don't want to open a browser and find your bookmark every time to get the number. ![](https://i.imgur.com/96ZfgpW.png) Writing a script to get that data is simple, and you can update the values automatically so your screen shows the real thing. Let's get in to it. First you need to find out **where the data is stored on the source code of the web page**, to do this **right click on the number** you want to get and select **_inspect_**: ![](https://i.imgur.com/z7AaMLs.png) You can see that the number is inside a _div_ with the id **_line--bus__number__H4_2041_** and in the _class_ **_next-buses__line-next_**. That's everything we need! First we need to import the needed libraries to **get the data from the internet** and **process it**: ~~~python from requests_html import HTMLSession from bs4 import BeautifulSoup ~~~ Then we get the data with the address of the page, and [_render_](https://requests.readthedocs.io/projects/requests-html/en/latest/index.html#javascript-support) it. ~~~python session = HTMLSession() response = session.get('https://www.tmb.cat/ca/barcelona/autobusos/-/lineabus/H4/parada/406') response.html.render() ~~~ Then we search for the container of the bus we want (_line--bus__number__H4_2041_) inside the resndered response: ~~~python bus = soup.find(id='line--bus__number__H4_2041') ~~~ And finally the time for next bus inside the _next-buses__line-next_ tag: ~~~python time = bus.find(class_='next-buses__line-next').string ~~~ At this point inside the _time_ variable we have the text that indicates how long for the netx bus! so we can just print it: ~~~python print(time) ~~~ ### Full code ~~~python= from requests_html import HTMLSession from bs4 import BeautifulSoup session = HTMLSession() resp = session.get('https://www.tmb.cat/ca/barcelona/autobusos/-/lineabus/H4/parada/406') resp.html.render() soup = BeautifulSoup(resp.html.html, "lxml") bus = soup.find(id='line--bus__number__H4_2041') time = bus.find(class_='next-buses__line-next').string print(time) ~~~ ## Get the news Now let's try to get the new heradlines from bbc.com website: ~~~python from requests_html import HTMLSession from bs4 import BeautifulSoup session = HTMLSession() resp = session.get('https://www.bbc.com/news') resp.html.render() soup = BeautifulSoup(resp.html.html, "lxml") for heading in soup.select('h3[class*=heading]'): print(heading.string) ~~~ and the output: ~~~ Two Americans dead after kidnapping in Mexico Why a million Americans a year risk Mexico medical tourism Ukraine denies reported link to Nord Stream attack Thousands protest at Georgian 'foreign agent' bill Ukraine investigates killing of unarmed soldier EU agency unhappy at Amsterdam 'erotic centre' plan Six Palestinians killed in Israeli West Bank raid ~~~ ## Batch example This example produces a list of foods in the [Open food facts](https://world.openfoodfacts.org/) page. ![](https://i.imgur.com/P6G8mPm.png) ### Product names ~~~python from requests_html import HTMLSession from bs4 import BeautifulSoup session = HTMLSession() resp = session.get('https://world.openfoodfacts.org/1') resp.html.render() soap = BeautifulSoup(resp.html.html, 'lxml') for name in soap.findAll(class_='list_product_name v-space-tiny'): print(name.string) ~~~ This is a small piece of the output: ~~~ Eaux de sources - Cristaline - 1,5 L Nutella - Ferrero - 400 g Prince - Lu - 300 g Coca-cola - 330 mL Nutella - Ferrero - 1 kg Coca Cola Zero - 330 ml Sésame - Gerblé - 230g nutella biscuits - 304g Cruesli Mélange De Noix - Quaker - 450 g Céréales chocapic - Nestlé - 430 g ~~~ ### Names and links You can create a list of comma separated values, for example _product name, link_ just adding a couple of lines: ~~~python from requests_html import HTMLSession from bs4 import BeautifulSoup session = HTMLSession() resp = session.get('https://world.openfoodfacts.org/2') resp.html.render() soap = BeautifulSoup(resp.html.html, 'lxml') for product in soap.findAll(class_='list_product_a'): print(product.find(class_='list_product_name v-space-tiny').string, ', ', product['href']) ~~~ It's output (trimed): ~~~ Harrys brioche tranchee -30% de sucre sans additif 485g - 485 g , https://world.openfoodfacts.org/product/3228857001378/harrys-brioche-tranchee-30-de-sucre-sans-additif-485g Coca Cola gout original - 1,25 L , https://world.openfoodfacts.org/product/5449000267412/coca-cola-gout-original Biscoff à tartiner - Lotus - 400g , https://world.openfoodfacts.org/product/5410126006957/biscoff-a-tartiner-lotus Alpro Mango (meer fruit) - 400 g , https://world.openfoodfacts.org/product/5411188125808/alpro-mango-meer-fruit Maple joe - 250g , https://world.openfoodfacts.org/product/3088542500285/maple-joe Céréales Extra Pépites Chocolat Noisettes - Kellogg's - 500 g , https://world.openfoodfacts.org/product/3159470005071/cereales-extra-pepites-chocolat-noisettes-kellogg-s ~~~ After this example you can imagine how to get more data of each product, you can even open the links of the product one by one and get, for example, the ingredients. ## Saving it to a CSV file You can save the scraped data directly to a file instead of printing it to the console: ~~~python myFile = open('test.csv', 'w') for product in soap.findAll(class_='list_product_a'): myFile.write(product.find(class_='list_product_name v-space-tiny').string) myFile.write(',') myFile.write(product['href']) myFile.write('\n') myFile.close() ~~~