Web Scraping - HackMD

# Web Scraping --- titlz00 card_type: cue_card --- ### Content - What is Web Scraping? - Understanding a website's structure (HTML/CSS) - Using `request` module - Using `beautiful soup` for scraping data - Fixing the data format --- title: Introduction to Web Scraping description: duration: 900 card_type: cue_card --- ## Introduction to Web Scraping #### <u>What is Web Scraping?</u> Web scraping is a technique used to extract large amounts of data from websites. It involves programmatically visiting web pages and extracting information from them. This process can range from simple data collection tasks, like scraping prices from an e-commerce site, to more complex activities, such as harvesting contact information for research purposes. #### <u>Is web scraping legal?</u> The legality of web scraping depends on several factors and can vary by country. Generally, scraping publicly accessible data for personal, non-commercial use is considered legal. However, issues arise when scraping: - Data behind a login or other access controls. - Copyrighted material or proprietary information. - Websites with explicit terms of service that prohibit scraping. It's crucial to respect privacy laws (like GDPR in Europe), adhere to the website's terms of service, and avoid overloading servers. When in doubt, seeking legal advice or obtaining explicit permission from the website owner is advisable. #### <u>How to make sure that whatever we are scraping is appropriate?</u> Websites often use a `robots.txt` file to communicate with web crawlers. Access this file by appending`/robots.txt` to the website's base URL (e.g., https://en.wikipedia.org/robots.txt). This file provides guidelines on what parts of the site can or can't be crawled by automated agents. However, note that `robots.txt` is more of a guideline for search engines and does not have legal standing. Most websites have a Terms of Service (ToS) or Terms of Use agreement, often found in the footer of the homepage. Review this document to see if it explicitly prohibits scraping. --- title: Quiz-1 description: Quiz-1 duration: 60 card_type: quiz_card --- # Question Which of the following statement regarding web scraping is correct? # Choices - [ ] Web scraping is illegal under all circumstances and in all countries. - [ ] Web scraping cannot be detected by website administrators. - [x] Web scraping involves extracting data from websites, but it must comply with legal and ethical standards. - [ ] All of the above --- title: Understanding HTML description: duration: 900 card_type: cue_card --- ### Understanding how web pages are structured and how we can extract data from them. <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/179/original/img1.png?1706078289"> \ The web pages are developed using `HTML (Hyper Text Markup Language)`, where there are some pre-defined `tags` used to develop/design such pages. All tags start with `<>` and end as `</>`. For example, `<h1>` tag is used for writing heading and `<p>` tag is used for writing paragraphs. Knowing these tags is useful as it can help us to select which data to scape. Each tag has some special attributes like, - `class` - Used to add some special properties to corresponding tag like styling, etc. - Multiple tags can have same `class` attribute. - `id` - Also used to add extra properties to tags but each tag can have a unique `id` attribute. We use such attributes to target specifc tags to extract data, like targetting all tags having some `class` attribute, etc. One more important thing to notice here is the hierarchy of tags, as in the above image some `tags(children)` are within another `tags(parent)`. This heirarchy is also useful for targetting certain tags. --- title: Quiz-2 description: Quiz-2 duration: 60 card_type: quiz_card --- # Question What is the primary difference between the 'class' and 'id' attributes in HTML? # Choices - [ ] The 'class' attribute is used for JavaScript, while the 'id' attribute is used for CSS. - [x] The 'class' attribute can be used multiple times on different tags, whereas the 'id' attribute must be unique. - [ ] The 'class' attribute is mandatory for all HTML tags, while the 'id' attribute is optional. - [ ] The 'id' attribute refers to the content of an tag, whereas the 'class' attribute describes its style. --- title: Problem Statement description: duration: 300 card_type: cue_card --- ### Problem Statement: * We'll be scraping this website http://books.toscrape.com/index.html for collecting book data like, name of book, genre, price, reviews, etc. * Before scraping, it is recommended to check the website and understand it's structure, what data it has. * As per the current task, we have around 1000 books, distributed on multiple pages. We will scrape the data of all of books. --- title: Extracting Web Content description: duration: 1200 card_type: cue_card --- ### Getting web content - We will be using the `requests` module here to fetch the content of webpage. Usually whenever we load some URL in browser it makes a `GET` request to that particular endpoint and then respected server responds with the HTML code for corresponding page. We can perform similar action with `requests.get()` function as it pings the URL and returns with the corresonding details. Code ```python= import requests base_url = "http://books.toscrape.com/index.html" home_page = requests.get(base_url) # checking whether the request made was successful or not if home_page.status_code == 200: print("SUCCESS") else: print(f"FAILED, status code: {home_page.status_code}") ``` > Output SUCCESS #### Information about some common staus codes : 1. **200 OK**: The request has been successfully processed, and the server returns the requested content. 2. **404 Not Found**: The requested resource or page could not be found on the server. 3. **403 Forbidden**: Access to the requested resource is forbidden or not allowed for the client. 4. **500 Internal Server Error**: The server encountered an unexpected error while processing the request. 5. **302 Found (or 301 Moved Permanently)**: The requested resource has been temporarily (or permanently) moved to a different URL, and the client should follow the redirection. <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/288/original/img1.png?1706109577"> --- title: Quiz-3 description: Quiz-3 duration: 60 card_type: quiz_card --- # Question When using the `requests` module in Python to make a GET request, what status code is typically returned if the URL is valid but the requested resource does not exist on the server? # Choices - [x] 404 Not Found - [ ] 500 Internal Server Error - [ ] 400 Bad Request - [ ] 200 OK --- title: Extracting all books from a page description: duration: 1200 card_type: cue_card --- ### Getting all books from a page - Now we have the web content as HTML code, now we need to parse this code so that we can extract relevant information from it. One popular python module used for such task is `beautiful soup`. It can parse the web content and allows user to extract data from specific tags. Code ```python= from bs4 import BeautifulSoup soup = BeautifulSoup(markup=home_page.content, parser="html.parser") ``` <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/289/original/img2.png?1706109628"> \ Now we will use the HTML tag information to extract data. Here we are extracting the section responsible for rendering book data onto the main page. As we can see in above image highlighted `<li>` tag is our target. And since there are 20 books per page so we can see multiple `<li>` tags one after another. We will use `soup.find_all` method to target the specific tag according to it's name, the class attribute it holds or even the id attribute if required. This will return the list of all the captured `<li>` tags. Code ```python= books = soup.find_all(name="li", class_="col-xs-6 col-sm-4 col-md-3 col-lg-3") len(books) ``` > Output 20 Beautiful Soup (bs4) is a Python library designed for web scraping purposes to pull the data out of HTML and XML files. It creates parse trees that is helpful to extract the data easily. Here are some of the commonly used methods in Beautiful Soup: 1. **find()**: This method is used to find the first tag that matches a given criteria. For example, `soup.find('div')` would find the first `div` tag in the HTML document. You can also pass attributes to refine the search, like `soup.find('div', class_='example')`. 2. **find_all()**: Unlike `find()`, `find_all()` retrieves all tags that match the criteria. It's useful when you want to extract information from multiple tags of the same type. For example, `soup.find_all('a')` would return a list of all anchor tags in the document. 3. **select()**: This method allows you to use CSS selectors to find elements in the document. It's particularly handy when dealing with classes or IDs. For instance, `soup.select('.someclass')` would find all elements with the class `someclass`. 4. **select_one()**: Similar to `select()`, but instead of returning all matches, it only returns the first match. For example, `soup.select_one('#uniqueId')` would find the first element with the ID `uniqueId`. These methods are integral to navigating and parsing HTML/XML documents with Beautiful Soup, making it easier to scrape data from websites. --- title: Break & Doubt Resolution description: duration: 600 card_type: cue_card --- ### Break & Doubt Resolution `Instructor Note:` * Take this time (up to 5-10 mins) to give a short break to the learners. * Meanwhile, you can ask the them to share their doubts (if any) regarding the topics covered so far. --- title: Extracting data of a single book description: duration: 1800 card_type: cue_card --- ### Extracting data of a single book - In this section, we will go to the actual location of book where its data present and extract it into a dictionary. Code ```python= book = books[0] ``` <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/290/original/3.png?1706109693"> \ As we can see each book thumbnail holds the information of page URL where the data of corresponding book is present. First we are going to extract this URL by utilising the hierarchy of this web page. Basically the `<a>` tag holding the URL of a book is within the `<li>` tag of that book which we already extracted and as we learnt earlier this `<a>` tag is one of the children of `<li>` tag. Hence we can query on that book by using `.findChild()` method to target the `<a>` tag and then capture its URL by accessing the `href` attribute. Code ```python= book_url = book.findChild(name="a").get("href") book_url ``` > Output 'catalogue/a-light-in-the-attic_1000/index.html' Ths issue is this is a relative URL which needs to be converted into complete absolute URL in order for us to access it using `requests.get()`. One way is directly prefixing `base_url` before this. The `urljoin` method from `urllib.parse` module is a very popular and error-free way to do so. Code ```python= from urllib.parse import urljoin book_url = urljoin(base_url, book_url) book_url ``` > Output 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html' This is what a page for book looks like, now we understand its structure and extract the data. Basically from this page we need data like **Name of book** and the **Product Information** table. <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/291/original/4.png?1706109767"> \ We start by getting its HTML content using `requests` and parsing it via `beautiful soup`. Code ```python= book_info = requests.get(book_url).content book_soup = BeautifulSoup(markup=book_info, parser="html.parser") ``` <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/292/original/5.png?1706110251"> \ So from the web structure, **Name of book** is within the only `<h1>` tag present in this page. Lets grab this first. We can use `.getText()` to get the text present between any tags. Code ```python= name = book_soup.find(name="h1").getText() name ``` > Output 'A Light in the Attic' <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/293/original/6.png?1706110345"> Now, the **Product Information** table is structured as a HTML table, where each row data is between `<tr>` tags. And each `<tr>` tag holds the information of heading between `<th>` tag and corresponding value between `<td>` tag. So now, we get hold on all the rows and one by one extract the heading and their corresponding values. Code ```python= book_table_data = book_soup.find_all(name="tr") len(book_table_data) ``` > Output 7 Code ```python= book_data = {} for row in book_table_data: key = row.find(name="th").getText() value = row.find(name="td").getText() book_data[key] = value book_data ``` > Output {'UPC': 'a897fe39b1053632', 'Product Type': 'Books', 'Price (excl. tax)': '£51.77', 'Price (incl. tax)': '£51.77', 'Tax': '£0.00', 'Availability': 'In stock (22 available)', 'Number of reviews': '0'} Let's wrap all the functionality into one function, that takes the absolute URL of a particular book page and returns data in a dictionary. Code ```python= def scrape_book(book_url): book_info = requests.get(book_url).content book_soup = BeautifulSoup(markup=book_info, parser="html.parser") book_data = {} # getting name name = book_soup.find(name="h1").getText() book_data['name'] = name # getting other data book_table_data = book_soup.find_all(name="tr") for row in book_table_data: key = row.find(name="th").getText() value = row.find(name="td").getText() book_data[key] = value # let's also keep the url of book in final result book_data['url'] = book_url return book_data # let's test this scrape_book(book_url) ``` > Output {'name': 'A Light in the Attic', 'UPC': 'a897fe39b1053632', 'Product Type': 'Books', 'Price (excl. tax)': '£51.77', 'Price (incl. tax)': '£51.77', 'Tax': '£0.00', 'Availability': 'In stock (22 available)', 'Number of reviews': '0', 'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'} --- title: Quiz-4 description: Quiz-4 duration: 60 card_type: quiz_card --- # Question In `BeautifulSoup`, what is the primary purpose of the `findChild` function? # Choices - [x] To find and return the first child of a given tag, based on optional criteria like tag name or attributes. - [ ] To retrieve all child elements of a specified tag in the form of a list. - [ ] To find and return the parent of a specified tag. - [ ] To search for and return a specific sibling tag based on its position. --- title: Fetching the data of all books in one page description: duration: 1200 card_type: cue_card --- ### Fetching the data of all books in one page - Let's see the pattern of changes in URL with respect to page change. * So if we go to the base url in browser and open next page, we can see that our URL changes to `https://books.toscrape.com/catalogue/page-2.html` * And if we go to the third page, then our URL changes to `https://books.toscrape.com/catalogue/page-3.html` * From here we can formulate that the URL for `i-th` page will be `https://books.toscrape.com/catalogue/page-i.html` Code ```python= # Fetching the data of all the books from the 1st page - page_url = "https://books.toscrape.com/catalogue/page-i.html" page_content = requests.get(page_url).content page_soup = BeautifulSoup(markup=page_content, parser="html.parser") page_books = soup.find_all(name="li", class_="col-xs-6 col-sm-4 col-md-3 col-lg-3") print(len(page_books)) ``` > Output 20 Now since we are already have the function to scrape each book data, we can directly use it to scrape all the books present in one page and store it. Code ```python= books_data = [] for book in page_books: book_url = book.findChild(name="a").get("href") # converting relative URL to absolute book_url = urljoin(base_url, book_url) book_data = scrape_book(book_url) books_data.append(book_data) books_data[:3] ``` > Output [{'name': 'A Light in the Attic', 'UPC': 'a897fe39b1053632', 'Product Type': 'Books', 'Price (excl. tax)': '£51.77', 'Price (incl. tax)': '£51.77', 'Tax': '£0.00', 'Availability': 'In stock (22 available)', 'Number of reviews': '0', 'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'}, {'name': 'Tipping the Velvet', 'UPC': '90fa61229261140a', 'Product Type': 'Books', 'Price (excl. tax)': '£53.74', 'Price (incl. tax)': '£53.74', 'Tax': '£0.00', 'Availability': 'In stock (20 available)', 'Number of reviews': '0', 'url': 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'}, {'name': 'Soumission', 'UPC': '6957f44c3847a760', 'Product Type': 'Books', 'Price (excl. tax)': '£50.10', 'Price (incl. tax)': '£50.10', 'Tax': '£0.00', 'Availability': 'In stock (20 available)', 'Number of reviews': '0', 'url': 'http://books.toscrape.com/catalogue/soumission_998/index.html'}] Let's wrap that functionality into a function so that we can directly use it to scrape all the pages. Code ```python= def scrape_page(page_url): books_data = [] page_content = requests.get(page_url).content page_soup = BeautifulSoup(markup=page_content, parser="html.parser") page_books = soup.find_all(name="li", class_="col-xs-6 col-sm-4 col-md-3 col-lg-3") for book in books: book_url = book.findChild(name="a").get("href") book_url = urljoin(base_url, book_url) book_data = scrape_book(book_url) books_data.append(book_data) return books_data ``` --- title: Scraping all books from all the pages description: duration: 1200 card_type: cue_card --- ### Scraping all books from all the pages From the website we can see that there are total 50 pages, but sometimes this information is not given in the websites. That is why we'll use a generic approach here to keep extracting page data until all the pages are done. Let's try requesting the data from page 100, which is more than the total number of pages on our website. Code ```python= requests.get("https://books.toscrape.com/catalogue/page-100.html").status_code ``` > Output 404 `404` status codes means that page does not exist. So our approach is to keep looking for pages till we get 404 and there we'll stop data scraping. Code ```python= page_count = 1 data = [] while True: page_url = f"https://books.toscrape.com/catalogue/page-{page_count}.html" status = requests.get(page_url).status_code # break the loop if we exceed the total page count if status == 404: break page_data = scrape_page(page_url) data.extend(page_data) # do not use .append() since the function returns a list print(f"Page: {page_count} is SUCCESSFULLY scraped") page_count += 1 ``` > Output Page: 1 is SUCCESSFULLY scraped Page: 2 is SUCCESSFULLY scraped Page: 3 is SUCCESSFULLY scraped Page: 4 is SUCCESSFULLY scraped Page: 5 is SUCCESSFULLY scraped Page: 6 is SUCCESSFULLY scraped Page: 7 is SUCCESSFULLY scraped Page: 8 is SUCCESSFULLY scraped Page: 9 is SUCCESSFULLY scraped Page: 10 is SUCCESSFULLY scraped Page: 11 is SUCCESSFULLY scraped Page: 12 is SUCCESSFULLY scraped Page: 13 is SUCCESSFULLY scraped Page: 14 is SUCCESSFULLY scraped Page: 15 is SUCCESSFULLY scraped Page: 16 is SUCCESSFULLY scraped Page: 17 is SUCCESSFULLY scraped Page: 18 is SUCCESSFULLY scraped Page: 19 is SUCCESSFULLY scraped Page: 20 is SUCCESSFULLY scraped Page: 21 is SUCCESSFULLY scraped Page: 22 is SUCCESSFULLY scraped Page: 23 is SUCCESSFULLY scraped Page: 24 is SUCCESSFULLY scraped Page: 25 is SUCCESSFULLY scraped Page: 26 is SUCCESSFULLY scraped Page: 27 is SUCCESSFULLY scraped Page: 28 is SUCCESSFULLY scraped Page: 29 is SUCCESSFULLY scraped Page: 30 is SUCCESSFULLY scraped Page: 31 is SUCCESSFULLY scraped Page: 32 is SUCCESSFULLY scraped Page: 33 is SUCCESSFULLY scraped Page: 34 is SUCCESSFULLY scraped Page: 35 is SUCCESSFULLY scraped Page: 36 is SUCCESSFULLY scraped Page: 37 is SUCCESSFULLY scraped Page: 38 is SUCCESSFULLY scraped Page: 39 is SUCCESSFULLY scraped Page: 40 is SUCCESSFULLY scraped Page: 41 is SUCCESSFULLY scraped Page: 42 is SUCCESSFULLY scraped Page: 43 is SUCCESSFULLY scraped Page: 44 is SUCCESSFULLY scraped Page: 45 is SUCCESSFULLY scraped Page: 46 is SUCCESSFULLY scraped Page: 47 is SUCCESSFULLY scraped Page: 48 is SUCCESSFULLY scraped Page: 49 is SUCCESSFULLY scraped Page: 50 is SUCCESSFULLY scraped > **Note:** For the sake of this class, we'll only scrape 5 pages as scraping all 50 will take good amount of time. Code ```python= page_count = 1 data = [] while True: page_url = f"https://books.toscrape.com/catalogue/page-{page_count}.html" status = requests.get(page_url).status_code # break the loop if we exceed the total page count if status == 404 or page_count == 6: break page_data = scrape_page(page_url) data.extend(page_data) # do not use .append() since the function returns a list print(f"Page: {page_count} is SUCCESSFULLY scraped") page_count += 1 ``` > Output Page: 1 is SUCCESSFULLY scraped Page: 2 is SUCCESSFULLY scraped Page: 3 is SUCCESSFULLY scraped Page: 4 is SUCCESSFULLY scraped Page: 5 is SUCCESSFULLY scraped Code ```python= data[:2] ``` > Output [{'name': 'A Light in the Attic', 'UPC': 'a897fe39b1053632', 'Product Type': 'Books', 'Price (excl. tax)': '£51.77', 'Price (incl. tax)': '£51.77', 'Tax': '£0.00', 'Availability': 'In stock (22 available)', 'Number of reviews': '0', 'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'}, {'name': 'Tipping the Velvet', 'UPC': '90fa61229261140a', 'Product Type': 'Books', 'Price (excl. tax)': '£53.74', 'Price (incl. tax)': '£53.74', 'Tax': '£0.00', 'Availability': 'In stock (20 available)', 'Number of reviews': '0', 'url': 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'}] --- title: Quiz-5 description: Quiz-5 duration: 60 card_type: quiz_card --- # Question What will the output of the following piece of code? Code ```python= from bs4 import BeautifulSoup # HTML html_snippet = """ <div> <p>Hello, <b>World!</b></p> <p>Welcome to <a href="https://example.com">Example</a>.</p> </div> """ soup = BeautifulSoup(html_snippet, 'html.parser') extracted_text = soup.find(name="div").getText() print(extracted_text) ``` # Choices - [ ] `<p>Hello, <b>World!</b></p>` - [ ] `Hello, World!` - [x] `Hello, World! \n Welcome to Example.` - [ ] `None of the above` --- title: Fixing the data formatting description: duration: 1200 card_type: cue_card --- ### Fixing the data formatting - The only issue now is that our data is not in a proper format. So in this section, we will fix those issues one by one. Code ```python= book = data[0].copy() book ``` > Output {'name': 'A Light in the Attic', 'UPC': 'a897fe39b1053632', 'Product Type': 'Books', 'Price (excl. tax)': '£51.77', 'Price (incl. tax)': '£51.77', 'Tax': '£0.00', 'Availability': 'In stock (22 available)', 'Number of reviews': '0', 'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'} Code ```python= # removing £ symbol and converting it to float float(book['Price (excl. tax)'][1:]) ``` > Output 51.77 Code ```python= # splitting 'availability' in 2 keys: 'quantity_available' and 'is_available' quantity_available = int(book['Availability'].split("(")[-1][:-1].split()[0]) is_available = book['Availability'].split("(")[0].strip() quantity_available, is_available ``` > Output (22, 'In stock') Code ```python= def fix(item): item['Price (excl. tax)'] = float(item['Price (excl. tax)'][1:]) item['Price (incl. tax)'] = float(item['Price (incl. tax)'][1:]) item['Tax'] = float(item['Tax'][1:]) availability = item.pop('Availability') item['is_available'] = True if availability.split("(")[0].strip() == 'In stock' else False item['quantity_available'] = int(availability.split("(")[-1][:-1].split()[0]) return item formatted_data = [fix(item.copy()) for item in data] formatted_data[:3] ``` > Output [{'name': 'A Light in the Attic', 'UPC': 'a897fe39b1053632', 'Product Type': 'Books', 'Price (excl. tax)': 51.77, 'Price (incl. tax)': 51.77, 'Tax': 0.0, 'Number of reviews': '0', 'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', 'is_available': True, 'quantity_available': 22}, {'name': 'Tipping the Velvet', 'UPC': '90fa61229261140a', 'Product Type': 'Books', 'Price (excl. tax)': 53.74, 'Price (incl. tax)': 53.74, 'Tax': 0.0, 'Number of reviews': '0', 'url': 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html', 'is_available': True, 'quantity_available': 20}, {'name': 'Soumission', 'UPC': '6957f44c3847a760', 'Product Type': 'Books', 'Price (excl. tax)': 50.1, 'Price (incl. tax)': 50.1, 'Tax': 0.0, 'Number of reviews': '0', 'url': 'http://books.toscrape.com/catalogue/soumission_998/index.html', 'is_available': True, 'quantity_available': 20}] Now we have well formatted data scraped from the website. --- title: Practice Coding Question(s) description: duration: 300 card_type: cue_card --- ### Practice Coding Question(s) You can pick the following question and solve it during the lecture itself. This will help the learners to get familiar with the problem solving process and motivate them to solve the assignments. <span style="background-color: pink;">Make sure to start the doubt session before you solve this question.</span> Q. https://www.scaler.com/hire/test/problem/101484/