Practical 2 - Python Web Crawler and Storage

# Practical 2 - Python Web Crawler and Storage The aim of the practical is to create a [Web Crawler](https://en.wikipedia.org/wiki/Web_crawler) that will store its scraped information in a database. For this, we are going to be using the [Scrapy](https://docs.scrapy.org/en/latest/index.html) framework and Python's [SQLite](https://docs.python.org/3.7/library/sqlite3.html) library. ## Exercise 1: Creating a Web Crawler The first thing we need to do is to make sure Scrapy is installed on your machine. This should be the case, but as we will all know by this point, stuff often never goes to plan. First, clone this repository [REMEMBER TO CHANGE THIS](https://nucode.ncl.ac.uk/scomp/stage1/csc1033/practical-2-web-scraping-and-storage) into your workspace and open it in PyCharm. Once it's finished loading up, enter the below command into the terminal: `pipenv update`. This may take a few minutes but it should install Scrapy into your project's virtual environment. From here, we should now be able to run the following command: `pipenv run scrapy startproject bookscraper`. Scrapy will now create a Scrapy project that consists of some folders with some files in them. These files make up a boilerplate crawler for us to work from. A breakdown of the boilerplate Scrapy project can be seen below. ``` bookscraper/ |---scrapy.cfg # deploy configuration file |---bookscraper/ # project's Python module, you'll import your code from here |---__init__.py |---items.py # project items definition file, we'll be using this to define how we store our scraped data |---middlewares.py # project middlewares file, we won't be using this today |---pipelines.py # project pipelines file, we'll be using this to put our scraped data into a database |---settings.py # project settings file, we'll use this just a little bit |---spiders/ # a directory where you'll later put your spiders, this is where we'll make our spider |---__init__.py ``` ## Exercise 2: Preparing to Scrape The first thing we need to do is decide what information we want to scrape. The site we are going to be scraping is [books.toscrape.com](http://books.toscrape.com/). This site was created with the intent of being used to practice scraping. The site's premise is that it is a book store and it displays a catalog of books that can be purchased. Let's imagine that we want to create a database that consists of all the details of the books that the store offers. The first bit of information we may want could be a book's title. Open the `items.py` file that Scrapy created for us. This file defines a data structure, called an `Item`, that we will be using to store a book's details in. [See here for more details](https://docs.scrapy.org/en/latest/topics/items.html). All item objects inherit from Scrapy's `scrapy.Item` class and behave similarly to how dictionaries do. Let's create an item called `BookItem` that we will use to store the details of the books we are going to scrape. See the below code: ```python import scrapy class BookItem(scrapy.Item): title = scrapy.Field() ``` **Task:** the above code creates a `BookItem` that has a Scrapy field that can store a book's title. Add the following fields to the `BookItem`: - page - url (the url of the book's page) - price - thumbnail (the url of the book's thumbnail image) Now that we have a place to store our scraped info, let's implement the spider that will be doing the crawling. ## Exercise 3: Crawling and Getting Some Data Now that we have created a `BookItem` item, we want to start scraping data from the website. The first step is to create a new file called `book_store_spider.py` in the `spider` folder Scrapy created for us. In this file, we will start defining where we want to scrape information from (what website) and what information we want to scrape. A Scrapy spider needs a few things first before it can get going. A spider will inherit from Scrapy's own `scrapy.Spider` class. The things a spider require are: - `name`: a field that identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders. We will be using this `name` field to run our spider. - `start_requests()`: must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests. We will be using this method to initially tell the spider what page to start from. - `parse()`: a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it. This is the method that will be selecting the information we want and translating it into a `BookItem`. The `parse()` method usually parses the response, extracting the scraped data as a dictionary (or in our case, a `BookItem`) and also finds new URLs to follow, creating new requests (Request) from them. ### XPath Selectors It is important to understand that Scrapy uses XPath selectors to extract data from a webpage. XPath is a language for selecting nodes in XML documents, but can also be used with HTML. You will need to investigate XPath and its syntax yourself, please see [Scrapy's documentation on XPath](https://doc.scrapy.org/en/0.12/topics/selectors.html). Scrapy's examples may also help you see [here](https://doc.scrapy.org/en/0.12/intro/tutorial.html#introduction-to-selectors) and [here](https://doc.scrapy.org/en/0.12/intro/overview.html#write-a-spider-to-extract-the-data). There are many other sites out there as well that you can learn from. Using Google Chrome, we can easily identify specific XPath using the Developer Tools. We can access Google Chrome's Developer Tools by pressing F12 (or right-clicking and inspect). You can then use the inspect element tool to find the XPath value of an HTML item. In this practical we will start scraping information about books, a majority of the code is given to you but a few items are missing. Through your research and by investigating the present code, you should be able to figure out how to scrape: - the page number. - a book's own personal pages url. - the url of a book's thumbnail. ```python import scrapy from bookscraper.bookscraper.items import BookItem class BookStoreSpider(scrapy.Spider): name = "book-store-crawler" allowed_domains = ["books.toscrape.com"] start_urls = ["http://books.toscrape.com/"] def start_requests(self): for page_url in self.start_urls: yield scrapy.Request(url=page_url, callback=self.parse) def parse(self, response): # get all items that are inside the article tag with class "product_pod" # (this is the grid of all books) books = response.xpath("//article[@class='product_pod']") # for every book on the page for book in books: book_item = BookItem() # get the page the book is displayed on # book_item["page"] = ... # get the title of the book # (the title of the a tag as the p title gets truncated) book_item["title"] = book.xpath("h3/a/@title").get() # get the price of the book as a str book_item["price"] = book.xpath("div[@class='product_price']/p[@class='price_color']/text()").get() # get the url to the books page # book_item["url"] = ... # Hint: look up urljoin in the scrapy documentation # get the thumbnail of the book page # book_item["thumbnail"] = ... yield book_item ``` <details> <summary>Hints:</summary> Place the "answers" here. Basically just the exact code they need. </details> We can run the above code by typing the command: `pipenv run scrapy crawl book-store-crawler`. If you have written your code correctly you will now begin to see some output in the terminal window about the information your spider is collecting. So far we have scraped data from a single page however, it is, in fact, possible to "automate" our spider so that it scrapes information from multiple pages on a website. To enable this we will need to write some code that will utilise the [pagination](https://en.wikipedia.org/wiki/Pagination#In_web_browsers) functionality on the website. Place the below code at the bottom of your `parse()` method. ```python # If there is a next button on the page, crawl the page it links to next_page = response.xpath("//li[@class='next']/a/@href").get() if next_page is not None: yield response.follow(next_page, self.parse) ``` What the above code does is check if the next button exists at the bottom of the page and if it does, click it and then crawl the newly loaded page. Now when you re-run your scraper, you should notice that it returns a lot more information than it did previously *(across all 50 pages of the site there is a total of 1,000 book listings)*. ## Exercise 4: Storing the Data So far, we are just crawling the data but not doing anything with it. What we want to do is store it into a database. We can do this by using [`SQLite`](https://docs.python.org/3.7/library/sqlite3.html), which is a Python module that you should now be familiar with. Scrapy comes with some nice commands out of the box that will allow you to create files like `json` and `csv` which contain your crawled data. The following commands will create a `json` and a `csv` file containing all of your crawled data: ```terminal pipenv run scrapy crawl book-store-crawler -o books.json pipenv run scrapy crawl book-store-crawler -o books.csv ``` The way Scrapy manages data storing is through [item pipelines](https://docs.scrapy.org/en/latest/topics/item-pipeline.html). We will need to create our pipeline that will create (if needed) a database with a table and then insert all of the scraped `BookItems` our crawler creates. A pipeline needs the following methods: - `process_item(self, item, spider)`: this method is called for every item pipeline component. `process_item()` must either: return a dict with data, return an Item (or any descendant class) object, return a Twisted Deferred or raise DropItem exception. Dropped items are no longer processed by further pipeline components. We will be inserting the items into the database from inside this method. - `open_spider(self, spider)`: this method is called when the spider is opened. We're going to want to connect to our database in this method. - `close_spider(self, spider)`: this method is called when the spider is closed. We're going to disconnect from our database in this method. Let's open up the `pipelines.py` file and place the following into it: ```python import sqlite3 class SQLitePipeline(object): def open_spider(self, spider): # what will be the primary key of our database table self.id = 0 # our connection to the local database 'Books.db' self.connection = sqlite3.connect("Books.db") self.cursor = self.connection.cursor() # the name of the table we will be storing our data in self.db_table = "books_details" # creating the table (if it does not exist) create_table_sql = f""" CREATE TABLE IF NOT EXISTS {self.db_table} ( id INTEGER PRIMARY KEY, page INTEGER, title TEXT, price TEXT, url TEXT, thumbnail TEXT ) """ # execute the commands and save the changes (commit them) self.cursor.execute(create_table_sql) self.connection.commit() def close_spider(self, spider): # when we have finished crawling, print all the contents of the database to the console for row in self.cursor.execute(f"SELECT * FROM {self.db_table}"): print(row) # terminate the connection self.connection.close() def process_item(self, book, spider): # an IntegrityError exception is thrown when you try to insert a record into a table # but the record already exists try: # the insert statement insert_into_sql = f""" INSERT INTO {self.db_table} (id, page, title, price, url, thumbnail) VALUES (?, ?, ?, ?, ?, ?) """ # convert the `BookItem` our spider created into a dictionary b = dict(book) self.id += 1 values = (self.id, b["page"], b["title"], b["price"], b["url"], b["thumbnail"]) # insert the item's data into the database self.cursor.execute(insert_into_sql, values) self.connection.commit() except sqlite3.IntegrityError: return book return book ``` Most of the above code is quite self-explanatory. All we are doing is for every `BookItem` our spider collects, we insert it into the `Books.db` database. The last thing we need to do is enable the `SQLitePipeline` pipeline we just created. Open `settings.py` and navigate to line `68`. Here we just need to uncomment the line (include the lines above and below it) and make sure it's pointing to the `pipelines.py` file we just edited. If we run the spider now, you should see that a `Books.db` was created and then when the crawler finishes (you might need to scroll up a little), the details of all the site's books should be printed to the console. We have now created a web crawler that will retrieve information from a website and store it into a database. ## Exercise 5: Further Work - Create a boolean value that is `True` when a book is in stock, `False` if it is out of stock. Update your `BookItem` fields and `pipelines.py` to add the new field to the database. - At the moment, the price of a book is stored as a string. Change this so that it is stored in the database as a number. - *Advanced*: Scrape a book's rating and modify your code to add it to the database. This will take some doing and will require using some more XPath methods as a book's rating is stored as a class name. - *Advanced*: Instead of just storing the book's thumbnail url, implement the ability to store the book's full resolution image url. You will need to `yield` another response that calls back to another parse method (a method you will have to create). - Create another spider that crawls another website of your choice, [quotes.toscrape.com](http://quotes.toscrape.com/) is another good site to practice on. Make sure to create a new item and also a new pipeline for your new crawler.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.