Bread
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Practical 2 - Python Web Crawler and Storage The aim of the practical is to create a [Web Crawler](https://en.wikipedia.org/wiki/Web_crawler) that will store its scraped information in a database. For this, we are going to be using the [Scrapy](https://docs.scrapy.org/en/latest/index.html) framework and Python's [SQLite](https://docs.python.org/3.7/library/sqlite3.html) library. ## Exercise 1: Creating a Web Crawler The first thing we need to do is to make sure Scrapy is installed on your machine. This should be the case, but as we will all know by this point, stuff often never goes to plan. First, clone this repository [REMEMBER TO CHANGE THIS](https://nucode.ncl.ac.uk/scomp/stage1/csc1033/practical-2-web-scraping-and-storage) into your workspace and open it in PyCharm. Once it's finished loading up, enter the below command into the terminal: `pipenv update`. This may take a few minutes but it should install Scrapy into your project's virtual environment. From here, we should now be able to run the following command: `pipenv run scrapy startproject bookscraper`. Scrapy will now create a Scrapy project that consists of some folders with some files in them. These files make up a boilerplate crawler for us to work from. A breakdown of the boilerplate Scrapy project can be seen below. ``` bookscraper/ |---scrapy.cfg # deploy configuration file |---bookscraper/ # project's Python module, you'll import your code from here |---__init__.py |---items.py # project items definition file, we'll be using this to define how we store our scraped data |---middlewares.py # project middlewares file, we won't be using this today |---pipelines.py # project pipelines file, we'll be using this to put our scraped data into a database |---settings.py # project settings file, we'll use this just a little bit |---spiders/ # a directory where you'll later put your spiders, this is where we'll make our spider |---__init__.py ``` ## Exercise 2: Preparing to Scrape The first thing we need to do is decide what information we want to scrape. The site we are going to be scraping is [books.toscrape.com](http://books.toscrape.com/). This site was created with the intent of being used to practice scraping. The site's premise is that it is a book store and it displays a catalog of books that can be purchased. Let's imagine that we want to create a database that consists of all the details of the books that the store offers. The first bit of information we may want could be a book's title. Open the `items.py` file that Scrapy created for us. This file defines a data structure, called an `Item`, that we will be using to store a book's details in. [See here for more details](https://docs.scrapy.org/en/latest/topics/items.html). All item objects inherit from Scrapy's `scrapy.Item` class and behave similarly to how dictionaries do. Let's create an item called `BookItem` that we will use to store the details of the books we are going to scrape. See the below code: ```python import scrapy class BookItem(scrapy.Item): title = scrapy.Field() ``` **Task:** the above code creates a `BookItem` that has a Scrapy field that can store a book's title. Add the following fields to the `BookItem`: - page - url (the url of the book's page) - price - thumbnail (the url of the book's thumbnail image) Now that we have a place to store our scraped info, let's implement the spider that will be doing the crawling. ## Exercise 3: Crawling and Getting Some Data Now that we have created a `BookItem` item, we want to start scraping data from the website. The first step is to create a new file called `book_store_spider.py` in the `spider` folder Scrapy created for us. In this file, we will start defining where we want to scrape information from (what website) and what information we want to scrape. A Scrapy spider needs a few things first before it can get going. A spider will inherit from Scrapy's own `scrapy.Spider` class. The things a spider require are: - `name`: a field that identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders. We will be using this `name` field to run our spider. - `start_requests()`: must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests. We will be using this method to initially tell the spider what page to start from. - `parse()`: a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it. This is the method that will be selecting the information we want and translating it into a `BookItem`. The `parse()` method usually parses the response, extracting the scraped data as a dictionary (or in our case, a `BookItem`) and also finds new URLs to follow, creating new requests (Request) from them. ### XPath Selectors It is important to understand that Scrapy uses XPath selectors to extract data from a webpage. XPath is a language for selecting nodes in XML documents, but can also be used with HTML. You will need to investigate XPath and its syntax yourself, please see [Scrapy's documentation on XPath](https://doc.scrapy.org/en/0.12/topics/selectors.html). Scrapy's examples may also help you see [here](https://doc.scrapy.org/en/0.12/intro/tutorial.html#introduction-to-selectors) and [here](https://doc.scrapy.org/en/0.12/intro/overview.html#write-a-spider-to-extract-the-data). There are many other sites out there as well that you can learn from. Using Google Chrome, we can easily identify specific XPath using the Developer Tools. We can access Google Chrome's Developer Tools by pressing F12 (or right-clicking and inspect). You can then use the inspect element tool to find the XPath value of an HTML item. In this practical we will start scraping information about books, a majority of the code is given to you but a few items are missing. Through your research and by investigating the present code, you should be able to figure out how to scrape: - the page number. - a book's own personal pages url. - the url of a book's thumbnail. ```python import scrapy from bookscraper.bookscraper.items import BookItem class BookStoreSpider(scrapy.Spider): name = "book-store-crawler" allowed_domains = ["books.toscrape.com"] start_urls = ["http://books.toscrape.com/"] def start_requests(self): for page_url in self.start_urls: yield scrapy.Request(url=page_url, callback=self.parse) def parse(self, response): # get all items that are inside the article tag with class "product_pod" # (this is the grid of all books) books = response.xpath("//article[@class='product_pod']") # for every book on the page for book in books: book_item = BookItem() # get the page the book is displayed on # book_item["page"] = ... # get the title of the book # (the title of the a tag as the p title gets truncated) book_item["title"] = book.xpath("h3/a/@title").get() # get the price of the book as a str book_item["price"] = book.xpath("div[@class='product_price']/p[@class='price_color']/text()").get() # get the url to the books page # book_item["url"] = ... # Hint: look up urljoin in the scrapy documentation # get the thumbnail of the book page # book_item["thumbnail"] = ... yield book_item ``` <details> <summary>Hints:</summary> Place the "answers" here. Basically just the exact code they need. </details> We can run the above code by typing the command: `pipenv run scrapy crawl book-store-crawler`. If you have written your code correctly you will now begin to see some output in the terminal window about the information your spider is collecting. So far we have scraped data from a single page however, it is, in fact, possible to "automate" our spider so that it scrapes information from multiple pages on a website. To enable this we will need to write some code that will utilise the [pagination](https://en.wikipedia.org/wiki/Pagination#In_web_browsers) functionality on the website. Place the below code at the bottom of your `parse()` method. ```python # If there is a next button on the page, crawl the page it links to next_page = response.xpath("//li[@class='next']/a/@href").get() if next_page is not None: yield response.follow(next_page, self.parse) ``` What the above code does is check if the next button exists at the bottom of the page and if it does, click it and then crawl the newly loaded page. Now when you re-run your scraper, you should notice that it returns a lot more information than it did previously *(across all 50 pages of the site there is a total of 1,000 book listings)*. ## Exercise 4: Storing the Data So far, we are just crawling the data but not doing anything with it. What we want to do is store it into a database. We can do this by using [`SQLite`](https://docs.python.org/3.7/library/sqlite3.html), which is a Python module that you should now be familiar with. Scrapy comes with some nice commands out of the box that will allow you to create files like `json` and `csv` which contain your crawled data. The following commands will create a `json` and a `csv` file containing all of your crawled data: ```terminal pipenv run scrapy crawl book-store-crawler -o books.json pipenv run scrapy crawl book-store-crawler -o books.csv ``` The way Scrapy manages data storing is through [item pipelines](https://docs.scrapy.org/en/latest/topics/item-pipeline.html). We will need to create our pipeline that will create (if needed) a database with a table and then insert all of the scraped `BookItems` our crawler creates. A pipeline needs the following methods: - `process_item(self, item, spider)`: this method is called for every item pipeline component. `process_item()` must either: return a dict with data, return an Item (or any descendant class) object, return a Twisted Deferred or raise DropItem exception. Dropped items are no longer processed by further pipeline components. We will be inserting the items into the database from inside this method. - `open_spider(self, spider)`: this method is called when the spider is opened. We're going to want to connect to our database in this method. - `close_spider(self, spider)`: this method is called when the spider is closed. We're going to disconnect from our database in this method. Let's open up the `pipelines.py` file and place the following into it: ```python import sqlite3 class SQLitePipeline(object): def open_spider(self, spider): # what will be the primary key of our database table self.id = 0 # our connection to the local database 'Books.db' self.connection = sqlite3.connect("Books.db") self.cursor = self.connection.cursor() # the name of the table we will be storing our data in self.db_table = "books_details" # creating the table (if it does not exist) create_table_sql = f""" CREATE TABLE IF NOT EXISTS {self.db_table} ( id INTEGER PRIMARY KEY, page INTEGER, title TEXT, price TEXT, url TEXT, thumbnail TEXT ) """ # execute the commands and save the changes (commit them) self.cursor.execute(create_table_sql) self.connection.commit() def close_spider(self, spider): # when we have finished crawling, print all the contents of the database to the console for row in self.cursor.execute(f"SELECT * FROM {self.db_table}"): print(row) # terminate the connection self.connection.close() def process_item(self, book, spider): # an IntegrityError exception is thrown when you try to insert a record into a table # but the record already exists try: # the insert statement insert_into_sql = f""" INSERT INTO {self.db_table} (id, page, title, price, url, thumbnail) VALUES (?, ?, ?, ?, ?, ?) """ # convert the `BookItem` our spider created into a dictionary b = dict(book) self.id += 1 values = (self.id, b["page"], b["title"], b["price"], b["url"], b["thumbnail"]) # insert the item's data into the database self.cursor.execute(insert_into_sql, values) self.connection.commit() except sqlite3.IntegrityError: return book return book ``` Most of the above code is quite self-explanatory. All we are doing is for every `BookItem` our spider collects, we insert it into the `Books.db` database. The last thing we need to do is enable the `SQLitePipeline` pipeline we just created. Open `settings.py` and navigate to line `68`. Here we just need to uncomment the line (include the lines above and below it) and make sure it's pointing to the `pipelines.py` file we just edited. If we run the spider now, you should see that a `Books.db` was created and then when the crawler finishes (you might need to scroll up a little), the details of all the site's books should be printed to the console. We have now created a web crawler that will retrieve information from a website and store it into a database. ## Exercise 5: Further Work - Create a boolean value that is `True` when a book is in stock, `False` if it is out of stock. Update your `BookItem` fields and `pipelines.py` to add the new field to the database. - At the moment, the price of a book is stored as a string. Change this so that it is stored in the database as a number. - *Advanced*: Scrape a book's rating and modify your code to add it to the database. This will take some doing and will require using some more XPath methods as a book's rating is stored as a class name. - *Advanced*: Instead of just storing the book's thumbnail url, implement the ability to store the book's full resolution image url. You will need to `yield` another response that calls back to another parse method (a method you will have to create). - Create another spider that crawls another website of your choice, [quotes.toscrape.com](http://quotes.toscrape.com/) is another good site to practice on. Make sure to create a new item and also a new pipeline for your new crawler.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully