# Web crawler
The goal of this app is to discover Python by creating a web crawler.
## Features
This crawler shall be able to:
- Read files for configuration
- URL to query
- URL allow list
- URL block list
- File extension allow list
- File extension block list
- Accept extra configuration through command-line
- Get the content of a webpage
- Go through the content of the page to find interesting elements
- Understand JSON and HTML formats
- Extract links to recursively download the whole site
- Set HTTP options via a configuration file
- Interact with the filesystem to store things downloaded
- Avoid loops (query A -> B -> A -> B -> ...)
- Be able to resume a download
- Log what's happening
## Python side
The following modules can be useful:
- requests
- html.parser
- json
- re
- argparse
- datetime
- yaml
- logging
The software architecture should make use of multiple files (one per "role" in the app), and classes.
The configuration files can be in any format, but YAML is nice.
## How to start ?
First make sure that you have a working environment running Python 3.
Install external dependencies using `pip`, like the `requests` module. Don't
forget to create a `requirements.txt` using `pip freeze`.
1. Handle configuration files
1. Handle the command line
1. Create the class to handle a "web ressource"
1. Add methods to get this ressource, parse it, etc
1. ...