Web crawler - HackMD

# Web crawler The goal of this app is to discover Python by creating a web crawler. ## Features This crawler shall be able to: - Read files for configuration - URL to query - URL allow list - URL block list - File extension allow list - File extension block list - Accept extra configuration through command-line - Get the content of a webpage - Go through the content of the page to find interesting elements - Understand JSON and HTML formats - Extract links to recursively download the whole site - Set HTTP options via a configuration file - Interact with the filesystem to store things downloaded - Avoid loops (query A -> B -> A -> B -> ...) - Be able to resume a download - Log what's happening ## Python side The following modules can be useful: - requests - html.parser - json - re - argparse - datetime - yaml - logging The software architecture should make use of multiple files (one per "role" in the app), and classes. The configuration files can be in any format, but YAML is nice. ## How to start ? First make sure that you have a working environment running Python 3. Install external dependencies using `pip`, like the `requests` module. Don't forget to create a `requirements.txt` using `pip freeze`. 1. Handle configuration files 1. Handle the command line 1. Create the class to handle a "web ressource" 1. Add methods to get this ressource, parse it, etc 1. ...