Crawl framework

# Crawl framework Participants: ```Vishy, Shishir, Venu, Mohan Sri Harsha, Ram ``` ## URL Schema ```json { "_id": "", "url": "", "domain": "", "latest_document_id": "", "latest_document_content_hash": "" } ``` ### URL Schedule schema ```json { "_id": "", "url": "", "domain": "", "last_run": "" # Timestamp in epoch format, "last_success_run": "", "next_run": "", "first_run": "", "nvisit": "" # Number of times crawler visited this URL, "nsuccess": "", "nfailed": "", "default_visit_interval": "" # Max visit interval by default, "deduced_visit_interval": "" # Interval decided by the scheduler at which this url should be picked } ``` ### Document Schema ```json { "_id": "", "url": "", "domain": "", "parent_id": "", "fields": [ { "key": "title", "value": "" }, { "key": "text", "value": "" }, { "key": "tags", "value": "" } ], "dt_published": "", "dt_updated": "", "dt_crawled": "", "content": "", "cotent_hash": "" } ``` ## Kinds of crawling - Unstructured - Semi-structured - Structured ## Types of crawling - HTML - PDF - JSON downloads - XML downloads - csv - Image crawl ## Crawl requirements - User Agent - Proxy ## Crawl settings - URL_TIME_OUT - MAX_CONTENT_SIZE - MAX_RETRIES - DEPTH_LIMIT - Allowed domains - FETCH_DELAY - Delay between every request - DNS cache - Max url length - URL's with length > 2048 won't work ## Canonicalise url to handle duplicates - http/https - with/without www - Ending with/without slash - URL fragments, Eg: #bottom, #comments - URL variants/redirects ## Duplicates handling - When doing deep-crawl the url's present in menu bar of the website will be identified in every page request. - These urls shouldn't be crawled again and again, we should be able to crawl only once. ## Id construction - id construction should be done on canonicalized URL so we will reduce duplicates ## Logs and metrics - Need better logs for debugging - Need to store stats so can build dashboards of - Num urls crawled per second - Avg response time - Num requests failed - etc.. ## Expectations - Scalability: Able to launch N number of workers at a time to crawl a huge website which has around 50million documents. - Real time crawling: Difference between document date published to date crawled should be <30mins. - Real time documents consumption: Crawled documents should be added to ElasticSearch in <10mins. - Adpative scheduling: Able to schedule the next run of the URL based on the website update frequency ## Notes ### Oct 1st 2019 #### Schema - URL schema - Contains information about url, it's id - Contains latest document id crawled from this URL - URL Schedule schema - Contains URL, it's id and meta information. - Contains all the crawl information like, when was it first crawled, last crawled, next run, etc.. - Scheduler uses this to find next run of the URL - Document schema - Actual data crawled from a URL - Contains data like title, text, dt_published, etc.. - One document captures snapshot of a URL crawled at that time. - One url might have multiple document versions - document with latest dt_crawled will be the latest document #### Components - Discovery queue - Will have all the new urls which needs to be crawled - All the urls provided by user will be stored here - New urls identified by crawler will be sent to this queue - Scheduler picks URL's from this queue and sends to crawl queue periodically - Crawl queue - Will have all the urls which are ready for crawl - Crawlers will fetch urls from this queue and do the crawl - Only scheduler will be sending urls to this queue for crawl - Documents queue - All the crawled documents are pushed to this queue by crawler - data processing starts from this queue - One url might have multiple versions of documents in this queue - Job status queue - Crawler sends the status of the crawled url to this queue - Success - Successfully crawled the URL and a new document is written to documents queue - Scheduler picks the URL and reduces the next visit interval - Eg: if the current visit interval is 6hours, scheduler might reduce it to 4 hours. - Failed - Failed to crawl the URL, website might be down, or someother network issue - Scheduler will schedule it to crawl immediately - No Change - Successfully crawled but no change in the page - Scheduler will increase the next visit interval - Eg: If the current visit interval is 6 hours scheduler might increase it to 8hours - Scheduler - Decides when to crawl a particular url and sends to crawl queue - Will depend on status queue to decide when a particular url should be crawled - Crawler - Crawl's the given url, prepare documnet out of it and sends to documents queue - Find new urls and sends to discovery queue - Transformer - Processes the documents from documents queue and converts to standard format - URL flow - URL starting point will be Disovery queue - Scheduler will pick the URL's check's the lastrun/nextrun and based on that will send to Crawl Queue - Once the URL is crawled by crawler if success will send the URL to status Queue ### Arch diagram ![](https://i.imgur.com/ZSrHb5S.png) #### Data location - We need to continue maintaining the existing mongo collections - So, we need to parse the crawled data and prepare individual collections basically at source level as the way we have now.