# Crawl framework
Participants: ```Vishy, Shishir, Venu, Mohan Sri Harsha, Ram ```
## URL Schema
```json
{
"_id": "",
"url": "",
"domain": "",
"latest_document_id": "",
"latest_document_content_hash": ""
}
```
### URL Schedule schema
```json
{
"_id": "",
"url": "",
"domain": "",
"last_run": "" # Timestamp in epoch format,
"last_success_run": "",
"next_run": "",
"first_run": "",
"nvisit": "" # Number of times crawler visited this URL,
"nsuccess": "",
"nfailed": "",
"default_visit_interval": "" # Max visit interval by default,
"deduced_visit_interval": "" # Interval decided by the scheduler at which this url should be picked
}
```
### Document Schema
```json
{
"_id": "",
"url": "",
"domain": "",
"parent_id": "",
"fields": [
{
"key": "title",
"value": ""
},
{
"key": "text",
"value": ""
},
{
"key": "tags",
"value": ""
}
],
"dt_published": "",
"dt_updated": "",
"dt_crawled": "",
"content": "",
"cotent_hash": ""
}
```
## Kinds of crawling
- Unstructured
- Semi-structured
- Structured
## Types of crawling
- HTML
- PDF
- JSON downloads
- XML downloads
- csv
- Image crawl
## Crawl requirements
- User Agent
- Proxy
## Crawl settings
- URL_TIME_OUT
- MAX_CONTENT_SIZE
- MAX_RETRIES
- DEPTH_LIMIT
- Allowed domains
- FETCH_DELAY
- Delay between every request
- DNS cache
- Max url length
- URL's with length > 2048 won't work
## Canonicalise url to handle duplicates
- http/https
- with/without www
- Ending with/without slash
- URL fragments, Eg: #bottom, #comments
- URL variants/redirects
## Duplicates handling
- When doing deep-crawl the url's present in menu bar of the website will be identified in every page request.
- These urls shouldn't be crawled again and again, we should be able to crawl only once.
## Id construction
- id construction should be done on canonicalized URL so we will reduce duplicates
## Logs and metrics
- Need better logs for debugging
- Need to store stats so can build dashboards of
- Num urls crawled per second
- Avg response time
- Num requests failed
- etc..
## Expectations
- Scalability: Able to launch N number of workers at a time to crawl a huge website which has around 50million documents.
- Real time crawling: Difference between document date published to date crawled should be <30mins.
- Real time documents consumption: Crawled documents should be added to ElasticSearch in <10mins.
- Adpative scheduling: Able to schedule the next run of the URL based on the website update frequency
## Notes
### Oct 1st 2019
#### Schema
- URL schema
- Contains information about url, it's id
- Contains latest document id crawled from this URL
- URL Schedule schema
- Contains URL, it's id and meta information.
- Contains all the crawl information like, when was it first crawled, last crawled, next run, etc..
- Scheduler uses this to find next run of the URL
- Document schema
- Actual data crawled from a URL
- Contains data like title, text, dt_published, etc..
- One document captures snapshot of a URL crawled at that time.
- One url might have multiple document versions
- document with latest dt_crawled will be the latest document
#### Components
- Discovery queue
- Will have all the new urls which needs to be crawled
- All the urls provided by user will be stored here
- New urls identified by crawler will be sent to this queue
- Scheduler picks URL's from this queue and sends to crawl queue periodically
- Crawl queue
- Will have all the urls which are ready for crawl
- Crawlers will fetch urls from this queue and do the crawl
- Only scheduler will be sending urls to this queue for crawl
- Documents queue
- All the crawled documents are pushed to this queue by crawler
- data processing starts from this queue
- One url might have multiple versions of documents in this queue
- Job status queue
- Crawler sends the status of the crawled url to this queue
- Success
- Successfully crawled the URL and a new document is written to documents queue
- Scheduler picks the URL and reduces the next visit interval
- Eg: if the current visit interval is 6hours, scheduler might reduce it to 4 hours.
- Failed
- Failed to crawl the URL, website might be down, or someother network issue
- Scheduler will schedule it to crawl immediately
- No Change
- Successfully crawled but no change in the page
- Scheduler will increase the next visit interval
- Eg: If the current visit interval is 6 hours scheduler might increase it to 8hours
- Scheduler
- Decides when to crawl a particular url and sends to crawl queue
- Will depend on status queue to decide when a particular url should be crawled
- Crawler
- Crawl's the given url, prepare documnet out of it and sends to documents queue
- Find new urls and sends to discovery queue
- Transformer
- Processes the documents from documents queue and converts to standard format
- URL flow
- URL starting point will be Disovery queue
- Scheduler will pick the URL's check's the lastrun/nextrun and based on that will send to Crawl Queue
- Once the URL is crawled by crawler if success will send the URL to status Queue
### Arch diagram

#### Data location
- We need to continue maintaining the existing mongo collections
- So, we need to parse the crawled data and prepare individual collections basically at source level as the way we have now.