# EPF - Update 3 I didn't originally have a plan for the future of the project, but I think what needs to happen is becoming clearer now. ### The past The original setup had an infinitely-growing crawler database because it was insert-only. This would eventually lead to a lockup because it would take too long to read the crawler database for each update of the API database. ![The original design of the crawler/API database setup](https://hackmd.io/_uploads/Bkaobi_l6.png) ### The present In my [last update](https://hackmd.io/@angaz/B1sPsVAR2), I explained how I changed the archetecture a bit to stop that problem from happening. ![The current design of the crawler/API database setup](https://hackmd.io/_uploads/ByeyGoul6.png) ### The future (Probably) One of the things I'd like to have for the future is a history of the crawled, and discovered nodes. The latter is not something which the current setup supports very well. ![The future setup of the database, crawler, and API](https://hackmd.io/_uploads/Bk3LBj_ea.png) The crawler and API will share the same database. The crawler will be responsible for updating, and aggregating the data, and the API will just query the aggregated data for the home page, and will query the crawl data for the node-specific page. My idea is for the crawler to start up N processes/goroutines, which will be assigned a "slice" of the node IDs to crawl. This process will `SELECT` the least-recently crawled node within it's node ID range, collect it's data, update the row, and repeat forever. As long as only one crawler is connected to the database, this should be alright without needing to lock the database for the entire duration of the crawl. We can use a lock file to ensure that doesn't happen. ![The crawler sharding system](https://hackmd.io/_uploads/H1w2isOxT.png) Updating a few thousand to a few ten thousand nodes per day shouldn't be a problem for a single database locking-wise. The [Write-Ahead Log](https://www.sqlite.org/wal.html) can help with this because it doesn't require an `fsync()` of the database file for each write. It also helps with concurrency as reads and writes don't block each other. Using this feature will help a lot for exceptionally slow storage, such as a MicroSD card. My plan to allow a user to input his node, so it can be crawled, I'm not sure this is a good way to do it. If the node is getting any peers at all, it means the the discovery protocol will find it eventually, and it will be added to the list of nodes to be crawled. If there is no history after some time, you can know pretty well that your setup is likely not exposing your node correctly, and having a history on the page of the last N days can help with debugging. Perhapse a feature of the crawler to connect a specific set of nodes instead of the entire network could also be something useful for self-hosting your own external monitoring. You can control the frequency of updates, which can be more frequent than what the standard crawler would be configured with.