Try   HackMD

EPF - Update 2

The setup

I'm running the node crawler on it's own Linode Nanode (1 shared CPU, 1GB of memory) instance, like I mentioned in the last update. You can see it here.

But I ran into a problem

The problem

How the system works is that there's an API and crawler which run as separate processes. In the NixOS setup, I have these running as separate SystemD services, in docker compose, these are separate containers.

The crawler process has it's own SQLite database file where it writes all of the nodes it has scraped, and the API process has it's own SQLite database containing a filtered/aggregated dataset which is updated on a periodic basis.

The original setup had the API querying for all the new rows inserted into the crawler database each time it update the nodes. This worked fine for a while, maybe one or two days. But after two to three days, the size of the crawler database was so big, it didn't fit into memory anymore, and would take a considerable amount of time to read all the new nodes. This caused lots of issues in the crawler and API to the point where everything came to a halt.

DELETE ALL THE DATA

My solution to this problem was to delete the data from the crawler database when updating the API database. This way, the crawler database should never grow, because the API will come along eventually and delete everything. SQL has a pretty nice feature where you can run a DELETE statement, and return all the rows which were deleted. There's a few other statements which have a similar idea, so check the docs for your database to see how it works. So this is exactly what the API updater does. It starts a transaction, runs the DELETE FROM nodes RETURNING..., inserts the data into it's database, and then calls COMMIT on the cralwer database, leaving it empty. The crawler is blocked from writing for this period between DELETE and COMMIT, but on this really resource constrained hardware, this process doesn't even take a second to complete.

This worked pretty well, but there are some more things we can do to make it even better. One other problem which came up quite often was that if the crawler was writing to the database, and the API updater tried to delete the rows, or vice versa, an error would occur because the database was locked by the other process.

Being polite

This was solved pretty easily with a parameter: busy_timeout. This allows us to set a timeout where the application will wait for the database to be unlocked before returning an error. I created a CLI argument for this, so it can be configured by the user. Pretty easy.

Size sometimes matters

The other improvement is just something nice, it's not really needed, but autovacuum is a pretty important feature of more complicated database management systems like Postgres for cleaning out old versions of rows, and we have it here as well. This helps us maintain a small database size in the event the API dies, or takes a long time to process the crawler database, and this grows to a size bigger than normal, autovacuum will shrink the size of the database again once rows get deleted. The default of SQLite is to leave the size of the database the same, leaving a lot of empty space when you delete rows, so it's pretty nice to have the ability to reclaim some space. In the case of really cheap virtual machines with 1GB of memory, this feature is even more appriciated.

Flags get their own file

I also moved all the CLI flags to their own file since the API and crawler are sharing more and more of them, it doesn't make sense to define a flag in the api.go file, and then also use it in the crawler.go file. So a file just for flags sounds like a nice idea.

SQLite is pretty cool

I appriciate SQLite a lot more than I did before. It seems to have all the features I would expect for a lot of projects, even some more complicated ones, and the docs are pretty nice. I will use it as a starting database for more projects where I would have gone for Postgres basically every time. The single file nature makes it easier to manage and back up, and moving to Postgres later is a pretty simple process. There's already a tool for that: pg-loader.