EPF Update 5: Refinement

# EPF Update 5: Refinement Since my last update, I think the most important thing I added was metrics. Both to the server via [Node Exporter](https://github.com/prometheus/node_exporter), and to the crawler itself, but there were also a lot of backend changes. ### Solving problems with data #### Metrics This has been really big for me to be able to solve problems, because I could see a pattern in the Grafana dashboard, think of what could solve it, make the change, and see how it looked the next day. It helped me to find so many problems, and bad ideas, which I probably would not have found without the metrics. Problematic queries, and changes to the crawler's logic could easily be spotted in the graphs. ![image.png](https://hackmd.io/_uploads/SkOGeFrX6.png) ![image.png](https://hackmd.io/_uploads/rJjvlKSm6.png) That blue dotted line is an annotation. I would make these whenever I made a big change so I could see what differences it made to the metrics. In this case, it didn't make any difference, which is good, nothing was negativly impacted. ![image.png](https://hackmd.io/_uploads/rJtrxFH7a.png) ![image.png](https://hackmd.io/_uploads/BJj9lYBQ6.png) That big spike was due to a new index being created. This was needed for the `Crawl History` page. ![image.png](https://hackmd.io/_uploads/rywngFB7p.png) ### More frontends with more data The website, I think is pretty complete now. The set of features I was looking for at the beginning is there now. #### The stats page We have a stats page, showing off some aggregated stats. This page is not cached, but the raw data used for the stats is cached. There is a daemon which will refresh the stats every 5 minutes. You can see in the database queries graph how this is actaully taking up the most time. ![image.png](https://hackmd.io/_uploads/BJYCWKSQ6.png) You might notice there is a filter for `Synced` or `Not Synced`. This was pretty fun to solve, and I will get to it later. #### The nodes list page ![image.png](https://hackmd.io/_uploads/SyGuGFS7T.png) We also have the filters like we did on the previous page, and the setting you chose on the previous page follow you when you navigate to this page. All the state is stored in the query parameters, so you can try to break the site by changing them. I would like to make them a bit more human-readable by using names instead of numbers, but that's pretty low on the priority list right now. We also have a search box which can be used to find a node using it's IP or Node ID. #### The node details page ![image.png](https://hackmd.io/_uploads/ry5cXtS7p.png) The map from the last update is still here, and the Eth logo is still placed using CSS. But there have been a lot of quality of life improvements added here. * All the timestamps are represented by their relative times, for the Western side of the globe who cannot conceptualize UTC for some reason. * Names have been added to things like the network ID, and fork IDs, so we can recognise them in human names, not `dce96c2d`. * The Head Hash also has a timestamp of the block, if it's known. This is used to say whether the node is synced or not. The logic is basically if the `Last Update` is within a minute of the `Head Hash` block timestamp, then it's in sync. * The `Next Crawl` timestamp I think could also be useful for the user. So if they have seen a bunch of `i/o timeout` errors, like we have in the above image, and they make changes to their firewall settings, they can have an idea when next the crawler will try to connect to their node. #### The crawl history page ![image.png](https://hackmd.io/_uploads/rJLG8FBXa.png) This one, I think will be the most useful for researchers. Because we can look at the data to find all sorts of interesting things. One which stands out is how Nethermind and Reth clients make so many attempts to connect to the crawler. We can also see this in the `Node Update` graphs. There's a very periodic reconnect strategy Nethermind makes, which is about every 15 minutes. Nethermind is the vast majority of accepted connections. It's probably filling up the `crawl_history` table a bit too much. We might have to create something to prune this data, but as it's currently going, I don't think that will be a problem, unless you are worried about disk space. Maybe we can add it as a subcommand to do pruning manually, or via a cron job so the user decides the amount of data which is acceptable to them. ### Changes behind the scenes #### SQLite While trying to debug some performance issues I identified after adding the metrics, I discovered that SQLite is not the database I thought it was. This was my table schema before my great awakening: ```sql CREATE TABLE IF NOT EXISTS crawl_history ( node_id TEXT NOT NULL, crawled_at TIMESTAMP WITH TIMEZONE NOT NULL, direction TEXT NOT NULL, error TEXT DEFAULT NULL, PRIMARY KEY (node_id, crawled_at) ); ``` Thinking I could reduce the size of the database by saving all the `node_id` columns as `BLOB` instead of `TEXT`, making them half the size because they are encoded as the raw bytes instead of the hex-encoded form, I discovered that SQLite data types mean absolutely nothing, and that `TIMESTAMP WITH TIMEZONE` is just `TEXT` because SQLite doesn't have any concept of a timestamp datatype. What I've actually been saving into that field this whole time has been a string. In fact, even if I had used a "recognized" datatype of SQLite, it would still let me insert whatever garbage I want into any column. Needless to say, I was a bit annoyed with this discovery. Why pretend to have datatypes if they are not enforced? For a day or two, I considered rolling my own database based on [pebble](https://github.com/cockroachdb/pebble), the database which [Geth](https://github.com/ethereum/go-ethereum) recently switched to. But the declaritive nature of SQL is quite nice. Stating what data you want, and not really having to worry so much about how the database engine will give it to you, I find quite compelling, so I stuck with SQLite. I discovered a pretty recent (2021-11-27) feature added to SQLite: [3.37.0](https://www.sqlite.org/changes.html#version_3_37_0) added the `STRICT` keyword for tables, which will cause the database to return an error if you try to insert the wrong datatype, or try to create a table with a fictional datatype. You will note the author's aprehension at this idea: > STRICT tables provide a prescriptive style of data type management, for developers who prefer that kind of thing. And from the [STRICT tables](https://www.sqlite.org/stricttables.html) page: > Some developers appreciate the freedom that SQLite's flexible typing rules provide and use that freedom to advantage. But other developers are aghast at SQLite's flagrant rule-breaking and prefer the traditional rigid type system found in all other SQL database engines and in the SQL standard. For this latter group, SQLite supports a strict typing mode, as of version 3.37.0 (2021-11-27), that is enabled separately for each table. Everything seemed like it's what I was looking for, until: > Columns that are part of the PRIMARY KEY are implicitly NOT NULL. However, even though the PRIMARY KEY has an implicit NOT NULL constraint, when a NULL value is inserted into an INTEGER PRIMARY KEY column, the NULL is automatically converted into a unique integer, using the same rules for INTEGER PRIMARY KEY on ordinary, non-strict tables. Dissagreement with the author aside, I tried this. The `INTEGER PRIMARY KEY` thing wouldn't be a problem for us because the primary key for all our tables is either the Node ID, or a combination of the Node ID and some other column. So I went about re-defining the tables. I used `BLOB` types wherever any kind of byte array was used. For example, the Node IDs, and block hashes. All timestamps were now Unix timestamps in seconds, stored as `INTEGER`s. There is a native `unixepoch` function in SQLite, which will give you this, so that's why I went with seconds over milliseconds, for example. I felt that the resolution would be enough for our purposes. Everything else basically stayed the same. I wrote a little migration function, and witnessed the change to the database size: ![image.png](https://hackmd.io/_uploads/ry45AKHX6.png) This is the size of the database in GB: `node_crawler_database_stats{stat_name="database_bytes"} / 1e9` Quite a significant drop. All those Node IDs, and timestamps must have been taking up a lot of space as strings. The Node IDs would be half the size as bytes, and each timestamp would be 19 bytes + whatever overhead strings have, down to 8 bytes. I did verify the change in size was not due to data being lost. I did a count of the rows before and after, and this was the same for all the tables. This is pretty significant, because it means more of the database can fit into memory/cache pages, so performance would be quite significantly improved as the database size grows. #### Synced or Not Synced This feature was pretty cool, because you would probably think: How would you get the block timestamps? The crawler must be connecting to some kind of service like [Etherscan](https://etherscan.io/). But no. I don't even have a local node. The node crawler does everything itself. I didn't want to have a reliance on some external service, and I didn't want to have something else I rely on which could not be run on the same 1 vCPU/1GB of RAM machine I'm currently running the system on. I realised that I'm connecting to nodes on the network, so I can just ask them. So after the [Status](https://github.com/ethereum/devp2p/blob/master/caps/eth.md#status-0x00) message is recieved, I have the client's details, so I use that to send the [GetBlockHeaders](https://github.com/ethereum/devp2p/blob/master/caps/eth.md#getblockheaders-0x03) message. This idea kind of also solved another problem. We started getting a lot more clients not disconnecting early. Because now, the crawler needs to pretend to be on the same network as the other peer for it to send us the block headers. The solution was to copy the `Status` message and sends it back, along with the `GetBlockHeaders` message. I also used this opportunity to fill in the gaps of blocks which were unknown. So with each connection, the database is also queried for node head hash values for which we don't have the associated block. There are still some gaps, but this has almost eliminated that problem. As you can see in the graphs, we have over 1.5 million blocks in the database now. A lot of work for such a simple data point, but I think this is quite important, considering the recent talk of a certain client overtaking Geth as the most "Synced" nodes on mainnet. #### Stats The stats page page requried rewriting the client name parser. I tried to make it as simple as possible, while still parsing every row we have in the database. [No REGEXs](https://regexlicensing.org/) were used in the making of the parser. There will be errors printed in the API logs if anything is missed in the future. Hopefully everything is covered. Not everything is displayed at the moment. We can display version numbers per client, and programming language used. I din't think that was super important, so it's not there yet. I have to call out [`nimbus-eth1`](https://github.com/status-im/nimbus-eth1) for not following the naming convention. WTF is [this](https://node-crawler.angaz.io/nodes/06a8921aa2e748702d40d25e7273de606d5bd19cf706ea869528f68b061df022)? > nimbus-eth1 v0.1.0 [linux: amd64, rocksdb, nimvm, 682b16] Anyway, we could add more features to this in the future. The values are not saved to the database at this time, so creating filters are a bit impractical. But after some proper testing, I think it should be fine to put these parsed values into the database.

Read more

ethernets.io - API

ethernets.io - Node Crawler

EPF Update 4: All New

EPF: Update 3: The future