# EPF Update 4: All New ### The New Crawler I tried the ideas outlined in my [previous update](https://hackmd.io/@angaz/r1DYC5Ox6). It seems that the database can keep up with the write workload. There was an issue where the database is locked while writing the WAL records, and while this is happening, it doesn't seem to obey the [busy_timeout](https://www.sqlite.org/c3ref/busy_timeout.html) setting. But a simple retry system seemed to make it work, and I haven't seen issues since. There's also some bugs which are now fixed. We are getting Nethermind nodes! I don't know why this is happening now. But there are some nodes which sent a `Ping` before the status message, and this messed up the previous crawler because it expected the messaged in a very specific order, and if a client did something in the wrong order, it would return an error. The new crawler has a loop where it reads messages until it has the `hello` and `status` message, at which point, it sends the disconnect message, and saves the data into the database. This seems to be much more reliable, and I think, simpler than the previous method. There still aren't any Reth clients. I think we would have to find someone from Reth to have a look and see what we can find. Maybe it's something similar to the previous issue? Or it's one of the clients closing the connection with `useless peer` based on our status message? The new table layout allows us to save a lot more data about the crawling. We have a new `crawl_history` table, which saves the crawl history, as well as any errors. This is useful to see if your node was accessible, and is not anymore, for example. Separating the discovery from the crawler also allows us to keep track of the discovered nodes, in the `discovered_nodes` table, and make sure to crawl all of them. There's over 180k nodes discovered after running for about a day. That's pretty crazy. The crawler still has to crawl 110k of them. There's a lot of useless nodes, which are set to crawl again in 14 days time. So they will be crawled again, but not as often as the "real" nodes we want to keep updated. The discovered nodes are found much faster than the crawler can crawl them, but over time, this slows down, and the crawler can catch up. Hopefully it's pretty stable after a few days. We can also see that there's a lot of nodes behind some specific IPs, which is quite suspicious. The top IP doesn't even have any of the ports exposed, so we can't connect there to see what's going on. It could be someone running the original node-crawler, because it would create a new discovery session for v4, and v5, and it would re-create these every 5 minutes by default. ``` $ sqlite3 crawler.db_v2 'SELECT ip_address, COUNT(*) AS count FROM discovered_nodes GROUP BY ip_address ORDER BY count DESC LIMIT 10;' ``` | IP Address | Count | |-----------------|-------| | 18.130.140.251 | 3095 | | 74.63.254.152 | 2336 | | 146.190.149.215 | 2175 | | 177.54.148.201 | 1996 | | 170.64.192.67 | 1916 | | 161.35.159.38 | 1724 | | 137.184.8.233 | 1596 | | 170.64.136.244 | 1588 | | 206.189.13.198 | 1276 | | 107.6.113.181 | 1099 | ### The New Frontend I haven't replaced the aggregated statistics page. That should come soon, but now we have a page to list the found nodes, with a filter for the network ID. There are so many networks using the Ethereum discovery, it was just insanity to list them all. You can hack the HTML to see the list. It's just `display: none;`. ![Nodes List Page](https://hackmd.io/_uploads/ByddWy9W6.png) The page for a specific node shows the information I thought would be useful. And a map, which took too many hours past midnight, via trial and error to figure out. The crawl history is pretty cool. We can see the timestamp of the crawl, whether it was `dial`ed out, or if it was an `accept`ed connection, meaning that the node connected to our crawler, instead of the crawler initiating the connection through the discovery process. When the node is found via the latter method, the only thing which is not ideal is that the enode port will be wrong because it will show the ephemeral port which was used for the connection. But everyting else should still be correct. And it looks like [someone's crawler](https://node-crawler.angaz.io/nodes/03783584de34893577b85fb91093d2c3f59e9e5c68aad16a75bd5b7237b15b39) found me! ![Someone's crawler found me!](https://hackmd.io/_uploads/HklBBk9bT.png) ### Some weirdness For some reason, we don't have the client name for a large set of the nodes. Which is obviously not ideal. I have no idea why this is happening. Hopefully it's something weird in my code, but it could also be something like old versions which don't report their client name? Anyway, it's good to have the information.