ethernets.io - Node Crawler

# ethernets.io - Node Crawler This is a high-level overview of the project, mainly offering the details needed for operating, and understanding the website. In a future iteration of the website, I would like to incorporate the information in this post into little information bubbles you can click on to see the information where it's most relevant.  You can read my [Updates][epf-updates] to learn more about what I was doing while working on the project. You can also watch my [EPF Day Presentation][epf-day-presentation] from [Devconnect Istanbul 2023][devconnect-istanbul]. ## What is the goal of the project? I wanted to make a tool for node operators to get information about their nodes. Initially, I just wanted to show if another node on the network could connect to your node, so you knew your router/firewalls were set up correctly, but then [Mario Havel][twitter-mario] suggested I revive the [Node Crawler][github-node-crawler] project for my [Ethereum Protocol Fellowship][epf] project. I worked on the project permissionlessly (when I had time) as I still have a full-time job. The goal of the project is still for node operators to get information about their nodes, but also for researchers, and core developers to get information about the various Ethereum networks. So I'm focusing on features which will be most useful to these groups. All crawled data is kept. Nothing is filtered, or excluded. Even non-Ethereum network data is kept. Consensus layer and portal network clients are coming soon. It's also very important that the project be [GPL-compatible][wikipedia-gpl-compatibility], and must be just as easy for anyone to run as it is to run an Ethereum node. These have been the main design considerations for the technologies used. ## Concepts Most concepts should be pretty easy to understand if you are running a node, but I think some should be explained in more detail. The crawler not only connects to nodes found on the discovery network, but also accepts connections, so nodes will randomly find the crawler on the discovery network and try to connect to it as a peer. The crawler is connected to, and crawling both [DiscV4][discv4], and [DiscV5][discv5] discovery networks. The node's details are saved to the database and the crawler disconnects so the node's peer limit is not needlessly consumed. Accepting connections is a critical feature of the crawler for nodes which cannot be reached because they are behind a firewall or [Carrier-grade NAT][wikipedia-cgnat]. We would not have the details of these nodes if they did not connect at least once to the crawler. Since we have to wait for these nodes to connect to update their details, there are quite a few nodes with stale data. ### Dial / Accept / Direction ![image](https://hackmd.io/_uploads/HJEx8C-U6.png) This is how the details of the node was acquired, and is in reference to the crawler. So `Dial` refers to the crawler dialing a connection, and `Accept` refers to the crawler accepting a connection. ### Dial Success ![image](https://hackmd.io/_uploads/B1PKs6ZLa.png) As we can see from the graph above, most (> 60%) of the nodes cannot accept connections. If this is your node, please review your router/firewall configurations so we can pump that success percentage number up. ### Client Identifier A string containing a bunch of information about the node. From this, we can extract: - Name - User Data / Identity (Geth-specific feature) - Version - Extra build info, beta versions, git commit, ... - Operating system - Architecture - Programming Language / Version Not all nodes contain all this data, so you might see places where this is blank. ## How do we get the best possible view of the network? This is a difficult thing to do since most of the nodes are not exposed to the internet so they cannot accept connections from peers. A crawler which is only trying to connect to nodes will have a very limited view of the network because they will only be able to get details of nodes which are accepting connections. Even when nodes are properly open to connections, they very commonly have too many peers, so we cannot update their details. This situation is still considered as a successful dial attempt, but we are unable to update the node's details. This problem is most common with Geth. Other nodes seem to be less strict with the peer limit. ## Index / Stats page ![image](https://hackmd.io/_uploads/HkqPX2bUp.png) The goal of this page is to give you access to aggregated stats about the various Ethereum networks. It only shows nodes which could still be found on the discovery network in the last 24 hours. The stats are collected every 30 minutes, on the hour, and 30 minutes past the hour. ### Filters 1. Filter the network - This will filter not only the network ID, but also make sure the node's fork ID is part of the selected network. 1. Synced status - When a connection with a node is made, the head block header is fetched from the node so we can get the timestamp of the block. If it's within a minute of the scrape time, we mark the node as synced. It's possible we don't know the timestamp of the head block. This is considered `Unsynced` on this page, and it will be shown with `Unknown` on other pages. 1. Next fork - This is currently disabled until we know the date of the fork for Cancun. So come back when we have this, and watch the graphs fill up with updated nodes. 1. Each of the client names is a link which will add a filter to only show that specific client name, and you can see the stats on the versions for that client. Each of these filters adds a query parameter to the URL, so you can share/bookmark a specific set of filters. You can also put in values which are not given as options. You can change the `network` parameter to `56` to see stats of the Binance Smart Chain, for example. [Chainlist][chainlist] or [Chainid][chainid] have a bunch of networks you can try. At the moment, only 3 days is shown for each of the filtered networks, and 1 day is shown for the `All` filter. I'm working on a better database for this. I would like to make the date range configurable. All the stats since the beginning of the project exists in the database, so I would like to make it available. ## Nodes List ![image](https://hackmd.io/_uploads/Hy5ND0bUa.png) This is where you can filter for a specific set of nodes, or even search to find your node. There's similar filters to the stats page, just the network filter is only filtering based on the network ID, the fork ID is not taken into consideration. The inputs should be pretty simple to understand. These will let you find your node by IP address, node ID, or public key. These inputs will search the discovery data, so you can find nodes which were not able to be crawled. If your node is there, you can find details on the [Help][help-add-peer] page on how to add the crawler as a peer to your node, so your node will connect to the crawler, and be part of the database. Geth let's you set an `identity` flag. This is added to the client's identifier which you can also search for, if you have that set. The `0` to `f` row is just a shortcut to add that character as a start for the node ID/public key filter. ## Node Details ![image](https://hackmd.io/_uploads/rJ_MGLGLp.png) Details about the node I think are most useful to show. Some fields of note: - Last Found (Discovery): The last time the discovery crawler found the node. This is the value used to show stats for "Live" nodes in the last 24 hours. - Last Update (Crawled): The last time the details of this page were updated. This can be from the crawler dialing, or accepting a connection. - Next Crawl (Scheduled): Depending on the type of connection, and if it was successful, the crawler will schedule the next time it will attempt to dial this node again. The next dial should be pretty soon after this time. The crawler very rarely has a backlog. - Is Synced: This is calculated by taking the Head Hash block time, and comparing it to the Last Update time. If these are within a minute, then the node is considered to be synced. - Dial Success: Calculated from the crawl history. Shows if the crawler was able to successfully connect to this node in the last 14 days. - Country / City: Uses the [Maxmind GeoIP 2][maxmind-geoip2] database to map an IP address to a city. - Enode / Record: URI for the node, you can use this to add this node as a peer on your own node. Enodes are used on [DiscV4][discv4], and Records are used on [DiscV5][discv5]. Records are preferred over Enodes in the database, as they contain a lot more data. The Enode field is constructed from the Record. ### Crawl History ![image](https://hackmd.io/_uploads/Bk3rGUfL6.png) Shows the last 10 accepted, and dialed connections, along with if there was an error or not. I hope this will be very useful to node operators trying to debug peer connection or network issues. I plan to have a GitHub-style year map of dial connection issues for each node. If the history is very important, you can download the database snapshot and query all of it. ## Crawl History Page ![image](https://hackmd.io/_uploads/ryf1TC-8T.png) Shows the history of dialed and accepted connections for the Node Crawler. ## Help Page ![image](https://hackmd.io/_uploads/HJmpPn7UT.png) Shows some information to help you add the node crawler as a peer to your node, and how to find your node's ID/public key so you can find it on the website. ## Snapshots ![image](https://hackmd.io/_uploads/HksXK7XUa.png) Database snapshots, taken once a day at midnight UTC. Available for anyone to download and use in their research. [technical-article]: https://hackmd.io/@angaz/H1-pMC-8a [epf-updates]: https://hackmd.io/@angaz?tags=%5B%22EPF%22%5D [epf-day-presentation]: https://app.streameth.org/devconnect/epf_day/session/node_crawler [devconnect-istanbul]: https://devconnect.org/istanbul [twitter-mario]: https://twitter.com/TMIYChao [github-node-crawler]: https://github.com/ethereum/node-crawler [epf]: https://github.com/eth-protocol-fellows/cohort-four [chainlist]: https://chainlist.org/ [chainid]: https://chainid.network/ [help-add-peer]: https://www.ethernets.io/help/#add-your-node [wikipedia-cgnat]: https://en.wikipedia.org/wiki/Carrier-grade_NAT [maxmind-geoip2]: https://dev.maxmind.com/geoip/geolite2-free-geolocation-data [discv4]: https://github.com/ethereum/devp2p/blob/master/discv4.md [discv5]: https://github.com/ethereum/devp2p/blob/master/discv5/discv5.md [wikipedia-gpl-compatibility]: https://en.wikipedia.org/wiki/License_compatibility#GPL_compatibility