# ethernets.io - Node Crawler
This is a high-level overview of the project, mainly offering the details needed
for operating, and understanding the website. In a future iteration of the
website, I would like to incorporate the information in this post into little
information bubbles you can click on to see the information where it's most
relevant.
<!-- A companion article can be found [here][technical-article] (WIP) going
through all the technical details of how the crawler works.
-->
You can read my [Updates][epf-updates] to learn more about what I was doing
while working on the project. You can also watch my
[EPF Day Presentation][epf-day-presentation] from
[Devconnect Istanbul 2023][devconnect-istanbul].
## What is the goal of the project?
I wanted to make a tool for node operators to get information about their
nodes. Initially, I just wanted to show if another node on the network could
connect to your node, so you knew your router/firewalls were set up correctly,
but then [Mario Havel][twitter-mario] suggested I revive the
[Node Crawler][github-node-crawler] project for my
[Ethereum Protocol Fellowship][epf] project. I worked on the project
permissionlessly (when I had time) as I still have a full-time job.
The goal of the project is still for node operators to get information about
their nodes, but also for researchers, and core developers to get information
about the various Ethereum networks. So I'm focusing on features which will be
most useful to these groups. All crawled data is kept. Nothing is filtered, or
excluded. Even non-Ethereum network data is kept. Consensus layer and portal
network clients are coming soon.
It's also very important that the project be
[GPL-compatible][wikipedia-gpl-compatibility], and must be just as easy for
anyone to run as it is to run an Ethereum node.
These have been the main design considerations for the technologies used.
## Concepts
Most concepts should be pretty easy to understand if you are running a node,
but I think some should be explained in more detail.
The crawler not only connects to nodes found on the discovery network, but
also accepts connections, so nodes will randomly find the crawler on the
discovery network and try to connect to it as a peer. The crawler is connected
to, and crawling both [DiscV4][discv4], and [DiscV5][discv5] discovery
networks.
The node's details are saved to the database and the crawler disconnects so the
node's peer limit is not needlessly consumed.
Accepting connections is a critical feature of the crawler for nodes which
cannot be reached because they are behind a firewall or
[Carrier-grade NAT][wikipedia-cgnat]. We would not have the details of these
nodes if they did not connect at least once to the crawler.
Since we have to wait for these nodes to connect to update their details, there
are quite a few nodes with stale data.
### Dial / Accept / Direction
![image](https://hackmd.io/_uploads/HJEx8C-U6.png)
This is how the details of the node was acquired, and is in reference to the
crawler. So `Dial` refers to the crawler dialing a connection, and `Accept`
refers to the crawler accepting a connection.
### Dial Success
![image](https://hackmd.io/_uploads/B1PKs6ZLa.png)
As we can see from the graph above, most (> 60%) of the nodes cannot accept
connections. If this is your node, please review your router/firewall
configurations so we can pump that success percentage number up.
### Client Identifier
A string containing a bunch of information about the node. From this, we can
extract:
- Name
- User Data / Identity (Geth-specific feature)
- Version
- Extra build info, beta versions, git commit, ...
- Operating system
- Architecture
- Programming Language / Version
Not all nodes contain all this data, so you might see places where this is
blank.
## How do we get the best possible view of the network?
This is a difficult thing to do since most of the nodes are not exposed to
the internet so they cannot accept connections from peers.
A crawler which is only trying to connect to nodes will have a very limited
view of the network because they will only be able to get details of nodes
which are accepting connections.
Even when nodes are properly open to connections, they very commonly have too
many peers, so we cannot update their details. This situation is still
considered as a successful dial attempt, but we are unable to update the node's
details. This problem is most common with Geth. Other nodes seem to be less
strict with the peer limit.
## Index / Stats page
![image](https://hackmd.io/_uploads/HkqPX2bUp.png)
The goal of this page is to give you access to aggregated stats about the
various Ethereum networks. It only shows nodes which could still be found
on the discovery network in the last 24 hours. The stats are collected every
30 minutes, on the hour, and 30 minutes past the hour.
### Filters
1. Filter the network - This will filter not only the network ID, but also
make sure the node's fork ID is part of the selected network.
1. Synced status - When a connection with a node is made, the head block
header is fetched from the node so we can get the timestamp of the block.
If it's within a minute of the scrape time, we mark the node as synced.
It's possible we don't know the timestamp of the head block. This is
considered `Unsynced` on this page, and it will be shown with `Unknown` on
other pages.
1. Next fork - This is currently disabled until we know the date of the fork
for Cancun. So come back when we have this, and watch the graphs fill up
with updated nodes.
1. Each of the client names is a link which will add a filter to only show
that specific client name, and you can see the stats on the versions
for that client.
Each of these filters adds a query parameter to the URL, so you can
share/bookmark a specific set of filters. You can also put in values which
are not given as options. You can change the `network` parameter to `56` to
see stats of the Binance Smart Chain, for example.
[Chainlist][chainlist] or [Chainid][chainid] have a bunch of networks you can
try.
At the moment, only 3 days is shown for each of the filtered networks, and 1
day is shown for the `All` filter. I'm working on a better database for this.
I would like to make the date range configurable. All the stats since the
beginning of the project exists in the database, so I would like to make it
available.
## Nodes List
![image](https://hackmd.io/_uploads/Hy5ND0bUa.png)
This is where you can filter for a specific set of nodes, or even search to
find your node.
There's similar filters to the stats page, just the network filter is only
filtering based on the network ID, the fork ID is not taken into consideration.
The inputs should be pretty simple to understand. These will let you find your
node by IP address, node ID, or public key.
These inputs will search the discovery data, so you can find nodes which were
not able to be crawled. If your node is there, you can find details on the
[Help][help-add-peer] page on how to add the crawler as a peer to your node,
so your node will connect to the crawler, and be part of the database.
Geth let's you set an `identity` flag. This is added to the client's identifier
which you can also search for, if you have that set.
The `0` to `f` row is just a shortcut to add that character as a start for the
node ID/public key filter.
## Node Details
![image](https://hackmd.io/_uploads/rJ_MGLGLp.png)
Details about the node I think are most useful to show.
Some fields of note:
- Last Found (Discovery): The last time the discovery crawler found the node.
This is the value used to show stats for "Live" nodes in the last 24 hours.
- Last Update (Crawled): The last time the details of this page were updated.
This can be from the crawler dialing, or accepting a connection.
- Next Crawl (Scheduled): Depending on the type of connection, and if it was
successful, the crawler will schedule the next time it will attempt to dial
this node again. The next dial should be pretty soon after this time. The
crawler very rarely has a backlog.
- Is Synced: This is calculated by taking the Head Hash block time, and
comparing it to the Last Update time. If these are within a minute, then the
node is considered to be synced.
- Dial Success: Calculated from the crawl history. Shows if the crawler was
able to successfully connect to this node in the last 14 days.
- Country / City: Uses the [Maxmind GeoIP 2][maxmind-geoip2] database to map
an IP address to a city.
- Enode / Record: URI for the node, you can use this to add this node as a
peer on your own node. Enodes are used on [DiscV4][discv4], and Records are
used on [DiscV5][discv5]. Records are preferred over Enodes in the database,
as they contain a lot more data. The Enode field is constructed from the
Record.
### Crawl History
![image](https://hackmd.io/_uploads/Bk3rGUfL6.png)
Shows the last 10 accepted, and dialed connections, along with if there was
an error or not. I hope this will be very useful to node operators trying to
debug peer connection or network issues. I plan to have a GitHub-style year
map of dial connection issues for each node. If the history is very important,
you can download the database snapshot and query all of it.
## Crawl History Page
![image](https://hackmd.io/_uploads/ryf1TC-8T.png)
Shows the history of dialed and accepted connections for the Node Crawler.
## Help Page
![image](https://hackmd.io/_uploads/HJmpPn7UT.png)
Shows some information to help you add the node crawler as a peer to your node,
and how to find your node's ID/public key so you can find it on the website.
## Snapshots
![image](https://hackmd.io/_uploads/HksXK7XUa.png)
Database snapshots, taken once a day at midnight UTC. Available for anyone to
download and use in their research.
[technical-article]: https://hackmd.io/@angaz/H1-pMC-8a
[epf-updates]: https://hackmd.io/@angaz?tags=%5B%22EPF%22%5D
[epf-day-presentation]: https://app.streameth.org/devconnect/epf_day/session/node_crawler
[devconnect-istanbul]: https://devconnect.org/istanbul
[twitter-mario]: https://twitter.com/TMIYChao
[github-node-crawler]: https://github.com/ethereum/node-crawler
[epf]: https://github.com/eth-protocol-fellows/cohort-four
[chainlist]: https://chainlist.org/
[chainid]: https://chainid.network/
[help-add-peer]: https://www.ethernets.io/help/#add-your-node
[wikipedia-cgnat]: https://en.wikipedia.org/wiki/Carrier-grade_NAT
[maxmind-geoip2]: https://dev.maxmind.com/geoip/geolite2-free-geolocation-data
[discv4]: https://github.com/ethereum/devp2p/blob/master/discv4.md
[discv5]: https://github.com/ethereum/devp2p/blob/master/discv5/discv5.md
[wikipedia-gpl-compatibility]: https://en.wikipedia.org/wiki/License_compatibility#GPL_compatibility