owned this note
owned this note
Published
Linked with GitHub
# Lighthouse Peer Scoring
The peer scoring implementation in Lighthouse is an initial version that was designed to be iterated upon based on data we collect from various networks. Take everything written here with a big grain of salt :).
## Overview
All peers are associated with a score. A score is an f64 in the range `[-100,100]` although in the current implementation only negative scores exist, i.e peers can only have scores `[-100, 0]`. Initially we had envisaged positive scores, where peers get rewarded for certain behaviours, such as valid messages, connection time, useful blocks etc. We quickly realised that the complexity of the system was greatly increased with positive scores and so we have currently removed them (There's a section below for some thoughts about positive scores).
The code in Lighthouse that handles the basic scores is [here](https://github.com/sigp/lighthouse/blob/master/beacon_node/eth2_libp2p/src/peer_manager/score.rs)
There are currently three states a peer can be in, related to their score:
- **Healthy**: Their score is > -20 (this is a normal peer with no restrictions)
- **Disconnected**: -50 < Score <= -20 (If a peer transitions into this state we disconnect them. We allow them to reconnect but we remember their score. We may also re-dial these peers if the client would like more connected peers)
- **Banned**: Score < -50 (A peer is banned. If its connected, we disconnect and do not allow reconnections. We don't dial or try to reconnect.)
Peers currently get penalized for a variety of actions. To make things simple throughout the codebase there are 4 penalty actions which adjust a peer's score that a variety of events can fall under:
- **Fatal** - The peer has done something really bad, we instantly ban and give it the minimum possible score. Things that are provable malicious (think bad signatures) or that we don't want to connect to the peer (its on the wrong fork).
- **LowToleranceError** - The peer has done some thing bad. Not provably malicious but a behaviour that we really don't want to have in a connected peer (think not supporting protocols, bad encoding, not following spec etc). This subtracts 10 from the peer's score.
- **MidToleranceError** - The peer has done something moderately bad. This subtracts 5 from the peer's score.
- **HighToleranceError** - The peer has done a behaviour which isn't ideal and we penalize slightly. An example of this is a peer not responding to RPC requests in time. We subtract 1 from the peer's score.
## Score Decay
Scores are not permanent. We don't want to penalize peers for all time. Thus all scores "decay" (i.e return to 0). This was designed for both positive and negative scores, but as we are only using negative scores at the moment, this brings negative scores back to 0.
The decay is a simple exponential decay with a specified half-life. This is the time it takes for the score to decay to half it's value. The current implementation has a half-life of 600 seconds (10 minutes). This if a peer gets disconnected with a score of -20. After 10 minutes it's score will be -10 and be returned back to the healthy state.
The decay logic is modified a little for banned peers however. If a peer gets banned, we wait a set amount of time before their score is eligible to start decaying. The current implementation sets this value to 1800 seconds (30 mins). Therefore if a peer gets banned, it stays banned for 30 minutes and then the standard half-life decay kicks in and the peer's score will slowly decay back to 0.
Based on this logic, we never ban peers permanently, rather for a short period of time.
## Peer History
We can't remember all peers and their score forever. We have a database that stores peers, but it only stores a maximum amount of `Disconnected` and `Banned` peers. Currently we keep a history of the most recent 500 disconnected peers and 1000 banned peers. In principle, if someone wanted to unban themselves they could connect to us 1000 times with new peers, that then get banned, to push out older peers. Although in order to do this they would need many IP addresses (see the IP banning section).
## Score Uses
The score is used in various parts of Lighthouse.
#### Accepting new connections
A beacon node, based on their local resources can only support a finite number of peers (to handle all the rpc requests etc). By default, Lighthouse targets a peer count of 50. Lighthouse allows 10% on top of this for new peers to connect. This prevents the scenario where nodes in a testnet all connect amongst each other and prevent new connections making it harder for new nodes to join the network. This also allows us to cycle peers and find potentially better scored peers.
As we allow extra peers to connect, we often end up with more peers than the target. Every 30 seconds we then prune the excess peers. The pruning process removes the lowest scoring peers.
#### IP Banning
If there are 5 of more banned peers who connected to us on the same IP, that IP becomes banned (until < 5 peers become unbanned).
#### Sync Progression
On some very weird connections and faulty client software etc has lead to some sync states where peers send us junk responses to sync (blocks by range) requests. The blocks can't be processed and the client has to determine if it should retry from different peers and figure out which peers are faulty. Without the scoring system, the sync algorithm would constantly try to re-download blocks from faulty peers in an endless loop. By penalizing peers that provide junk information during sync, we eventually ban or kick these peers allowing sync to either progress or eventually kick all its peers and remain idle. This was very handy during the Medalla chain explosion. The client, if left long enough, would eventually filter out and ban all peers on invalid software/incompatible chains, leaving it a collection of peers that it could progress and sync from.
#### Potential Sync Choices (Not implemented)
Depending on how we are scoring peers, it can be a measure of their bandwidth/reliability (peers get penalized for not responding to requests). We don't currently use it as such, but there was some thought about using the score to decide on best peers to download blocks from during sync.
### Examples of Scoring Penalties
To give some more concrete examples of how peers get penalized in Lighthouse, I'll list some examples for the different peer action types:
**Fatal**
- Peer is on the wrong fork or wrong genesis etc.
- Peer sent RPC data that is not spec (we would constantly fail requests, so we ban instantly)
- Peer doesn't support Ping protocol
- Peer responds with blocks that don't correspond to the requested hash
**LowToleranceError**
- A peer references an unknown block, but when trying to download the blocks ancestors the peer doesn't respond.
- When syncing, we request blocks by range. If doing a finalized sync a peer sends us a chain of blocks which cannot be processed (i.e not canonical or diverges etc) and we request the same section of blocks from another peer who sends the correct information. We penalize the original peer with this action.
- When syncing if a group of peers all provide a chain of blocks that cannot be processed and we try multiple times from different peers and the chain makes no sense. All peers in this group get penalized with this action.
- If a peer doesn't respond to a ping or status in the RPC timeout
- If a peer hits our rate limit for goodbye, metadata or status requests
**MidToleranceError**
- When doing parent lookups (a peer sent an attestation or block that references a block we don't know, we recursively download a chain of blocks from that peer until we reach our head) we try to download this chain. If the chain hits a certain limit, or one of the blocks fails processing, we penalize the peer with this action.
- If a peer hits our rate limit for ping requests
- If a peer sends us a bad blocks by range response and then after re-requesting sends the correct blocks by range response
**HighToleranceError**
- Peers not responding to blocks by range or blocks by root requests within the timeout
- Various network dropouts or disconnections
- We get rate-limited for our blocks by range or blocks by root requests
- Sending invalid exits, slashings etc, which could be due to the state of our chain
### Positive Score Thoughts
Positive scores bring in complexity in choices of what exact behaviours we want to encourage. Also, it adds avenues to game the system where peers can perform some actions to gain scores to negate some other bad behaviours. It also introduced complexity in how we cycle new peers that get connected to us. Currently, in the case that all peers are scored equally (0) and we need to prune excess peers, we randomly select peers to disconnect. With positive scores we run into the risk of getting positive scored peers that become "sticky" in the sense we are unlikely to cycle them as new peers coming into the system are likely to have lower scores. (We had considered giving new peers a median score of current peers, but didn't fully think through the implications.) We'd need to balance positive scores against the decay so it's not too easy for peers to get maximum scores and never get pruned.