Block broadcasting on epoch transitions

We collect some measurements made on the pyrmont and mainnet ETH 2.0 network related to delays in block production/propagation at epoch transition.

The problem:

Validators that are assigned the first slot of an epoch to attest are 20% likely to vote incorrectly on it. On each epoch a validator has 1/32 chances of attesting on this epoch, and a bad vote penalizes the validator for both head and target in this case. This accounts for at least (assuming perfect inclusion distance and a good vote on source)

\frac{1}{32} \cdot \frac{1}{5} = 0.65 %

of the validator rewards.

The measured data:

The approach is straightforward and minimal. For each block we receive we collect the slot, the delay, the graffiti, and the attestation data. Here delay is measured astimestamp - genesis_time - 12*slot where timestamp is the time at which we have received the block. On Pyrmont, the graffiti allows us to identify the client node without much error since dev teams and the EF are currently running over 90% of validators.

Attestation data is collected as follows. We let TargetRoot to be the root of the block we got in slot=N % 32 and HeadRoot be the root of the block we got in slot=N-1. We declare the attestation data on target to be correct if it votes on TargetRoot and the attestation data on head to be correct if it votes on HeadRoot (ie. we assume that the monitoring node is in the canonical chain). In this particular case we have been on the canonical chain for target during the whole measurement so no manual intervention was necessary. Assuming the wrong HeadRoot for a few slots will not skew the aggregated data.

The last systematic error we are making is that timestamp is measured after the monitoring node (prysm in this case) has synced the block. This was benchmarked by Nisdas at prysmaticlabs to be less 15 ms for slots other than the first two slots, while less than 100ms and 60ms for slots 0 and 1 respectively [1].

The metrics:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

In this graph about block propagation delay in Pyrmont we can see immediately the problem. The scale is logarithmic and each point represents a received block, proposed by the color-coded client in the given slot (we count slots relative to the beginning of the epoch). We see that most blocks on the first slot are arriving over two seconds from the start of the slot. We can immediatly pin-point some client problems/qualities:

Teku's noded are the best performing nodes. They get most of their blocks in the 4 second window. And they are systematically sending blocks faster than other clients throghout the epoch.
Lighthouse nodes perform just as Tekku's on the first block, but clearly have a problem throghout the epoch: all of their blocks are received at least 1 second into the slot.
Prysm's nodes might as well not propose on block 0 since their blocks are essentially guaranteed to not be voted.
Nimbus is really an interesting case, as it is in general the client whose blocks are delayed the most, but also have consistently the fastest arriving blocks on slot 0.

Attestation Data

One issue with the above graph is that this is the point of view of a single node in the network. To discard a skew on block arrival time, we look at the number of wrong voted blocks. We expect a strong correlation between the delay at which the block arrived and the number of wrong votes for head.

In the following graph, each point represents a block. The color encodes the client. The x-scale is logarithmic and is the delay in miliseconds.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

We confirm that Lighthouse's blocks are concentrated in a small timeframe, Tekku's are arriving consistently earlier, and prysm's are all over the place. While it is true that most fast arriving blocks are correctly voted. We see that there is a large concentration of fully bad voted blocks at all latencies. This can be caused by short-lived forks. In fact, we can check the same metric on mainnet and we see that the vote generally is much better (all points are at the bottom) but we also see a small stripe on top with essentially all votes being wrong

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

We can also correlate wrong head vote with the position of the slot, here is the data of pyrmont

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

We confirm our thesis about prysm nodes on slot 0: prysm proposed blocks generate over 95% of wrong votes in slot 0. Besides this fact, we see that most clients are comparable by this measure (someone might want to point out Lighthouse performance on slot 0 which is compatible with our observation above that Lighthouse performs as Teku on this slot)

The same measure on mainnet is much better, although it shows a higher correlation between the first few slots. Warning: this could be due to lack of enough points on mainnet. This document will be updated.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

In the graphs above I collected the percentage of all votes, summed up on every congruence class for a slot. This is because this is the measure that is mostly relevant from the validator perspective since it's how likely you are to vote wrong and be penalized. If we look however for the percentage of blocks which were badly voted (namely more than 50% of bad votes) we see the following on mainnet

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The contrast of this graph with the previous one teaches us something: we are getting some blocks on slot 1 (the second slot) with lots of bad votes. But most blocks on slots >0 are correctly voted.

Bad Target votes

Regardless of what happens on slot 0, we would expect good metrics on target votes after a few slots. Indeed we see that we are much more likely to vote wrong for target on the first slot:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

We are still seeing 20% of wrong target votes after half an epoch. In fact, we see that there are blocks produced with bad votes at every slot.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

On mainnet again these metrics are much better:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Total votes by clients

If we sum over all blocks in the epoch on pyrmont we see that no particular client is producing blocks with worse voting than the others:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Prysm gets less total number of wrong votes, but the same voting, meaning that prysm-produced blocks seem to be less voted in pyrmont, at least during this measurement. Warning During part of this measurement, EF prysm nodes were down.

Conclusions

No conclusions yet, will eventually write up something when I gather enough data on mainnet.

References

[1] https://github.com/prysmaticlabs/prysm/tree/benchmarkGossip

Block broadcasting on epoch transitions

The problem:

The measured data:

The metrics:

Attestation Data

Bad Target votes

Total votes by clients

Conclusions

References

Read more

ePBS: the case for Glamsterdam

The case for EIP-7732 in Fusaka

The handwavy case for EIP 7732

Validator performance tracking