2024-09-29

EPF5: Week 16

Hey! At the end of the last update I ended with a cliffhanger on simulations with changes aiming to improve PeerDAS in Lighthouse, so I will get right to that. Afterward, I will discuss my plans for the next week, and as we are approaching the end of the cohort, my plans for the final phase of my project. This is by far my longest update post until now, so buckle up!

Final notes on PeerDAS simulations

As mentioned last time, PeerDAS did not perform well in my simulations. This seems to match with what teams see on devnets. This is somewhat expected, as both the spec as well as the implementations are still work in progress. Anyway, I ran some MORE simulations with modified parameters and implementation variants to try and fix that.

First, I will explain the simulation setup, followed by an explanation of why we see such poor performance.

Simulation setup

We run 1000 nodes in total, each with one Lighthouse beacon node, one Reth node, and one Lighthouse validator client (having 4 validators each for a total of 4000 validators). We simulate latency between the nodes, with a rough geographical distribution:

Europe: 360
Eastern North America: 220
Western North America: 220
East Asia: 80
Australia: 50
West Asia: 50
South Africa: 10
South America: 10

These values try to roughly mimic the IRL distribution of nodes, based on Etherscan's and ProbeLab's data.

Furthermore, half of the nodes are simulated with 1Gbit/s up- and downstream, and the other half is simulated with 50Mbit/s up- and downstream and 20ms of additional latency. This is to mimic home stakers, however, I admit that these values may be a bit optimistic.

We run a single node with a Geth boot node and a Lighthouse boot node. All other nodes use that node for initial discovery.

To capture Lighthouse metrics, we have a single Prometheus server that downloads metrics from every Lighthouse beacon node in the simulation.

Lastly, we run one host with my purpose-built blob spammer. The spammer picks one node at random and sends between 1 and 6 blobs each slot.

Genesis occurs at 5 minutes into the simulation to give nodes enough time to find good peers. The simulation runs for 40 more minutes - this equates to 200 slots and a total simulated time of 45 minutes.

These simulations were run on several different cloud instance types. You need at least 1TB of memory.

The issue

This explanation considers the spec as of ~3 weeks ago.

In PeerDAS, we split the blobs into 128 columns. Each node, by default, takes custody of 4 of those columns. This means that the node subscribes to one subnet for each column it has custody for, and only imports the block if we have all custody columns for a block. Not importing a block means that it can not become the head block, and no block descending from it can be imported either.

Lighthouse targets to have 100 nodes connected. These are selected based on their scoring. Proposing nodes should have peers available for every subnet - so that all columns for the block's columns can be sent out. Lighthouse does not take that into account during peer selection yet. Therefore, the available connected subnets are random. With 100 peers and 4 custody columns each, it is rather likely that we do not cover all 128 subnets. The related probability calculus is left as an exercise for the reader.^[1]

When a column can not be sent out by the proposing node, the node does not retry sending it. Furthermore, no one will ask the node for the column - after all, it most likely does not advertise custody of it. These columns will never reach any other node. However, as mentioned above, some nodes rely on these columns to be able to import the block! Therefore, these nodes will fork off. I will call this "losing" a node in this document. This will happen on almost every block that includes blobs! Quickly, we end up with a lot of forks, causing the network to fall apart. Lost nodes will never join the main fork again, as the block that caused them to get lost will never be importable to them.

To assess the quality of the networks I simulate, I will use three metrics. First, I determine the last slot in the simulation in which less than one-third of nodes have not chosen a head block for that slot six seconds into the slot. I will call this the "last healthy slot". For the stock PeerDAS implementation and specification parameters, this is slot 3. As slot 4 is the first slot containing blobs, this means that upon the first block containing blobs, we already lose enough peers to finalize at some point.

Second, I will use the average rate of empty slots over all nodes of the last ten slots in each simulation (slots 190 - 199). This is useful for simulations where the previous metric is equal to the last slot of the simulation (in other words: networks that do not fall apart before the end of the simulation).

Finally, I will use the percentage of nodes lost in slot 4 (i.e. the first slot with blobs). This is useful to compare simulations where the last healthy slot is 3.

I will publish the full captured simulation data Soon:tm::soon:.

The "fixes"

Let's compare that to my attempts to mitigate this. In total, I ran 18 simulation configurations, including the "spec" simulation without any changes.

The quotation marks in the title of this section hint at the fact that none of these simulations are full fixes. That is out of the scope of my project. Nevertheless, the simulations show which approaches to the "proper" fix might be valid, and show us that the simulations themselves are actually useful.

In this section, I want to get into the details of every simulation variant.

PR 6268

PR 6268 (aka Decentralized Blob Building) is an approach to reduce bandwidth by trying to get blobs from the local EL client instead of the gossip network. If the EL is already aware of all blobs, we no longer need any blob data - and can import the blob. This should help with our issue, as we can now import the block in this case, even if the proposing node was unable to publish all columns.

This improved the situation - as now the last healthy slot is 45! However, when investigating the logs, I quickly noticed that there was something wrong, as in many cases the block was only chosen at 11.5s into the slot. This was due to a bug: the head was not recomputed when fetching the blobs from the EL.

After fixing this, the network behaved even better: the last healthy slot is now slot 199, i.e. the end of the simulation. Still, the rate of empty slots in the last 10 slots is 40.84% - showing that the network suffers. This makes sense: while we increase our chances that we can import a block, we end up still losing nodes on occasion, which adds up over time, as these nodes can never join back again.

Varying the custody requirement

Recall that there is a certain likelihood that we have one peer for each subnet. Three variables influence this: the number of peers, the number of subnets, and the number of subnets each peer has custody of. To confirm if we correctly identified the issue, we can run simulations modifying these variables. Let's start by increasing and decreasing the custody requirement!

When lowering the custody requirement from 4 to 2 columns, the network seems to behave even worse (as expected): We lose 75.4% instead of 35.2% percent of nodes in the first slot with blobs. This happens because our connected nodes now cover even fewer column subnets.

Raising the custody requirement to 8 columns, the last healthy slot moves to 13. This shows that we now lose nodes at a slower rate, but eventually still collapse.

Cranking up the custody requirement to 16 columns, we now make it through the whole simulation, with 0% missed slots in the last 10 slots of the simulation! Perfect. Still, this is not a good solution, as it increases the bandwidth requirements.

Multiple columns per subnet

Currently, a subnet contains 1 of 128 columns. Let's try to increase that - while maintaining the relative custody amount: 2 of 128 columns (ergo, 64 subnets) with each node custodying 2 of those subnets and 4 of 128 columns (32 subnets) with each node custodying 1 subnet. This should increase our coverage of subnets slightly, and indeed, our last healthy slot increases to 6 and 17 respectively.

Higher peer target

The last variable to change is the number of target peers. Lighthouse, by default, targets to have 100 peers. I increased this to 150, 200, and finally 300 nodes. As expected, this increases the likelihood that we cover all subnets at proposing nodes, increasing the last healthy slot to 13, 59, and 199. For 300 nodes, the average rate of empty slots is 18.16%. This is even lower than the fixed variant of pr_6268, making this the second-best attempt yet.

I would love to analyze the impact of having more nodes on bandwidth. However, curiously, the fewer target nodes we have, the higher the average bandwidth is. This is because the nodes try to fetch the columns via RPC very often. This prevents us from having a fair comparison until the lower node config is stable enough to follow the network.

Supernodes

A supernode is a node that takes custody of ALL the column subnets! This is pretty useful for us - a proposing node peering to a supernode can send all columns to at least that supernode. Let's investigate how supernodes affect network performance!

I added 5, 10, 25, and 75 supernodes to the 1000 regular nodes of the simulation. These supernodes have no validators. Surprisingly, they had a minimal, but measurable impact: the empty slot rate on slot 4 first rose to 36.52% and 40.3% with 5 and 10 supernodes respectively, and sank to 24.49% and 15.72% with 25 and 75 supernodes. This surprised me: With 75/1075 nodes being supernodes, it should be fairly likely that every node is connected to at least one supernode. Unfortunately, I can not explain these bad results. This points to more issues than the one I outlined in this post.

Custody 6, but require 4

What if we are a bit fault-tolerant? With peer sampling, we are tolerant of having some missing columns, because it is still likely enough that at least half of the columns are available. I emulated that by modifying Lighthouse to custody 6 columns, but actually import the block even if only 4 columns are imported. This increases the likelihood that a node can import a block because at least three of the exact six columns we custody need to be missing.

This worked well! We stay stable enough until the end of the simulation - ending with a 14.88% average empty slot rate.

Better peer selection for proposers

Up until now, we ran simulations where the peer selection algorithm did not attempt to have a diverse peer set with regard to the column subnets - leaving it to chance whether we have enough peers.

I quickly implemented two strategies into Lighthouse: attempting to always have at least one peer for each subnet (always_maintain) and trying to quickly connect to three peers for each subnet right before proposing (maintain_before). I want to stress that this was a quick and dirty hack - certainly not comparable to a proper implementation of smart peer selection.

"always_maintain" performed worse than the spec simulation, losing 82.3% of nodes on slot 4. "maintain_before" performed slightly better, losing 22.9%.

This shows that peer selection is not a problem one solves in 15 minutes of coding, but I don't think this is a new revelation. :)

Summary

Reminder: these metrics are explained in the last paragraph of the section "The issue".

Simulation	Last healthy slot	Avg. rate of empty slots on slots 190 - 199	Empty slot rate on slot 4
spec	3	99.99%	35.2%
pr_6268 broken	45	N/A^[2]	91.4%
pr_6268 fixed	199	40.84%	0.1%
custody_02	3	99.94%	75.4%
custody_08	13	99.98%	1.4%
custody_16	199	0%	0.2%
subnets_32	17	97.87%	9.5%
subnets_64	6	99.99%	17.2%
peers_150	13	100%	14%
peers_200	59	N/A^[3]	0.6%
peers_300	199	18.16%	0%
supernodes_05	3	100%	36.52%
supernodes_10	3	100%	40.3%
supernodes_25	5	99.99%	24.49%
supernodes_75	5	100%	15.72%
4_of_6	199	14.88%	0.1%
always_maintain	3	100%	82.3%
maintain_before	4	99.95%	22.9%

Issues with my approach

There are some issues with my approach that I want to outline here, so you can judge the meaningfulness of the results yourself.

Only one run per experiment configuration. This leaves us vulnerable to "unlucky" runs. For example, having 10 supernodes behaves worse than having 5 supernodes, which itself behaves worse than the spec simulation. If we had more runs per configuration, we would likely average out at behaving just like the spec simulation, or a little better. However, these simulations are expensive, so this is not really viable.
Only adjusting one variable at a time. For some variables, it would be interesting to see how they affect the performance in combination (e.g. custody and subnets). This would create vastly more simulation configurations, so I have to be picky.
PeerDAS is early. As seen, PeerDAS is still very unstable. Some experiments would be interesting to do again when the network does not immediately fall apart. Some should be run again when Lighthouse e.g. has a better peer selection algorithm and resending of columns and when the spec includes peer sampling again.
Static network/optimistic configuration. At the beginning of this article, I described the network setup. This setup is static, i.e. no nodes join or leave the network. Furthermore, I feel like the home staker config of 50 Mbit/s up and down is optimistic in many parts of the world, even in developed countries.
Simulation-imposed restrictions. While Shadow simulates networks nicely, it does not estimate latency induced by computation. This is unfortunate, as this causes some messages to propagate more quickly than they actually do: As far as Shadow is concerned, KZG verification happens instantly. Furthermore, while 1000 nodes are pretty nice, it also seems to be the practical limit. Throwing more hardware at it won't improve this, as memory speed seems to be the bottleneck. Of course, you can parallelize by having simulations run on multiple servers. Furthermore, increasing the validator count vastly raises memory requirements and through that also hurts performance.

Conclusion

PeerDAS needs more work. We should run more simulations when it's working better. Likely, 100 target peers are not viable with the current PeerDAS spec.

Also, we can see that Shadow is useful! It allowed me to troubleshoot the spec simulation, check whether my suspicions were correct, and identify potential ways to improve the situation.

IDONTWANT to exclude solo stakers

While writing this, I got more simulations in the oven :male-cook:. With the recent discussions around a fork split, there is strong support behind increasing blob target and/or maximum, so we don't have to wait until PectrB/Fusaka until we can increase capacity due to PeerDAS. Some voices in the core development community are critical of this, as it further raises hardware and especially bandwidth requirements, hurting decentralization.

There are some ideas to reduce this effect even before PeerDAS. One is engine_getBlobsV1, which I already tested in conjunction with PeerDAS, see "PR 6268" above. The other is IDONTWANT, aka Gossipsub 1.2, with which nodes can signal that they do not need a certain message because they already received it. This avoids nodes unnecessarily sending column data and should reduce bandwidth requirements. Currently, only about 15% of the network supports this - I want to check exactly how much bandwidth we save if we increase this to 100% (achievable by updating all clients before Pectra). Maybe this already saves more bandwidth than the blob count increase requires.

I hope to have the data out as soon as possible so that a data-driven discussion in the upcoming ACD is possible!

Final phase of my project

Time truly flies. In my project proposal, I wrote:

If the experiments prove to be useful, I want to extend the simulation framework to be as convenient to use as possible and support simulations with all major clients, so that the client teams can easily run their own simulations.
This phase will start towards the end of Cohort 5, or even afterwards. Its length will depend on the progress on ethereum-shadow during phases 1 and 2, but I expect at least two weeks.

I believe Shadow has proven its viability. While it is tempting to run experiments until the end of the cohort, I want to make sure that it is easily and conveniently usable for everyone. Therefore, I have the following things to do:

Refactor ethereum-shadow so that it is easily maintainable. I want to get it to a state where others and I can maintain it on the side after the end of the cohort. I already started with the refactor.
Support as many clients as possible. This will be hard. Even with only two and a half clients supported (Lighthouse, Reth, and Geth's boot node), I identified some tricky things that need to be implemented in Shadow for "proper" support. Ideally, I implement all this in Shadow, but this might be unrealistic with the remaining timeframe, especially with the additional issues encountered with the other clients.
Document, document, document.

I think I was far too optimistic with my estimation of two weeks, so I will start working on this soon. After only ONE more simulation series. I promise!

…no, I didn't fully calculate that. But I believe the likelihood for a singular column to be subscribed by at least one peer is (1 - (1 - 1/32) ^ 100) = ~95.82%. Which is not good enough. ↩︎
Cancelled before completion. ↩︎
Crashed out of memory few minutes before completion, could not be bothered to run it again. ↩︎