2024-09-16
I want to start this weeks update with a quote from last weeks update:
This again shows that careful preparation is important
🙃
The past two weeks I analyzed the data I talked about in the last update, ran performance test and ran more simulations (and am waiting for even more to finish as I'm writing this).
As mentioned last time, Shadow offers many parameters for performance tuning, so I did a series of simulations that are equivalent in content, but with different performance settings. It turns out the default parameters were already the best ones! Oh well, it was worth a try.
Additionally, I figured out that at these simulation sizes, the simulation speed does NOT scale with CPU count - but rather with memory speed, as the simulator has to switch between nodes very frequently, and CPU cache sizes can't really handle this at 1000 nodes. This allowed me to bring down simulation costs a bit, as I could select instances with fewer cores without sacrificing too much performance. Unfortunately, the instance sizes offered by the various cloud services still have a too high CPU to memory ratio for my taste.
At first I was disappointed by first week's data (which is now published), because the different configurations made barely an impact - the data looked superb. Very good performance across the board!
Then I noticed that I accidently ran every node in the simulation as supernode, i.e. downloading every column. Well, that is not quite realistic, so the resulting data is not that useful.
This again shows that careful preparation is important
After noticing this mistake, I ran a second set of simulations, quickly updating Reth and Lighthouse to newer version.
Turns out that an update to libp2p-rust in Lighthouse broke the simulation entirely, making nodes unable to connect to each other. Libp2p changed the way port reuse (as per the SO_REUSEPORT socket option in Linux) is configured by the user of the library. Lighthouse used the opportunity to switch to port reuse - but SO_REUSEPORT is not supported by Shadow, so opening sockets would fail with "port already in use".
Guess who did not test locally first and didn't realize this until after the simulations ran.
This again shows that careful preparation is important
Ok. I then VERY CAREFULLY started a third run after fixing and testing everything. And that one turned out quite nicely!
Lighthouse did not perform very well this time, with nodes quickly going out of sync and not being able to catch up. However, I could show that PR6268 partially mitigates that issue, so that's nice!
I have some ideas to improve Lighthouse here, and will run simulations to validate the ideas in the coming days. These will be explained in the next dev update, sorry for the cliffhanger. But hopefully I prepare carefully and will have some results already then!
See ya!