EPF5: Week 12 - HackMD

_2024-09-02_ # EPF5: Week 12 Hello! The past weeks I've finally ran simulations on large servers - and got sick :( So while I have finally gathered data (and a large AWS bill), due to the illness I do not have any interesting conclusions for you yet. **So the proper writeup of the *results* will follow.** Let me tell you about my experiment anyways: ## First Experiment The first experiment consists of six simulations with 45 "simulated" minutes each. In the experiment, I want to figure out the effects of different PeerDAS parameters on the network. You can find the experiment setup and detailed configuration files [here](https://github.com/dknopik/ethshadow-experiments/tree/main/das_params). Please tell me if you try to reproduce it and run into any issues! In each simulation, we have 1000 nodes, distributed across the globe. We use simulated latency between the nodes according to their location. The six simulations we ran are: 1. Spec: Use the current spec values (128 columns in 128 subnets, custody 4, no sampling) 2. [PR6268](https://github.com/sigp/lighthouse/pull/6268): Use the current spec values and engine_getBlobs (128 columns in 128 subnets, custody 4, no sampling). Unfortunately, this was not effective, since I didn't use a compatible build of Reth :( 3. Custody 8 instead of 4 4. Custody 16 instead of 4 5. 64 subnets instead of 128 6. 32 subnets instead of 128 Each simulation took approximatey 6 hours to run on a AWS R7a.24xlarge instance, which has 96 cores and 768 GB of memory. Additionally, I needed to add 80 GB of swap. There was also some overhead at the beginning for setting up the server, compiling the clients, and compressing the simulation results after each run. The metrics were captured with Prometheus and I've started visualizing them with Grafana. This is what it looks like until now: ![spec_v1](https://hackmd.io/_uploads/rJkRySXhC.png) I will explain this in detail as soon as my interpretation of the experiment is done. ## Škoda WHEW - while running something on such large machines is fun, receiving the bill is not. Right now, I am a bit disappointed by the scalability of the simulations. I will try out more instance types and maybe find some ways to further improve the performance. After all, the quicker the simulations are and the cheaper they are to run, the more they are a viable tool for testing Ethereum. This again shows that careful preparation is important - I could have realized that my setup for simulation 2 is faulty, which would have saved me some money. Unfortunately, this is hard in the context of PeerDAS, as we can only properly test it with enough peers around, or by making every node a super node. ## Next steps Of course, the first order of business is finishing the analysis of the first experiment, and writing it down (as research without writing it down is just messing around). I probably should also rerun the invalid second simulation properly. With the results, I will try and figure out which results are interesting enough to further test. For example, if we now introduce more factors into the simulation (like super nodes, and especially attackers), which of the scenarios above do we want to extend/modify? I would love to have a huge multidimensional test series, where I define like 6 parameters with ~5 possible values each (which are 5^6 = 15625 simulations), but this is not really feasible if each simulation costs ~15€. Therefore I have to think about what is really making sense to simulate. For the same reason, I will again take a step back and do some performance tests. Shadow offers many options for performance tuning, and maybe I find some nice improvements! Last not but least, I will give my best to stay healthy :)