DAS Simulation Scenarios

# DAS Simulation Scenarios DAS simulation scenarios discussion and proposals ## Current simulator config space Our code is configurable through a number of parameters defining a multi-dimensional simulation parameter space. Below we explain the parameters and show the values from an example configuration used for quick tests. See [2 class config params](https://github.com/status-im/das-research/blob/twoClasses/config_example.py) (to be merged in [develop branch](https://github.com/status-im/das-research/blob/develop/config_example.py)). These are only illustrative examples. Later we will redefine values (or ranges to explore) for each Scenario. #### Number of simulation runs with the same parameters This is to have results with statistical relevance `runs = range(10)` #### Number of nodes Number of beacon nodes, each potentially running more validators `numberNodes = range(256, 513, 128)` // Dankrad suggestion: `numberNodes = [2**x for x in range(8, 18)]` #### Percentage of blocks not released by producer `failureRates = range(10, 91, 40)` #### Block size in one dimension in segments A block is composed of blockSizes * blockSizes segments. `blockSizes = range(32,65,16)` #### Per-topic mesh neighborhood size We simulate GossipSub topics as d-regular random graphs. `netDegrees = range(6, 9, 2)` #### ratio of "class1" nodes from total We currently support 2 node classes, each having different bandwidth and validator count. This is the ratio of class 1 nodes. `class1ratios = np.arange(0, 1, .2)` Note: we also have more complex multi-class and distribution-based versions of the code, we just don't think it is necessary in this stage. #### number of rows and columns a validator is interested in Node to validator ratio is currently simulated through an increased value of Chi. `chis = range(1, 5, 2)` // Dankrad suggestion: `chi = range(1, 5)` #### number of validarors per node `vpn1 = [1]` `vpn2 = [64, 128, 256]` // Dankrad suggestion: `vpn2 = [64, 128, 256, 512, 1024]` #### Uplink bandwidth Bandwidth is currently set in segments (~560 bytes) per simulation timestep, which is also the simulated transmission latency. Assuming 560 bytes and 50ms timesteps, we get 1 Mbps ~= 1e6 bps * 0.05 sec / 8 (bit/byte) / 560 (bytes/segment) ~= 11 segments `bwUplinksProd = [2200]` `bwUplinks1 = [2200]` `bwUplinks2 = [110]` Note: we'll change this to Mbps in the code, so lets use that as unit in what follows `bwUplinksProd = [1000]` # 1gbps `bwUplinks1 = [1000]` # 1 gbps `bwUplinks2 = [10]` # 10 mbps // Dankrad suggestion: `bwUplinks1 = [1000, 10000]` // Dankrad suggestion: `bwUplinks2 = [10, 20, 100]` ## Scenario 1: "As it would be if introduced now" **Goal** The goal of this scenario is to get close to what would be if we would introduce DAS now. **Approach** We can use this as a baseline for 1D or 2D parameter explorations, swiping 1-2 parameters to understand their effect. See individual parameter studies below. **Parameters** `numberNodes = [8000]` `blockSizes = [512]` `class1ratios = [0.800]` # TODO, check data `bwUplinksProd = [1000]` #1 gbps `bwUplinks1 = [10]` # ? // Dankrad suggestion: `bwUplinks1 = [50]` `bwUplinks2 = [1000]` # ? // Dankrad suggestion: `bwUplinks2 = [5000]` `chis = [2]` # I would use 2 as default as it was the example in previous calculations `vpn1 = [1]` `vpn2 = [100]`? # how many validators should we think of behind a fat node? Is 100 a reasonable number? // Dankrad suggestion: `vpn2 = [500]` ### Parameter studies #### Failure rate What happens if the full block is not released. How many validators are fooled, how much time it takes for results to settle (btw, while in simulation we have a closing criteria, is there something in real life besides a hard time limit?) `failureRates = range(0, 80, 3)` // Answer Dankrad: There is a hard time limit for validators voting for a block. However, balancing attacks can happen and are probably the most interesting case to study. #### Block size Explore how much room there is to bump the block size. What's communication cost of a larger block size? How does it influence other parameters and metrics? `blockSizes = [128, 256, 512, 1024, 2048]` Note: the block is blockSizes*blockSizes segments. #### Number or rows/columns observed (Chi) Changing (increasing) Chi has several expected consequences: - more informed validators - better connected 2D structure - more data to distribute (more validators interested in each segment) First we focus on 1 class and we scale Chi to understand the overall effect `class1ratios = [1]` `chis = range(1,20,1)` What is the optimal number of Chi? Then we use the class distribution as in real life, and we scale Chi to understand the overall effect `chis = range(1,20,1)` What is the optimal number of Chi? #### Optimal Chi under different bandwidth constraints While we can set many parameters of the scheme, bandwidth is a big unknown. Here we explore how changes to bandwidth effect diffusion speed under different values of Chi. #### Resilient Chi under different bandwidth constraints Here we are looking for resilient values of Chi. Resilient, as a function of some lower bandwidth limit and some reasonable failureRate limit. Maybe also some network distortion. Acceptance criteria: - time to availability #### Optimal Chi under different node type distributions // Added by Dankrad I think the biggest influencer on Chi will be how many "supernodes" there are around. If lots of nodes run many validators, the optimal value is probably close to 1, whereas higher values are needed if most nodes run only 1-2 validators ## Scenario 2: "Adversarial erasures" **Goal** The goal of these simulations is to study how the system behaves in case of "nasty" erasure patterns. While these are very unlikely to happen if we assume random erasures, it might be possible for adversaries to induce such erasure patterns. For example, the block producer can simply release according to such a pattern. ### Scenario 2a: "Minimal Erasure Pattern" We call the worst case erasure pattern (defined as the one with the least amount of erased segments) a *Minimal Erasure Pattern (MEP)*. Any pattern where the intersection of k+1 columns and k+1 rows are missing, but nothing else, is a MEP. This is to check how resilient the system is against a MEP. `failureRates = [MEP]` # to be implemented We scale Chi to understand the overall effect `class1ratios = [1]` `chis1 = range(1,20,1)` Then we check with 2 classes like in Scenario 1 Notes: - a MEP is a special (rare if assuming random erasures) case of a 0.252 failureRate, - Most non-recoverable erasure patterns have a MEP subset. - If a non-recoverable erasure does not have a MEP subset, it has a Diagonal Erasure Pattern (DEP) subset (see https://colab.research.google.com/gist/cskiraly/5ea92116a5707b87f1fd2beec76b9f12/das-sampling-numeric-analysis.ipynb). ### Scenario 2b: "Diagonal Erasure Pattern" A Diagonal Erasure Pattern is one in which every row and every column has exacly k+1 erasures. `(x+y) % 2k <= k` or any row/column permutation of such a pattern. A DEP is more than 0.5 failureRate, therefore it is not expected to create issues. Still, might worth checking. ### Scenario 2c: Hard to reconstruct patterns // Added by Dankrad I think the above patterns are actually not that interesting because both are ultimately unavailable. More interesting are patterns that become eventually available, but do not appear to be so at first. What's the pattern that takes the longest to reconstruct would be interesting to find out. ## Scenario 3: Multiple diffusion overlap ### Scenario 3a: Overlap between slots due to long diffusion These would be diffusions of subsequent blocks overlapping, but maybe the hard time limit eliminates this possibility. Can this happen? It could happen if someone is late to start the diffusion. ### Scenario 3c: Overlap due to multiple blocks in same epoch Can this happen? I suppose the assumption is that only the selected builder starts diffusing the block ## Scenario 4: GossipSub details and dynamics GossipSub (assuming we would use base version of GossipSub here), has a bunch of potential shortcomings that could impact performance. A partial list: - d (degree) of a node is simulated as a fixed number. In reality implementations try to keep it between d_min and d_max - peer selection is simulated to be perfectly random, using an externally generated random d-regular graph per topic. In reality peer selection is biased in many ways, resulting in different topologies. - connectivity is not perfect, sometimes missing at IP level (potentially an issue between home validators with NAT traversal) - topic peers are selected from a local view, which is not perfect (depends on peer exchange and topic discovery used). This results in more clustering - peer scoring indirectly prefers low latency connections, including local ones. This results in a larger diameter and more clustering than random graph - Initial push has an idealized model currently, pushing out individual segements (560 bytes) to 1000+ nodes in a reliable way. In reality: - the node might not know nodes in each of the 4k topics - if pushing with UDP, some might be lost - if pushing over TCP or other connection oriented, cost is much higher and delay might be different - We are currently not modeling the actual gossip part of GossipSub. We focus on the mesh based push part, which dominates the diffusion. We could add gossip, in a next step Simulation options: - full protocol simulation - adding gossip - adding peer dynamics - adding network induced dynamics - performance under artificially distorted ("crippled") topologies ### Scenario 5: Network splits Here we are focusing on network imperfections, of which two main types are: - overlay issues: unfortunate situations due to randomness in protocol behavior, such as peer selection - underlay issues: actual problems in the underlying IP network, such as connectivity issues, capacity bottlenecks, clustering, heterogeneity Still to understand what we want to focus on here. *********************** # UPDATE for subnetDAS We need to consider a potential attack on a single attestation subnet. An attacker could craft specific nodeIds and spin up malicious nodes to eclipse a specific subnets for some specific range of epochs. An attack on a single subnet would require just 1/128 of malicious nodes, in contrast to attacking the whole network.