Network scale limitations - thoughts

# Network scale limitations - thoughts Current SSV network has 128 pubsub topics in which validators are deterministically assigned. Each operator subscribes to the relevant topic depending on the validators it has. We divide topic messages into committee and non-committee messages. Committee messages are messages relevant to the operator to carry out its duties, they need to propograte fast for performance and max rewards. Non-committee messages are messages received by the node that do not require "action", the node then propogates them to its peers. There is a special type of message which is relevant for all peers which is the decided message, all peers process it and store it as part of SSV's distribuited slashing protection database. To harden the P2P network, nodes need to detect peers sending "junk" messages which can be defined as any message which is too late, invalid, malicious, etc. To detect such messages, nodes need to validate even non-committee messages to penealize peers propograting "junk" messages. **How many messages are there?** We can simulate a network of 128K validators distribuited across 128 subnets. | # of Vlidators | # of topics | msg/ validator/ epoch |avg msg size | | -------- | -------- | -------- |-------- | | 128,000 | 128 | 13 |700 bytes | *13 msgs/ second includes: proposal, prepareX4, commitX4, post-consensusX4 as a bare minimum* `Per subnet messages` | # of validators | msgs/ epoch | msgs/ sec |mbs/ sec |msg process/ sec | | -------- | -------- | -------- |-------- |-------- | | 1000 | 13,0000 | 33.85 |0.02 |8.46 | `All subnets messages` | # of validators | msgs/ epoch | msgs/ sec |mbs/ sec |msg process/ sec | | -------- | -------- | -------- |-------- |-------- | | 128,000 | 1,664,000 | 4,333 |2.9 |4,3333 | **Network limitations** We can broadly notice 2 network limitations 1. Traffic overhead - how much data is being propogated through the network 2. Message processing - how much time it takes to process a message fully In both the cases above we aim to be lower than an average $100-$200/ month machine so even large operators which need to subscribe to all subnets can haldnle the message load It seems that traffic overhead is well within reason (~3MB/ second), though in reality it will be larger as pubsub has more overhead than just the pure message size. Message processing is harder to measure but we can proxy it via its most expensive component, BLS signature verification. From benchmarking done on a desktop computer (Apple M1 Max), pure BLS verify can manage up to 8K signatures/ second using the code below ```go func TestBLSVerify(t *testing.T) { var wg sync.WaitGroup types.InitBLS() sk := bls.SecretKey{} sk.SetByCSPRNG() pk := sk.GetPublicKey() sign := sk.SignByte([]byte{1, 2, 3, 4}) start := time.Now() for i := 0; i < 50; i++ { wg.Add(1) go func(pk bls.PublicKey) { defer wg.Done() for j := 0; j < 10000; j++ { require.True(t, sign.VerifyByte(&pk, []byte{1, 2, 3, 4})) } }(*pk) } wg.Wait() elapsed := time.Since(start) fmt.Printf("took %s\n", elapsed) } ``` Conservatley we might limit full message process to 4K/ second, pushing the limits from the above. **Random Sampling** If we don't process every single non-committee message but rather use some random sampling we could reduce overhead significatly. We can allow for random sampling as the purpose is to "catch" junk messages and penalize their sender (specifically for non-committee messages). Committee messages are processed in full. `All subnets messages` | # of validators | msgs/ epoch | msgs/ sec |mbs/ sec |msg process/ sec |random sampling rate | | -------- | -------- | -------- |-------- |-------- |-------- | | 128,000 | 1,664,000 | 4,333 |2.9 |4,3333 |100% | | 128,000 | 1,664,000 | 4,333 |2.9 |2,166 |50% | | 128,000 | 1,664,000 | 4,333 |2.9 |1,083 |25% | Random sapling can put us back well within a reasonable overhaed.