Alternative DAS concept based on RLNC

# Alternative DAS concept based on RLNC ## Abstract The present DAS approach for Ethereum is based on Reed-Solomon erasure coding + KZG commitments. The novel scheme (described here [Faster block/blob propagation in Ethereum](https://ethresear.ch/t/faster-block-blob-propagation-in-ethereum/21370) by @potuz) introduces another pair: RLNC (Random Linear Network Coding) erasure coding + Pedersen commitments. It has nice properties which could be used to develop an alternative Data Availability approach with its pros and cons. In this write-up, we’ll describe the concepts of this approach — its benefits and its challenges. The reference algorithm is not pretending to be a final solution and may just highlight potential directions to investigate alternative Data Availability approaches. We’ll begin with the simplest ideas and then iteratively address their limitations.The following reasoning may be a bit unstructured or messy, but it can help clarify the implementation’s underlying ideas. Feel free to skip it and go straight to the [final algorithm description](#The-final-algorithm). ## Intro Much like the write-up [Faster block/blob propagation in Ethereum](https://ethresear.ch/t/faster-block-blob-propagation-in-ethereum/21370), the original data is split into `N` vectors $v_1, \dots, v_N$ over the field $\mathbb{F}_p$. These vectors can be coded using RLNC and committed via a Pedersen commitment (of size $32 \times N$) included in the block. Any linear combination can be verified given the commitment and its associated coefficients. We also assume that the *coefficients* may come from a much smaller subfield $\mathbb{C}_q$, whose size could be as small as a single byte—or even a single bit. Let's also introduce the _sharding factor_ `S`: any regular node need to download and custody `1/S` of the _original data_ size. ## Idea #1 Let's assume we have enough 'super-nodes' (receiving and storing all `N` chunks) to serve all other [regular] nodes. A regular node then performs sampling _and_ custody by doing the following: - Choose locally `N` random coefficients $A = [a_1, ..., a_N]$ - Request super-node a vector which is the linear combination of all original data vectors: $w_A = a_1 * v_1 + ... + a_N * v_N$ - Validate received vector $W_A$ against Pedersen commitment from the corresponding block and its random coefficients The major takeaways from the above process: - The sampling node is [almost] sure that the responding super-node has _all the data_ (Sampling side of the process) - The node got and store the vector which is _non-colinear_ with any other honest node custody vector almost surely (Custody side of the process) - (derivative statement) _Any_ set of `N` honest nodes is sufficient to recover the original data If all supernodes are malicious the maximum harm they can do is to deceive `N - 1` node making them treat non-available data as available. Here we have `S == N` and any regular node downloads and custodies `1/S` of the original data size only **Challenge 1**: the network is required to have enough super-nodes to serve regular nodes. The network is not homogenous -- there are 'servers' and 'clients'. No p2p data dissemination. ## Idea #2 As an intermediary step instead of super-nodes let's now have a set of specific nodes (let's call them 'seed nodes') which custody their own subset of original vectors. Let's say every such node stores just `K` (`K < N`) original vectors: $[v_1, ..., v_K]$, $[v_{K+1}, ..., v_{2K}]$, etc. A regular node now needs to query at least `N/K` seed nodes to make sure the whole data is collectively possessed by them. After sampling the node may aggregate the received vectors into a single one (just a linear combination with random coefficients) and store it in the custody. We still have `S == N` so far **Challenge 2**: still have different kind of nodes in the network **Challenge 3**: while a regular node may custody just a single vector it still needs to download `K/N` of original data size instead of `1/N` now ## Idea #3 Let's address the `Challenge 3` first: for this let's increase the number of original chunks. Let the `S` remain as before and let now `N = S * K`. So a regular node now again need to download just `1/S` of original data ## Idea #4 Let's now try to address the `Challenge 2` Now let the 'seed' nodes store `K` random linear combinations of original vectors ($w_1, ..., w_K$) instead of `K` original vectors: $w_1 = a_{1,1} * v_1 + ... + a_{1,N} * v_N$ ... $w_K = a_{K,0} * v_1 + ... + a_{K,N} * v_N$ And let a sampling node request a linear combination of $w_i$ possessed by it's 'seed' peer with local random coefficients $C = [c_1, ..., c_K]$: $w_C = c_1 * w_1 + ... + c_K * w_K$ The seed peer would need to respond back $w_C$ together with coefficients $a_{i,j}$ After a node requests `K` samples it becomes a 'seed' node and may serve other nodes requests. So here we are getting a canonical p2p network structure with all the nodes playing the same client/server role However there are new challenges: **Challenge 4**: the seed peer may have less vectors than declared (it may even have just a single vector) and maliciously craft coefficients $a_{i,j}$ _Example of attack_: Suppose we have $\mathbb{F}_{256}$, $\mathbb{C}_{16}$, data vector of just a single lement, and `S = 2` (thus having `N = 4`) Suppose original vectors (data vector has just a single $\mathbb{F}_{256}$ element): $(v_1, ..., v_4) = ([32], [55], [1], [255])$ Honest node should have 2 linear combinations, but suppose a malicious node has just a single linear combination with coefficients: $a_1 = [a_{1,1}, ..., a_{1,4}] = [1, 3, 7, 15]$ thus $w_1 = [32] * 1 + [55] * 3 + [1] * 7 + [255] * 15 = [189]$ $w_2$ is missing For example a sampling node requests the combination with the following coefficients $C$: $C = [c_1, c_2] = [15, 3]$ $w_C = 15 * w'_1 + 3 * w'_2$ The sampling node doesn't yet know what linear combinations the responder has ($w'_1$ and $w'_2$). It expects to find it out from the response. The malicious node responds with the only vector is has: $w_C = w_1 = [189]$ however claims that this is the linear combination of the vectors with the following coefficients: $a'_1 = [1, 1, 1, 1] / 15 = [15,15,15,15]$ $a'_2 = [0, 2, 6, 14] / 3 = [0, 6, 2, 10]$ $15 * a'_1 + 3 * a'_2 = [1,3,7,15] = a_1$ $w_C$ is legit and could be verified against $15 * a'_1 + 3 * a'_2$, $a'_1$ and $a'_2$ are linear independent, so the response is correct. ## Idea #5a Let's try to approach the _Challenge 4_ Split request into 2 interactions: 1. request $a_{i,j}$ coefficients of existing vectors 2. request a linear combination of these vectors with local random coefficients $c_i$ As existing vectors are now fixed after the request (1) the seed peer has no more possibilities to cheat and would need to posses declared vectors (or another basis for the same subspace) to calculate $w_C$ respond to the request (2) _Example_: Continue with the sample from [previous topic](#Idea-4): if the sampling node request for existing $a_i$ first, and for example the malicious node responds with the same $[a'_1, a'_2]$ coefficients. In this case the responder is able to provide the correct answer only if the requester would ask for $C = [15, 3] * k$ for any $k$. Else it wouldn't be able to respond correctly. ## Idea #5b The above approach solves the _Challenge 4_ but adds another interaction step which may significantly matter during time critical data dissemination. Basically `K` vectors $a_i = [a_{i,1}, ..., a_{i,N}]$ describe a K-dimentional subspace in the N-dimentional space. (our node needs to sample just a single randomly selected vector from this subspace) There are almost infinite basis vector sets to describe the same subspace, but what if there is just some single canonical basis describing that subspace? I believe _Reduced Row Echelon Form (RREF)_ $a'$ of the matrix $a$ could probably serve as a canonical basis for the mentioned subspace: $a = [a_0, ..., a_K]$ $a' = rref(a)$ $a' = [a'_1,..., a'_K]$ $a'_i = [a'_{i,1}, ..., a'_{i,N}]$ The seed node may then derive the following data vectors $w'_i = a'_{i,1} * v_1 + ... + a'_{i,N} * v_N$ And respond to the request with the following linear combination: $w'_C = c_1 * w'_1 + ... + c_K * w'_K$ As the canonical basis is implicitly unique and fixed, this should make additional preliminary interaction for fixing the existing coefficients obsolete ##### _Example_: Continue with the sample from [previous topic](#Idea-4)... So the requesting node is now have an additional requirement on coefficient vectors $a'_1$ and $a'_2$: the matrix $\begin{bmatrix}a'_1\\a'_2\end{bmatrix}$ should be in RREF form. The trick is that while an attacker may craft any number of $(a'_1, a'_2)$ which satisfies $15 * a'_1 + 3 * a'_2 = a_1$ but it can't craft such a pair to be a RREF matrix with very high probability. **DISCLAIMER: The last statement is quite intuitive and needs a better justification.** As an adversary let's try to calculate the matrix in RREF form while having just a single vector $w_1 = [189]$ with coefficients $a_1 = [1, 3, 7, 15]$ We now have 2 bounding conditions for maliciously crafted coefficients $a'_1$, $a'_2$: 1. should satisfy $15 * a'_1 + 3 * a'_2 = [1,3,7,15]$, and thus if $a'_1 = [x_1, x_2, x_3, x_4]$ then $a'_2 = [\frac{1-15x_1}{3}, \frac{3-15x_2}{3}, \frac{7-15x_3}{3}, \frac{15-15x_1}{3}]$ 2. the matrix $\begin{bmatrix}a'_1\\a'_2\end{bmatrix}$ should in RREF form, and thus have the following form (after maybe column permutations): $\begin{bmatrix}1 && 0 && y_1 && y_2\\0 && 1 && y_3 && y_4\end{bmatrix}$ So we have this equation system with no solution: $\begin{bmatrix}1 && 0\\0 && 1\end{bmatrix} = \begin{bmatrix}x_1 && x_2\\\frac{1-15x_1}{3} && \frac{3-15x_2}{3}\end{bmatrix} => \begin{cases} x_1 = 1 \\ x_2 = 0 \\ \frac{1-15x_1}{3} = 0 \\ \frac{3-15x_2}{3} = 1 \\ \end{cases}$ ## Idea #6 We could probably remove the remaining interaction without sacrifying security. It's likely fine to have static public random coefficients $c_i$ which could be derived for example from the node's `nodeID`. In this case a seed node may compose and propagate a sample vector to its peers as soon as it received enough own vectors. ## Idea #7 So far we have: - `S` - sharding factor - `N` - total number of original vectors - `K = N / S` - the number of vectors a node needs to download and custody We want to minimize `N` to reduce the commitment and proof overheads. We want to maximize `S` potentially to have a better sharding effect Thus we want to minimize `K`. However every regular node stores just `1/S` of data and to sample the whole `N`-dimensional space a node needs to download at least `S` vectors. Thus the `K` needs to be at least as large as `S`, thus it needs to be equal to `S`: ``` S == K N = S ^ 2 == K ^ 2 ``` ## Idea #8 Assuming the subfield $\mathbb{C}_q$ is pretty small, e.g. fit just 1 byte. Then with `S = 32` a sampling message should contain `32 * 1024` coefficients, i.e. 32 KBytes which is the same size as a data vector itself. That sounds like a significant overhead. But what if we make the subfield $\mathbb{C}_q$ even smaller: just 1 bit. A vector sampled from 32 original vectors still has quite few chances to be colinear with another random vector sampled from the same 32 original vectors - `1 / 2^32`. This is not negligible but we are pretty OK to have rare collisions here. With 1-bit coefficient we would need just 4 KBytes to pass the original coefficients. That looks like a good saving. ## The final algorithm ### Brief overview A node receives pseudo-random linear combination vectors from its peers. Every vector accompanied with a basis vectors of subspace this vector was randomly sampled from. Subspace basis vectors are provably exist in peer custody. The node samples until - all received basis vectors sum up the full N-dimensional space - receives at least `S` vectors for its custody After that the node completes sampling and becomes the 'active' node which can be used for sampling by other nodes. ![DA-RLNC.drawio](https://hackmd.io/_uploads/Hk1hf3VZle.png) ### Note The below Python-like code snippets are for illustration purposes only, they are not intended to be compiled and executed for now. Some obvious or library-like functions lack their implementations so far. ### Types Let data vector elements be from a finite field $\mathbb{F}_p$ which could be mapped to 32 bytes (exact for simplicity, in fact it's a bit lower) ```python def FScalar ``` Let coefficients be from a significantly smaller subfield $\mathbb{C}_q$ and could be encoded with `COEF_SIZE` bytes (could be fractional) ```python def CScalar ``` Let a single data vector has `VECTOR_LEN` elements from $\mathbb{F}_p$ ```python def DataVector = Vector[FScalar, VECTOR_LEN] ``` The _sharding factor_ `S` means that every regular node receives, sends and stores just `1/S` fraction of the whole data. Let `N = S * S` and let the whole data be spit onto `N` _original_ `DataVector`s: ```python def OriginalData = Vector[DataVector, N] ``` _Derived_ `DataVector` is a linear combination of _original_ `DataVector`s. Since an original vector is just a special case of a derived vector we will no longer make any distinctions between them. To validate a `DataVector` against commitment its linear combination coefficients should be known: ```python def ProofVector = Vector[CScalar, N] ``` There is also a slightly different sequence of coefficents which is used when deriving a new vector from existing `DataVector`s: ```python def CombineVector = List[CScalar] ``` ### Structures A data vector accompanied by the its linear combination coefficients ```python class DataAndProof: data_vector: DataVector proof: ProofVector ``` The only message for dissemination is ```python class DaMessage: data_vector: DataVector # coefVector for validation is derived orig_coeff_vectors: List[CombineVector, N] seq_no: Int ``` There is an ephemeral store for a block data: ```python class BlockDaStore: custody: List[DataAndProof, N] sample_matrix: List[ProofVector] # commitment is initialized first from the corresponding block commitment: Vector[RistrettoPoint] ``` ### Functions Let the function `random_coef` generates a deterministically random coefficient vector with the seed `(nodeId, seq_no)`: ```python def random_coef(node_id: UInt256, length: int, seq_no: int) -> CombineVector ``` Let's define an abstract utility function to compute linear combination of any vectors: ```python def linear_combination( vectors: Sequence[AnyVector], coefficients: CombineVector) -> AnyVector ``` Function which performs linear operations on `DataVector`s and `ProofVector`s simultaneously to transform the `ProofVector`s to RREF: ```python def to_rref(data_vectors: Sequence[DataAndProof]) -> Sequence[DataAndProof] def is_rref(proofs: Sequence[ProofVector]) -> bool ``` The function creating a new deterministically random message from existing custody vectors for a specific peer: ```python def create_da_message( da_store: BlockDaStore receiver_node_id: uint256, seq_no: int) -> DaMessage: rref_custody_data = to_rref_form(da_store.custody) rnd_coefs = random_coef(receiver_node_id, len(rref_custody_data), seq_no) data_vector = linear_combination(rref_custody_data, rnd_coefs) custody_proofs = [e.proof for e in da_store.custody] return DaMessage(data_vector, Vector(custody_proofs), seq_no) ``` Computes the rank of data vectors' coefficients. The rank equal to `N` means that the original data can be restored from the data vectors. ```python def rank(proof_matrix: Sequence[ProofVector]) -> int ``` ### Publish The publisher should split the data and get `OriginalData`. Before propagating the data itself publisher should calculate Pedersen commitments (of size `32 * N`), build and propagate block with these commitments. Then is should populate custody with data vectors (original) and their proofs which are simply elements of a standard basis: `[1, 0, 0, ..., 0], [0, 1, 0, ..., 0]`: ```python def init_da_store(data: OriginalData, da_store: BlockDaStore): for i,v in enumerate(data): e_proof = ProofVector( [CScalar(1) if index == i else CScalar(0) for index in range(size)] ) da_store.custody += DataAndProof(v, eproof) ``` Publishing process is basically the same as the propagation process except the publisher sends by `S` messages to a single peer in a single round instead of sending just 1 message during propagation. ```python def publish(da_store: BlockDaStore, mesh: Sequence[Peer]): # Letf the Peer undefined for peer in mesh: for seq_no in range(S): peer.send( create_da_message(da_store, peer.node_id, seq_no) ) ``` ### Receive Assuming that the corresponding block is received and the `BlockDaStore.commitment` is filled prior receiving a `message: DaMessage` Derive the `ProofVector` from original vectors: ```python def derive_proof_vector(myNodeId: uint256, message: DaMessage) -> ProofVector: lin_c = randomCoef( myNodeId, len(message.orig_coefficients), message.seq_no) return linear_combination(message.orig_coefficients, lin_c) ``` ##### Validate The message first validated: ```python def validate_message(message: DaMessage, da_store: BlockDaStore): # Verify the original coefficients are in Reduced Row Echelon Form assert is_rref(message.orig_coefficients) # Verify that the source vectors are linear independent assert rank(message.orig_coefficients) == len(message.orig_coefficients) # Validate Pedersen commitment proof = derive_proof_vector(my_node_id, message) validate_pedersen(message.data_vector, da_store.commitment, proof) ``` ##### Store Received `DataVector` is added to the custody. The `sample_matrix` is accumulating all the original coefficients and as soon as the matrix rank reaches `N` the sampling process is succeeded ```python def process_message(message: DaMessage, da_store: BlockDaStore) -> boolean: da_store.custody += DataAndProof( message.data_vector, derive_proof_vector(my_node_id, message) ) da_store.sample_matrix += message.orig_coefficients is_sampled = N == rank(da_store.sample_matrix) return is_sampled ``` `is_sampled == true` means that we are [almost] 100% sure that our peers who sent us messages collectively posses 100% of information to recover the data. ### Propagate When the `process_message()` returns `True`, this means that the sampling was successful and the custody is filled with enough number of vectors. From now the node may start creating and propagating its own vectors. We are sending a vector to every peer in the mesh which didn't sent any vector to us yet: ```python # mesh: here the set of peers which haven't sent any vectors yet def propagate(da_store: BlockDaStore, mesh: Sequence[Peer]): for peer in mesh: peer.send( create_da_message(da_store, peer.node_id, 0) ) ``` ## Pros and Cons ### Pros - The total data size to be disseminated, custodied and sampled is x1 (+ coefficients overhead of about 20% for `S = 32`) of the original data size (compared to x2 for PeerDAS and x4 for FullDAS) - We may have 'shard factor' as large as 32 (or probably even greater with more modifications), which means every node needs to download and custody just `1/32` of the original data size. (v.s. `1/8` in the present PeerDAS approach: 8 columns from 64 original) - Data is available and may be recovered by having just `S` _any_ regular honest nodes - The approach 'what you sample is what you custody and serve' (like PeerDAS subnet sampling but unlike full sharding peer sampling). The concerns regarding peer sampling approach described [here](https://hackmd.io/LOa1LqbWRcShGlfX9eN_hw?view) - Since no column slicing involved separate blobs may just span several original data vectors and thus sampling process may potentially effectively benefit from EL blob transaction pool (aka `engine_getBlobs()`) - No smaller subnets which may be vulnerable to sybil attacks - Duplication of data on the transport level could potentially be lower due to the fact that a node needs `S` messages from _any_ peers - RLNC + Pedersen could be less resource consuming than RS + KZG ### Cons - Commitment size grows as $O(S^2)$ while message size overhead due to coefficients grows as $O(S^3)$. - Still need to estimate dissemination latency as a node may propagate data only after it completes its own sampling - For happy case the minimal number of peers for sampling is 32 (comparing to 8 in PeerDAS) ## Statements to be proved The above algorithm relies on a number of [yet fuzzy] statements which need to be proved so far: 1. A valid linear combination with requested random coefficients returned by a super-node proves that the responding node has access to the full data (enough vectors to recover). This statement doesn't look that tricky to prove 2. If a full-node claims it has basis vectors which form a coefficient subspace, then if a node requests a random vector from this subspace and get a valid data vector then it proves the responding node indeed has the basis vectors for the claimed subspace. The proof could be a generalization of the statement (1) 3. An EC over $\mathbb{F}_p$ with subfield $\mathbb{C}_2$ makes sense and is secure 4. The trick with RREF basis form for zero-RRT sampling actually works 5. The described algorithm yields evenly distributed basis vectors across hones nodes for the full `N`-dimensional space. There a pretty small `E` (kind of 1-3) such that the rank of basis vectors of any randomly selected `S + E` nodes would be `N` with some 'very high' probability . The same should be true in a Byzantine environment