The present DAS approach for Ethereum is based on Reed-Solomon erasure coding + KZG commitments. The novel scheme (described here Faster block/blob propagation in Ethereum by @potuz) introduces another pair: RLNC (Random Linear Network Coding) erasure coding + Pedersen commitments. It has nice properties which could be used to develop an alternative Data Availability approach with its pros and cons.
In this write-up, we’ll describe the concepts of this approach — its benefits and its challenges. The reference algorithm is not pretending to be a final solution and may just highlight potential directions to investigate alternative Data Availability approaches.
We’ll begin with the simplest ideas and then iteratively address their limitations.The following reasoning may be a bit unstructured or messy, but it can help clarify the implementation’s underlying ideas. Feel free to skip it and go straight to the final algorithm description.
Much like the write-up Faster block/blob propagation in Ethereum, the original data is split into N
vectors over the field . These vectors can be coded using RLNC and committed via a Pedersen commitment (of size ) included in the block. Any linear combination can be verified given the commitment and its associated coefficients.
We also assume that the coefficients may come from a much smaller subfield , whose size could be as small as a single byte—or even a single bit.
Let's also introduce the sharding factor S
: any regular node need to download and custody 1/S
of the original data size.
Let's assume we have enough 'super-nodes' (receiving and storing all N
chunks) to serve all other [regular] nodes.
A regular node then performs sampling and custody by doing the following:
N
random coefficients The major takeaways from the above process:
N
honest nodes is sufficient to recover the original dataIf all supernodes are malicious the maximum harm they can do is to deceive N - 1
node making them treat non-available data as available.
Here we have S == N
and any regular node downloads and custodies 1/S
of the original data size only
Challenge 1: the network is required to have enough super-nodes to serve regular nodes. The network is not homogenous – there are 'servers' and 'clients'. No p2p data dissemination.
As an intermediary step instead of super-nodes let's now have a set of specific nodes (let's call them 'seed nodes') which custody their own subset of original vectors. Let's say every such node stores just K
(K < N
) original vectors: , , etc.
A regular node now needs to query at least N/K
seed nodes to make sure the whole data is collectively possessed by them.
After sampling the node may aggregate the received vectors into a single one (just a linear combination with random coefficients) and store it in the custody.
We still have S == N
so far
Challenge 2: still have different kind of nodes in the network
Challenge 3: while a regular node may custody just a single vector it still needs to download K/N
of original data size instead of 1/N
now
Let's address the Challenge 3
first: for this let's increase the number of original chunks. Let the S
remain as before and let now N = S * K
. So a regular node now again need to download just 1/S
of original data
Let's now try to address the Challenge 2
Now let the 'seed' nodes store K
random linear combinations of original vectors () instead of K
original vectors:
…
And let a sampling node request a linear combination of possessed by it's 'seed' peer with local random coefficients :
The seed peer would need to respond back together with coefficients
After a node requests K
samples it becomes a 'seed' node and may serve other nodes requests. So here we are getting a canonical p2p network structure with all the nodes playing the same client/server role
However there are new challenges:
Challenge 4: the seed peer may have less vectors than declared (it may even have just a single vector) and maliciously craft coefficients
Example of attack:
Suppose we have , , data vector of just a single lement, and S = 2
(thus having N = 4
)
Suppose original vectors (data vector has just a single element):
Honest node should have 2 linear combinations, but suppose a malicious node has just a single linear combination with coefficients:
thus
is missing
For example a sampling node requests the combination with the following coefficients :
The sampling node doesn't yet know what linear combinations the responder has ( and ). It expects to find it out from the response.
The malicious node responds with the only vector is has: however claims that this is the linear combination of the vectors with the following coefficients:
is legit and could be verified against , and are linear independent, so the response is correct.
Let's try to approach the Challenge 4
Split request into 2 interactions:
As existing vectors are now fixed after the request (1) the seed peer has no more possibilities to cheat and would need to posses declared vectors (or another basis for the same subspace) to calculate respond to the request (2)
Example:
Continue with the sample from previous topic: if the sampling node request for existing first, and for example the malicious node responds with the same coefficients.
In this case the responder is able to provide the correct answer only if the requester would ask for for any . Else it wouldn't be able to respond correctly.
The above approach solves the Challenge 4 but adds another interaction step which may significantly matter during time critical data dissemination.
Basically K
vectors describe a K-dimentional subspace in the N-dimentional space. (our node needs to sample just a single randomly selected vector from this subspace) There are almost infinite basis vector sets to describe the same subspace, but what if there is just some single canonical basis describing that subspace?
I believe Reduced Row Echelon Form (RREF) of the matrix could probably serve as a canonical basis for the mentioned subspace:
The seed node may then derive the following data vectors
And respond to the request with the following linear combination:
As the canonical basis is implicitly unique and fixed, this should make additional preliminary interaction for fixing the existing coefficients obsolete
Example:
Continue with the sample from previous topic…
So the requesting node is now have an additional requirement on coefficient vectors and : the matrix should be in RREF form.
The trick is that while an attacker may craft any number of which satisfies but it can't craft such a pair to be a RREF matrix with very high probability.
DISCLAIMER: The last statement is quite intuitive and needs a better justification.
We could probably remove the remaining interaction without sacrifying security. It's likely fine to have static public random coefficients which could be derived for example from the node's nodeID
. In this case a seed node may compose and propagate a sample vector to its peers as soon as it received enough own vectors.
So far we have:
S
- sharding factorN
- total number of original vectorsK = N / S
- the number of vectors a node needs to download and custodyWe want to minimize N
to reduce the commitment and proof overheads.
We want to maximize S
potentially to have a better sharding effect
Thus we want to minimize K
.
However every regular node stores just 1/S
of data and to sample the whole N
-dimensional space a node needs to download at least S
vectors.
Thus the K
needs to be at least as large as S
, thus it needs to be equal to S
:
Assuming the subfield is pretty small, e.g. fit just 1 byte. Then with S = 32
a sampling message should contain 32 * 1024
coefficients, i.e. 32 KBytes which is the same size as a data vector itself. That sounds like a significant overhead.
But what if we make the subfield even smaller: just 1 bit. A vector sampled from 32 original vectors still has quite few chances to be colinear with another random vector sampled from the same 32 original vectors - 1 / 2^32
. This is not negligible but we are pretty OK to have rare collisions here. With 1-bit coefficient we would need just 4 KBytes to pass the original coefficients. That looks like a good saving.
A node receives pseudo-random linear combination vectors from its peers.
Every vector accompanied with a basis vectors of subspace this vector was randomly sampled from. Subspace basis vectors are provably exist in peer custody.
The node samples until
S
vectors for its custodyAfter that the node completes sampling and becomes the 'active' node which can be used for sampling by other nodes.
The below Python-like code snippets are for illustration purposes only, they are not intended to be compiled and executed for now. Some obvious or library-like functions lack their implementations so far.
Let data vector elements be from a finite field which could be mapped to 32 bytes (exact for simplicity, in fact it's a bit lower)
Let coefficients be from a significantly smaller subfield and could be encoded with COEF_SIZE
bytes (could be fractional)
Let a single data vector has VECTOR_LEN
elements from
The sharding factor S
means that every regular node receives, sends and stores just 1/S
fraction of the whole data.
Let N = S * S
and let the whole data be spit onto N
original DataVector
s:
Derived DataVector
is a linear combination of original DataVector
s. Since an original vector is just a special case of a derived vector we will no longer make any distinctions between them.
To validate a DataVector
against commitment its linear combination coefficients should be known:
There is also a slightly different sequence of coefficents which is used when deriving a new vector from existing DataVector
s:
A data vector accompanied by the its linear combination coefficients
The only message for dissemination is
There is an ephemeral store for a block data:
Let the function random_coef
generates a deterministically random coefficient vector with the seed (nodeId, seq_no)
:
Let's define an abstract utility function to compute linear combination of any vectors:
Function which performs linear operations on DataVector
s and ProofVector
s simultaneously to transform the ProofVector
s to RREF:
The function creating a new deterministically random message from existing custody vectors for a specific peer:
Computes the rank of data vectors' coefficients. The rank equal to N
means that the original data can be restored from the data vectors.
The publisher should split the data and get OriginalData
.
Before propagating the data itself publisher should calculate Pedersen commitments (of size 32 * N
), build and propagate block with these commitments.
Then is should populate custody with data vectors (original) and their proofs which are simply elements of a standard basis: [1, 0, 0, ..., 0], [0, 1, 0, ..., 0]
:
Publishing process is basically the same as the propagation process except the publisher sends by S
messages to a single peer in a single round instead of sending just 1 message during propagation.
Assuming that the corresponding block is received and the BlockDaStore.commitment
is filled prior receiving a message: DaMessage
Derive the ProofVector
from original vectors:
The message first validated:
Received DataVector
is added to the custody.
The sample_matrix
is accumulating all the original coefficients and as soon as the matrix rank reaches N
the sampling process is succeeded
is_sampled == true
means that we are [almost] 100% sure that our peers who sent us messages collectively posses 100% of information to recover the data.
When the process_message()
returns True
, this means that the sampling was successful and the custody is filled with enough number of vectors. From now the node may start creating and propagating its own vectors.
We are sending a vector to every peer in the mesh which didn't sent any vector to us yet:
S = 32
) of the original data size (compared to x2 for PeerDAS and x4 for FullDAS)1/32
of the original data size. (v.s. 1/8
in the present PeerDAS approach: 8 columns from 64 original)S
any regular honest nodesengine_getBlobs()
)S
messages from any peersS
.The above algorithm relies on a number of [yet fuzzy] statements which need to be proved so far:
N
-dimensional space. There a pretty small E
(kind of 1-3) such that the rank of basis vectors of any randomly selected S + E
nodes would be N
with some 'very high' probability . The same should be true in a Byzantine environment