Try   HackMD

Alternative DAS concept based on RLNC

Abstract

The present DAS approach for Ethereum is based on Reed-Solomon erasure coding + KZG commitments. The novel scheme (described here Faster block/blob propagation in Ethereum by @potuz) introduces another pair: RLNC (Random Linear Network Coding) erasure coding + Pedersen commitments. It has nice properties which could be used to develop an alternative Data Availability approach with its pros and cons.

In this write-up, we’ll describe the concepts of this approach — its benefits and its challenges. The reference algorithm is not pretending to be a final solution and may just highlight potential directions to investigate alternative Data Availability approaches.

We’ll begin with the simplest ideas and then iteratively address their limitations.The following reasoning may be a bit unstructured or messy, but it can help clarify the implementation’s underlying ideas. Feel free to skip it and go straight to the final algorithm description.

Intro

Much like the write-up Faster block/blob propagation in Ethereum, the original data is split into N vectors

v1,,vN over the field
Fp
. These vectors can be coded using RLNC and committed via a Pedersen commitment (of size
32×N
) included in the block. Any linear combination can be verified given the commitment and its associated coefficients.
We also assume that the coefficients may come from a much smaller subfield
Cq
, whose size could be as small as a single byte—or even a single bit.

Let's also introduce the sharding factor S: any regular node need to download and custody 1/S of the original data size.

Idea #1

Let's assume we have enough 'super-nodes' (receiving and storing all N chunks) to serve all other [regular] nodes.

A regular node then performs sampling and custody by doing the following:

  • Choose locally N random coefficients
    A=[a1,...,aN]
  • Request super-node a vector which is the linear combination of all original data vectors:
    wA=a1v1+...+aNvN
  • Validate received vector
    WA
    against Pedersen commitment from the corresponding block and its random coefficients

The major takeaways from the above process:

  • The sampling node is [almost] sure that the responding super-node has all the data (Sampling side of the process)
  • The node got and store the vector which is non-colinear with any other honest node custody vector almost surely (Custody side of the process)
    • (derivative statement) Any set of N honest nodes is sufficient to recover the original data

If all supernodes are malicious the maximum harm they can do is to deceive N - 1 node making them treat non-available data as available.

Here we have S == N and any regular node downloads and custodies 1/S of the original data size only

Challenge 1: the network is required to have enough super-nodes to serve regular nodes. The network is not homogenous there are 'servers' and 'clients'. No p2p data dissemination.

Idea #2

As an intermediary step instead of super-nodes let's now have a set of specific nodes (let's call them 'seed nodes') which custody their own subset of original vectors. Let's say every such node stores just K (K < N) original vectors:

[v1,...,vK],
[vK+1,...,v2K]
, etc.

A regular node now needs to query at least N/K seed nodes to make sure the whole data is collectively possessed by them.

After sampling the node may aggregate the received vectors into a single one (just a linear combination with random coefficients) and store it in the custody.

We still have S == N so far

Challenge 2: still have different kind of nodes in the network
Challenge 3: while a regular node may custody just a single vector it still needs to download K/N of original data size instead of 1/N now

Idea #3

Let's address the Challenge 3 first: for this let's increase the number of original chunks. Let the S remain as before and let now N = S * K. So a regular node now again need to download just 1/S of original data

Idea #4

Let's now try to address the Challenge 2

Now let the 'seed' nodes store K random linear combinations of original vectors (

w1,...,wK) instead of K original vectors:
w1=a1,1v1+...+a1,NvN


wK=aK,0v1+...+aK,NvN

And let a sampling node request a linear combination of

wi possessed by it's 'seed' peer with local random coefficients
C=[c1,...,cK]
:
wC=c1w1+...+cKwK

The seed peer would need to respond back

wC together with coefficients
ai,j

After a node requests K samples it becomes a 'seed' node and may serve other nodes requests. So here we are getting a canonical p2p network structure with all the nodes playing the same client/server role

However there are new challenges:

Challenge 4: the seed peer may have less vectors than declared (it may even have just a single vector) and maliciously craft coefficients

ai,j

Example of attack:
Suppose we have

F256,
C16
, data vector of just a single lement, and S = 2 (thus having N = 4)

Suppose original vectors (data vector has just a single

F256 element):
(v1,...,v4)=([32],[55],[1],[255])

Honest node should have 2 linear combinations, but suppose a malicious node has just a single linear combination with coefficients:

a1=[a1,1,...,a1,4]=[1,3,7,15]
thus
w1=[32]1+[55]3+[1]7+[255]15=[189]

w2
is missing

For example a sampling node requests the combination with the following coefficients

C:
C=[c1,c2]=[15,3]

wC=15w1+3w2

The sampling node doesn't yet know what linear combinations the responder has (

w1 and
w2
). It expects to find it out from the response.

The malicious node responds with the only vector is has:

wC=w1=[189] however claims that this is the linear combination of the vectors with the following coefficients:

a1=[1,1,1,1]/15=[15,15,15,15]
a2=[0,2,6,14]/3=[0,6,2,10]

15a1+3a2=[1,3,7,15]=a1

wC is legit and could be verified against
15a1+3a2
,
a1
and
a2
are linear independent, so the response is correct.

Idea #5a

Let's try to approach the Challenge 4

Split request into 2 interactions:

  1. request
    ai,j
    coefficients of existing vectors
  2. request a linear combination of these vectors with local random coefficients
    ci

As existing vectors are now fixed after the request (1) the seed peer has no more possibilities to cheat and would need to posses declared vectors (or another basis for the same subspace) to calculate

wC respond to the request (2)

Example:

Continue with the sample from previous topic: if the sampling node request for existing

ai first, and for example the malicious node responds with the same
[a1,a2]
coefficients.

In this case the responder is able to provide the correct answer only if the requester would ask for

C=[15,3]k for any
k
. Else it wouldn't be able to respond correctly.

Idea #5b

The above approach solves the Challenge 4 but adds another interaction step which may significantly matter during time critical data dissemination.

Basically K vectors

ai=[ai,1,...,ai,N] describe a K-dimentional subspace in the N-dimentional space. (our node needs to sample just a single randomly selected vector from this subspace) There are almost infinite basis vector sets to describe the same subspace, but what if there is just some single canonical basis describing that subspace?

I believe Reduced Row Echelon Form (RREF)

a of the matrix
a
could probably serve as a canonical basis for the mentioned subspace:
a=[a0,...,aK]

a=rref(a)

a=[a1,...,aK]

ai=[ai,1,...,ai,N]

The seed node may then derive the following data vectors

wi=ai,1v1+...+ai,NvN

And respond to the request with the following linear combination:

wC=c1w1+...+cKwK

As the canonical basis is implicitly unique and fixed, this should make additional preliminary interaction for fixing the existing coefficients obsolete

Example:

Continue with the sample from previous topic

So the requesting node is now have an additional requirement on coefficient vectors

a1 and
a2
: the matrix
[a1a2]
should be in RREF form.

The trick is that while an attacker may craft any number of

(a1,a2) which satisfies
15a1+3a2=a1
but it can't craft such a pair to be a RREF matrix with very high probability.

DISCLAIMER: The last statement is quite intuitive and needs a better justification.

Idea #6

We could probably remove the remaining interaction without sacrifying security. It's likely fine to have static public random coefficients

ci which could be derived for example from the node's nodeID. In this case a seed node may compose and propagate a sample vector to its peers as soon as it received enough own vectors.

Idea #7

So far we have:

  • S - sharding factor
  • N - total number of original vectors
  • K = N / S - the number of vectors a node needs to download and custody

We want to minimize N to reduce the commitment and proof overheads.
We want to maximize S potentially to have a better sharding effect

Thus we want to minimize K.

However every regular node stores just 1/S of data and to sample the whole N-dimensional space a node needs to download at least S vectors.

Thus the K needs to be at least as large as S, thus it needs to be equal to S:

S == K
N = S ^ 2 == K ^ 2

Idea #8

Assuming the subfield

Cq is pretty small, e.g. fit just 1 byte. Then with S = 32 a sampling message should contain 32 * 1024 coefficients, i.e. 32 KBytes which is the same size as a data vector itself. That sounds like a significant overhead.

But what if we make the subfield

Cq even smaller: just 1 bit. A vector sampled from 32 original vectors still has quite few chances to be colinear with another random vector sampled from the same 32 original vectors - 1 / 2^32. This is not negligible but we are pretty OK to have rare collisions here. With 1-bit coefficient we would need just 4 KBytes to pass the original coefficients. That looks like a good saving.

The final algorithm

Brief overview

A node receives pseudo-random linear combination vectors from its peers.
Every vector accompanied with a basis vectors of subspace this vector was randomly sampled from. Subspace basis vectors are provably exist in peer custody.
The node samples until

  • all received basis vectors sum up the full N-dimensional space
  • receives at least S vectors for its custody

After that the node completes sampling and becomes the 'active' node which can be used for sampling by other nodes.

DA-RLNC.drawio

Note

The below Python-like code snippets are for illustration purposes only, they are not intended to be compiled and executed for now. Some obvious or library-like functions lack their implementations so far.

Types

Let data vector elements be from a finite field

Fp which could be mapped to 32 bytes (exact for simplicity, in fact it's a bit lower)

def FScalar

Let coefficients be from a significantly smaller subfield

Cq and could be encoded with COEF_SIZE bytes (could be fractional)

def CScalar

Let a single data vector has VECTOR_LEN elements from

Fp

def DataVector = Vector[FScalar, VECTOR_LEN]

The sharding factor S means that every regular node receives, sends and stores just 1/S fraction of the whole data.

Let N = S * S and let the whole data be spit onto N original DataVectors:

def OriginalData = Vector[DataVector, N]

Derived DataVector is a linear combination of original DataVectors. Since an original vector is just a special case of a derived vector we will no longer make any distinctions between them.

To validate a DataVector against commitment its linear combination coefficients should be known:

def ProofVector = Vector[CScalar, N]

There is also a slightly different sequence of coefficents which is used when deriving a new vector from existing DataVectors:

def CombineVector = List[CScalar]

Structures

A data vector accompanied by the its linear combination coefficients

class DataAndProof:
    data_vector: DataVector
    proof: ProofVector

The only message for dissemination is

class DaMessage:
    data_vector: DataVector
    # coefVector for validation is derived
    orig_coeff_vectors: List[CombineVector, N]
    seq_no: Int

There is an ephemeral store for a block data:

class BlockDaStore:
    custody: List[DataAndProof, N]
    sample_matrix: List[ProofVector]
    
    # commitment is initialized first from the corresponding block
    commitment: Vector[RistrettoPoint]

Functions

Let the function random_coef generates a deterministically random coefficient vector with the seed (nodeId, seq_no):

def random_coef(node_id: UInt256, length: int, seq_no: int) -> CombineVector

Let's define an abstract utility function to compute linear combination of any vectors:

def linear_combination(
        vectors: Sequence[AnyVector], 
        coefficients: CombineVector) -> AnyVector

Function which performs linear operations on DataVectors and ProofVectors simultaneously to transform the ProofVectors to RREF:

def to_rref(data_vectors: Sequence[DataAndProof]) -> Sequence[DataAndProof]
def is_rref(proofs: Sequence[ProofVector]) -> bool

The function creating a new deterministically random message from existing custody vectors for a specific peer:

def create_da_message(
        da_store: BlockDaStore
        receiver_node_id: uint256, 
        seq_no: int) -> DaMessage:
    rref_custody_data = to_rref_form(da_store.custody)
    rnd_coefs = random_coef(receiver_node_id, len(rref_custody_data), seq_no)
    data_vector = linear_combination(rref_custody_data, rnd_coefs)
    custody_proofs = [e.proof for e in da_store.custody]
    return DaMessage(data_vector, Vector(custody_proofs), seq_no)

Computes the rank of data vectors' coefficients. The rank equal to N means that the original data can be restored from the data vectors.

def rank(proof_matrix: Sequence[ProofVector]) -> int

Publish

The publisher should split the data and get OriginalData.

Before propagating the data itself publisher should calculate Pedersen commitments (of size 32 * N), build and propagate block with these commitments.

Then is should populate custody with data vectors (original) and their proofs which are simply elements of a standard basis: [1, 0, 0, ..., 0], [0, 1, 0, ..., 0]:

def init_da_store(data: OriginalData, da_store: BlockDaStore):
    for i,v in enumerate(data):
        e_proof = ProofVector(
            [CScalar(1) if index == i else CScalar(0) for index in range(size)]
        )
        da_store.custody += DataAndProof(v, eproof)

Publishing process is basically the same as the propagation process except the publisher sends by S messages to a single peer in a single round instead of sending just 1 message during propagation.

def publish(da_store: BlockDaStore, mesh: Sequence[Peer]):
    # Letf the Peer undefined 
    for peer in mesh:
        for seq_no in range(S):
            peer.send(
                create_da_message(da_store, peer.node_id, seq_no)
            )

Receive

Assuming that the corresponding block is received and the BlockDaStore.commitment is filled prior receiving a message: DaMessage

Derive the ProofVector from original vectors:

def derive_proof_vector(myNodeId: uint256, message: DaMessage) -> ProofVector:
    lin_c = randomCoef( 
        myNodeId, 
        len(message.orig_coefficients),
        message.seq_no)
    return linear_combination(message.orig_coefficients, lin_c)
Validate

The message first validated:

def validate_message(message: DaMessage, da_store: BlockDaStore):
    # Verify the original coefficients are in Reduced Row Echelon Form
    assert is_rref(message.orig_coefficients)
    # Verify that the source vectors are linear independent
    assert rank(message.orig_coefficients) == len(message.orig_coefficients)
    # Validate Pedersen commitment
    proof = derive_proof_vector(my_node_id, message)
    validate_pedersen(message.data_vector, da_store.commitment, proof)
Store

Received DataVector is added to the custody.
The sample_matrix is accumulating all the original coefficients and as soon as the matrix rank reaches N the sampling process is succeeded

def process_message(message: DaMessage, da_store: BlockDaStore) -> boolean:
    da_store.custody += DataAndProof(
        message.data_vector,
        derive_proof_vector(my_node_id, message)
    )
    da_store.sample_matrix += message.orig_coefficients
    is_sampled = N == rank(da_store.sample_matrix)
    return is_sampled

is_sampled == true means that we are [almost] 100% sure that our peers who sent us messages collectively posses 100% of information to recover the data.

Propagate

When the process_message() returns True, this means that the sampling was successful and the custody is filled with enough number of vectors. From now the node may start creating and propagating its own vectors.

We are sending a vector to every peer in the mesh which didn't sent any vector to us yet:

# mesh: here the set of peers which haven't sent any vectors yet 
def propagate(da_store: BlockDaStore, mesh: Sequence[Peer]):
    for peer in mesh:
            peer.send(
                create_da_message(da_store, peer.node_id, 0)
            )

Pros and Cons

Pros

  • The total data size to be disseminated, custodied and sampled is x1 (+ coefficients overhead of about 20% for S = 32) of the original data size (compared to x2 for PeerDAS and x4 for FullDAS)
  • We may have 'shard factor' as large as 32 (or probably even greater with more modifications), which means every node needs to download and custody just 1/32 of the original data size. (v.s. 1/8 in the present PeerDAS approach: 8 columns from 64 original)
  • Data is available and may be recovered by having just S any regular honest nodes
  • The approach 'what you sample is what you custody and serve' (like PeerDAS subnet sampling but unlike full sharding peer sampling). The concerns regarding peer sampling approach described here
  • Since no column slicing involved separate blobs may just span several original data vectors and thus sampling process may potentially effectively benefit from EL blob transaction pool (aka engine_getBlobs())
  • No smaller subnets which may be vulnerable to sybil attacks
  • Duplication of data on the transport level could potentially be lower due to the fact that a node needs S messages from any peers
  • RLNC + Pedersen could be less resource consuming than RS + KZG

Cons

  • Commitment and proof size grow quadraticaly with sharding coefficient S.
  • Still need to estimate dissemination latency as a node may propagate data only after it completes its own sampling
  • For happy case the minimal number of peers for sampling is 32 (comparing to 8 in PeerDAS)

Statements to be proved

The above algorithm relies on a number of [yet fuzzy] statements which need to be proved so far:

  1. A valid linear combination with requested random coefficients returned by a super-node proves that the responding node has access to the full data (enough vectors to recover). This statement doesn't look that tricky to prove
  2. If a full-node claims it has basis vectors which form a coefficient subspace, then if a node requests a random vector from this subspace and get a valid data vector then it proves the responding node indeed has the basis vectors for the claimed subspace. The proof could be a generalization of the statement (1)
  3. An EC over
    Fp
    with subfield
    C2
    makes sense and is secure
  4. The trick with RREF basis form for zero-RRT sampling actually works
  5. The described algorithm yields evenly distributed basis vectors across hones nodes for the full N-dimensional space. There a pretty small E (kind of 1-3) such that the rank of basis vectors of any randomly selected S + E nodes would be N with some 'very high' probability . The same should be true in a Byzantine environment