New Data Availabiliy Sampling spec draft

Note

This is a draft version of refactored Sharding DAS specification:
https://github.com/ethereum/eth2.0-specs/blob/dev/specs/das/das-core.md

The goals of this refactoring

Fill the gaps in the original version
Allow arbitrary length data to be sampled/reconstructed/proven (not just pow 2 lengths as in original version)
Generify the spec by relying on fundamental concepts (like polynomials and pairings) instead of specific optimized functions like FFT, FK20, KZG lib functions
(special case of the above point) Get rid of numerous point permutations in the spec as they are only applicable to optimized algorithms inputs/outputs. Those permutations were the major source of confusion for me while understanding the spec core.
Preserve 100% compatibility to the original version. I.e. functions could be substituted with their optimized variants in a compatible way
Potentially use any arbitrary (even fractional) extension factor (hardcoded to 2x in the original spec)

Disclaimer: please consider Python code below as a 'pseudo-code': it is not 100% correct and needs to be revised later

Implementation of this spec: https://github.com/Nashatyrev/artemis/tree/8a5cfe60978c82edbe237ffeca5edeaf51f9dfb2/ethereum/spec/src/main/java/tech/pegasys/teku/spec/logic/common/das/baseline

Eth2 Data Availability Sampling

All operations related to Erasure coding and Kate commitments are based on polynomials over

F_{R}

field (finite field of integers modulo MODULUS):

P (x) = c_{0} + c_{1} x + c_{2} x^{2} + . . . + c_{N} x^{N}

All polynomial elements

c_{i}

x

and

P (x)

are from

F_{R}

. In the spec the Point class represents a

F_{R}

element

Let's choose and fix

M

distinct arbitrary

x

values:

[x_{0}, x_{1}, . . ., x_{M}]

. For now let's assume they are arbitrary, and a smarter set will be chosen later. This set of values is fixed for any datapoints to be encoded ever thus the set should large enough and contain at least MAX_EXT_POINTS elements (the maximum number of extended data points)

Given

N

datapoints

[a_{0}, a_{1}, . . ., a_{N}]

let's find the coefficients

[p_{0}, p_{1}, . . ., p_{N}]

such that polinomial

P (x)

evaluations would be equal to our datapoints

[a_{0}, a_{1}, . . ., a_{N}]

x

points

[x_{0}, x_{1}, . . ., x_{N}]

. I.e.

P (x_{0}) = a_{0}, P (x_{1}) = a_{1}, . . . P (x_{N}) = a_{N}

P (x_{i}) = a_{i}

shortly. See poly_from_evals() function.

Thus we get the specific polynomial

P_{a} (x) = p_{0} + p_{1} x + p_{2} x^{2} + . . . + p_{N} x^{N}

corresponding to our original dataset

[a_{0}, a_{1}, . . ., a_{N}]

More constnants

# max points in extended data
MAX_EXT_POINTS = MAX_SAMPLES_PER_BLOCK * POINTS_PER_SAMPLE
# max data polynomial degree
MAX_DEGREE = MAX_EXT_POINTS / 2 - 1 

# Unity (element 'one') in G2
G2_GENERATOR = G2_SETUP[0]
# Zero element in Fr
POINT_ZERO = Point(0)
# Unity element in Fr
POINT_ONE = Point(1)

Polynomial operations

Poly class represents a polynomial which is essentially just a vector of it's coeeficients

[c_{0}, c_{1}, . . ., c_{N}]

and a set of standart operations.

class Poly:
    # polynomial coefficients (any of them can be 0)
    coeff: List[Point]
    
    def __add__(other: Poly): Poly 
    def __sub__(other: Poly): Poly 
    def __mul__(other: Poly): Poly 
    def __mul__(coeff: Point): Poly 
    def __div__(other: Poly): Poly
    def __div__(coeff: Point): Poly
    # max degree of X with non-zero coefficient
    def degree(): int
    # evaluate polynomial at point x
    def eval(x: Point): Point

# calculates production of several polynomials
def prod(polys: List[Poly]): Poly
    return reduce(lambda p1, p2: p1 * p2, polys)

eval_in_G1() calculates the evaluation of polynomial

P (x)

(poly parameter) at the secret point

s

mapped to

G_{1}

G \cdot P (s)

. But since the

s

is not known to anyone we can't directly evalute the polynomial. But we have G1_SETUP with powers of

s

mapped to

G_{1}

[s^{0}]_{1}, [s^{1}]_{1}, [s^{2}]_{1} . . . [s^{N}]_{1}

. Thus we may evaluate the polynomial directly in

G_{1}

eval_in_G1 (P (x)) = p_{0} [s^{0}]_{1} + p_{1} [s^{1}]_{1} + p_{2} [s^{2}]_{1} + . . . + p_{N} [s^{N}]_{1}

The same applies to eval_in_G2() for the field

G_{2}

and predefined setup points G2_SETUP

def eval_in_G1(poly: Poly): G1 
    return sum([G1_SETUP[i] * poly.coeff[i] for i in range(0, poly.degree() + 1])
    
def eval_in_G2(poly: Poly): G2 
    return sum([G2_SETUP[i] * poly.coeff[i] for i in range(0, poly.degree() + 1])

poly_from_evals() function derives a polynomial from its evaluations ys at points xs
The baseline method for this calculation is the Lagrange polynomial which is implemented below.

The complexity of the reference implementation in

O (n^{2})

which is not perfect. Luckily when

x

points selection is made smartly like below a faster inverse FFT (Fast Fourie Transformation) algorithm could be used.

def poly_from_evals(xs: List[Point], ys: List[Point]): Poly
    assert length(xs) == length(ys)
    len = length(xs)
    
    def lagrange_basis(j: int): Poly
        return prod([
            Poly([-xm, POINT_ONE]) / (xs[j] - xm)
            for xm in xs[:j] + xs[j+1:]
        ])
        
    return sum([
        ys[j] * lagrange_basis(j)
        for j in range(0, len)
    ])

# Calculates polynomial (x - x0)*(x - x1)*...*(x-xN)
def zero_poly(xs: List[Point]): Poly
    return prod([Poly([-x, POINT_ONE]) for x in xs])

Kate commitments

Kate polynomial commitment

Polynomial commitment could be thought of as a compact polynomial fingerprint, or hash.

A Kate polynomial commitment is an element from the group

G_{1}

, and is essentially the evaluation of

P_{a} (x)

at the secret point

s

mapped to

G_{1}

G \cdot P_{a} (s)

. But since the

s

is not known to anyone we can't directly evalute the polynomial. But we have G1_SETUP with powers of

s

mapped to

G_{1}

[s^{0}]_{1}, [s^{1}]_{1}, [s^{2}]_{1} . . . [s^{N}]_{1}

. Thus we may evaluate the polynomial directly in

G_{1}

C o m m i t m e n t_{a} = p_{0} [s^{0}]_{1} + p_{1} [s^{1}]_{1} + p_{2} [s^{2}]_{1} + . . . + p_{N} [s^{N}]_{1}

Kate multi-proof

Multi-proof is an element from from the group

G_{1}

which proves that the polynomial

P_{a} (x)

(proves to a verifier who doesn't know the polynomial but has the polinomial

C o m m i t m e n t_{a}

constructed above) at points

[z_{0}, z_{1}, . . ., z_{K}]

evaluates to values

[y_{0}, y_{1}, . . ., y_{K}]

. I.e.

P_{a} (z_{i}) = y_{i}

for

i = 1, . . ., K

Functions

commit_to_poly() calculates a commitment for a polynomial represented by its evaluation points xs and ys. Read here for more details.

def commit_to_poly(xs: List[Point], data_ys: List[Point]): BLSCommitment
    poly = poly_from_evals(xs, data_ys)
    return BLSCommitment(eval_in_G1(poly))

Please refer to this chapter for more details on Kate multiple point proofs

# Creates a proof that the polynomial evaluates to
# values ys[0], ys[1],... at points xs[0], xs[1],...
def construct_proof(poly: Poly, xs: List[Point], ys: List[Point]): BLSCommitment
    i_x = poly_from_evals(xs, ys)
    z_x = zero_poly(xs)
    proof_poly = (poly - i_x) / z_x
    return BLSCommitment(eval_in_G1(proof_poly))

# Verifies the `proof` that a polynomial with `commitment` indeed evaluates to
# values ys[0], ys[1],... at points xs[0], xs[1],...
def check_proof(BLSCommitment commitment, BLSCommitment proof, 
        xs: List[Point], ys: List[Point]): bool
        
    i_x = poly_from_evals(xs, ys)
    z_x = zero_poly(xs)
    return (
        bls.Pairing(proof, eval_in_G2(z_x)) == 
        bls.Pairing(commitment - eval_in_G1(i_x), G2_GENERATOR)
    )

commit_to_max_degree() calculates the proof that the polynomial poly has degree not above poly.degree()

The degree proof works as follows. For a polynomial with degree

N

B (x) = c_{0} + c_{1} x + . . . c_{N} x^{N}

the length proof is the commitment to the polynomial

q (x) = B (x) \cdot x^{(M A X_D E G R E E - N)}

, where MAX_DEGREE is the maximum power of polynomial available in the setup. The goal is to ensure that a proof can only be constructed if deg(B) <= N (there are not hidden higher-order terms in the polynomial, which would thwart reconstruction).

def commit_to_max_degree(poly: Poly): BLSCommitment
    add_degree = MAX_DEGREE - poly.degree()
    add_poly = Poly([POINT_ZERO]*add_degree + [Point(1)])
    proof_poly = poly * add_poly
    return BLSCommitment(eval_in_G1(proof_poly))
    
def check_max_degree(BLSCommitment commitment, BLSCommitment proof, int max_degree): bool
    add_degree = MAX_DEGREE - max_degree
    addPoly = Poly([POINT_ZERO]*add_degree + [POINT_ONE])
    return(
        bls.Pairing(proof, G2_GENERATOR) ==
        bls.Pairing(commitment, eval_in_G2(addPoly))
    )

Data evaluation points

Since we are free to choose any

x_{0}, x_{1}, . . ., x_{N}

let's do it smart to help optimized algorithms (Fast Fourie Transformation, FK20) to do their job. Thus let's choose and fix the following values:

x_{i} = ω^{r e v (i)}

where rev(i) = reverse_bit_order(i, MAX_EXT_POINTS), and

ω

is the unity root (with order MAX_EXT_POINTS) which is defined by the constant ROOT_OF_UNITY

def reverse_bit_order(n: int): int
    """
    Reverse the bit order of an integer n
    """
    assert n < MAX_EXT_POINTS
    return int(('{:0' + str(MAX_EXT_POINTS.bit_length() - 1) + 'b}').format(n)[::-1], 2)
    
def get_x(point_index: int): Point
    return ROOT_OF_UNITY ** reverse_bit_order(point_index)
    
def get_xs(point_start_index: int, point_end_index: int): List[Point]
    return [get_x(index) for index in range(point_start_index, point_end_index)]

Data Availability Sampling

Erasure coding

Extension of original data

[a_{N + 1}, a_{N + 2}, . . ., a_{2 N}]

is simly another

N

evluations of polynomial

P_{a} (x)

at points

[x_{N + 1}, x_{N + 2}, . . ., x_{2 N}]

, i.e.

a_{i} = P_{a} (x_{i})

for

i = (N + 1), . . ., 2 N

Thus the full extended data is 2x bigger than original and consists of values

[a_{0}, a_{1}, . . ., a_{2 N}]

The polynom

P_{a} (x)

can be reconstructed with just any

N

values of this set, and thus the whole extended data set can be recovered with just any

N

values.

Functions

eval_multi() evaluates ploynomial poly at a number of consequtive data points.
This reference implementation is suboptimal (has quadratic complexity), it's worth considering faster FFT (Fast Fourie Transformation) algorithm
Other methods could also be implemented more effectively

# Evaluate polynomial in consecutive X points
# Optimized version is FFT 
def eval_multi(poly: Poly, start_point_index: int, end_point_index: int): List[Point]
    return [poly.eval(x) for x in get_xs(start_point_index, end_point_index)]

# Returns extended data (2x points) of original unextended data
def extend_data(data: List[Point]): List[Point]
    poly = poly_from_evals(get_xs(0, length(data)), data)
    return eval_multi(0, 2 * length(data)

# Returns the original data from extended data
def unextend_data(extended_data: List[Point]): List[Point]
    return extended_data[:len(extended_data)//2]   

# Taking any N points (or more) from 2N of extended data points 
# recovers and returns all 2N points of the extended data
def recover_data(some_ext_data: List[Optional[Point]]): List[Point]:
    xs = [get_x(i) for i, opt in enumerate(some_ext_data) if opt != None]    
    ys = [y for y in some_ext_data if y != None]
    poly = poly_from_evals(xs, ys)
    return eval_multi(0, length(some_ext_data))

Sampled Kate proofs

# Returns the commitment for the full dataset
def commit_to_data(data: List[Point]): BLSCommitment
    return commit_to_poly(get_xs(0, length(data)), data)

construct_proofs() creates proofs for all samples (data chunks of length POINTS_PER_SAMPLE) in a batch way. This reference implementation has suboptimal complexity

O (n^{2})

. Assuming specific

x

points selection a better optimized FK20 (TODO link) algorithm can be used here

def construct_proofs(extended_data: List[Point]): List[BLSCommitment]
    xs = get_xs(0, length(extended_data))
    poly = poly_from_evals(xs, extended_data)
    assert poly.degree() < length(extended_data) / 2
    sample_count = len(extended_data) // POINTS_PER_SAMPLE
    return [
        construct_proof(
            poly, 
            xs[i*POINTS_PER_SAMPLE : (i+1)*POINTS_PER_SAMPLE], 
            extended_data[i*POINTS_PER_SAMPLE : (i+1)*POINTS_PER_SAMPLE]
        ) for i in range(0, sample_count)
    ]

Mikhail Kalinin

2021/08/12 13:02:19

Point

Point? (Edited)

2021/08/12 13:05:16

martly

when the selection is smart or when the selection is clever (Edited)

Alex Vlasov

2021/08/12 15:23:53

faster inverse FFT

Fast FT is an optimization. So, a direct way to calculate is more appropriate for a spec. However, as the pyspecs are executable, FastFT is definitely faster. That's one more reason to put such stuff in a library. Basically, a spec can contain a direct way to calc, while the lib can contain an optimized version. (Edited)

Ben Edgington

2021/10/07 09:25:18

Let's choose and fix $M$ distinct arbitrary $x$ values: $[x_0, x_1, ... , x_M]$. For now let's assume they are arbitrary, and a smarter set will be chosen [later](#Data-evaluation-points). This set of values is fixed for any datapoints to be encoded ever thus the set should large enough and contain at least `MAX_EXT_POINTS ` elements (the maximum number of extended data points)

That's M+1 values :-) (Edited)

New Data Availabiliy Sampling spec draft

Note

Eth2 Data Availability Sampling

More constnants

Polynomial operations

Kate commitments

Kate polynomial commitment

Kate multi-proof

Functions

Data evaluation points

Data Availability Sampling

Erasure coding

Functions

Sampled Kate proofs

Read more

Alternative DAS concept based on RLNC

Comparing Reed-Solomon and RLNC for Pubsub

Ideal pubsub

Network Shards