# New Data Availabiliy Sampling spec draft ## _Note_ This is a draft version of refactored Sharding DAS specification: https://github.com/ethereum/eth2.0-specs/blob/dev/specs/das/das-core.md The goals of this refactoring - Fill the gaps in the original version - Allow arbitrary length data to be sampled/reconstructed/proven (not just pow 2 lengths as in original version) - Generify the spec by relying on fundamental concepts (like polynomials and pairings) instead of specific optimized functions like FFT, FK20, KZG lib functions - (special case of the above point) Get rid of numerous point permutations in the spec as they are only applicable to optimized algorithms inputs/outputs. Those permutations were the major source of confusion for me while understanding the spec core. - Preserve 100% compatibility to the original version. I.e. functions could be substituted with their optimized variants in a compatible way - Potentially use any arbitrary (even fractional) extension factor (hardcoded to 2x in the original spec) _Disclaimer_: please consider Python code below as a 'pseudo-code': it is not 100% correct and needs to be revised later Implementation of this spec: https://github.com/Nashatyrev/artemis/tree/8a5cfe60978c82edbe237ffeca5edeaf51f9dfb2/ethereum/spec/src/main/java/tech/pegasys/teku/spec/logic/common/das/baseline ## Eth2 Data Availability Sampling All operations related to [Erasure coding](https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction) and [Kate commitments](https://dankradfeist.de/ethereum/2020/06/16/kate-polynomial-commitments.html) are based on polynomials over $\Bbb{F}_R$ field ([finite field](https://en.wikipedia.org/wiki/Finite_field) of integers modulo `MODULUS`): $$P(x) = c_0 + c_1x + c_2x^2 + ... + c_Nx^N$$ All polynomial elements $c_i$, $x$ and $P(x)$ are from $\Bbb{F}_R$. In the spec the `Point` class represents a $\Bbb{F}_R$ element Let's choose and fix $M$ distinct arbitrary $x$ values: $[x_0, x_1, ... , x_M]$. For now let's assume they are arbitrary, and a smarter set will be chosen [later](#Data-evaluation-points). This set of values is fixed for any datapoints to be encoded ever thus the set should large enough and contain at least `MAX_EXT_POINTS ` elements (the maximum number of extended data points) Given $N$ datapoints $[a_0, a_1, ... , a_N]$ let's find the coefficients $[p_0, p_1, ... , p_N]$ such that polinomial $P(x)$ evaluations would be equal to our datapoints $[a_0, a_1, ... , a_N]$ at $x$ points $[x_0, x_1, ... , x_N]$. I.e. $P(x_0) = a_0, P(x_1) = a_1, ... P(x_N) = a_N$ or $P(x_i) = a_i$ shortly. See `poly_from_evals()` function. Thus we get the specific polynomial $P_a(x) = p_0 + p_1x + p_2x^2 + ... + p_Nx^N$ corresponding to our original dataset $[a_0, a_1, ... , a_N]$ ## More constnants ```python # max points in extended data MAX_EXT_POINTS = MAX_SAMPLES_PER_BLOCK * POINTS_PER_SAMPLE # max data polynomial degree MAX_DEGREE = MAX_EXT_POINTS / 2 - 1 # Unity (element 'one') in G2 G2_GENERATOR = G2_SETUP[0] # Zero element in Fr POINT_ZERO = Point(0) # Unity element in Fr POINT_ONE = Point(1) ``` ## Polynomial operations `Poly` class represents a polynomial which is essentially just a vector of it's coeeficients $[c_0, c_1,...,c_N]$ and a set of standart operations. ```python class Poly: # polynomial coefficients (any of them can be 0) coeff: List[Point] def __add__(other: Poly): Poly def __sub__(other: Poly): Poly def __mul__(other: Poly): Poly def __mul__(coeff: Point): Poly def __div__(other: Poly): Poly def __div__(coeff: Point): Poly # max degree of X with non-zero coefficient def degree(): int # evaluate polynomial at point x def eval(x: Point): Point # calculates production of several polynomials def prod(polys: List[Poly]): Poly return reduce(lambda p1, p2: p1 * p2, polys) ``` `eval_in_G1()` calculates the evaluation of polynomial $P(x)$ (`poly` parameter) at the secret point $s$ mapped to $\Bbb{G}_1$: $G\cdot P(s)$. But since the $s$ is not known to anyone we can't directly evalute the polynomial. But we have `G1_SETUP` with powers of $s$ mapped to $\Bbb{G}_1$: $[s^0]_1, [s^1]_1, [s^2]_1...[s^N]_1$. Thus we may evaluate the polynomial directly in $\Bbb{G}_1$: $$ \texttt{eval_in_G1}(P(x)) = p_0[s^0]_1 + p_1[s^1]_1 + p_2[s^2]_1 +...+ p_N[s^N]_1$$ The same applies to `eval_in_G2()` for the field $\Bbb{G}_2$ and predefined setup points `G2_SETUP` ```python def eval_in_G1(poly: Poly): G1 return sum([G1_SETUP[i] * poly.coeff[i] for i in range(0, poly.degree() + 1]) def eval_in_G2(poly: Poly): G2 return sum([G2_SETUP[i] * poly.coeff[i] for i in range(0, poly.degree() + 1]) ``` `poly_from_evals()` function derives a polynomial from its evaluations `ys` at points `xs` The baseline method for this calculation is the [Lagrange polynomial](https://en.wikipedia.org/wiki/Lagrange_polynomial) which is implemented below. The complexity of the reference implementation in $O(n^2)$ which is not perfect. Luckily when $x$ points selection is made smartly like [below](#Data-evaluation-points) a faster inverse FFT ([Fast Fourie Transformation](https://en.wikipedia.org/wiki/Fast_Fourier_transform)) algorithm could be used. ```python def poly_from_evals(xs: List[Point], ys: List[Point]): Poly assert length(xs) == length(ys) len = length(xs) def lagrange_basis(j: int): Poly return prod([ Poly([-xm, POINT_ONE]) / (xs[j] - xm) for xm in xs[:j] + xs[j+1:] ]) return sum([ ys[j] * lagrange_basis(j) for j in range(0, len) ]) # Calculates polynomial (x - x0)*(x - x1)*...*(x-xN) def zero_poly(xs: List[Point]): Poly return prod([Poly([-x, POINT_ONE]) for x in xs]) ``` ## Kate commitments ### Kate polynomial commitment _Polynomial commitment_ could be thought of as a compact polynomial fingerprint, or hash. A _Kate_ polynomial commitment is an element from the group $\Bbb{G}_1$, and is essentially the evaluation of $P_a(x)$ at the secret point $s$ mapped to $\Bbb{G}_1$: $G\cdot P_a(s)$. But since the $s$ is not known to anyone we can't directly evalute the polynomial. But we have `G1_SETUP` with powers of $s$ mapped to $\Bbb{G}_1$: $[s^0]_1, [s^1]_1, [s^2]_1...[s^N]_1$. Thus we may evaluate the polynomial directly in $\Bbb{G}_1$: $$ Commitment_a = p_0[s^0]_1 + p_1[s^1]_1 + p_2[s^2]_1 +...+ p_N[s^N]_1$$ ### Kate multi-proof Multi-proof is an element from from the group $\Bbb{G}_1$ which proves that the polynomial $P_a(x)$ (proves to a verifier who doesn't know the polynomial but has the polinomial $Commitment_a$ constructed above) at points $[z_0, z_1, ... , z_K]$ evaluates to values $[y_0, y_1, ... , y_K]$. I.e. $P_a(z_i) = y_i$ for $i = 1,...,K$ ### Functions `commit_to_poly()` calculates a commitment for a polynomial represented by its evaluation points `xs` and `ys`. Read [here](https://dankradfeist.de/ethereum/2020/06/16/kate-polynomial-commitments.html#kate-commitment) for more details. ```python def commit_to_poly(xs: List[Point], data_ys: List[Point]): BLSCommitment poly = poly_from_evals(xs, data_ys) return BLSCommitment(eval_in_G1(poly)) ``` Please refer to [this chapter](https://dankradfeist.de/ethereum/2020/06/16/kate-polynomial-commitments.html#multiproofs) for more details on Kate multiple point proofs ```python # Creates a proof that the polynomial evaluates to # values ys[0], ys[1],... at points xs[0], xs[1],... def construct_proof(poly: Poly, xs: List[Point], ys: List[Point]): BLSCommitment i_x = poly_from_evals(xs, ys) z_x = zero_poly(xs) proof_poly = (poly - i_x) / z_x return BLSCommitment(eval_in_G1(proof_poly)) # Verifies the `proof` that a polynomial with `commitment` indeed evaluates to # values ys[0], ys[1],... at points xs[0], xs[1],... def check_proof(BLSCommitment commitment, BLSCommitment proof, xs: List[Point], ys: List[Point]): bool i_x = poly_from_evals(xs, ys) z_x = zero_poly(xs) return ( bls.Pairing(proof, eval_in_G2(z_x)) == bls.Pairing(commitment - eval_in_G1(i_x), G2_GENERATOR) ) ``` `commit_to_max_degree()` calculates the proof that the polynomial `poly` has degree not above `poly.degree()` The degree proof works as follows. For a polynomial with degree $N$ : $B(x)=c_0+c_1x+...c_Nx^N$ the length proof is the commitment to the polynomial $q(x)=B(x) \cdot x^{(MAX\_DEGREE - N)}$, where `MAX_DEGREE` is the maximum power of polynomial available in the setup. The goal is to ensure that a proof can only be constructed if `deg(B) <= N` (there are not hidden higher-order terms in the polynomial, which would thwart reconstruction). ```python def commit_to_max_degree(poly: Poly): BLSCommitment add_degree = MAX_DEGREE - poly.degree() add_poly = Poly([POINT_ZERO]*add_degree + [Point(1)]) proof_poly = poly * add_poly return BLSCommitment(eval_in_G1(proof_poly)) def check_max_degree(BLSCommitment commitment, BLSCommitment proof, int max_degree): bool add_degree = MAX_DEGREE - max_degree addPoly = Poly([POINT_ZERO]*add_degree + [POINT_ONE]) return( bls.Pairing(proof, G2_GENERATOR) == bls.Pairing(commitment, eval_in_G2(addPoly)) ) ``` ## Data evaluation points Since we are free to choose any $x_0, x_1, ... , x_N$ let's do it smart to help optimized algorithms (Fast Fourie Transformation, FK20) to do their job. Thus let's choose and fix the following values: $$ x_i = \omega^{rev(i)}$$ where `rev(i) = reverse_bit_order(i, MAX_EXT_POINTS)`, and $\omega$ is the unity root (with order `MAX_EXT_POINTS`) which is defined by the constant `ROOT_OF_UNITY` ```python def reverse_bit_order(n: int): int """ Reverse the bit order of an integer n """ assert n < MAX_EXT_POINTS return int(('{:0' + str(MAX_EXT_POINTS.bit_length() - 1) + 'b}').format(n)[::-1], 2) def get_x(point_index: int): Point return ROOT_OF_UNITY ** reverse_bit_order(point_index) def get_xs(point_start_index: int, point_end_index: int): List[Point] return [get_x(index) for index in range(point_start_index, point_end_index)] ``` ## Data Availability Sampling ### Erasure coding Extension of original data $[a_{N+1}, a_{N+2}, ... , a_{2N}]$ is simly another $N$ evluations of polynomial $P_a(x)$ at points $[x_{N+1}, x_{N+2}, ... , x_{2N}]$, i.e. $a_i = P_a(x_i)$ for $i = (N+1),...,2N$ Thus the full extended data is 2x bigger than original and consists of values $[a_0, a_1, ... , a_{2N}]$. The polynom $P_a(x)$ can be reconstructed with just any $N$ values of this set, and thus the whole extended data set can be recovered with just any $N$ values. ### Functions `eval_multi()` evaluates ploynomial `poly` at a number of consequtive data points. This reference implementation is suboptimal (has quadratic complexity), it's worth considering faster FFT ([Fast Fourie Transformation](https://en.wikipedia.org/wiki/Fast_Fourier_transform)) algorithm Other methods could also be implemented more effectively ``` python # Evaluate polynomial in consecutive X points # Optimized version is FFT def eval_multi(poly: Poly, start_point_index: int, end_point_index: int): List[Point] return [poly.eval(x) for x in get_xs(start_point_index, end_point_index)] # Returns extended data (2x points) of original unextended data def extend_data(data: List[Point]): List[Point] poly = poly_from_evals(get_xs(0, length(data)), data) return eval_multi(0, 2 * length(data) # Returns the original data from extended data def unextend_data(extended_data: List[Point]): List[Point] return extended_data[:len(extended_data)//2] # Taking any N points (or more) from 2N of extended data points # recovers and returns all 2N points of the extended data def recover_data(some_ext_data: List[Optional[Point]]): List[Point]: xs = [get_x(i) for i, opt in enumerate(some_ext_data) if opt != None] ys = [y for y in some_ext_data if y != None] poly = poly_from_evals(xs, ys) return eval_multi(0, length(some_ext_data)) ``` ### Sampled Kate proofs ``` python # Returns the commitment for the full dataset def commit_to_data(data: List[Point]): BLSCommitment return commit_to_poly(get_xs(0, length(data)), data) ``` `construct_proofs()` creates proofs for all samples (data chunks of length `POINTS_PER_SAMPLE`) in a batch way. This reference implementation has suboptimal complexity $O(n^2)$. Assuming specific $x$ [points selection](#Data-evaluation-0points) a better optimized [FK20 (TODO link)]() algorithm can be used here ``` python def construct_proofs(extended_data: List[Point]): List[BLSCommitment] xs = get_xs(0, length(extended_data)) poly = poly_from_evals(xs, extended_data) assert poly.degree() < length(extended_data) / 2 sample_count = len(extended_data) // POINTS_PER_SAMPLE return [ construct_proof( poly, xs[i*POINTS_PER_SAMPLE : (i+1)*POINTS_PER_SAMPLE], extended_data[i*POINTS_PER_SAMPLE : (i+1)*POINTS_PER_SAMPLE] ) for i in range(0, sample_count) ] ```