# New Data Availabiliy Sampling spec draft
## _Note_
This is a draft version of refactored Sharding DAS specification:
https://github.com/ethereum/eth2.0-specs/blob/dev/specs/das/das-core.md
The goals of this refactoring
- Fill the gaps in the original version
- Allow arbitrary length data to be sampled/reconstructed/proven (not just pow 2 lengths as in original version)
- Generify the spec by relying on fundamental concepts (like polynomials and pairings) instead of specific optimized functions like FFT, FK20, KZG lib functions
- (special case of the above point) Get rid of numerous point permutations in the spec as they are only applicable to optimized algorithms inputs/outputs. Those permutations were the major source of confusion for me while understanding the spec core.
- Preserve 100% compatibility to the original version. I.e. functions could be substituted with their optimized variants in a compatible way
- Potentially use any arbitrary (even fractional) extension factor (hardcoded to 2x in the original spec)
_Disclaimer_: please consider Python code below as a 'pseudo-code': it is not 100% correct and needs to be revised later
Implementation of this spec: https://github.com/Nashatyrev/artemis/tree/8a5cfe60978c82edbe237ffeca5edeaf51f9dfb2/ethereum/spec/src/main/java/tech/pegasys/teku/spec/logic/common/das/baseline
## Eth2 Data Availability Sampling
All operations related to [Erasure coding](https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction) and [Kate commitments](https://dankradfeist.de/ethereum/2020/06/16/kate-polynomial-commitments.html) are based on polynomials over $\Bbb{F}_R$ field ([finite field](https://en.wikipedia.org/wiki/Finite_field) of integers modulo `MODULUS`):
$$P(x) = c_0 + c_1x + c_2x^2 + ... + c_Nx^N$$
All polynomial elements $c_i$, $x$ and $P(x)$ are from $\Bbb{F}_R$. In the spec the `Point` class represents a $\Bbb{F}_R$ element
Let's choose and fix $M$ distinct arbitrary $x$ values: $[x_0, x_1, ... , x_M]$. For now let's assume they are arbitrary, and a smarter set will be chosen [later](#Data-evaluation-points). This set of values is fixed for any datapoints to be encoded ever thus the set should large enough and contain at least `MAX_EXT_POINTS ` elements (the maximum number of extended data points)
Given $N$ datapoints $[a_0, a_1, ... , a_N]$ let's find the coefficients $[p_0, p_1, ... , p_N]$ such that polinomial $P(x)$ evaluations would be equal to our datapoints $[a_0, a_1, ... , a_N]$ at $x$ points $[x_0, x_1, ... , x_N]$. I.e. $P(x_0) = a_0, P(x_1) = a_1, ... P(x_N) = a_N$ or $P(x_i) = a_i$ shortly. See `poly_from_evals()` function.
Thus we get the specific polynomial $P_a(x) = p_0 + p_1x + p_2x^2 + ... + p_Nx^N$ corresponding to our original dataset $[a_0, a_1, ... , a_N]$
## More constnants
```python
# max points in extended data
MAX_EXT_POINTS = MAX_SAMPLES_PER_BLOCK * POINTS_PER_SAMPLE
# max data polynomial degree
MAX_DEGREE = MAX_EXT_POINTS / 2 - 1
# Unity (element 'one') in G2
G2_GENERATOR = G2_SETUP[0]
# Zero element in Fr
POINT_ZERO = Point(0)
# Unity element in Fr
POINT_ONE = Point(1)
```
## Polynomial operations
`Poly` class represents a polynomial which is essentially just a vector of it's coeeficients $[c_0, c_1,...,c_N]$ and a set of standart operations.
```python
class Poly:
# polynomial coefficients (any of them can be 0)
coeff: List[Point]
def __add__(other: Poly): Poly
def __sub__(other: Poly): Poly
def __mul__(other: Poly): Poly
def __mul__(coeff: Point): Poly
def __div__(other: Poly): Poly
def __div__(coeff: Point): Poly
# max degree of X with non-zero coefficient
def degree(): int
# evaluate polynomial at point x
def eval(x: Point): Point
# calculates production of several polynomials
def prod(polys: List[Poly]): Poly
return reduce(lambda p1, p2: p1 * p2, polys)
```
`eval_in_G1()` calculates the evaluation of polynomial $P(x)$ (`poly` parameter) at the secret point $s$ mapped to $\Bbb{G}_1$: $G\cdot P(s)$. But since the $s$ is not known to anyone we can't directly evalute the polynomial. But we have `G1_SETUP` with powers of $s$ mapped to $\Bbb{G}_1$: $[s^0]_1, [s^1]_1, [s^2]_1...[s^N]_1$. Thus we may evaluate the polynomial directly in $\Bbb{G}_1$:
$$ \texttt{eval_in_G1}(P(x)) = p_0[s^0]_1 + p_1[s^1]_1 + p_2[s^2]_1 +...+ p_N[s^N]_1$$
The same applies to `eval_in_G2()` for the field $\Bbb{G}_2$ and predefined setup points `G2_SETUP`
```python
def eval_in_G1(poly: Poly): G1
return sum([G1_SETUP[i] * poly.coeff[i] for i in range(0, poly.degree() + 1])
def eval_in_G2(poly: Poly): G2
return sum([G2_SETUP[i] * poly.coeff[i] for i in range(0, poly.degree() + 1])
```
`poly_from_evals()` function derives a polynomial from its evaluations `ys` at points `xs`
The baseline method for this calculation is the [Lagrange polynomial](https://en.wikipedia.org/wiki/Lagrange_polynomial) which is implemented below.
The complexity of the reference implementation in $O(n^2)$ which is not perfect. Luckily when $x$ points selection is made smartly like [below](#Data-evaluation-points) a faster inverse FFT ([Fast Fourie Transformation](https://en.wikipedia.org/wiki/Fast_Fourier_transform)) algorithm could be used.
```python
def poly_from_evals(xs: List[Point], ys: List[Point]): Poly
assert length(xs) == length(ys)
len = length(xs)
def lagrange_basis(j: int): Poly
return prod([
Poly([-xm, POINT_ONE]) / (xs[j] - xm)
for xm in xs[:j] + xs[j+1:]
])
return sum([
ys[j] * lagrange_basis(j)
for j in range(0, len)
])
# Calculates polynomial (x - x0)*(x - x1)*...*(x-xN)
def zero_poly(xs: List[Point]): Poly
return prod([Poly([-x, POINT_ONE]) for x in xs])
```
## Kate commitments
### Kate polynomial commitment
_Polynomial commitment_ could be thought of as a compact polynomial fingerprint, or hash.
A _Kate_ polynomial commitment is an element from the group $\Bbb{G}_1$, and is essentially the evaluation of $P_a(x)$ at the secret point $s$ mapped to $\Bbb{G}_1$: $G\cdot P_a(s)$. But since the $s$ is not known to anyone we can't directly evalute the polynomial. But we have `G1_SETUP` with powers of $s$ mapped to $\Bbb{G}_1$: $[s^0]_1, [s^1]_1, [s^2]_1...[s^N]_1$. Thus we may evaluate the polynomial directly in $\Bbb{G}_1$:
$$ Commitment_a = p_0[s^0]_1 + p_1[s^1]_1 + p_2[s^2]_1 +...+ p_N[s^N]_1$$
### Kate multi-proof
Multi-proof is an element from from the group $\Bbb{G}_1$ which proves that the polynomial $P_a(x)$ (proves to a verifier who doesn't know the polynomial but has the polinomial $Commitment_a$ constructed above) at points $[z_0, z_1, ... , z_K]$ evaluates to values $[y_0, y_1, ... , y_K]$. I.e. $P_a(z_i) = y_i$ for $i = 1,...,K$
### Functions
`commit_to_poly()` calculates a commitment for a polynomial represented by its evaluation points `xs` and `ys`. Read [here](https://dankradfeist.de/ethereum/2020/06/16/kate-polynomial-commitments.html#kate-commitment) for more details.
```python
def commit_to_poly(xs: List[Point], data_ys: List[Point]): BLSCommitment
poly = poly_from_evals(xs, data_ys)
return BLSCommitment(eval_in_G1(poly))
```
Please refer to [this chapter](https://dankradfeist.de/ethereum/2020/06/16/kate-polynomial-commitments.html#multiproofs) for more details on Kate multiple point proofs
```python
# Creates a proof that the polynomial evaluates to
# values ys[0], ys[1],... at points xs[0], xs[1],...
def construct_proof(poly: Poly, xs: List[Point], ys: List[Point]): BLSCommitment
i_x = poly_from_evals(xs, ys)
z_x = zero_poly(xs)
proof_poly = (poly - i_x) / z_x
return BLSCommitment(eval_in_G1(proof_poly))
# Verifies the `proof` that a polynomial with `commitment` indeed evaluates to
# values ys[0], ys[1],... at points xs[0], xs[1],...
def check_proof(BLSCommitment commitment, BLSCommitment proof,
xs: List[Point], ys: List[Point]): bool
i_x = poly_from_evals(xs, ys)
z_x = zero_poly(xs)
return (
bls.Pairing(proof, eval_in_G2(z_x)) ==
bls.Pairing(commitment - eval_in_G1(i_x), G2_GENERATOR)
)
```
`commit_to_max_degree()` calculates the proof that the polynomial `poly` has degree not above `poly.degree()`
The degree proof works as follows. For a polynomial with degree $N$ : $B(x)=c_0+c_1x+...c_Nx^N$ the length proof is the commitment to the polynomial $q(x)=B(x) \cdot x^{(MAX\_DEGREE - N)}$, where `MAX_DEGREE` is the maximum power of polynomial available in the setup. The goal is to ensure that a proof can only be constructed if `deg(B) <= N` (there are not hidden higher-order terms in the polynomial, which would thwart reconstruction).
```python
def commit_to_max_degree(poly: Poly): BLSCommitment
add_degree = MAX_DEGREE - poly.degree()
add_poly = Poly([POINT_ZERO]*add_degree + [Point(1)])
proof_poly = poly * add_poly
return BLSCommitment(eval_in_G1(proof_poly))
def check_max_degree(BLSCommitment commitment, BLSCommitment proof, int max_degree): bool
add_degree = MAX_DEGREE - max_degree
addPoly = Poly([POINT_ZERO]*add_degree + [POINT_ONE])
return(
bls.Pairing(proof, G2_GENERATOR) ==
bls.Pairing(commitment, eval_in_G2(addPoly))
)
```
## Data evaluation points
Since we are free to choose any $x_0, x_1, ... , x_N$ let's do it smart to help optimized algorithms (Fast Fourie Transformation, FK20) to do their job. Thus let's choose and fix the following values:
$$ x_i = \omega^{rev(i)}$$
where `rev(i) = reverse_bit_order(i, MAX_EXT_POINTS)`, and $\omega$ is the unity root (with order `MAX_EXT_POINTS`) which is defined by the constant `ROOT_OF_UNITY`
```python
def reverse_bit_order(n: int): int
"""
Reverse the bit order of an integer n
"""
assert n < MAX_EXT_POINTS
return int(('{:0' + str(MAX_EXT_POINTS.bit_length() - 1) + 'b}').format(n)[::-1], 2)
def get_x(point_index: int): Point
return ROOT_OF_UNITY ** reverse_bit_order(point_index)
def get_xs(point_start_index: int, point_end_index: int): List[Point]
return [get_x(index) for index in range(point_start_index, point_end_index)]
```
## Data Availability Sampling
### Erasure coding
Extension of original data $[a_{N+1}, a_{N+2}, ... , a_{2N}]$ is simly another $N$ evluations of polynomial $P_a(x)$ at points $[x_{N+1}, x_{N+2}, ... , x_{2N}]$, i.e. $a_i = P_a(x_i)$ for $i = (N+1),...,2N$
Thus the full extended data is 2x bigger than original and consists of values $[a_0, a_1, ... , a_{2N}]$.
The polynom $P_a(x)$ can be reconstructed with just any $N$ values of this set, and thus the whole extended data set can be recovered with just any $N$ values.
### Functions
`eval_multi()` evaluates ploynomial `poly` at a number of consequtive data points.
This reference implementation is suboptimal (has quadratic complexity), it's worth considering faster FFT ([Fast Fourie Transformation](https://en.wikipedia.org/wiki/Fast_Fourier_transform)) algorithm
Other methods could also be implemented more effectively
``` python
# Evaluate polynomial in consecutive X points
# Optimized version is FFT
def eval_multi(poly: Poly, start_point_index: int, end_point_index: int): List[Point]
return [poly.eval(x) for x in get_xs(start_point_index, end_point_index)]
# Returns extended data (2x points) of original unextended data
def extend_data(data: List[Point]): List[Point]
poly = poly_from_evals(get_xs(0, length(data)), data)
return eval_multi(0, 2 * length(data)
# Returns the original data from extended data
def unextend_data(extended_data: List[Point]): List[Point]
return extended_data[:len(extended_data)//2]
# Taking any N points (or more) from 2N of extended data points
# recovers and returns all 2N points of the extended data
def recover_data(some_ext_data: List[Optional[Point]]): List[Point]:
xs = [get_x(i) for i, opt in enumerate(some_ext_data) if opt != None]
ys = [y for y in some_ext_data if y != None]
poly = poly_from_evals(xs, ys)
return eval_multi(0, length(some_ext_data))
```
### Sampled Kate proofs
``` python
# Returns the commitment for the full dataset
def commit_to_data(data: List[Point]): BLSCommitment
return commit_to_poly(get_xs(0, length(data)), data)
```
`construct_proofs()` creates proofs for all samples (data chunks of length `POINTS_PER_SAMPLE`) in a batch way. This reference implementation has suboptimal complexity $O(n^2)$. Assuming specific $x$ [points selection](#Data-evaluation-0points) a better optimized [FK20 (TODO link)]() algorithm can be used here
``` python
def construct_proofs(extended_data: List[Point]): List[BLSCommitment]
xs = get_xs(0, length(extended_data))
poly = poly_from_evals(xs, extended_data)
assert poly.degree() < length(extended_data) / 2
sample_count = len(extended_data) // POINTS_PER_SAMPLE
return [
construct_proof(
poly,
xs[i*POINTS_PER_SAMPLE : (i+1)*POINTS_PER_SAMPLE],
extended_data[i*POINTS_PER_SAMPLE : (i+1)*POINTS_PER_SAMPLE]
) for i in range(0, sample_count)
]
```