Week 44 - HackMD

# Week 44 Notes/Report: Working on understanding the fft evaluation gpu code, kernels and any math behind it. ## General poly eval notes: There are two was to represent polynomials: **Coefficient representation:** $\sum_k c_k*x^k$ Easy to do addition & subtraction, however it is difficult to do multiplication & division **Evaluation representation:** or point-value representationof a polynomial $A(x)$ of degree-bound $n$ is a set of point-value pairs ${(x_0, y_0),(x_1, y_1) ... (x_{n-1}, y_{n-1})}$ such that all of the $x_k are distinct and $y_k = A(x_{k})$ for $k = 0,1,...,n-1$ in eval form its easy to do addition & multiplication. FFT and InverseFFT converts between "eval form" and "coeff form" InvFFT turns coeff form into eval form. FFT turns eval form int coeff form. i.e., The inverse of evaluation is called interpolation. It determines the coefficient form of polynomial point-value representation. For any set ${(x_0, y_0),(x_1, y_1) ... (x_{n-1}, y_{n-1})}$ of $n$ point-value pairs such that all the $x_k$ values are distinct, there is a **unique** polynomial $A(x)$ of degree-bound $n$ such that $y_k = A(x_k)$ for $k = 0,1,...,n-1$ ## Polynomial multi the trick is to represent the poly not in terms of its coefficients, but in terms of their values at specific points. FFT allows us to get very efficiently from coefficitns to values. In point-value form, multiplication $C(x) = A(x)B(x)$ is given by $C(x)= A(x_k)B(x_k)$ for any point $x_k$. **Problem:** if $A$ and $B$ are of degree-bound $n$, then $C$ is of degree-bound $2n$. We need to start with the " " point-value forms for $A$ and $B$ consisting of $2n$ point-value pairs each. $A: {(x_0, y_0),(x_1, y_1) ... (x_{2n-1}, y_{2n-1})}$ $B: {(x_0, y'_0),(x_1, y'_1) ... (x_{2n-1}, y'_{2n-1})}$ $C: {(x_0, y_0 y'_0),(x_1, y_1 y'_1) ... (x_{2n-1}, y_{2n-1} y'_{2n-1})}$ The time to multiply two polys of degree-bound $n$ in point-value form is $O(n)$. FFT takes advantage of the special properties of the (complex) roots of unity to comput $DFT_n(a)$ in $)(n log n)$ time. Divide-and-conquer strategy: --> define two new polynomials of degree-bound $n/2$, using even-index and odd-index coefficients of $A(x)$ separately. $A^[0](x) = a_0 + a_2x + a_4x^2 +...+ a_{n-2}x^{{n/2}-1}$ $A^[1](x) = a_0 + a_2x + a_4x^2 +...+ a_{n-2}x^{{n/2}-1}$ $A(x) = A^[0](x^2) + xA^[1](x^2)$ The problem of evaluating $A(x)$ at $\omega^{0}_n,\omega^{1}_n,\omega^{n-1}_n$ reduces to: - evaluating the degree-bond $n/2$ polynomials $A^[0](x)$ and $A^[1](x)$ at the points $((\omega^{0}_n)^2, ((\omega^{1}_n)^2, ((\omega^{n-1}_n)^2$. - combining the results by $A(x) = A^[0](x^2) + xA^[1](x^2)$ Polynomials $A^[0]$ and $A^[1]$ are recursively evaluated at the ${n/2}^{th}$ roots of unity. looking at the `extended_fft()` function `evaluation_gpu.rs` ```Rust! #[cfg(feature = "cuda")] pub(crate) fn do_extended_fft<F: FieldExt, C: CurveAffine<ScalarExt = F>>( pk: &ProvingKey<C>, program: &Program, origin_values: &Polynomial<F, Coeff>, allocator: &mut LinkedList<Buffer<F>>, helper: &mut ExtendedFFTHelper<F>, ) -> EcResult<Buffer<F>> { let origin_size = 1u32 << pk.vk.domain.k(); let extended_size = 1u32 << pk.vk.domain.extended_k(); let local_work_size = 128; let global_work_size = extended_size / local_work_size; //let timerall = start_timer!(|| "gpu eval unit"); let values = allocator .pop_front() .unwrap_or_else(|| unsafe { program.create_buffer::<F>(extended_size as usize).unwrap() }); let origin_values_buffer = Rc::get_mut(&mut helper.origin_value_buffer).unwrap(); //let timer = start_timer!(|| "write from buffer"); program.write_from_buffer(origin_values_buffer, &origin_values.values)?; //end_timer!(timer); //let timer = start_timer!(|| "distribute powers zeta"); do_distribute_powers_zeta( pk, program, origin_values_buffer, &helper.coset_powers_buffer, )?; //end_timer!(timer); //let timer = start_timer!(|| "eval fft prepare"); let kernel_name = format!("{}_eval_fft_prepare", "Bn256_Fr"); let kernel = program.create_kernel( &kernel_name, global_work_size as usize, local_work_size as usize, )?; kernel .arg(origin_values_buffer) .arg(&values) .arg(&origin_size) .run()?; //end_timer!(timer); //let timer = start_timer!(|| "do fft pure"); let domain = &pk.vk.domain; let a = do_fft_pure( program, values, domain.extended_k(), allocator, &helper.pq_buffer, &helper.omegas_buffer, ); //end_timer!(timer); //end_timer!(timerall); a } ``` from `fft.cl` the kernel for `_eval_fft_prepare`: ```opencl! KERNEL void FIELD_eval_fft_prepare( GLOBAL FIELD* origin_value, GLOBAL FIELD* value, uint origin_size ) { uint gid = GET_GLOBAL_ID(); // --> gets the work item id uint idx = gid; // --> assing the work item to idx if (idx < origin_size) { value[idx] = origin_value[idx]; // copy the value at index (idx) from the `origin_value[idx]` fro mthe `vlaue` array } else { value[idx] = FIELD_ZERO; // anything out of bounds assign as zero. } } ``` this kernel prepares the `value` array by copying data from `origin_value` array. If the loop goes out of bounds it fills `value` with zero. `origin_value` is: ```Rust! let origin_values_buffer = Rc::get_mut(&mut helper.origin_value_buffer).unwrap(); ``` `value` is: ```Rust! let values = allocator .pop_front() .unwrap_or_else(|| unsafe { program.create_buffer::<F>(extended_size as usize).unwrap() }); ``` but before the above kernel is called. `do_distribute_powers_zeta()` is called. i.e., distributing powers of the primitive $n^{th}$ root of unity, commonly denoted as "zeta." ### NTT general The Number-Theoretic Transfer (NTT) algorithm is a variation of the FFT algorithm that enables efficient conversions between different polynomial representations like coefficient and evaluation forms. These conversions are necessary because arithmetic like multiplications among large polynomials are more amenable in evaluation form since they can be performed in log-linear time. While traditional FFTs are defined over complex numbers, NTTs are defined over finite fields F with a multiplicative subgroup $H$ of order $n = 2k$ . The subgroup $H$ corresponds to the domain consisting of the $n^{th}$ roots of unity $H = {\omega^0 , \omega^1 , ..., \omega^n }$ and are known as the twiddle factors. Divide-and-conquer --> the Cooley-Tukey algorithm is the most common, which recursively decomposes the NNT of size N into smaller size $N/2$ by preforming a butterfly operation at every stage. define NTT as two $N$-sized vectors $X$ and $Y$ where: $X_i = \sum^{N-1}_{j=0} Y_j \omega^{ij}_N (mod P)$ where $x$ is the size of the input vector, and $\omega$ are teh twiddle factors (roots of unity). inverse NTT is define as: $X_i = N^{-1} \sum^{N-1}_{j=0} Y_j \omega^{-ij}_N (mod P)$. **WIP** --------------------- ## ProveExpression::Scale **WIP** from `fft.cl` ```opencl! KERNEL void FIELD_eval_scale( GLOBAL FIELD* res, GLOBAL FIELD* l, int l_rot, uint size, GLOBAL FIELD* c ) { uint gid = GET_GLOBAL_ID(); uint idx = gid; uint lidx = (idx + size + l_rot) & (size - 1); res[idx] = FIELD_mul(l[lidx], c[0]); } ```