## Week 47 Notes/Report Still on poly eval in gpu. working on benching (continued work next week), working on the split-radix fft algo and any adjustments to it. basic bench on a simple circuit produces the number of 30% of time spent in `fft main` --> which is the `format!("{}_radix_fft", "Bn256_Fr"` kernel. General notes and basic bench below. This is heavily being changed and worked on in the next week. ### fft in `evaluation_gpu.rs` ``` in --> pub(crate) fn do_extended_fft<F: FieldExt, C: CurveAffine<ScalarExt = F>>(...) | | setup buffers and values: |--> writes origin_values: &Polynomial<F, Coeff>, to gpu buffer. |--> distribute powers zeta |--> eval fft prepare |--> call do_fft_pure() which is entry to below Both --> pub(crate) fn do_fft_pure<F: FieldExt>(...) --> pub(crate) fn _do_ifft<F: FieldExt>(...) // not actually used in evaluation_gpu.rs call: ----- |--> pub(crate) fn do_fft_core<F: FieldExt>(...) | which uses the format!("{}_radix_fft kernel | ``` So for `do_extended_fft` $$ \begin{aligned} \sum_{j=0}^m (\textrm{write_from_buffer}) + (\textrm{distribute powers zeta}) + (\textrm{_eval_fft_prepare}) + (\textrm{fft main(log_n)}) \end{aligned} $$ where fft main(log_n) is the function that runs the kernel `_radix_fft` within `do_fft_core()` #### Bench basic circuit: Components: ``` Total s for expressions gpu eval : 1642.000ms Total ms for permutations : 183.264ms Total ms for eval_h_lookups : 573.290ms Total ms for h_poly : 2640.000ms Total ms for multi_open : 539.771ms Total s for create_proof : 6.185s Total evals (distribute_powers_zeta) : 729 Total evals (eval_fft_prepare) : 729 Total evals : 729 Poly evals: Total µs for write_from_buffer : 546.075 --> 0.546075ms average time for write_from_buffer: 0.749µs Total µs for distribute_powers_zeta : 17.698 --> 0.017698ms average time for distribute_powers_zeta: 0.024µs Total µs for eval_fft_prepare : 38.994 --> 0.038994ms average time for eval_fft_prepare: 0.053µs Total ms for fft main 20 : 1259.488 average time for fft_main: 1.727693ms Total ms for do fft pure : 1277.932 average time for fft_pure: 1.752993ms Total ms for gpu eval unit : 1900.563 --> 1.900563s average time for gpu_eval_unit: 2.607ms Percentages: |--> advice : 4.228 |--> lookups 30 : 9.016 |--> lookups commit product : 7.393 |--> lookups add blinding value : 0.012 |--> lookups msm and fft : 13.426 |--> permutation commit : 1.522 |--> vanishing commit : 0.520 | |--> h_poly : 42.684 |--> lagrange_to_coeff_st : 3.752 |--> expressions_gpu_eval : 26.548 |--> permutations : 2.963 |--> eval_h_lookups : 9.269 | |--> vanishing construct : 1.540 |--> eval poly : 8.042 |--> eval poly vanishing : 0.000 -- fix missing |--> eval poly permutation : 0.000 -- fix missing |--> eval poly lookups : 0.000 -- fix missing | |--> multi_open : 8.727 % time in gpu_eval_unit : 30.729 Total: 97.109 ``` note: need to fix missing the values for eval poly in late stages of proof ![basicchart](https://hackmd.io/_uploads/HJoary0ET.png) -------------------------------------------------------- ### prove notes: WIP ``` --------------------------------------- prover.rs --> create_proof_from_witnes(); | | into L1148 | | let h_poly = pk.ev.evaluate_h(... | --------------------------------------- evaluation.rs --> pub(in crate::plonk) fn evaluate_h( ) | | --------------------------------------- evaluation_gpu.rs |--> pub(crate) fn eval_gpu<C: CurveAffine<ScalarExt = F>>( ...) | ----------------- |--> setup closure that: | precalculates omega(s), cosets, | ----------------- |--> | | | ``` ```rust! let h_poly: Polynomial<<<C as CurveAffine>::CurveExt as CurveExt>::ScalarExt, ExtendedLagrangeCoeff> ``` ```rust! x.eval_gpu(group_idx, pk, &unit_stat, &advice_poly[0], &instance_poly[0], y)) //in the below: let mut values = pk .ev .gpu_gates_expr .par_iter() .enumerate() .map(|(group_idx, x)| x.eval_gpu(group_idx, pk, &unit_stat, &advice_poly[0], &instance_poly[0], y)) .collect::<Vec<_>>() .into_iter() .reduce(|acc, x| acc + &x) .unwrap(); ``` #### Note 3: L615(ish) ```rust! let mut helper = gen_do_extended_fft(pk, program)?; let values_buf = self._eval_gpu( pk, program, memory_cache, advice, instance, &mut ys, &mut unit_cache, &mut LinkedList::new(), &mut helper, )?; ``` ----------------------------------------------------------------------------------------------- ## `prover.rs` ```rust // Obtain challenge for keeping all separate gates linearly independent let y: ChallengeY<_> = transcript.squeeze_challenge_scalar(); ``` ```rust! // Evaluate the h(X) polynomial // Evaluate the h(X) polynomial's constraint system expressions for the permutation constraints. ``` //------------------------------------------------ //------------------------------------------------ **From halo2 zcash:** Evaluate the h(X) polynomial's constraint system expressions for the permutation constraints. ```rust! // Evaluate the h(X) polynomial's constraint system expressions for the permutation constraints. let (permutations, permutation_expressions): (Vec<_>, Vec<_>) = permutations .into_iter() .zip(advice_cosets.iter()) .zip(instance_cosets.iter()) .map(|((permutation, advice), instance)| { permutation.construct( pk, &pk.vk.cs.permutation, advice, &fixed_cosets, instance, &permutation_cosets, l0, l_blind, l_last, beta, gamma, ) }) .unzip(); ``` //------------------------------------------------ **From halo2 zcash:** Evaluate the h(X) polynomial's constraint system expressions for the lookup constraints, if any. ```rust! let (lookups, lookup_expressions): (Vec<Vec<_>>, Vec<Vec<_>>) = lookups ... ``` then feed this into the below ```rust! let expressions = advice_cosets .iter() .zip(instance_cosets.iter()) .zip(permutation_expressions.into_iter()) .zip(lookup_expressions.into_iter()) .flat_map( |(((advice_cosets, instance_cosets), permutation_expressions), lookup_expressions)| { ... ``` **From GPUmod halo2:** Evaluate the h(X) polynomial ```rust! let advice = advice .into_iter() .map(|advice| { let timer = start_timer!(|| "lagrange_to_coeff_st"); let advice_polys: Vec<_> = advice .into_par_iter() .map(|poly| domain.lagrange_to_coeff_st(poly)) .collect(); end_timer!(timer); #[cfg(not(feature = "cuda"))] let advice_cosets: Vec<_> = advice_polys .iter() .map(|poly| domain.coeff_to_extended(poly.clone())) .collect(); AdviceSingle::<C> { advice_polys, #[cfg(not(feature = "cuda"))] advice_cosets, } }) .collect::<Vec<_>>(); #[cfg(feature = "cuda")] let h_poly = pk.ev.evaluate_h( pk, advice.iter().map(|a| &a.advice_polys).collect(), instance.iter().map(|i| &i.instance_polys).collect(), *y, *beta, *gamma, *theta, &lookups, &permutations, ); ``` //------------------------------------------------ Construct the vanishing argument's h(X) commitments ```rust! // Construct the vanishing argument's h(X) commitments let vanishing = vanishing.construct( params, domain, coset_evaluator, expressions, y, &mut rng, transcript, )?; ``` ```rust! let x: ChallengeX<_> = transcript.squeeze_challenge_scalar(); let xn = x.pow(&[params.n as u64, 0, 0, 0]); ``` ---------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------- ### Notes on `ec-gpu-gen` & the `fft.cl` and `fft.rs` Notes of the gpu `fft.cl` implementation. is a version of: http://www.bealto.com/gpu-fft_group-1.html Radix FFT, such as Radix-2 or Radix-4 FFT, is a variation of the Fast Fourier Transform (FFT) algorithm that is optimized for factors of 2 or 4. The Radix-2 FFT, is a Cooley-Tukey FFT algorithm that recursively breaks down a DFT of size N into two smaller DFTs of sizes N/2. It is the most common form of the Cooley-Tukey algorithm and is used when N is a power of two. Similarly, the Radix-4 FFT breaks down a DFT of size N into four smaller DFTs of sizes N/4. It can save a few multiplications compared to Radix-2, but it also has higher complexity in terms of code structure, exception handling, coefficient management, register management, digit-reverse addressing, etc Another variation of the FFT algorithm is the Split-Radix FFT. It's a blend of radices 2 and 4, which means it recursively expresses a DFT of length N in terms of one smaller DFT of length N/2 and two smaller DFTs of length N/4. For a long time, the Split-Radix FFT was believed to achieve the lowest arithmetic operation count for power-of-two sizes, although recent variations achieve an even lower count