## Week 47 Notes/Report
Still on poly eval in gpu. working on benching (continued work next week), working on the split-radix fft algo and any adjustments to it. basic bench on a simple circuit produces the number of 30% of time spent in `fft main` --> which is the `format!("{}_radix_fft", "Bn256_Fr"` kernel.
General notes and basic bench below. This is heavily being changed and worked on in the next week.
### fft in `evaluation_gpu.rs`
```
in
--> pub(crate) fn do_extended_fft<F: FieldExt, C: CurveAffine<ScalarExt = F>>(...)
|
| setup buffers and values:
|--> writes origin_values: &Polynomial<F, Coeff>, to gpu buffer.
|--> distribute powers zeta
|--> eval fft prepare
|--> call do_fft_pure() which is entry to below
Both
--> pub(crate) fn do_fft_pure<F: FieldExt>(...)
--> pub(crate) fn _do_ifft<F: FieldExt>(...) // not actually used in evaluation_gpu.rs
call:
-----
|--> pub(crate) fn do_fft_core<F: FieldExt>(...)
| which uses the format!("{}_radix_fft kernel
|
```
So for `do_extended_fft`
$$
\begin{aligned}
\sum_{j=0}^m (\textrm{write_from_buffer}) + (\textrm{distribute powers zeta}) + (\textrm{_eval_fft_prepare}) + (\textrm{fft main(log_n)})
\end{aligned}
$$
where fft main(log_n) is the function that runs the kernel `_radix_fft` within `do_fft_core()`
#### Bench basic circuit:
Components:
```
Total s for expressions gpu eval : 1642.000ms
Total ms for permutations : 183.264ms
Total ms for eval_h_lookups : 573.290ms
Total ms for h_poly : 2640.000ms
Total ms for multi_open : 539.771ms
Total s for create_proof : 6.185s
Total evals (distribute_powers_zeta) : 729
Total evals (eval_fft_prepare) : 729
Total evals : 729
Poly evals:
Total µs for write_from_buffer : 546.075 --> 0.546075ms
average time for write_from_buffer: 0.749µs
Total µs for distribute_powers_zeta : 17.698 --> 0.017698ms
average time for distribute_powers_zeta: 0.024µs
Total µs for eval_fft_prepare : 38.994 --> 0.038994ms
average time for eval_fft_prepare: 0.053µs
Total ms for fft main 20 : 1259.488
average time for fft_main: 1.727693ms
Total ms for do fft pure : 1277.932
average time for fft_pure: 1.752993ms
Total ms for gpu eval unit : 1900.563 --> 1.900563s
average time for gpu_eval_unit: 2.607ms
Percentages:
|--> advice : 4.228
|--> lookups 30 : 9.016
|--> lookups commit product : 7.393
|--> lookups add blinding value : 0.012
|--> lookups msm and fft : 13.426
|--> permutation commit : 1.522
|--> vanishing commit : 0.520
|
|--> h_poly : 42.684
|--> lagrange_to_coeff_st : 3.752
|--> expressions_gpu_eval : 26.548
|--> permutations : 2.963
|--> eval_h_lookups : 9.269
|
|--> vanishing construct : 1.540
|--> eval poly : 8.042
|--> eval poly vanishing : 0.000 -- fix missing
|--> eval poly permutation : 0.000 -- fix missing
|--> eval poly lookups : 0.000 -- fix missing
|
|--> multi_open : 8.727
% time in gpu_eval_unit : 30.729
Total: 97.109
```
note: need to fix missing the values for eval poly in late stages of proof

--------------------------------------------------------
### prove notes: WIP
```
--------------------------------------- prover.rs
--> create_proof_from_witnes();
|
| into L1148
|
| let h_poly = pk.ev.evaluate_h(...
|
--------------------------------------- evaluation.rs
--> pub(in crate::plonk) fn evaluate_h( )
|
|
--------------------------------------- evaluation_gpu.rs
|--> pub(crate) fn eval_gpu<C: CurveAffine<ScalarExt = F>>( ...)
|
-----------------
|--> setup closure that:
| precalculates omega(s), cosets,
|
-----------------
|-->
|
|
|
```
```rust!
let h_poly: Polynomial<<<C as CurveAffine>::CurveExt as CurveExt>::ScalarExt, ExtendedLagrangeCoeff>
```
```rust!
x.eval_gpu(group_idx, pk, &unit_stat, &advice_poly[0], &instance_poly[0], y))
//in the below:
let mut values = pk
.ev
.gpu_gates_expr
.par_iter()
.enumerate()
.map(|(group_idx, x)| x.eval_gpu(group_idx, pk, &unit_stat, &advice_poly[0], &instance_poly[0], y))
.collect::<Vec<_>>()
.into_iter()
.reduce(|acc, x| acc + &x)
.unwrap();
```
#### Note 3:
L615(ish)
```rust!
let mut helper = gen_do_extended_fft(pk, program)?;
let values_buf = self._eval_gpu(
pk,
program,
memory_cache,
advice,
instance,
&mut ys,
&mut unit_cache,
&mut LinkedList::new(),
&mut helper,
)?;
```
-----------------------------------------------------------------------------------------------
## `prover.rs`
```rust
// Obtain challenge for keeping all separate gates linearly independent
let y: ChallengeY<_> = transcript.squeeze_challenge_scalar();
```
```rust!
// Evaluate the h(X) polynomial
// Evaluate the h(X) polynomial's constraint system expressions for the permutation constraints.
```
//------------------------------------------------
//------------------------------------------------
**From halo2 zcash:**
Evaluate the h(X) polynomial's constraint system expressions for the permutation constraints.
```rust!
// Evaluate the h(X) polynomial's constraint system expressions for the permutation constraints.
let (permutations, permutation_expressions): (Vec<_>, Vec<_>) = permutations
.into_iter()
.zip(advice_cosets.iter())
.zip(instance_cosets.iter())
.map(|((permutation, advice), instance)| {
permutation.construct(
pk,
&pk.vk.cs.permutation,
advice,
&fixed_cosets,
instance,
&permutation_cosets,
l0,
l_blind,
l_last,
beta,
gamma,
)
})
.unzip();
```
//------------------------------------------------
**From halo2 zcash:**
Evaluate the h(X) polynomial's constraint system expressions for the lookup constraints, if any.
```rust!
let (lookups, lookup_expressions): (Vec<Vec<_>>, Vec<Vec<_>>) = lookups ...
```
then feed this into the below
```rust!
let expressions = advice_cosets
.iter()
.zip(instance_cosets.iter())
.zip(permutation_expressions.into_iter())
.zip(lookup_expressions.into_iter())
.flat_map(
|(((advice_cosets, instance_cosets), permutation_expressions), lookup_expressions)| {
...
```
**From GPUmod halo2:**
Evaluate the h(X) polynomial
```rust!
let advice = advice
.into_iter()
.map(|advice| {
let timer = start_timer!(|| "lagrange_to_coeff_st");
let advice_polys: Vec<_> = advice
.into_par_iter()
.map(|poly| domain.lagrange_to_coeff_st(poly))
.collect();
end_timer!(timer);
#[cfg(not(feature = "cuda"))]
let advice_cosets: Vec<_> = advice_polys
.iter()
.map(|poly| domain.coeff_to_extended(poly.clone()))
.collect();
AdviceSingle::<C> {
advice_polys,
#[cfg(not(feature = "cuda"))]
advice_cosets,
}
})
.collect::<Vec<_>>();
#[cfg(feature = "cuda")]
let h_poly = pk.ev.evaluate_h(
pk,
advice.iter().map(|a| &a.advice_polys).collect(),
instance.iter().map(|i| &i.instance_polys).collect(),
*y,
*beta,
*gamma,
*theta,
&lookups,
&permutations,
);
```
//------------------------------------------------
Construct the vanishing argument's h(X) commitments
```rust!
// Construct the vanishing argument's h(X) commitments
let vanishing = vanishing.construct(
params,
domain,
coset_evaluator,
expressions,
y,
&mut rng,
transcript,
)?;
```
```rust!
let x: ChallengeX<_> = transcript.squeeze_challenge_scalar();
let xn = x.pow(&[params.n as u64, 0, 0, 0]);
```
----------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------
### Notes on `ec-gpu-gen` & the `fft.cl` and `fft.rs`
Notes of the gpu `fft.cl` implementation.
is a version of:
http://www.bealto.com/gpu-fft_group-1.html
Radix FFT, such as Radix-2 or Radix-4 FFT, is a variation of the Fast Fourier Transform (FFT) algorithm that is optimized for factors of 2 or 4. The Radix-2 FFT, is a Cooley-Tukey FFT algorithm that recursively breaks down a DFT of size N into two smaller DFTs of sizes N/2. It is the most common form of the Cooley-Tukey algorithm and is used when N is a power of two.
Similarly, the Radix-4 FFT breaks down a DFT of size N into four smaller DFTs of sizes N/4. It can save a few multiplications compared to Radix-2, but it also has higher complexity in terms of code structure, exception handling, coefficient management, register management, digit-reverse addressing, etc
Another variation of the FFT algorithm is the Split-Radix FFT. It's a blend of radices 2 and 4, which means it recursively expresses a DFT of length N in terms of one smaller DFT of length N/2 and two smaller DFTs of length N/4. For a long time, the Split-Radix FFT was believed to achieve the lowest arithmetic operation count for power-of-two sizes, although recent variations achieve an even lower count