## Week 51 - December/EoY wrap up summary.
Thoughout Q4 of 2023 I have been stuudying polynomial evaluation and fft within the scope of halo2 and the codebase of the [halo2-gpu-specific](https://github.com/DelphinusLab/halo2-gpu-specific)
With the objective of improving the time (reducing) of the polynomial evaluation in proof generation.
#### small review:
```
-------------------------------- prover.rs
--> pub fn create_proof()
|
|
-------------------------------- evaluation.rs
--> evaluate_h()
|
| for feature = "cuda"
-------------------------------- evaluation_gpu.rs
|-->_eval_gpu()
|
```
halo2 and more generally fft is gone by using The Number-Theoretic Transfer (NTT) which is an algorithm variation of the FFT algorithm that enables efficient conversions between different polynomial representations like coefficient and evaluation forms.
These conversions are necessary because arithmetic like multiplications among large polynomials are more amenable in evaluation form since they can be performed in log-linear time. While traditional FFTs are defined over complex numbers, NTTs are defined over finite fields F with a multiplicative subgroup $H$ of order $n = 2k$ . The subgroup $H$ corresponds to the domain consisting of the $n^{th}$ roots of unity $H = {\omega^0 , \omega^1 , ..., \omega^n }$ and are known as the twiddle factors.
**Method:**
Divide-and-conquer --> the Cooley-Tukey algorithm is the most common, which recursively decomposes the NTT of size N into smaller size $N/2$ by preforming a butterfly operation at every stage.
define NTT as two $N$-sized vectors $X$ and $Y$ where:
$X_i = \sum^{N-1}_{j=0} Y_j \omega^{ij}_N (mod P)$
where $x$ is the size of the input vector, and $\omega$ are teh twiddle factors (roots of unity).
inverse NTT is define as:
$X_i = N^{-1} \sum^{N-1}_{j=0} Y_j \omega^{-ij}_N (mod P)$.
A simple radix-2 Decimate-In-Time is given by
$$
\begin{aligned}
X(k) = \sum_{n=0}^{N-1} x(n)W_{N}^{kn} \;\;\;\;\; k =0,1,2,\dots,N-1 \;\;\;\;\;\; W_N = e^{-j2\pi / N}
\end{aligned}
$$
#### gpu fft:
In `halo2-gpu-specific` fft is done using a opencl kernel invoked from within the `evaluation_gpu.rs` source
```rust
#[cfg(feature = "cuda")]
pub(crate) fn do_extended_fft<F: FieldExt, C: CurveAffine<ScalarExt = F>>(
...
let timer = start_timer!(|| "do fft pure");
let domain = &pk.vk.domain;
let a = do_fft_pure(
program,
values,
domain.extended_k(),
allocator,
&helper.pq_buffer,
&helper.omegas_buffer,
);
end_timer!(timer);
}
```
which calls `fft_core` that runs a split `radix_fft` kernel from `ec-gpu-gen`.
```rust
#[cfg(feature = "cuda")]
pub(crate) fn do_fft_core<F: FieldExt>(
...
let mut log_p = 0u32;
// Each iteration performs a FFT round
while log_p < log_n {
// 1=>radix2, 2=>radix4, 3=>radix8, ...
let deg = cmp::min(max_deg, log_n - log_p);
let n = 1u32 << log_n;
let local_work_size = 1 << cmp::min(deg - 1, MAX_LOG2_LOCAL_WORK_SIZE);
let global_work_size = n >> deg;
let kernel_name = format!("{}_radix_fft", "Bn256_Fr");
let kernel = program.create_kernel(
&kernel_name,
global_work_size as usize,
local_work_size as usize,
)?;
kernel
.arg(&src_buffer)
.arg(&dst_buffer)
.arg(pq_buffer)
.arg(omegas_buffer)
.arg(&LocalBuffer::<F>::new(1 << deg))
.arg(&n)
.arg(&log_p)
.arg(°)
.arg(&max_deg)
.run()?;
log_p += deg;
std::mem::swap(&mut src_buffer, &mut dst_buffer);
}
...
Ok(src_buffer)
}
```
### Indicator benchmarks for fft time (current situation):
Some basic becnes for a simple test circuit from zkWasm and also a more complex circuit from zkWasm-host-circuits give an indication of how much time is spend in polynomial evaluation, specifically fft.
Benchmark for bn254 circuit from host-circuits reveals that approximately 25% of time is spent in `h_poly` eval and in time approxiamtely 14% is spent on the pure fft.
```
Total s for expressions gpu eval : 1344.000ms
Total ms for permutations : 3523.000ms
Total ms for eval_h_lookups : 1631.000ms
Total ms for h_poly : 7316.000ms
Total ms for multi_open : 2600.000ms
Total s for create_proof : 29.622s
Total evals (distribute_powers_zeta) : 124
Total evals (eval_fft_prepare) : 124
Total evals : 124
Poly evals:
Total µs for write_from_buffer : 1745.116 --> 1.745116ms
average time for write_from_buffer: 14.074µs
Total µs for distribute_powers_zeta : 37.751 --> 0.037751ms
average time for distribute_powers_zeta: 0.304µs
Total µs for eval_fft_prepare : 124.083 --> 0.124083ms
average time for eval_fft_prepare: 1.001µs
Total ms for fft main 20 : 0.000
average time for fft_main: 0.000000ms
Total ms for do fft pure : 2321.861
average time for fft_pure: 18.724685ms
Total ms for gpu eval unit : 4237.073 --> 4.237073s
average time for gpu_eval_unit: 34.170m
Percentages:
|--> advice : 24.880
|--> lookups 30/4 : 5.236
|--> lookups commit product : 6.634
|--> lookups add blinding value : 0.000
|--> lookups msm and fft : 3.987
|--> permutation commit : 11.046
|--> vanishing commit : 1.214
|
|--> h_poly : 24.698
|--> lagrange_to_coeff_st : 2.757
|--> expressions_gpu_eval : 4.537
|--> permutations : 11.893
|--> eval_h_lookups : 5.506
|
|--> vanishing construct : 3.393
|--> eval poly : 8.436
|--> eval poly vanishing : 0.000
|--> eval poly permutation : 0.000
|--> eval poly lookups : 0.000
|
|--> multi_open : 8.777
% time in gpu_eval_unit : 14.304
Total: 98.301
```

The raw numbers for the circuit as a percentage of time is quite different in comparison to a very basic circuit (posted in week 47), where approximately 42% of time was spent in `h_poly` and (for pure fft) `gpu_eval_unit was` 30.729%.

## WIP:
The below can be seen in PR: https://github.com/DelphinusLab/polyeval/pull/5
Generally experiments into fft evaluation via gpu with the use of cuda. Note, this is in comparison to/and will be benched against the current `ec-gpu-gen` lib [rolled by DelphinusLab](https://github.com/lanbones/ec-gpu).
The setup is to generate the field `Fp` in `bn256.cu` as the Scalar field from BN254.
```cuda
namespace p256
{
// BN256 Scalar field
// modulus = 21888242871839275222246405745257275088548364400416034343698204186575808495617
// 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593f0000001
using Fp = Fp256<
/* =N **/ /*u256(*/ 3486998266802970665, 13281191951274694749, 2896914383306846353, 4891460686036598785 /*)*/,
/* =R_SQUARED **/ /*u256(*/ 1011752739694698287, 7381016538464732718,
3962172157175319849, 12436184717236109307 /*)*/,
/* =N_PRIME **/ /*u256(*/ 0, 0,
0, 14042775128853446655 /*)*/
>;
} // namespace p256
```
constants taken from `halo2curves` BN256 scalar: https://github.com/privacy-scaling-explorations/halo2curves/blob/main/src/bn256/fr.rs
Calcualtion of 2-adic-root-of-unity done with sage
```sage
modulus = 21888242871839275222246405745257275088548364400416034343698204186575808495617
assert(modulus.is_prime())
Fp = GF(modulus)
generator = Fp(0);
for i in range(0, 20):
i = Fp(i);
neg_i = Fp(-i)
if not(i.is_primitive_root() or neg_i.is_primitive_root()):
continue
elif i.is_primitive_root():
assert(i.is_primitive_root());
print("Generator: %d" % i)
generator = i
break
else:
assert(neg_i.is_primitive_root());
print("Generator: %d" % neg_i)
generator = neg_i
break
two_adicity = valuation(modulus - 1, 2);
trace = (modulus - 1) / 2**two_adicity;
two_adic_root_of_unity = generator^trace
print("2-adicity: %d " % two_adicity)
print("2-adic Root of Unity: %d " % two_adic_root_of_unity)
```
```
Generator: 5
2-adicity: 28
2-adic Root of Unity: 19103219067921713944291392827692070036145651957329286315305642004821462161904
```
Which is included in the trait for `Fftield` in the PR. This is usefull to check that roots can be calulated from the scalar field constants `Self::ROOT_OF_UNITY` and `Self::S` **OR** using the values for `two_adic_root_of_unity` and `2-adicity` (above), using the formula below:
$$
\textrm{root} = \textrm{two_adic_primitive_root_of_unity}^{(\textrm{log_power + 1})}
$$
These two sets of constants should give the same value `F::ONE` when
$$
1 = \textrm{root}^{1 << n}
$$
or
```rust
root.pow([1 << n]);
```
where $n$ is the order. This checks out.
#### Running the PR:
The cuda `.ptx` can be built and tests can be run by
```console
/polyeval$ make build-cuda-bn256
nvcc -ptx src/cuda/field_bn256/bn256.cu -o src/cuda/field_bn256/bn256.ptx
/polyeval$ cargo test --package poly-optimizer
```
#### geenrate twiddles:
Comparing the cpu calculation of the twiddle factors used in fft, using the `root : 0x21082ca216cbbf4e1c6e4f4594dd508c996dfbe1174efb98b11509c6e306460b`, calcualte $\omega^k$ powers up to $k = 8$.
via cpu:
```
order = 4
1 << order = 16
1 << order / 2 = 8
log_power = 24
root : 0x21082ca216cbbf4e1c6e4f4594dd508c996dfbe1174efb98b11509c6e306460b
up_to: 8
twiddles len : 8
twiddle[0] : 0x0000000000000000000000000000000000000000000000000000000000000001
twiddle[1] : 0x21082ca216cbbf4e1c6e4f4594dd508c996dfbe1174efb98b11509c6e306460b
twiddle[2] : 0x2b337de1c8c14f22ec9b9e2f96afef3652627366f8170a0a948dad4ac1bd5e80
twiddle[3] : 0x107aab49e65a67f9da9cd2abf78be38bd9dc1d5db39f81de36bcfa5b4b039043
twiddle[4] : 0x30644e72e131a029048b6e193fd841045cea24f6fd736bec231204708f703636
twiddle[5] : 0x2290ee31c482cf92b79b1944db1c0147635e9004db8c3b9d13644bef31ec3bd3
twiddle[6] : 0x1d59376149b959ccbd157ac850893a6f07c2d99b3852513ab8d01be8e846a566
twiddle[7] : 0x2d8040c3a09c49698c53bfcb514d55a5b39e9b17cb093d128b8783adb8cbd723
```
via gpu:
```
order = 4
S = 28
ROOT_OF_UNITY = 0x03ddb9f5166d18b798865ea93dd31f743215cf6dd39329c8d34f1ed960c37c9c
log_power = 24
root = 0x21082ca216cbbf4e1c6e4f4594dd508c996dfbe1174efb98b11509c6e306460b
root : FieldElement { value: 0x21082ca216cbbf4e1c6e4f4594dd508c996dfbe1174efb98b11509c6e306460b }
gpu twiddles output len : 8
twiddle[00] : FieldElement { value: 0x1b4ad2dc0898b49d5135a186eaf032da17cfeb66b26e268a686badc0b6a7c4d3 }
twiddle[01] : FieldElement { value: 0x2d7b771fb39bc32f1500bde3969809784ac51699e12c80f6ae112e24256b9d8b }
twiddle[02] : FieldElement { value: 0x2229514fd6249489969a80684015e44d12171bc564aeafa582e4d4720af4005f }
twiddle[03] : FieldElement { value: 0x116dea3e9d818f9eedf47ed7db0cf58a82e5ca63f3c9b933d935d278d683887e }
twiddle[04] : FieldElement { value: 0x1d353dfabdf3d4ed085832ff8ab5e5f3eed74e133b25ce519b667e0037a30d76 }
twiddle[05] : FieldElement { value: 0x0f5e4c425e199ec2116a72d41196280cad3a5b867592000eca7e25558c55bcfc }
twiddle[06] : FieldElement { value: 0x0325aa7a5029aa5606e966d79fad1c6bd59a7a5df19609fb1c4d47c290f6503f }
twiddle[07] : FieldElement { value: 0x1bb7ad4b76831af26fad33715b026d607b51e5093ff880c90881ff5cc2075ebb }
```
The results of the twiddle calculation between cpu and gpu are not consistnent which leads to a simple sanity check below.
#### sanity check:
Simple sanity check on gpu arithmetic,
cpu calcualte multiplication of $\omega * \omega$ gives:
```
omega = 0x21082ca216cbbf4e1c6e4f4594dd508c996dfbe1174efb98b11509c6e306460b
omega * omega = 0x2b337de1c8c14f22ec9b9e2f96afef3652627366f8170a0a948dad4ac1bd5e80
```
gpu calculation gives the below.
```
order = 4
S = 28
ROOT_OF_UNITY = 0x03ddb9f5166d18b798865ea93dd31f743215cf6dd39329c8d34f1ed960c37c9c
log_power = 24
root = 0x21082ca216cbbf4e1c6e4f4594dd508c996dfbe1174efb98b11509c6e306460b
sm[00] : FieldElement { value: 0x156738b403007e389bc60d78d65b8cd01895d78c1ed40c532f12c51ad92bb602 }
```
Its easy to see these values are not consistent. The assumption here is that the gpu evaluation/arithmetic is done and the results returned in Montgomery form.
### Direction for Janurary:
* rework/implement Montgomery backend for gpu evaluation and produce the correct results for gpu kernels.
* implement the radix2/8 algorithm (from Week48) in cuda, and or other fast fft algorithms, bench against the current fft algorithms.
* incorporate into halo2-gpu-specific.
-----------------------------
##### radix-2/8 from Week 48.
radix-2/8 algorithm for length-$2^m$. The radix-2/8 split-radix FFT algorithm has the same asymptotic arithmetic complexity as the conventional split-radix FFT algorithm, but with the advantage of fewer loads and stores.
The idea of the extended split-radix FFT is the application of a radix-2 index map to the even-indexed tersm and a radix-8 index to the odd-indexed tersm. i.e., its the combination of one half-length and four eigth-length DFTs.
If $N = 2^m$, then for the even index tersm
$$
\begin{aligned}
X(2k) = \sum_{n=0}^{(N/2) -1} \big[ x(n) + x(n + N/2) \big] W_{N}^{2kn}
\end{aligned}
$$
and for the odd indexed terms:
$$
\begin{aligned}
X(8k+1) = \sum_{n=0}^{(N/8) -1} \big[ ( (x(n) - x(n + N/2)) - (j(x(n+N/4) - x(n+3N/4)))) \\+ \frac{1}{\sqrt{2}} \bigl\{ (1-j) (x(n + N/8) - x(n + 5N/8))\bigl\} \\ - (1-j)(x(n + 3N/8) - x(n + 7N/8)) \big] W_N^n W_N^{8nk}
\end{aligned}
$$
-----
$$
\begin{aligned}
X(8k+3) = = \sum_{n=0}^{(N/8) -1} \big[ ( (x(n) - x(n + N/2)) + (j(x(n+N/4) - x(n+3N/4)))) \\+ \frac{1}{\sqrt{2}} \bigl\{ (1+j) (x(n + N/8) - x(n + 5N/8))\bigl\} \\ - (1+j)(x(n + 3N/8) - x(n + 7N/8)) \big] W_N^3n W_N^{8nk}
\end{aligned}
$$
-----
$$
\begin{aligned}
X(8k+5) = \sum_{n=0}^{(N/8) -1} \big[ ( (x(n) - x(n + N/2)) - (j(x(n+N/4) - x(n+3N/4)))) \\ - \frac{1}{\sqrt{2}} (1-j) (x(n + N/8) - x(n + 5N/8)) \\ - (1+j)(x(n + 3N/8) - x(n + 7N/8)) \big] W_N^5n W_N^{8nk}
\end{aligned}
$$
-----
$$
\begin{aligned}
X(8k+7) = \sum_{n=0}^{(N/8) -1} \big[ ( (x(n) - x(n + N/2)) + (j(x(n+N/4) - x(n+3N/4)))) \\ + \frac{1}{\sqrt{2}} (1+j) (x(n + N/8) - x(n + 5N/8)) \\ - (1-j)(x(n + 3N/8) - x(n + 7N/8)) \big] W_N^7n W_N^{8nk}
\end{aligned}
$$
-----------------------
#### BN256 constants:
```rust
/*
Modulus:
Limb 1: 30644E72E131A029(hex) --> 3486998266802970665(dec)
Limb 2: B85045B68181585D(hex) --> 13281191951274694749(dec)
Limb 3: 2833E84879B97091(hex) --> 2896914383306846353(dec)
Limb 4: 43E1F593F0000001(hex) --> 4891460686036598785(dec)
2-adicity:
Limb 1: 0000000000000000(hex) --> 0(dec)
Limb 2: 0000000000000000(hex) --> 0(dec)
Limb 3: 0000000000000000(hex) --> 0(dec)
Limb 4: 000000000000001C(hex) --> 28(dec)
2-adic root of unity:
Limb 1: 2A3C09F0A58A7E85(hex) --> 3043318377369730693(dec)
Limb 2: 00E0A7EB8EF62ABC(hex) --> 63235024940837564(dec)
Limb 3: 402D111E41112ED4(hex) --> 4624371214017703636(dec)
Limb 4: 9BD61B6E725B19F0(hex) --> 11229192882073836016(dec)
const TWO_ADICITY: u64 = 28;
*/
```