## Week 51 - December/EoY wrap up summary. Thoughout Q4 of 2023 I have been stuudying polynomial evaluation and fft within the scope of halo2 and the codebase of the [halo2-gpu-specific](https://github.com/DelphinusLab/halo2-gpu-specific) With the objective of improving the time (reducing) of the polynomial evaluation in proof generation. #### small review: ``` -------------------------------- prover.rs --> pub fn create_proof() | | -------------------------------- evaluation.rs --> evaluate_h() | | for feature = "cuda" -------------------------------- evaluation_gpu.rs |-->_eval_gpu() | ``` halo2 and more generally fft is gone by using The Number-Theoretic Transfer (NTT) which is an algorithm variation of the FFT algorithm that enables efficient conversions between different polynomial representations like coefficient and evaluation forms. These conversions are necessary because arithmetic like multiplications among large polynomials are more amenable in evaluation form since they can be performed in log-linear time. While traditional FFTs are defined over complex numbers, NTTs are defined over finite fields F with a multiplicative subgroup $H$ of order $n = 2k$ . The subgroup $H$ corresponds to the domain consisting of the $n^{th}$ roots of unity $H = {\omega^0 , \omega^1 , ..., \omega^n }$ and are known as the twiddle factors. **Method:** Divide-and-conquer --> the Cooley-Tukey algorithm is the most common, which recursively decomposes the NTT of size N into smaller size $N/2$ by preforming a butterfly operation at every stage. define NTT as two $N$-sized vectors $X$ and $Y$ where: $X_i = \sum^{N-1}_{j=0} Y_j \omega^{ij}_N (mod P)$ where $x$ is the size of the input vector, and $\omega$ are teh twiddle factors (roots of unity). inverse NTT is define as: $X_i = N^{-1} \sum^{N-1}_{j=0} Y_j \omega^{-ij}_N (mod P)$. A simple radix-2 Decimate-In-Time is given by $$ \begin{aligned} X(k) = \sum_{n=0}^{N-1} x(n)W_{N}^{kn} \;\;\;\;\; k =0,1,2,\dots,N-1 \;\;\;\;\;\; W_N = e^{-j2\pi / N} \end{aligned} $$ #### gpu fft: In `halo2-gpu-specific` fft is done using a opencl kernel invoked from within the `evaluation_gpu.rs` source ```rust #[cfg(feature = "cuda")] pub(crate) fn do_extended_fft<F: FieldExt, C: CurveAffine<ScalarExt = F>>( ... let timer = start_timer!(|| "do fft pure"); let domain = &pk.vk.domain; let a = do_fft_pure( program, values, domain.extended_k(), allocator, &helper.pq_buffer, &helper.omegas_buffer, ); end_timer!(timer); } ``` which calls `fft_core` that runs a split `radix_fft` kernel from `ec-gpu-gen`. ```rust #[cfg(feature = "cuda")] pub(crate) fn do_fft_core<F: FieldExt>( ... let mut log_p = 0u32; // Each iteration performs a FFT round while log_p < log_n { // 1=>radix2, 2=>radix4, 3=>radix8, ... let deg = cmp::min(max_deg, log_n - log_p); let n = 1u32 << log_n; let local_work_size = 1 << cmp::min(deg - 1, MAX_LOG2_LOCAL_WORK_SIZE); let global_work_size = n >> deg; let kernel_name = format!("{}_radix_fft", "Bn256_Fr"); let kernel = program.create_kernel( &kernel_name, global_work_size as usize, local_work_size as usize, )?; kernel .arg(&src_buffer) .arg(&dst_buffer) .arg(pq_buffer) .arg(omegas_buffer) .arg(&LocalBuffer::<F>::new(1 << deg)) .arg(&n) .arg(&log_p) .arg(&deg) .arg(&max_deg) .run()?; log_p += deg; std::mem::swap(&mut src_buffer, &mut dst_buffer); } ... Ok(src_buffer) } ``` ### Indicator benchmarks for fft time (current situation): Some basic becnes for a simple test circuit from zkWasm and also a more complex circuit from zkWasm-host-circuits give an indication of how much time is spend in polynomial evaluation, specifically fft. Benchmark for bn254 circuit from host-circuits reveals that approximately 25% of time is spent in `h_poly` eval and in time approxiamtely 14% is spent on the pure fft. ``` Total s for expressions gpu eval : 1344.000ms Total ms for permutations : 3523.000ms Total ms for eval_h_lookups : 1631.000ms Total ms for h_poly : 7316.000ms Total ms for multi_open : 2600.000ms Total s for create_proof : 29.622s Total evals (distribute_powers_zeta) : 124 Total evals (eval_fft_prepare) : 124 Total evals : 124 Poly evals: Total µs for write_from_buffer : 1745.116 --> 1.745116ms average time for write_from_buffer: 14.074µs Total µs for distribute_powers_zeta : 37.751 --> 0.037751ms average time for distribute_powers_zeta: 0.304µs Total µs for eval_fft_prepare : 124.083 --> 0.124083ms average time for eval_fft_prepare: 1.001µs Total ms for fft main 20 : 0.000 average time for fft_main: 0.000000ms Total ms for do fft pure : 2321.861 average time for fft_pure: 18.724685ms Total ms for gpu eval unit : 4237.073 --> 4.237073s average time for gpu_eval_unit: 34.170m Percentages: |--> advice : 24.880 |--> lookups 30/4 : 5.236 |--> lookups commit product : 6.634 |--> lookups add blinding value : 0.000 |--> lookups msm and fft : 3.987 |--> permutation commit : 11.046 |--> vanishing commit : 1.214 | |--> h_poly : 24.698 |--> lagrange_to_coeff_st : 2.757 |--> expressions_gpu_eval : 4.537 |--> permutations : 11.893 |--> eval_h_lookups : 5.506 | |--> vanishing construct : 3.393 |--> eval poly : 8.436 |--> eval poly vanishing : 0.000 |--> eval poly permutation : 0.000 |--> eval poly lookups : 0.000 | |--> multi_open : 8.777 % time in gpu_eval_unit : 14.304 Total: 98.301 ``` ![host_circuits_bn254](https://hackmd.io/_uploads/HJjpjcGD6.png) The raw numbers for the circuit as a percentage of time is quite different in comparison to a very basic circuit (posted in week 47), where approximately 42% of time was spent in `h_poly` and (for pure fft) `gpu_eval_unit was` 30.729%. ![basicchart](https://hackmd.io/_uploads/Bk-hRqMv6.png) ## WIP: The below can be seen in PR: https://github.com/DelphinusLab/polyeval/pull/5 Generally experiments into fft evaluation via gpu with the use of cuda. Note, this is in comparison to/and will be benched against the current `ec-gpu-gen` lib [rolled by DelphinusLab](https://github.com/lanbones/ec-gpu). The setup is to generate the field `Fp` in `bn256.cu` as the Scalar field from BN254. ```cuda namespace p256 { // BN256 Scalar field // modulus = 21888242871839275222246405745257275088548364400416034343698204186575808495617 // 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593f0000001 using Fp = Fp256< /* =N **/ /*u256(*/ 3486998266802970665, 13281191951274694749, 2896914383306846353, 4891460686036598785 /*)*/, /* =R_SQUARED **/ /*u256(*/ 1011752739694698287, 7381016538464732718, 3962172157175319849, 12436184717236109307 /*)*/, /* =N_PRIME **/ /*u256(*/ 0, 0, 0, 14042775128853446655 /*)*/ >; } // namespace p256 ``` constants taken from `halo2curves` BN256 scalar: https://github.com/privacy-scaling-explorations/halo2curves/blob/main/src/bn256/fr.rs Calcualtion of 2-adic-root-of-unity done with sage ```sage modulus = 21888242871839275222246405745257275088548364400416034343698204186575808495617 assert(modulus.is_prime()) Fp = GF(modulus) generator = Fp(0); for i in range(0, 20): i = Fp(i); neg_i = Fp(-i) if not(i.is_primitive_root() or neg_i.is_primitive_root()): continue elif i.is_primitive_root(): assert(i.is_primitive_root()); print("Generator: %d" % i) generator = i break else: assert(neg_i.is_primitive_root()); print("Generator: %d" % neg_i) generator = neg_i break two_adicity = valuation(modulus - 1, 2); trace = (modulus - 1) / 2**two_adicity; two_adic_root_of_unity = generator^trace print("2-adicity: %d " % two_adicity) print("2-adic Root of Unity: %d " % two_adic_root_of_unity) ``` ``` Generator: 5 2-adicity: 28 2-adic Root of Unity: 19103219067921713944291392827692070036145651957329286315305642004821462161904 ``` Which is included in the trait for `Fftield` in the PR. This is usefull to check that roots can be calulated from the scalar field constants `Self::ROOT_OF_UNITY` and `Self::S` **OR** using the values for `two_adic_root_of_unity` and `2-adicity` (above), using the formula below: $$ \textrm{root} = \textrm{two_adic_primitive_root_of_unity}^{(\textrm{log_power + 1})} $$ These two sets of constants should give the same value `F::ONE` when $$ 1 = \textrm{root}^{1 << n} $$ or ```rust root.pow([1 << n]); ``` where $n$ is the order. This checks out. #### Running the PR: The cuda `.ptx` can be built and tests can be run by ```console /polyeval$ make build-cuda-bn256 nvcc -ptx src/cuda/field_bn256/bn256.cu -o src/cuda/field_bn256/bn256.ptx /polyeval$ cargo test --package poly-optimizer ``` #### geenrate twiddles: Comparing the cpu calculation of the twiddle factors used in fft, using the `root : 0x21082ca216cbbf4e1c6e4f4594dd508c996dfbe1174efb98b11509c6e306460b`, calcualte $\omega^k$ powers up to $k = 8$. via cpu: ``` order = 4 1 << order = 16 1 << order / 2 = 8 log_power = 24 root : 0x21082ca216cbbf4e1c6e4f4594dd508c996dfbe1174efb98b11509c6e306460b up_to: 8 twiddles len : 8 twiddle[0] : 0x0000000000000000000000000000000000000000000000000000000000000001 twiddle[1] : 0x21082ca216cbbf4e1c6e4f4594dd508c996dfbe1174efb98b11509c6e306460b twiddle[2] : 0x2b337de1c8c14f22ec9b9e2f96afef3652627366f8170a0a948dad4ac1bd5e80 twiddle[3] : 0x107aab49e65a67f9da9cd2abf78be38bd9dc1d5db39f81de36bcfa5b4b039043 twiddle[4] : 0x30644e72e131a029048b6e193fd841045cea24f6fd736bec231204708f703636 twiddle[5] : 0x2290ee31c482cf92b79b1944db1c0147635e9004db8c3b9d13644bef31ec3bd3 twiddle[6] : 0x1d59376149b959ccbd157ac850893a6f07c2d99b3852513ab8d01be8e846a566 twiddle[7] : 0x2d8040c3a09c49698c53bfcb514d55a5b39e9b17cb093d128b8783adb8cbd723 ``` via gpu: ``` order = 4 S = 28 ROOT_OF_UNITY = 0x03ddb9f5166d18b798865ea93dd31f743215cf6dd39329c8d34f1ed960c37c9c log_power = 24 root = 0x21082ca216cbbf4e1c6e4f4594dd508c996dfbe1174efb98b11509c6e306460b root : FieldElement { value: 0x21082ca216cbbf4e1c6e4f4594dd508c996dfbe1174efb98b11509c6e306460b } gpu twiddles output len : 8 twiddle[00] : FieldElement { value: 0x1b4ad2dc0898b49d5135a186eaf032da17cfeb66b26e268a686badc0b6a7c4d3 } twiddle[01] : FieldElement { value: 0x2d7b771fb39bc32f1500bde3969809784ac51699e12c80f6ae112e24256b9d8b } twiddle[02] : FieldElement { value: 0x2229514fd6249489969a80684015e44d12171bc564aeafa582e4d4720af4005f } twiddle[03] : FieldElement { value: 0x116dea3e9d818f9eedf47ed7db0cf58a82e5ca63f3c9b933d935d278d683887e } twiddle[04] : FieldElement { value: 0x1d353dfabdf3d4ed085832ff8ab5e5f3eed74e133b25ce519b667e0037a30d76 } twiddle[05] : FieldElement { value: 0x0f5e4c425e199ec2116a72d41196280cad3a5b867592000eca7e25558c55bcfc } twiddle[06] : FieldElement { value: 0x0325aa7a5029aa5606e966d79fad1c6bd59a7a5df19609fb1c4d47c290f6503f } twiddle[07] : FieldElement { value: 0x1bb7ad4b76831af26fad33715b026d607b51e5093ff880c90881ff5cc2075ebb } ``` The results of the twiddle calculation between cpu and gpu are not consistnent which leads to a simple sanity check below. #### sanity check: Simple sanity check on gpu arithmetic, cpu calcualte multiplication of $\omega * \omega$ gives: ``` omega = 0x21082ca216cbbf4e1c6e4f4594dd508c996dfbe1174efb98b11509c6e306460b omega * omega = 0x2b337de1c8c14f22ec9b9e2f96afef3652627366f8170a0a948dad4ac1bd5e80 ``` gpu calculation gives the below. ``` order = 4 S = 28 ROOT_OF_UNITY = 0x03ddb9f5166d18b798865ea93dd31f743215cf6dd39329c8d34f1ed960c37c9c log_power = 24 root = 0x21082ca216cbbf4e1c6e4f4594dd508c996dfbe1174efb98b11509c6e306460b sm[00] : FieldElement { value: 0x156738b403007e389bc60d78d65b8cd01895d78c1ed40c532f12c51ad92bb602 } ``` Its easy to see these values are not consistent. The assumption here is that the gpu evaluation/arithmetic is done and the results returned in Montgomery form. ### Direction for Janurary: * rework/implement Montgomery backend for gpu evaluation and produce the correct results for gpu kernels. * implement the radix2/8 algorithm (from Week48) in cuda, and or other fast fft algorithms, bench against the current fft algorithms. * incorporate into halo2-gpu-specific. ----------------------------- ##### radix-2/8 from Week 48. radix-2/8 algorithm for length-$2^m$. The radix-2/8 split-radix FFT algorithm has the same asymptotic arithmetic complexity as the conventional split-radix FFT algorithm, but with the advantage of fewer loads and stores. The idea of the extended split-radix FFT is the application of a radix-2 index map to the even-indexed tersm and a radix-8 index to the odd-indexed tersm. i.e., its the combination of one half-length and four eigth-length DFTs. If $N = 2^m$, then for the even index tersm $$ \begin{aligned} X(2k) = \sum_{n=0}^{(N/2) -1} \big[ x(n) + x(n + N/2) \big] W_{N}^{2kn} \end{aligned} $$ and for the odd indexed terms: $$ \begin{aligned} X(8k+1) = \sum_{n=0}^{(N/8) -1} \big[ ( (x(n) - x(n + N/2)) - (j(x(n+N/4) - x(n+3N/4)))) \\+ \frac{1}{\sqrt{2}} \bigl\{ (1-j) (x(n + N/8) - x(n + 5N/8))\bigl\} \\ - (1-j)(x(n + 3N/8) - x(n + 7N/8)) \big] W_N^n W_N^{8nk} \end{aligned} $$ ----- $$ \begin{aligned} X(8k+3) = = \sum_{n=0}^{(N/8) -1} \big[ ( (x(n) - x(n + N/2)) + (j(x(n+N/4) - x(n+3N/4)))) \\+ \frac{1}{\sqrt{2}} \bigl\{ (1+j) (x(n + N/8) - x(n + 5N/8))\bigl\} \\ - (1+j)(x(n + 3N/8) - x(n + 7N/8)) \big] W_N^3n W_N^{8nk} \end{aligned} $$ ----- $$ \begin{aligned} X(8k+5) = \sum_{n=0}^{(N/8) -1} \big[ ( (x(n) - x(n + N/2)) - (j(x(n+N/4) - x(n+3N/4)))) \\ - \frac{1}{\sqrt{2}} (1-j) (x(n + N/8) - x(n + 5N/8)) \\ - (1+j)(x(n + 3N/8) - x(n + 7N/8)) \big] W_N^5n W_N^{8nk} \end{aligned} $$ ----- $$ \begin{aligned} X(8k+7) = \sum_{n=0}^{(N/8) -1} \big[ ( (x(n) - x(n + N/2)) + (j(x(n+N/4) - x(n+3N/4)))) \\ + \frac{1}{\sqrt{2}} (1+j) (x(n + N/8) - x(n + 5N/8)) \\ - (1-j)(x(n + 3N/8) - x(n + 7N/8)) \big] W_N^7n W_N^{8nk} \end{aligned} $$ ----------------------- #### BN256 constants: ```rust /* Modulus: Limb 1: 30644E72E131A029(hex) --> 3486998266802970665(dec) Limb 2: B85045B68181585D(hex) --> 13281191951274694749(dec) Limb 3: 2833E84879B97091(hex) --> 2896914383306846353(dec) Limb 4: 43E1F593F0000001(hex) --> 4891460686036598785(dec) 2-adicity: Limb 1: 0000000000000000(hex) --> 0(dec) Limb 2: 0000000000000000(hex) --> 0(dec) Limb 3: 0000000000000000(hex) --> 0(dec) Limb 4: 000000000000001C(hex) --> 28(dec) 2-adic root of unity: Limb 1: 2A3C09F0A58A7E85(hex) --> 3043318377369730693(dec) Limb 2: 00E0A7EB8EF62ABC(hex) --> 63235024940837564(dec) Limb 3: 402D111E41112ED4(hex) --> 4624371214017703636(dec) Limb 4: 9BD61B6E725B19F0(hex) --> 11229192882073836016(dec) const TWO_ADICITY: u64 = 28; */ ```