# Tutorial 5 - Iterative NTT # Introduction. Previously, we have covered [The Numeric Theoretic Transform (NTT)](https://hackmd.io/@qozymandias/S1_vxnp5a) where we detailed the recursive implementation of the NTT. In this tutorial, I will endeavour to explain the iterative implementation of the transform and hopefully provide some more insight into the hidden magic of the algorithm by unravelling the steps of the core sub procedure. All the implementations showcased will be using the Rust [Finite field library](https://docs.rs/ff/latest/ff/). If you would like to see the iterative implementation with `u64` types, then you may refer to this [repo](https://github.com/qozymandias/ntt-iterative) where the iterative algorithm uses a "bare bones" finite field with a known modulus and primitive root, $GF(998244353)$. If you would like to recap the recursive NTT, please see [this section of the previous tut](https://hackmd.io/@qozymandias/S1_vxnp5a#Core-FFT-stage-of-NTT-implementation). # Iterative implementation. ## Recap on calculating primitive roots Before we get into this, the one thing to remember from the previous tutorial is how we determine the $n$'th primitive root. Remember, the finite field library doesn't expose the modulus since it defined as traits where the underlying implementation may vary (in our case [BN256](https://asecuritysite.com/bn/) finite field is used). But knowing the modulus is not required because the library already exposes enough to easily calculate the $n$'th primitive root for any $n$ that can fit into an unsigned 64 bit type. In [this section,](https://hackmd.io/bnXn6rSNR4OyyES4vD6oiw?view#Pre-processing-stage-of-NTT-implementation) we detail how the `NTT` constructor calculates the primitive root, but I will also provide a recap here. In our iterative implementation we have to "lift" this logic into its own helper function, `Self::calc_primitive_root(n)`, but more on that later, let's get into the pseudocode for our implementation and once we understand that I will explain how to compute the $n$'th primitive root required for the implementation. ## Pseudocode The following is pseudocode taken from [Cooley-Tukey wiki page](https://en.wikipedia.org/wiki/Cooley–Tukey_FFT_algorithm). However, I have slightly modified it to match the implementation found [here on geeks for geeks](https://www.geeksforgeeks.org/iterative-fast-fourier-transformation-polynomial-multiplication/), as I think it's slight more understandable, the only difference being the two inner loops are swapped. This is known as the iterative "radix-2" FFT algorithm because we are dividing and stepping through half each time, if it were to be split into four segments each step then it would be a "radix-4" version. ### Radix-2 FFT Pseudocode ``` algorithm iterative_fft is input: Array a of n complex values where n is a power of 2. output: Array A the DFT of a. bit_reverse_copy(a, A) n <- a.length for s = 1 to log(n) do m <- 2^s wm <- calc_primitive_root(m) w <- 1 for j = 0 to m/2 - 1 do for k = 0 to n-1 by m do t <- w * A[k + j + m/2] u <- A[k + j] A[k + j] <- u + t A[k + j + m/2] <- u - t w <- w * wm return A ``` As can be seen in the above pseudocode, the core butterfly of the FFT remains the same. The main changes are to the looping which now has to achieve the "divide and conquer" pattern without using the recursion technique. "Divide and conquer" recursive technique works by continually dividing inputs until it reaches the base case or "bottom", this I refer to as the "top to bottom" approach; once the base case is reached each recursive call gets its results and the function calls start unwinding. This is how the "bottom" level computations of the transform are recycled into the "upper" levels computations. The iterative version of the FFT achieves the same thing but instead uses indexing based on a "step" amount. It also relies on the inputs being in their odd and even pairs, which is achieved in the recursive implementation by recursively partitioning inputs by their odd and even indexes. To achieve this in the iterative version, the inputs are sorted by the bit reversed indexes, more on this later. ## Primitive root intuition. The only difference between the FFT (with complex numbers) and the NTT (with finite field) is the primitive root calculation. Let the function `calc_primitive_root(n)` be the function for calculating the $n$'th primitive root. For complex numbers, `calc_primitive_root(n)` would simply be: ``` wn <- e^(2*pi*i/n) ``` For generic finite fields, the function would instead look like this: ``` wn <- power_mod(primitive_root, (MODULUS - 1) / n as u64, MODULUS); ``` Where `power_mod` computes the power with modulo arithmetic, and the primitive root is the `MODULUS`'th primitive root of unity. This approach can be seen in this [repo](https://github.com/qozymandias/ntt-iterative), using the field $GF(998244353)$, where the modulus is a carefully chosen prime, $998244353$, which is a common prime for implementations of NTT because it has a small primitive root, $3$! You can read more about it in [this blog](https://codeforces.com/blog/entry/75326). Now what about primitive root calculation in our Finite Field library, that, as mentioned previously, is not so straightforward, the following function is how we compute it. ```rust fn calc_primitive_root(degree: usize) -> F { // ... let k = degree.ilog2() as u32; let mut omega = F::root_of_unity(); for _ in 0..(F::S - k) { omega = omega.square(); } let primitive_root = omega; // ... primitive_root } ``` Both `F::S` and `k` and exponents of two, `2^F::S` and `2^k` respectively. `2^k` is the size of our input vector and the degree of our polynomial. Let's just think of `2^F::S` and `2^k` as vector lengths because we represent our polynomials as vectors. We can directly access the `2^F::S`'th primitive root of unity because this is exposed by the finite field library as `F::root_of_unity()`. `2^F::S` is our maximum vector length possible, since `F::S` is $28$, it is more than enough. So how do we start with `2^F::S`'th primitive root and compute the `2^k`'th primitive root? We start with `2^F::S`'th root of unity (which is for `2^F::S` vector length, aka number of evaluation points we want for the polynomial from the NTT). The `2^F::S`'th primitive root of unity is `F::root_of_unity()`. Now, consider the loop in `calc_primitive_root()`, moving from `F::S` to `F::S - 1` is simulating halving the vector length of `2^F::S`, think back to recursive FFT implementation, whenever we wanted to get the primitive root of a halved vector, we would square the previous root! The exact same logic is working here, `2^{F::S - 1}` is half of `2^F::S` and therefore the `2^{F::S - 1}`'th primitive root is the square of the `2^F::S`'th which is given by `F::root_of_unity`! We can continue this process to reach any $n$'th primitive root. Therefore, we can get the `2^{k}`'th primitive root by repetitively squaring `F::root_of_unity` `F::S - k` times. ## First approach. ### Forward and Inverse transforms. ```rust fn forward_iterative(&self, f: &Vec<F>, inverse: bool) -> Vec<F> { let n = f.len(); assert!(n.is_power_of_two()); // Reorder is equivalent to recursively separating vector into odds and evens indexes. let mut g = Self::bit_reverse_copy(f); for i in 0..(n.ilog2() as usize) { // Calculate 2^(i+1) let step = 1 << (i + 1); // Calculate step/2 -1 let half_step = step >> 1; let mut base = Self::calc_primitive_root(mh); if inv { base = base.invert().unwrap(); } let mut w = F::one(); for j in 0..half_step { for k in (0..n).step_by(step as usize) { let u = g[k + j]; let t = g[k + j + half_step] * w; g[k + j] = u + t; g[k + j + half_step] = u - t; } w = w * base; } } g } fn inverse_iterative(&self, f: &Vec<F>) -> Vec<F> { let scaler = F::from(f.len() as u64).invert().unwrap(); self.forward_iterative(f, true) .iter() .map(|&x| x * scaler) .collect::<Vec<F>>() } ``` ### Vector bit-reverse copy The "bit-reverse copy" function gets the input into the order that would normally be achieved by recursively splitting the vector into even and odd indexes. This is done with explicit recursion in the recursive FFT, as it's very much baked into the algorithm. For the iterative approach, we instead use the bit-reverse approach. The core idea is to represent the indexes in binary and then reorder the bits to be "backwards", e.g. `b3 b2 b1 b0` becomes `b0 b1 b2 b3`. Turns out that flipping the order of the bits sorts the indexes perfectly into odd and even pairs just like the recursive FFT, for example `0 1 2 3 4 5 6 7` becomes `0 4 2 6 1 5 3 7`. The following function performs a vector copy and reorders the values with a bit reverse technique that leverages bit operations. For more details on how bit twiddles achieve this, please check out this [excellent page](https://www.katjaas.nl/bitreversal/bitreversal.html). ```rust fn bit_reverse_copy(f: &Vec<F>) -> Vec<F> { let n = f.len(); let mut a = f.clone(); let mut rev = vec![0; n]; for i in 0..n { rev[i] = rev[i >> 1] >> 1 | (if i & 1 == 1 { n >> 1 } else { 0 }); if i < rev[i] { a.swap(i, rev[i]); } } a } ``` ### Example, 8-point iterative NTT. Let's work through an example, in coefficient form, consider the vector, where the indexes correspond to the powers of $x$. `input = vec![2, 3, 4, 5, 6, 7, 8, 9];`. The following is the output of the algorithm. #### Forward transform (NTT) output. ``` Pre NTT [0x0000000000000000000000000000000000000000000000000000000000000002, 0x0000000000000000000000000000000000000000000000000000000000000003, 0x0000000000000000000000000000000000000000000000000000000000000004, 0x0000000000000000000000000000000000000000000000000000000000000005, 0x0000000000000000000000000000000000000000000000000000000000000006, 0x0000000000000000000000000000000000000000000000000000000000000007, 0x0000000000000000000000000000000000000000000000000000000000000008, 0x0000000000000000000000000000000000000000000000000000000000000009] Bit reversed input = [0x0000000000000000000000000000000000000000000000000000000000000002, 0x0000000000000000000000000000000000000000000000000000000000000006, 0x0000000000000000000000000000000000000000000000000000000000000004, 0x0000000000000000000000000000000000000000000000000000000000000008, 0x0000000000000000000000000000000000000000000000000000000000000003, 0x0000000000000000000000000000000000000000000000000000000000000007, 0x0000000000000000000000000000000000000000000000000000000000000005, 0x0000000000000000000000000000000000000000000000000000000000000009] ------------------------------------------------------------------------- step = 2, half_step = 1, i = 0, j = 0, k = 0 w = 0x0000000000000000000000000000000000000000000000000000000000000001 g[0] = 0x0000000000000000000000000000000000000000000000000000000000000002 (pre bfly) g[1] = 0x0000000000000000000000000000000000000000000000000000000000000006 (pre bfly) g[0] = 0x0000000000000000000000000000000000000000000000000000000000000008 g[1] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffffd ------------------------------------------------------------------------- step = 2, half_step = 1, i = 0, j = 0, k = 2 w = 0x0000000000000000000000000000000000000000000000000000000000000001 g[2] = 0x0000000000000000000000000000000000000000000000000000000000000004 (pre bfly) g[3] = 0x0000000000000000000000000000000000000000000000000000000000000008 (pre bfly) g[2] = 0x000000000000000000000000000000000000000000000000000000000000000c g[3] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffffd ------------------------------------------------------------------------- step = 2, half_step = 1, i = 0, j = 0, k = 4 w = 0x0000000000000000000000000000000000000000000000000000000000000001 g[4] = 0x0000000000000000000000000000000000000000000000000000000000000003 (pre bfly) g[5] = 0x0000000000000000000000000000000000000000000000000000000000000007 (pre bfly) g[4] = 0x000000000000000000000000000000000000000000000000000000000000000a g[5] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffffd ------------------------------------------------------------------------- step = 2, half_step = 1, i = 0, j = 0, k = 6 w = 0x0000000000000000000000000000000000000000000000000000000000000001 g[6] = 0x0000000000000000000000000000000000000000000000000000000000000005 (pre bfly) g[7] = 0x0000000000000000000000000000000000000000000000000000000000000009 (pre bfly) g[6] = 0x000000000000000000000000000000000000000000000000000000000000000e g[7] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffffd ------------------------------------------------------------------------- step = 4, half_step = 2, i = 1, j = 0, k = 0 w = 0x0000000000000000000000000000000000000000000000000000000000000001 g[0] = 0x0000000000000000000000000000000000000000000000000000000000000008 (pre bfly) g[2] = 0x000000000000000000000000000000000000000000000000000000000000000c (pre bfly) g[0] = 0x0000000000000000000000000000000000000000000000000000000000000014 g[2] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffffd ------------------------------------------------------------------------- step = 4, half_step = 2, i = 1, j = 0, k = 4 w = 0x0000000000000000000000000000000000000000000000000000000000000001 g[4] = 0x000000000000000000000000000000000000000000000000000000000000000a (pre bfly) g[6] = 0x000000000000000000000000000000000000000000000000000000000000000e (pre bfly) g[4] = 0x0000000000000000000000000000000000000000000000000000000000000018 g[6] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffffd ------------------------------------------------------------------------- step = 4, half_step = 2, i = 1, j = 1, k = 0 w = 0x30644e72e131a029048b6e193fd841045cea24f6fd736bec231204708f703636 g[1] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffffd (pre bfly) g[3] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffffd (pre bfly) g[1] = 0x0000000000000002cf135e7506a45d632d270d45f1181294833fc48d823f2728 g[3] = 0x30644e72e131a026e93ce7417adcfaf9fb0cdb0288a15dfcc0a231066dc0d8d1 ------------------------------------------------------------------------- step = 4, half_step = 2, i = 1, j = 1, k = 4 w = 0x30644e72e131a029048b6e193fd841045cea24f6fd736bec231204708f703636 g[5] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffffd (pre bfly) g[7] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffffd (pre bfly) g[5] = 0x0000000000000002cf135e7506a45d632d270d45f1181294833fc48d823f2728 g[7] = 0x30644e72e131a026e93ce7417adcfaf9fb0cdb0288a15dfcc0a231066dc0d8d1 ------------------------------------------------------------------------- step = 8, half_step = 4, i = 2, j = 0, k = 0 w = 0x0000000000000000000000000000000000000000000000000000000000000001 g[0] = 0x0000000000000000000000000000000000000000000000000000000000000014 (pre bfly) g[4] = 0x0000000000000000000000000000000000000000000000000000000000000018 (pre bfly) g[0] = 0x000000000000000000000000000000000000000000000000000000000000002c g[4] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffffd ------------------------------------------------------------------------- step = 8, half_step = 4, i = 2, j = 1, k = 0 w = 0x2b337de1c8c14f22ec9b9e2f96afef3652627366f8170a0a948dad4ac1bd5e80 g[1] = 0x0000000000000002cf135e7506a45d632d270d45f1181294833fc48d823f2728 (pre bfly) g[5] = 0x0000000000000002cf135e7506a45d632d270d45f1181294833fc48d823f2728 (pre bfly) g[1] = 0x002701a4fd3f1d3e7a309cdc72c7c8fcb5c94af009cb48e6e51461367a2f1796 g[5] = 0x303d4ccde3f282f0dc4665c41c024a26ccb8b7e4521e4cd3654d1d787a4f36bb ------------------------------------------------------------------------- step = 8, half_step = 4, i = 2, j = 2, k = 0 w = 0x30644e72e131a029048b6e193fd841045cea24f6fd736bec231204708f703636 g[2] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffffd (pre bfly) g[6] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffffd (pre bfly) g[2] = 0x0000000000000002cf135e7506a45d632d270d45f1181294833fc48d823f2728 g[6] = 0x30644e72e131a026e93ce7417adcfaf9fb0cdb0288a15dfcc0a231066dc0d8d1 ------------------------------------------------------------------------- step = 8, half_step = 4, i = 2, j = 3, k = 0 w = 0x1d59376149b959ccbd157ac850893a6f07c2d99b3852513ab8d01be8e846a566 g[3] = 0x30644e72e131a026e93ce7417adcfaf9fb0cdb0288a15dfcc0a231066dc0d8d1 (pre bfly) g[7] = 0x30644e72e131a026e93ce7417adcfaf9fb0cdb0288a15dfcc0a231066dc0d8d1 (pre bfly) g[3] = 0x002701a4fd3f1d38dc09dff2657f0e365b7b3064279b23bdde94d81b75b0c93e g[7] = 0x303d4ccde3f282eb3e1fa8da0eb98f60726a9d586fee27aa5ecd945d75d0e863 ------------------------------------------------------------------------- Post NTT [0x000000000000000000000000000000000000000000000000000000000000002c, 0x002701a4fd3f1d3e7a309cdc72c7c8fcb5c94af009cb48e6e51461367a2f1796, 0x0000000000000002cf135e7506a45d632d270d45f1181294833fc48d823f2728, 0x002701a4fd3f1d38dc09dff2657f0e365b7b3064279b23bdde94d81b75b0c93e, 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffffd, 0x303d4ccde3f282f0dc4665c41c024a26ccb8b7e4521e4cd3654d1d787a4f36bb, 0x30644e72e131a026e93ce7417adcfaf9fb0cdb0288a15dfcc0a231066dc0d8d1, 0x303d4ccde3f282eb3e1fa8da0eb98f60726a9d586fee27aa5ecd945d75d0e863] ``` We can see the iteration acting like the recursion, for the "step = 2" outputs we see it's the same as the bottom level of recursion for the recursive version. For the "step = 4" outputs, we see the results from "step = 2" are feed into the computations at this level, which is exactly the same as the recursive version! So we can see that the core loops, while they look quite obscure, perform the indexing that is required for the butterfly operation to be applied to odd and even pairs at each level. ``` for j in 0..half_step { for k in (0..n).step_by(step as usize) { ``` This indexing is best visualised with the definitive butterfly diagram, [example here](https://www.researchgate.net/figure/8-point-FFT-butterfly-calculation_fig1_241164556). Now we can see the inverse output is very similar, but I have included it so you can observe how the values are unravelled by their inverses. #### Inverse transform (INTT) output. ``` Pre INTT [0x000000000000000000000000000000000000000000000000000000000000002c, 0x002701a4fd3f1d3e7a309cdc72c7c8fcb5c94af009cb48e6e51461367a2f1796, 0x0000000000000002cf135e7506a45d632d270d45f1181294833fc48d823f2728, 0x002701a4fd3f1d38dc09dff2657f0e365b7b3064279b23bdde94d81b75b0c93e, 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffffd, 0x303d4ccde3f282f0dc4665c41c024a26ccb8b7e4521e4cd3654d1d787a4f36bb, 0x30644e72e131a026e93ce7417adcfaf9fb0cdb0288a15dfcc0a231066dc0d8d1, 0x303d4ccde3f282eb3e1fa8da0eb98f60726a9d586fee27aa5ecd945d75d0e863] Bit reversed input = [0x000000000000000000000000000000000000000000000000000000000000002c, 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffffd, 0x0000000000000002cf135e7506a45d632d270d45f1181294833fc48d823f2728, 0x30644e72e131a026e93ce7417adcfaf9fb0cdb0288a15dfcc0a231066dc0d8d1, 0x002701a4fd3f1d3e7a309cdc72c7c8fcb5c94af009cb48e6e51461367a2f1796, 0x303d4ccde3f282f0dc4665c41c024a26ccb8b7e4521e4cd3654d1d787a4f36bb, 0x002701a4fd3f1d38dc09dff2657f0e365b7b3064279b23bdde94d81b75b0c93e, 0x303d4ccde3f282eb3e1fa8da0eb98f60726a9d586fee27aa5ecd945d75d0e863] ------------------------------------------------------------------------- step = 2, half_step = 1, i = 0, j = 0, k = 0 w = 0x0000000000000000000000000000000000000000000000000000000000000001 g[0] = 0x000000000000000000000000000000000000000000000000000000000000002c (pre bfly) g[1] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffffd (pre bfly) g[0] = 0x0000000000000000000000000000000000000000000000000000000000000028 g[1] = 0x0000000000000000000000000000000000000000000000000000000000000030 ------------------------------------------------------------------------- step = 2, half_step = 1, i = 0, j = 0, k = 2 w = 0x0000000000000000000000000000000000000000000000000000000000000001 g[2] = 0x0000000000000002cf135e7506a45d632d270d45f1181294833fc48d823f2728 (pre bfly) g[3] = 0x30644e72e131a026e93ce7417adcfaf9fb0cdb0288a15dfcc0a231066dc0d8d1 (pre bfly) g[2] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffff9 g[3] = 0x00000000000000059e26bcea0d48bac65a4e1a8be2302529067f891b047e4e58 ------------------------------------------------------------------------- step = 2, half_step = 1, i = 0, j = 0, k = 4 w = 0x0000000000000000000000000000000000000000000000000000000000000001 g[4] = 0x002701a4fd3f1d3e7a309cdc72c7c8fcb5c94af009cb48e6e51461367a2f1796 (pre bfly) g[5] = 0x303d4ccde3f282f0dc4665c41c024a26ccb8b7e4521e4cd3654d1d787a4f36bb (pre bfly) g[4] = 0x00000000000000059e26bcea0d48bac65a4e1a8be2302529067f891b047e4e50 g[5] = 0x004e0349fa7e3a77563a7cced846d73311447b5431666ca4c3a93951efdfe0dc ------------------------------------------------------------------------- step = 2, half_step = 1, i = 0, j = 0, k = 6 w = 0x0000000000000000000000000000000000000000000000000000000000000001 g[6] = 0x002701a4fd3f1d38dc09dff2657f0e365b7b3064279b23bdde94d81b75b0c93e (pre bfly) g[7] = 0x303d4ccde3f282eb3e1fa8da0eb98f60726a9d586fee27aa5ecd945d75d0e863 (pre bfly) g[6] = 0x30644e72e131a0241a2988cc74389d96cde5cdbc97894b683d626c78eb81b1a1 g[7] = 0x004e0349fa7e3a77563a7cced846d73311447b5431666ca4c3a93951efdfe0dc ------------------------------------------------------------------------- step = 4, half_step = 2, i = 1, j = 0, k = 0 w = 0x0000000000000000000000000000000000000000000000000000000000000001 g[0] = 0x0000000000000000000000000000000000000000000000000000000000000028 (pre bfly) g[2] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffff9 (pre bfly) g[0] = 0x0000000000000000000000000000000000000000000000000000000000000020 g[2] = 0x0000000000000000000000000000000000000000000000000000000000000030 ------------------------------------------------------------------------- step = 4, half_step = 2, i = 1, j = 0, k = 4 w = 0x0000000000000000000000000000000000000000000000000000000000000001 g[4] = 0x00000000000000059e26bcea0d48bac65a4e1a8be2302529067f891b047e4e50 (pre bfly) g[6] = 0x30644e72e131a0241a2988cc74389d96cde5cdbc97894b683d626c78eb81b1a1 (pre bfly) g[4] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffff1 g[6] = 0x000000000000000b3c4d79d41a91758cb49c3517c4604a520cff123608fc9cb0 ------------------------------------------------------------------------- step = 4, half_step = 2, i = 1, j = 1, k = 0 w = 0x0000000000000000b3c4d79d41a91758cb49c3517c4604a520cff123608fc9cb g[1] = 0x0000000000000000000000000000000000000000000000000000000000000030 (pre bfly) g[3] = 0x00000000000000059e26bcea0d48bac65a4e1a8be2302529067f891b047e4e58 (pre bfly) g[1] = 0x0000000000000000000000000000000000000000000000000000000000000028 g[3] = 0x0000000000000000000000000000000000000000000000000000000000000038 ------------------------------------------------------------------------- step = 4, half_step = 2, i = 1, j = 1, k = 4 w = 0x0000000000000000b3c4d79d41a91758cb49c3517c4604a520cff123608fc9cb g[5] = 0x004e0349fa7e3a77563a7cced846d73311447b5431666ca4c3a93951efdfe0dc (pre bfly) g[7] = 0x004e0349fa7e3a77563a7cced846d73311447b5431666ca4c3a93951efdfe0dc (pre bfly) g[5] = 0x22a8ba9ea5d3704302fa32b82b953a1034e365cfa06cf7d9b1628efef42a180f g[7] = 0x0e579a68305aa4d561cb0c9c0679ccb315d979213c19520119d1d938db95a9aa ------------------------------------------------------------------------- step = 8, half_step = 4, i = 2, j = 0, k = 0 w = 0x0000000000000000000000000000000000000000000000000000000000000001 g[0] = 0x0000000000000000000000000000000000000000000000000000000000000020 (pre bfly) g[4] = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593effffff1 (pre bfly) g[0] = 0x0000000000000000000000000000000000000000000000000000000000000010 g[4] = 0x0000000000000000000000000000000000000000000000000000000000000030 ------------------------------------------------------------------------- step = 8, half_step = 4, i = 2, j = 1, k = 0 w = 0x130b17119778465cfb3acaee30f81dee20710ead41671f568b11d9ab07b95a9b g[1] = 0x0000000000000000000000000000000000000000000000000000000000000028 (pre bfly) g[5] = 0x22a8ba9ea5d3704302fa32b82b953a1034e365cfa06cf7d9b1628efef42a180f (pre bfly) g[1] = 0x0000000000000000000000000000000000000000000000000000000000000018 g[5] = 0x0000000000000000000000000000000000000000000000000000000000000038 ------------------------------------------------------------------------- step = 8, half_step = 4, i = 2, j = 2, k = 0 w = 0x0000000000000000b3c4d79d41a91758cb49c3517c4604a520cff123608fc9cb g[2] = 0x0000000000000000000000000000000000000000000000000000000000000030 (pre bfly) g[6] = 0x000000000000000b3c4d79d41a91758cb49c3517c4604a520cff123608fc9cb0 (pre bfly) g[2] = 0x0000000000000000000000000000000000000000000000000000000000000020 g[6] = 0x0000000000000000000000000000000000000000000000000000000000000040 ------------------------------------------------------------------------- step = 8, half_step = 4, i = 2, j = 3, k = 0 w = 0x0530d09118705106cbb4a786ead16926d5d174e181a26686af5448492e42a181 g[3] = 0x0000000000000000000000000000000000000000000000000000000000000038 (pre bfly) g[7] = 0x0e579a68305aa4d561cb0c9c0679ccb315d979213c19520119d1d938db95a9aa (pre bfly) g[3] = 0x0000000000000000000000000000000000000000000000000000000000000028 g[7] = 0x0000000000000000000000000000000000000000000000000000000000000048 ------------------------------------------------------------------------- Post INTT [0x0000000000000000000000000000000000000000000000000000000000000002, 0x0000000000000000000000000000000000000000000000000000000000000003, 0x0000000000000000000000000000000000000000000000000000000000000004, 0x0000000000000000000000000000000000000000000000000000000000000005, 0x0000000000000000000000000000000000000000000000000000000000000006, 0x0000000000000000000000000000000000000000000000000000000000000007, 0x0000000000000000000000000000000000000000000000000000000000000008, 0x0000000000000000000000000000000000000000000000000000000000000009] ``` ## Second approach: pre-computing roots. Notice we don't care about the primitive root for the original vector length, but rather compute on the fly for what polynomial length we are currently at. In the recursive implementation, we would square the root for $n$ length and get the primitive root for $n/2$, since each step halved the vector and passed it to the next recursive call. For iterative version instead we are step through $2^i$ lengths, $2, 4, 8, 16, \dots$, which is essentially starting from the "bottom" recursive level and stepping "upwards". Theoretically, we could start with the primitive root for $2$ length, and then each step, square root the current primitive root to get the next, however, I haven't had a chance to test this and haven't recorded any other implementations doing this. So we will just focus optimising our standard primitive root computation. Simply pre-computing these roots and storing them in a lookup table provides an increase in efficiency. This is because the main operation that composes the primitive root computation is field multiplication, and in the initial approach, roots were calculated every time! Which means every forward/inverse transform would recompute the exact same roots, so if we pre-compute the roots in the constructor of our `NTT` object, then every call of transform would reuse the roots. The following code snippet shows how we have achieved this. We must introduce new constants for our root table, these are `LD` and `RD` which will always be dependent on the size of the input polynomial, `D`. ```rust const D = // .. array.len() const LD: usize = 2 * (D.ilog2() as usize); const RD: usize = D / 2; ``` The function `new_coeff_context` is our new constructor which generates the roots for the transforms; the "invoker" of the function is expected to provide a chunk of memory for storage, this is because we want to control memory allocations. The `NTT` object has been renamed to `NTTContext` and now has a new phantom type which dictates which polynomial representation the context supports. ```rust impl<F: PrimeField, const D: usize, const LD: usize, const RD: usize> NTTContext<Coeff, F, D, LD, RD> { pub fn new_coeff_context( rt_chunk_ptr: *mut [[F; RD]; LD], modifier_buffer: *mut [F; D], ) -> Self { // Generate the omegas, which are the points we evaluate at for the evaluation form. let primitive_root = Self::calc_primitive_root(D); let inv_primitive_root = primitive_root.invert().unwrap(); let roots_table = unsafe { &mut *rt_chunk_ptr }; Self::generate_roots(roots_table); NTTContext { primitive_root, inv_primitive_root, roots_table, modifier_buffer, repr: PhantomData::<Coeff>, } } fn calc_primitive_root(degree: usize) -> F { let mut omega = F::root_of_unity(); let k = degree.ilog2() as u32; for _ in 0..(F::S - k) { omega = omega.square(); } let primitive_root = omega; primitive_root } fn generate_roots(roots_table: &mut [[F; RD]; LD]) { let logd = D.ilog2() as usize; for i in 0..logd { // Calculate 2^(i+1) let step = 1 << (i + 1); // Calculate step/2 -1 let half_step = step >> 1; // Root of unity for "step" number of roots; // equivalent to w_step = e^(2 * pi * i/ step) for complex number roots. let base = Self::calc_primitive_root(step); let inv_base = base.invert().unwrap(); let mut w = F::one(); let mut inv_w = F::one(); for j in 0..half_step { roots_table[i][j] = w; roots_table[i + logd][j] = inv_w; w *= base; inv_w *= inv_base; } } } } ``` Now our implementation becomes the following, which looks very similar to the initial implementation, apart from out `get_roots` which does our root table lookup. ```rust fn get_roots(&self, i: usize, j: usize) -> (F, F) { let rt = unsafe { &*self.roots_table }; let logd = D.ilog2() as usize; (rt[i][j], rt[i + logd][j]) } fn forward_iterative(&self, f: &Vec<F>, inverse: bool) -> Vec<F> { let n = f.len(); assert!(n.is_power_of_two()); // Reorder is equivalent to recursively separating vector into odds and evens indexes. let mut g = Self::bit_reverse_copy(f); for i in 0..(n.ilog2() as usize) { // Calculate 2^(i+1) let step = 1 << (i + 1); // Calculate step/2 -1 let half_step = step >> 1; for j in 0..half_step { let (root, inv_root) = self.get_roots(i, j); let w = if inverse { inv_root } else { root }; for k in (0..n).step_by(step as usize) { let u = g[k + j]; let t = g[k + j + half_step] * w; g[k + j] = u + t; g[k + j + half_step] = u - t; } } } g } fn inverse_iterative(&self, f: &Vec<F>) -> Vec<F> { let scaler = F::from(f.len() as u64).invert().unwrap(); self.forward_iterative(f, true) .iter() .map(|&x| x * scaler) .collect::<Vec<F>>() } ```