# The FFT Algorithm over Finite Fields ## 1 DFT and FFT ### 1.1 Background of DFT FFT is a special case of DFT, which transforms between the "time domain" and the "frequency domain." For an n-point DFT, the transformation from $\{x_{0},x_{1}...x_{n-1}\}$ to $\{X_{0},X_{1}...X_{n-1}\}$ is defined as: $X(k)=\sum_0^{n-1} x(i)w_{n}^{ki}=x_{0}w_{n}^{0k}+x_{1}w_{n}^{k}+x_{2}w_{n}^{2k}...+x_{n-1}w_{n}^{(n-1)k}$ where $w_{n}=e^{\frac{-2\pi i}{n}}$ From a signal processing perspective, $w_{n}$ can be considered as the basis frequency $f$, then $w_{n}^{k}$ corresponds to the frequency $kf$. This transformation maps the time-domain signal $\{x_{0},x_{1},x_{2}...x_{n-1}\}$ to the frequency-domain signal $\{X_{0},X_{1}...X_{n-1}\}$. ### 1.2 Applications in Finite Fields The primitive root $w_{n}$ in the complex field and the root of unity $w$ in finite fields both serve as generators of cyclic groups. In the complex field, periodicity arises from rotational symmetry, while in finite fields, it stems from the algebraic structure of finite groups. Therefore, in a finite field, we can consider $w_{n}$ as the generator of the field, with the property that $w_{n}^n=1$ . when n is a power of 2, additional properties hold: $w_{n}^{\frac{n}{2}}=-1$, $w_{n}^{i}=-w_{n}^{{\frac{n}{2}}+i}$,and $w_{n}^{2i}=w_{\frac{n}{2}}^{i}$ These properties, arising from the radix-2 condition ($n=2^k$), form the basis for FFT optimization, as discussed later. ### 1.3 Matrix Representation of DFT The DFT can be expressed in matrix form (using $w=w_{n}$ for simplicity): $$\mathbf{X} =\begin{bmatrix}X_{0} \\X_{1} \\...\\X_{n-1}\end{bmatrix} =\begin{bmatrix} 1 & 1 & 1 &... &1 \\ 1 & w & w^2 &... &w^{n-1} \\ 1 & w^2 & w^4 &... &w^{2(n-1)} \\ 1 & ... & ... &... &...\\ 1 & w^{n-1} & w^{2(n-1)} &... &w^{(n-1)(n-1)}\end{bmatrix} \cdot \begin{bmatrix}x_{0} \\x_{1}\\... \\x_{n-1} \end{bmatrix}$$ The above computation can be abbreviated as $X=M \cdot x$ where the matrix M is a Vandermonde matrix. The inverse transformation from the frequency-domain signal $X$ back to the time-domain signal $x$ is given by: $$\mathbf{x} =\begin{bmatrix}x_{0} \\x_{1} \\...\\x_{n-1}\end{bmatrix} =\frac{1}{n} \begin{bmatrix} 1 & 1 & 1 &... &1 \\ 1 & w^{-1} & w^{-2} &... &w^{-(n-1)} \\ 1 & w^{-2} & w^{-4} &... &w^{-2(n-1)} \\ 1 & ... & ... &... &...\\ 1 & w^{-(n-1)} & w^{-2(n-1)} &... &w^{-(n-1)(n-1)}\end{bmatrix} \cdot \begin{bmatrix}X_{0} \\X_{1}\\... \\X_{n-1} \end{bmatrix}=\frac{1}{n} M^{'} \cdot X$$ We observe that the inverse matrix $M^{-1}$ effectively defines a frequency domain constructed with $w'=w^{-1}$. So we can see that: The time-domain signal $x$ is transformed into the frequency-domain signal $X$ through the operation $M$ (which we can regard as a DFT module); Conversely, the frequency-domain signal $X$ can be transformed back to the time-domain signal $x$ through the inverse operation $M^{-1}$ ### 1.4 FFT Algorithm When the input size $n$ is a power of 2 (radix-2), the DFT computation can be optimized from $O(n^2)$ to $O(n \log n)$<sup><a href="#ref1">[1]</a></sup>. In the finite field setting, if $n$ is radix-2, then the domain ${1, w, w^2, ..., w^{n-1}}$ folds on itself. For example, when $n=8$: $w=-w^5$, $w^2=(w^5)^2$; $w^2=-w^6$, $(w^2)^2=(w^6)^2=w^4$; $w^4=-w^0=-1$, $(w^4)^2=1$ So we can visualize this: $w$ and $w^5$ fold to $w^2$, $w^2$ and $w^6$ fold to $w^4$, and $w^4$ folds to 1. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/B1nb0lVTyx.png" width="350"/> </div> The optimization idea is: for any $X(k)$, we evaluate it at $w^k$, then divide the polynomial into even and odd parts: $f(k)=f(w^k)=\sum_0^{n-1} x(i)w_{n}^{ki}=f_{even}(w^{2k})+w^k\cdot f_{odd}(w^{2k})$ And since $w^{k+\frac{n}{2}}=-w^{k}$, we also get: $f({k+\frac{n}{2}})=f(w^{k+\frac{n}{2}})=f_{even}(w^{2k})-w^k\cdot f_{odd}(w^{2k})$ Therefore, if we compute $f_{even}(w^{2k})$ and $f_{odd}(w^{2k})$, we get both $f(k)$ and $f(k+n/2)$. We can continue recursively breaking down even/odd terms like: $f_{even}(w^{2k}) = f_{even1}(w^{4k}) + w^{2k} \cdot f_{odd1}(w^{4k})$ (and similarly for $f_{odd}$) until we reach $w^{nk}$. #### 1.4.1 Butterfly Algorithm The radix-2 FFT computation process is often represented with the "butterfly algorithm" diagram (where $y$ corresponds to $X$ mentioned earlier). The "frequency factors" used in butterfly calculations are called $twiddle factors$. The butterfly algorithm is shown below<sup><a href="#ref2">[2]</a></sup>: <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/BkSfqZQc1e.png" width="500"/> </div> Butterfly computation for n points goes through $\log n$ rounds. In the diagram above: the first round (on the left) is an 8-point DFT, the middle round splits the 8-point DFT into two 4-point DFTs, and the final round further splits each into 2-point DFTs to compute the result. ## 2. DIT and DIF When implementing FFT, there are two main strategies: Decimation in Time (DIT) and Decimation in Frequency (DIF). Both use the butterfly operation, but the order of input and output differs. ### 2.1 Computational Methods - DIT:Starts from 2-point DFT, then proceeds to 4-point, 8-point... and finally computes the n-point DFT. $x_{L-next}=x_{L}+twiddlefactor \cdot x_{R}$ $x_{R-next}=x_{L}-twiddlefactor \cdot x_{R}$ Taking an 8-point computation as an example, the $twiddlefactor$ values are: First round: $w_{8}^0=1$ Second round (within the same 4-point DFT): $w_{8}^0, w_{8}^2$ Third round (within the 8-point DFT): $w_{8}^0, w_{8}^1, w_{8}^2, w_{8}^3$ As shown below<sup><a href="#ref3">[3]</a></sup>: <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/S1-pCXXcJg.png" width="550"/> </div> - DIF:Starts from the n-point DFT, then to n/2-point, ... and ends with 2-point DFT. $x_{L-next}=x_{L}+x_{R}$ $x_{R-next}=twiddlefactor \cdot (x_{L}-x_{R})$ For the $twiddlefactor$ values: First round (8-point DFT): $w_{8}^{0}, w_{8}^{1}, w_{8}^{2}, w_{8}^{3}$ Second round (within the same 4-point DFT): $w_{8}^0, w_{8}^2$ Third round (each 2-point DFT): $w_{8}^0=1$ As shown below<sup><a href="#ref4">[4]</a></sup>: <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/HJXxyEm9kl.png" width="500"/> </div> In the above DIT and DIF computations, both use a pair of points from the previous layer to compute a pair in the next layer. The differences are: (1) the computation method, and (2) the values of $twiddlefactor$. ### 2.2 Input and Output Order For DIT: Input in bit-reversed order → Output in natural order For DIF: Input in natural order → Output in bit-reversed order ### 2.3 Code Implementation<sup><a href="#ref5">[5]</a></sup> Here are code snippets of personally implemented DIT and DIF parts. $a$ is the input data; $roots$ are the twiddlefactors, i.e., $\{1,w_{n},w_{n}^{2},...,w_{n}^{n-1}\}$. - dit ```python= def dit(self, a, roots): n = len(a) a = self.bit_reverse(a) logn = n.bit_length() - 1 for s in range(1, logn + 1): m = 1 << s wm = roots[n//m] for k in range(0, n, m): w = 1 for j in range(m // 2): u = a[k + j] v = self.gf.mul(w, a[k + j + m // 2]) a[k + j] = self.gf.add(u, v) a[k + j + m // 2] = self.gf.sub(u, v) w = self.gf.mul(w, wm) return a ``` - dif ```python= def dif(self, a, roots): n = len(a) logn = n.bit_length() - 1 for s in range(logn, 0, -1): m = 1 << s wm = roots[n//m] for k in range(0, n, m): w = 1 for j in range(m // 2): u = a[k + j] v = a[k + j + m // 2] a[k + j] = self.gf.add(u, v) a[k + j + m // 2] = self.gf.mul(w, self.gf.sub(u, v)) w = self.gf.mul(w, wm) return self.bit_reverse(a) ``` - Inverse DIT and DIF Computation: In Section 1.3 (DFT in matrix form), we saw that converting frequency-domain data $X$ back into time-domain signal $x$ only requires: Changing the original twiddlefactors formed from the primitive root $w$ to those formed from the inverse root $w^{-1}$, i.e.,from $\{1,w_{n},w_{n}^{2}...w_{n}^{n-1}\}$ to $\{1,w_{n}^{-1},w_{n}^{-2}...w_{n}^{-(n-1)}\}$, Using the inverse Vandermonde matrix $M^{-1}$ and applying a $\frac{1}{n}$ scaling. Therefore, the inverse operation for either DIT or DIF can be implemented in the same way: ```python= #inverse dit fft def inverse_dit(self, a): inverse_roots=self.get_inverse_roots(len(a)) a = self.dit(a, inverse_roots) n_inv = self.gf.inv(len(a)) return [self.gf.mul(x, n_inv) for x in a] #inverse dif fft def inverse_dif(self, a): inverse_roots=self.get_inverse_roots(len(a)) a = self.dif(a, inverse_roots) n_inv = self.gf.inv(len(a)) return [self.gf.mul(x, n_inv) for x in a] ``` For full code details, see Eric’s plonky3-python-notebook FFT section: https://github.com/0xxEric/plonky3-python-notebook/blob/main/fft/four_step.ipynb ### 2.4 An Example: Deriving the Consistency Between DIT and DIF Why do DIT and DIF FFT computations produce the same result? Let’s use a 4-point DFT as an example and derive the computations to verify their consistency. Let the input $a$ be $\{x_{0},x_{1},x_{2},x_{3}\}$ and the $roots$ be $\{w_{4}^0,w_{4}^1,w_{4}^2,w_{4}^3\}$ - DIT Input after bit-reversal: $\{x_{0},x_{2},x_{1},x_{3}\}$ First round, $twiddlefactors$ are $\{w_{4}^0,w_{4}^0\}$; Output is $\{x_{0}+x_{2}, x_{0}-x_{2}, x_{1}+x_{3}, x_{1}-x_{3}\}$ Second round, $twiddlefactors$ are $\{w_{4}^0,w_{4}^1\}$, Output is $\{(x_{0}+x_{2})+(x_{1}+x_{3}), (x_{0}-x_{2})+(x_{1}-x_{3}), (x_{0}+x_{2})-(x_{1}+x_{3}), (x_{0}-x_{2})-w_{4}^1(x_{1}-x_{3})\}$ - DIF Input: $\{x_{0},x_{1},x_{2},x_{3}\}$ First round, $twiddlefactors$ are $\{w_{4}^0,w_{4}^1\}$, and Output is $\{x_{0}+x_{2}, x_{1}+x_{3}, x_{0}-x_{2}, w_{4}^1(x_{1}-x_{3})\}$ Second round, $twiddlefactors$ are $\{w_{4}^0,w_{4}^0\}$,and Output is $\{(x_{0}+x_{2})+(x_{1}+x_{3}), (x_{0}+x_{2})-(x_{1}+x_{3}), (x_{0}-x_{2})+w_{4}^1(x_{1}-x_{3}), (x_{0}-x_{2})-w_{4}^1(x_{1}-x_{3})\}$ So, after applying bit-reversal to the second-round output of DIF, we get: $\{(x_{0}+x_{2})+(x_{1}+x_{3}), (x_{0}-x_{2})+(x_{1}-x_{3}), (x_{0}+x_{2})-(x_{1}+x_{3}), (x_{0}-x_{2})-w_{4}^1(x_{1}-x_{3})\}$ As we can see, the results of DIF and DIT are identical. ## 3. Recursion vs. Iteration: A Comparison of Two Implementation Methods In Section 1.4, we introduced the core idea of the FFT algorithm: breaking down the computation of $f(w^k)$ into two smaller problems, $f_{even}(w^{2k})$ and $f_{odd}(w^{2k})$, and solving them recursively. In the DIT and DIF implementations shown earlier, the algorithm is implemented via multiple layers of iteration. Although the core algorithmic idea remains the same, the iterative implementation calculates each recursive layer explicitly, step by step—from the initial n input values (first layer), to the second layer, and so on, until the $\log n$-th layer. #### 3.1 Recursive FFT Code Implementation<sup><a href="#ref6">[6]</a></sup> ```python= # fft: from coefficients to evaluations def fft(vals, domain): if (len(domain) & (len(domain) - 1)) != 0: raise ValueError("Domain length must be a power of 2.") if len(vals) < len(domain): if len(vals) == 0: zero = Field(0) else: zero = vals[0] - vals[0] vals = vals + [zero] * (len(domain) - len(vals)) if len(vals) == 1: return vals half_domain = halve_domain(domain) f0 = fft(vals[::2], half_domain) f1 = fft(vals[1::2], half_domain) left = [L+x*R for L,R,x in zip(f0, f1, domain)] right = [L-x*R for L,R,x in zip(f0, f1, domain)] return left+right # ifft: from evaluations to coefficients def inv_fft(vals, domain): if (len(domain) & (len(domain) - 1)) != 0: raise ValueError("Domain length must be a power of 2.") if len(vals) < len(domain): if len(vals) == 0: zero = Field(0) else: zero = vals[0] - vals[0] vals = vals + [zero] * (len(domain) - len(vals)) if len(vals) == 1: return vals half_domain = halve_domain(domain) left = vals[:len(domain)//2] right = vals[len(domain)//2:] f0 = [(L+R)/2 for L,R in zip(left, right)] f1 = [(L-R)/(2*x) for L,R,x in zip(left, right, domain)] o = [0] * len(domain) o[::2] = inv_fft(f0, half_domain) o[1::2] = inv_fft(f1, half_domain) return o ``` ### 3.2 Comparison Between Recursion and Iteration #### 3.2.1 Usage of Twiddle Factors in Butterfly Computation - Recursive Approach In the recursive approach, a domain is used to store the twiddle factors at each layer, and they must be recalculated for every level. For example, in the above recursive code, a half_domain is used to compute the next step: First layer: $domain = \{1, w, w^2, ..., w^{n/2 - 1}\}$ Second layer: $domain = \{1, w^2, ..., w^{n - 2}\}$ ...until the final layer: $domain = \{1\}$ - Terative Approach In the DIT and DIF implementations, the twiddle factors are precomputed in advance. At each of the $\log n$ layers, the relevant values are accessed as needed. Example (for 8-point DIT FFT): $twiddle\ factor = t = \{1, w, w^2, ..., w^7\}$ First round: use $t[0]$ Second round: use $t[0], t[2]$ Third round: use $t[0], t[1], t[2], t[3]$ #### 3.2.2 Output Order - Recursive Approach Each recursive call naturally returns results in natural order, with no need for bit-reversal sorting. - Iterative Approach Requires manual bit-reversal sorting of the data. #### 3.2.3 Summary of the Comparison Based on the two points above, and considering other related factors, we can summarize the characteristics of the two approaches as follows: - Recursive Implementation The data is implicitly decomposed recursively, and the output is naturally in correct order, requiring no extra processing.This method is relatively simple to implement and easier for learning and conceptual understanding. However, it can introduce overhead due to function calls, consume more memory for deep recursion, and lead to cache-unfriendly memory access patterns. - Iterative Implementation The data decomposition requires explicit bit-reversal ordering.Although the logic is slightly more complex and less intuitive, it usually operates in-place, offering better performance. This approach is more suitable for hardware optimization and high-performance applications due to better memory locality and cache friendliness.Hence, for large-scale data and performance-critical scenarios, iteration is the better choice. ## 4. An Optimization: Bowers FFT In Plonky3, based on Kevin J. Bowers’ paper<sup><a href="#ref2">[2]</a></sup>, an optimized variant called Bowers FFT is implemented. It is very similar to DIT and DIF FFTs, but includes a subtle optimization in how $twiddlefactors$ are accessed—improving memory access patterns and efficiency. ### 4.1 Algorithm Overview Bowers FFT includes two components: bowers_g and bowers_g_t, which are mutual inverses. - bowers_g This is similar to the DIF FFT and computes the DFT.The difference is: the input $twiddle factors$ are provided in bit-reversed order. The butterfly computation is the same as DIF: $x_{L-next}=x_{L}+x_{R}$ $x_{R-next}=twiddlefactor \cdot (x_{L}-x_{R})$ - bowers_g_t This is similar to inverse-DIF, implementing the inverse DFT.The difference is: the twiddle factors are constructed from the inverse primitive root $w^{-1}$, also in bit-reversed order. The butterfly computation is the same as DIT: $x_{L-next}=x_{L}+twiddlefactor \cdot x_{R}$ $x_{R-next}=x_{L}-twiddlefactor \cdot x_{R}$ Taking an 8-point DFT as an example, the butterfly diagram of bowers_g_t is shown below: <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/ByuLFH49kx.png" width="500"/> </div> ### 4.2 Optimization Details Because twiddle factors in Bowers FFT are given in bit-reversed order, they can be accessed sequentially in memory during butterfly computations. Taking an 8-point DFT as an example: Original twiddlefactors: $\{w_8^0, w_8^1, w_8^2, w_8^3\}$ Bit-reversed: $\{w_8^0, w_8^2, w_8^1, w_8^3\}$ In bowers_g_t: First round: read the first element of the array 4 times:$\{w_8^0, w_8^0, w_8^0, w_8^0\}$ Second round: read the first two elements twice each: $\{w_8^0, w_8^0, w_8^2, w_8^2\}$ Third round: read the first four elements once each: $\{w_8^0, w_8^2, w_8^1, w_8^3\}$ In contrast, in Section 2.1 (DIT FFT), the twiddlefactors accessed in the second round are: $\{w_8^0, w_8^2, w_8^0, w_8^2\}$ This results in non-sequential memory access, which is less efficient than Bowers FFT. ### 4.3 Code Implementation<sup><a href="#ref7">[7]</a></sup> ```python= def bower_g(self,a): n = len(a) a = self.bit_reverse(a) roots=self.get_forward_roots(n) roots=self.bit_reverse(roots[:len(roots) // 2]) logn = n.bit_length() - 1 for s in range(1, logn + 1): m = 1 << s for k in range(0, n, m): w = roots[k//m] for j in range(m // 2): u = a[k + j] v = a[k + j + m // 2] a[k + j] = self.gf.add(u, v) a[k + j + m // 2] = self.gf.mul(w, self.gf.sub(u, v)) return a def bower_g_t(self, a): n = len(a) roots=self.get_inverse_roots(n) roots=self.bit_reverse(roots[:len(roots) // 2]) logn = n.bit_length() - 1 for s in range(logn, 0, -1): m = 1 << s for k in range(0, n, m): w = roots[k//m] for j in range(m // 2): u = a[k + j] v = self.gf.mul(w, a[k + j + m // 2]) a[k + j] = self.gf.add(u, v) a[k + j + m // 2] = self.gf.sub(u, v) n_inv = self.gf.inv(len(a)) a=self.bit_reverse(a) return [self.gf.mul(x, n_inv) for x in a] ``` ## 5.Four-Step FFT Apart from optimizations like the Bowers FFT, which improve local memory access, is there an algorithm better suited for parallel processing of large-scale data? The Four-Step FFT is designed exactly for this purpose<sup><a href="#ref8">[8]</a></sup><sup><a href="#ref9">[9]</a></sup>. The Four-Step Fast Fourier Transform (also known as Bailey’s FFT) is a high-performance algorithm for computing FFTs. This variant of the Cooley-Tukey FFT was originally designed for hierarchical memory systems commonly found in modern computers. The algorithm treats input samples as a 2D matrix (thus also known as matrix FFT), and performs short FFT operations independently on the columns and rows of the matrix. ### 5.1 Computational Steps 1. Reshape the input sequence into a matrix. 2. Apply standard FFT independently to each column of the matrix. 3. Multiply each matrix element by a correction factor (called Twiddle Factors). 4. Apply standard FFT independently to each row of the matrix. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/BJ5_oINc1g.png" width="300"/> </div> ### 5.2 Performance and Efficiency Benefits Take a dataset of size $2^{10}$ as an example.Directly using the Cooley-Tukey FFT algorithm would require $2^{10} \cdot 10$ operations. With the Four-Step FFT, the data is reshaped into a matrix of $2^5$ columns and $2^5$ rows. Then, FFT is applied to each row and column independently, with each requiring $2^5 \cdot 5$ operations. Total operations: $2^5 \cdot 2 \cdot 2^5 \cdot 5 = 2^{10} \cdot 10$ Although the Four-Step FFT introduces extra steps for scaling and twiddle factor correction, it offers several practical advantages in terms of memory access and hardware compatibility: - Higher memory access efficiency: By decomposing a large FFT into smaller row and column FFTs, each only operating on a portion of the matrix, this approach reduces random global memory access compared to traditional FFTs. - Better parallelism: Since row FFTs and column FFTs are independent, they can be computed in parallel. Modern hardware (such as FPGAs and GPUs) can exploit this to compute multiple FFTs simultaneously. - Reduced memory requirements: The algorithm processes smaller blocks (rows or columns), which can fit within limited on-chip RAM, avoiding the need to store the full FFT input/output in main memory. ### 5.3 Code Implementation<sup><a href="#ref10">[10]</a></sup> The following code snippet shows the main steps of the Four-Step FFT. Supporting functions are assumed to be implemented elsewhere: ```python= def four_step(array, log_rows,modulus): n = len(array) logn = n.bit_length() - 1 log_cols = logn - log_rows assert log_rows > 0 assert log_cols > 0 assert modulus > n gf = Field(modulus) ntt = NTT(modulus, n) # first step: transfer the array into matrix matrix = ntt.matrix(array, log_cols, log_rows) print("origin matrix is:",matrix) # second step: do FFT for each column ntt.apply_column_fft(matrix) print("after column fft, matrix is:",matrix) # third step: apply twiddles wm^(i*j) wm = ntt.get_forward_roots(n)[1] ntt.apply_twiddles(wm, matrix) # fourth step: do FFT for each row ntt.apply_row_fft(matrix) print("after row fft, matrix is:",matrix) # Transpose the matrix, and flatten it into array out_array = ntt.transpose_and_flatten(matrix) print("after transpose and flatten, array is:",out_array) return out_array ``` ## 6. Conclusion This article has summarized the process of DFT and FFT algorithms, introduced some common FFT algorithms (DIT, DIF), and described optimized FFT algorithms suitable for large-scale computation (Bowers FFT, Four-Step FFT). In addition to these, there are many other FFT variants that have been developed. Furthermore, in circle-stark, a specialized FFT algorithm called circle FFT is used for computations over a circular domain.The core algorithmic ideas remain similar, but the folding of the domain and bit-reversal strategy are slightly different. I've made some attempts to organize my understanding of this topic through notes and code implementations<sup><a href="#ref11">[11]</a></sup><sup><a href="#ref12">[12]</a></sup><sup><a href="#ref13">[13]</a></sup>.These may contain imperfections—feedback and suggestions are warmly welcome. I'm still a beginner and have much to learn. Constructive criticism is greatly appreciated! ## References 1. <a id="ref1"></a> [The Fast Fourier Transform (FFT): Most Ingenious Algorithm Ever?](https://www.youtube.com/watch?v=h7apO7q16V0&feature=emb_logo) 2. <a id="ref2"></a> [Improved Twiddle Access for Fast Fourier Transforms](https://ieeexplore.ieee.org/document/5313934) 3. <a id="ref3"></a> [Part 1 of FFT:Radix-2 DIT FFT](https://zhuanlan.zhihu.com/p/663306670) 4. <a id="ref4"></a> [Part 2 of FFT:Radix-2 DIF FFT](https://zhuanlan.zhihu.com/p/663449060) 5. <a id="ref5"></a> [Code:dit and dif fft](https://github.com/coset-io/plonky3-python-notebook/blob/main/fft/four_step.ipynb) 6. <a id="ref6"></a> [Code:Recursively performing FFT](https://github.com/coset-io/plonky3-python-notebook/blob/main/fft/radix_2_dit.ipynb) 7. <a id="ref7"></a> [Code:bowers fft](https://github.com/coset-io/plonky3-python-notebook/blob/main/fft/radix_2_bowers.ipynb) 8. <a id="ref8"></a> [Fast Fourier transform on a 3D FPGA](https://www.researchgate.net/publication/37995260_Fast_Fourier_transform_on_a_3D_FPGA) 9. <a id="ref9"></a> [Bailey's FFT algorithm](https://en.wikipedia.org/wiki/Bailey%27s_FFT_algorithm#) 10. <a id="ref9"></a> [Code:four-step fft](https://github.com/coset-io/plonky3-python-notebook/blob/main/fft/four_step.ipynb) 11. <a id="ref11"></a> [Circle Domain&Circle FFT](https://drive.google.com/file/d/1a_xzPu0iIaKNC9nt6kp7R5v2PYXwOnnS/view?usp=sharing) 12. <a id="ref12"></a> [Demonstration of Circle FFT](https://www.youtube.com/watch?v=ur3c4mIi1Jc&list=PLbQFt1T_44DylxPQWM1OgVRlbyriKeXqk) 13. <a id="ref13"></a> [Rust implementation of Circle FFT](https://github.com/0xxEric/CircleFFT)