# Accelerate Fast Fourier Transform
> 謝廷昇, 王晴文
[GitHub](https://github.com/annie58221/term_project)
## Introduction of the task
Task : Research FFT and its improvement schemes.
Implement the following:
RV32I : Develop the base RV32I ISA implementation.
Using B Extension : Integrate and utilize the Bit-Manipulation (B) extension.
Using RVV : Incorporate and utilize the RISC-V Vector Extension (RVV).
The code should be validated using QEMU.
Finally, prepare a signal processing application that utilizes your improved FFT as a technical demonstration.
## Introduction to Fourier transform
### DFT (Discrete Fourier transform)
The discrete Fourier transform (DFT) is defined as follows:
$$
Y(k) = \sum_{n=0}^{N-1} x(n) W_N^{nk} = \sum_{n=0}^{N-1} x(n) e^{-\frac{2\pi i}{N}nk}
$$
Where:
- \(x(n)\) : Input sequence
- \(Y(k)\) : Output sequence
- $W_N^{nk} = e^{-\frac{2\pi i}{N}nk}$ : Twiddle factor
- $i = \sqrt{-1}$ : Imaginary unit
The Discrete Fourier Transform (DFT) is a mathematical tool that transforms a signal from the time domain to the frequency domain. Its primary purpose is to analyze the spectral characteristics of a signal. DFT is widely used in various fields, including digital signal processing, image processing, sound analysis, and scientific and engineering applications.
#### Characteristics of DFT:
* Spectrum Analysis: Decomposes the input signal into a superposition of different frequency components.
* Inverse Transform (IDFT): Allows the reconstruction of the time-domain signal from the frequency-domain data.
* High Complexity: The computational complexity of a standard DFT is O( $N^2$ ) , which increases significantly with the length of the input signal.
#### Applications of DFT:
* Signal Processing: Filtering, modulation, and spectrum analysis.
Image Processing: Edge detection and compression (e.g., discrete cosine transform in JPEG).
* Engineering and Science: Seismic analysis and structural vibration studies.
### Cooley-Tukey FFT (Fast Fourier Transform)
An efficient algorithm used for rapidly computing the Discrete Fourier Transform and its inverse . Fourier Transform is a core technique in signal processing for analyzing frequency-domain characteristics, and FFT reduces computational complexity, making the application of Fourier Transform more efficient.
Direct computation of DFT requires O( $N^2$ ) computational complexity (where N is the number of data points), while FFT reduces the complexity to O( $NlogN$ ), significantly improving efficiency. Its applications include signal processing (e.g., spectrum analysis and filtering), image processing (e.g., compression and denoising), audio processing (e.g., music analysis and echo cancellation), and scientific computation (e.g., simulation and solving partial differential equations). The basic principle of FFT is based on the divide-and-conquer approach, where DFT is broken down into smaller sub-problems for recursive computation. By dividing the input data into even and odd parts for separate calculations and then combining the results, the process is greatly optimized. Among the common implementations, the Cooley-Tukey FFT algorithm is the most widely used.
The decomposition method of the Cooley-Tukey FFT algorithm can be expressed as:
$$
Y(k) = \sum_{n=0}^{N-1} x(n) W_N^{nk}
= \sum_{n=0}^{N/2-1} x(2n) W_N^{2nk} + \sum_{n=0}^{N/2-1} x(2n+1) W_N^{(2n+1)k}
$$
$$
= \sum_{n=0}^{N/2-1} f(n) W_{N/2}^{nk} + \sum_{n=0}^{N/2-1} g(n) W_{N/2}^{nk}
$$
Where:
- \(f(n)\) and \(g(n)\) are complex sequences of length ${N/2}$.
- $W_N^{nk} = e^{-\frac{2\pi i}{N}nk}$ : Twiddle factor
### PerfMPL-FFT
In terms of algorithm implementation, this method used the SIMD vectorization, and vector features of the RISC-V instruction set architecture to assemble and implement the optimized FFT algorithm manually and optimized the selection and sequencing of instructions to maximize and improve the performance of the FFT algorithm library based on the RISC-V architecture, and finally realize a high-performance algorithm library based on the RISC-V CPUs. The experimental results show that: on the RISC-V platform, the PerfMPL-FFT algorithm library implemented in this paper has a better performance improvement for DFT processing of single-precision and double-precision data compared with the open-source FFTW library.
The key contributions of this paper are summarized as follows:
* Proposed optimization strategies: We introduced a set of optimization strategies and methods targeting the butterfly kernel in the Cooley-Tukey FFT algorithm.
* Designed a technology system for RISC-V: Based on the characteristics of the RISC-V architecture, we developed an optimization technology system for the FFT algorithm, enhancing its performance on the RISC-V platform.
* Implemented an efficient FFT library: We implemented a high-performance FFT algorithm library on the RISC-V platform, demonstrating significant performance advantages.
### FFTW (Fast Fourier Transform in the West)
There are various fast Fourier transform (FFT) algorithms, including open-source libraries like FFTW, as well as FFT optimization libraries specifically designed by some chip companies for their own hardware, such as Intel’s MKL_FFT and Huawei’s KML_FFT.
FFTW (Fast Fourier Transform in the West) is an open-source library designed for fast Fourier transform (FFT). It is distributed under a free software license, similar to GPL, and you can download the latest version from its official website or code repository, following the licensing terms for usage. This library enables efficient numerical computations across various computing platforms.
By leveraging pre-decomposition algorithms and caching techniques, FFTW significantly accelerates FFT computations, saving several times the computation time compared to traditional FFT algorithms. FFTW is widely applicable, not only in scientific computing and signal processing but also in fields like image and audio processing. Its exceptional performance makes it a vital tool in numerous scientific and engineering applications.
Here are the URLs for FFTW:
Official Website : http://fftw.org/
GitHub Repository : https://github.com/FFTW/fftw3
## QEMU
### What is QEMU?
QEMU (Quick Emulator) is an open-source general-purpose machine emulator and virtualization tool that primarily supports the following modes:
1. **User Mode Emulation**: In this mode, QEMU allows programs compiled for one CPU architecture to run on a different CPU architecture. eg.`qemu-riscv64`
2. **System Emulation**: In this mode, QEMU provides a complete environment for running a guest system, including the CPU, memory, and peripheral devices. eg.`qemu-system-riscv64`
When emulating the RISC-V architecture, QEMU can simulate the virt virtual development board. **While this board does not correspond to any physical hardware**, it supports the emulation of various hardware devices, making it ideal for development and testing purposes.
### QEMU for Linux
#### Install QEMU and Risc-V tool chain
```
sudo apt update
sudo apt install qemu qemu-kvm libvirt-daemon-system libvirt-clients bridge-utils -y
git clone https://github.com/riscv/riscv-gnu-toolchain
```
Since we are working with the RISC-V architecture, we need to begin compiling U-Boot and OpenSBI.
1. Compile U-Boot
```
git clone https://github.com/u-boot/u-boot.git
cd u-boot
make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- qemu-riscv64_smode_defconfig
make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- -j$(nproc)
```
Next, the desired kernel will be generated as shown in the figure below.

2. OpenSBI
```
git clone https://github.com/riscv-software-src/opensbi.git
cd opensbi
make PLATFORM=generic CROSS_COMPILE=riscv64-linux-gnu- PLATFORM_RISCV_XLEN=64
```
Completing the above steps will successfully install OpenSBI.
Then navigate to the following path.
```
cd build/platform/generic/firmware
```
A fw_bin will then be generated, which is our firmware.

#### Download Ubuntu for RISC-V Platforms
Go to the following URL to download the corresponding IMG file (QEMU).
https://ubuntu.com/download/risc-v
#### Boot the system
Once the kernel and firmware are ready, pass them with the corresponding parameters to QEMU to boot the system.
```
qemu-system-riscv64 -machine virt -nographic -m 2048 -smp 4 -bios /home/annie/opensbi/build/platform/generic/firmware/fw_jump.bin -kernel /home/annie/u-boot/u-boot.bin -device virtio-net-device,netdev=eth0 -netdev user,id=eth0 -device virtio-rng-pci -drive file=/home/annie/ubuntu-24.10-preinstalled-server-riscv64.img,format=raw,if=virtio
```
:::info
The path should be adjusted to the correct path accordingly.
:::
:::success
Basic Commands and Parameter Descriptions
-`qemu-system-riscv64`
Launches the RISC-V 64-bit version of the QEMU emulator.
-`machine virt`
Uses the virtual machine type (virt machine type), which is a generic virtualization hardware platform provided by QEMU for RISC-V.
-`nographic`
Disables graphical output, redirecting display output to the terminal.
-`m 2048`
Allocates 2048 MB of virtual memory to the emulator.
-`smp 4`
Configures the emulator with 4 virtual CPU cores.
-`bios` /home/annie/opensbi/build/platform/generic/firmware/fw_jump.bin
Specifies the OpenSBI firmware fw_jump.bin as the BIOS for booting U-Boot.
-`kernel /home/annie/u-boot/u-boot.bin`
Loads the U-Boot binary file as the kernel bootloader.
-`device virtio-net-device,netdev=eth0`
Configures a virtual network device using the VirtIO driver and connects it to the network device with ID eth0.
-`netdev user,id=eth0`
Simulates a network connection using "user mode networking" and assigns it the ID eth0.
-`device virtio-rng-pci`
Adds a PCI-based VirtIO random number generator device to provide entropy.
-`drive file=/home/annie/ubuntu-24.10-preinstalled-server-riscv64.img,format=raw,if=virtio`
Loads the specified Ubuntu system image file (ubuntu-24.10-preinstalled-server-riscv64.img) with the following options:
-`format=raw: Reads the image file in raw format`.
if=virtio: Attaches the disk as a VirtIO device to the virtual machine.
:::
Then, wait for the system to boot.
When you see the message "Completed socket interaction for boot stage final", press `Enter` to log in.
The default username is: ubuntu
The default password is: ubuntu

Finally, check the system to confirm it is RISC-V:
```
cd /bin
uname -m
# It will show like this
ubuntu@ubuntu:/bin$ uname -m
riscv64
```
### How to create a file and run it on QEMU
1. Writing a `.s` File
Create a simple `hello.s`. Below is an example program :
```
.data
msg: .asciz "Hello, QEMU!\n"
.text
.global _start
_start:
# Write system call
li a7, 64 # syscall number for write
li a0, 1 # file descriptor (stdout)
la a1, msg # address of the message
li a2, 13 # length of the message
ecall
# Exit system call
li a7, 93 # syscall number for exit
li a0, 0 # exit code
ecall
```
:::info
The directive `.global _start` is necessary in RISC-V assembly code.
* Program Entry Point:
When an executable file is loaded for execution, the operating system looks for the program's entry point. By default, the entry point is usually `_start`.
If `_start` is not defined as a global symbol, the linker may fail to locate it, resulting in the executable being unable to run properly.
:::
2. Compiling the `.s` File
Use the riscv64-linux-gnu-as and riscv64-linux-gnu-ld to compile the .s file into an executable :
```
# Assemble the .s file into an object file
riscv64-linux-gnu-as -o hello.o hello.s
# Link the object file to create an executable
riscv64-linux-gnu-ld -o hello hello.o
```
3. Running with QEMU
Run the compiled program using QEMU :
```
qemu-riscv64 hello
```
After execution, the terminal should display :
```
Hello, QEMU!
```
### How to tanslate c into riscv
First, locate the positions of `stdio.h` and `libm.a`.
```
ubuntu@ubuntu:~$ find /usr -name stdio.h
find /usr -name libm.a
/usr/include/riscv64-linux-gnu/bits/stdio.h
/usr/include/newlib/ssp/stdio.h
/usr/include/newlib/sys/stdio.h
/usr/include/newlib/stdio.h
/usr/include/stdio.h
/usr/lib/riscv64-linux-gnu/libm.a
```
The following instruction will translate `C` into `RISC-V`.
```
riscv64-unknown-elf-gcc -O2 -S fft.c -o fft.s \
-I/usr/include/newlib \
-L/usr/lib/riscv64-linux-gnu -lm
```
This instruction will translate `c` into `elf` file.
```
gcc -o fft.elf fft.c -lm
```
### How to expand the virtual machine's space
The following code can expand the virtual machine's space.
```
qemu-img resize /home/annie/ubuntu-24.10-preinstalled-server-riscv64.img +15G
```
### QEMU for Mac
:::info
Use `$` to operate in terminal
:::
> environment: Macbook Air M3 Sonoma 14.6
:::danger
$ riscv32-unknown-elf-gcc -march=rv32i_zba_zbb_zbc_zbs_v -mabi=ilp32 -S -o output.s input.c
:::
**Use [Homebrew](https://brew.sh/) to install QEMU**
```
$ brew install qemu
```
**Check whether RISC-V has been supported**
```
$ qemu-system-riscv32 --version
```
If we have installed correctly, we get the version below:
( version 9.2.0 in 2025/1 )
```
QEMU emulator version 9.2.0
Copyright (c) 2003-2024 Fabrice Bellard and the QEMU Project developers
```
Then we compile RV32I/B extension/RVV via RISC-V GCC tools
**Installation**
```
$ brew tap riscv-software-src/homebrew-riscv
$ brew install riscv-tools
```
**Verification**
```
$ riscv64-unknown-elf-gcc --version
```
:::info
You may notice that the instruction in above is "riscv64".
It's normal.
Because most RISC-V toolchains target 64-bit architectures (riscv64) by default. However, **these tools usually support 32-bit targets as well**, by specifying the appropriate architecture flags at compile time.
:::
If it shows that `commend not found`, checking whether it is in the environment variable.
```
$ echo $PATH
```
If it is't, we add it manually and make sure again.
```
$ export PATH="/opt/homebrew/opt/riscv-gnu-toolchain/bin:$PATH"
$ source ~/.zshrc
```
**Compile the C file**
```
$ riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -o test.elf test.c
```
- `-march=rv32i`:specity architecture as RV32I
- `-mabi=ilp32`:specify ABI(Application Binary Interface)as 32-bit integer
### Issues encountered in QEMU
* Unable to use ecall 1 in QEMU (unable to directly print numbers).
solution :
Convert a word to ASCII. This is used for all numbers greater than or equal to 0.
```
.data
buffer: .space 21 # String buffer for the converted number
newline: .asciz "\n" # Newline character
.text
.global _start
_start:
int_to_string:
addi sp, sp, -16 # Save stack frame
sw ra, 12(sp) # Save return address
sw t0, 8(sp) # Save t0
sw t1, 4(sp) # Save t1
mv t0, a0 # Copy input number (a0) to t0
addi t1, a1, 16 # Point to the end of the buffer
li a0, 0 # Initialize the length counter
loop:
li t3, 10 # Load divisor (10)
rem t2, t0, t3 # t2 = t0 % 10 (remainder)
addi t2, t2, '0' # Convert the remainder to ASCII character
addi t1, t1, -1 # Move the buffer pointer back
sb t2, 0(t1) # Store the character in the buffer
div t0, t0, t3 # t0 = t0 / 10 (integer division)
addi a0, a0, 1 # Increment the length counter
bnez t0, loop # Repeat if t0 != 0
mv a1, t1 # Return the string starting address in a1
lw ra, 12(sp) # Restore return address
lw t0, 8(sp) # Restore t0
lw t1, 4(sp) # Restore t1
addi sp, sp, 16 # Restore stack frame
ret # Return to caller
```
## Code
### Part1 : logint
#### Functionality
The function calculates $\log_2(N)$, which determines the power of 2 closest to $N$.
#### Logic
* Each iteration right-shifts $N$ by 1 bit, effectively simulating a "divide by 2" operation.
* For every right shift, the counter $i$ increments, representing how many times $N$ can be divided by 2.
* The loop stops when $k$ becomes 0. At this point, $i$ contains $\log_2(N) + 1$. Subtracting 1 gives the correct result.
code :
```c=
/* Calculates the log2 of number */
int logint(int N)
{
int k = N, i = 0;
while (k) {
k >>= 1;
i++;
}
return i - 1;
}
```
### Convert to RISC-V assembly:
```riscv=
# Takes input N (a0), returns its log base 2 in a0
logint:
addi sp, sp, -4
sw t0, 0(sp) # store t0
add t0, a0, zero # k = N
add a0, zero, zero # i = 0
logloop:
beq t0, zero, logloop_end # Exit if k == 0
srai t0, t0, 1 # k >>= 1
addi a0, a0, 1 # i++
j logloop
logloop_end:
addi a0, a0, -1 # Return i - 1 as the counter was incremented one extra time.
lw t0, 0(sp)
addi sp, sp, 4
jr ra
```
#### Compiling the `.s` File in RV32I
Use the riscv64-linux-gnu-as and riscv64-linux-gnu-ld to compile the .s file into an executable :
```
riscv64-linux-gnu-as -o logint.o logint.s
riscv64-linux-gnu-ld -o logint logint.o
qemu-riscv64 logint
```
Using this instruction will see the time the file taken.
```
time qemu-riscv64 logint
```
for example :
```
real 0m0.139s
user 0m0.053s
sys 0m0.074s
```
### B-extension
```riscv=
set_log2n:
clz t0, a0 # Count leading zeros in a0
li t1, 31 # Maximum index for a 32-bit register
sub t1, t1, t0 # Compute log2(N)
la t0, logsize # Load address of 'logsize'
sw t1, 0(t0)
jr ra # Return to caller
logsize: .word 0
```
#### Compiling the `.s` File in B extension
Use the riscv64-linux-gnu-as and riscv64-linux-gnu-ld to compile the .s file into an executable :
```
riscv64-linux-gnu-as -march=rv64imafdzbb -o logint.o logint.s
riscv64-linux-gnu-ld -o logint logint.o
time qemu-riscv64 logint
```
### vector extension
```riscv=
# Calculate log base 2 of input
# Input:a0 = N
# Output: a0 = log2(N)
logint:
clz a0, a0 # Count leading zeros of a0
addi t0, zero, 31 # Load 31 (32-bit word size - 1)
sub a0, t0, a0
jr ra# Return
```
### Part2 : bitwise reverse
#### Functionality
The function `reverse` is used to reverse the binary representation of a number $n$ based on the bit length determined by the maximum number $N$.
- Input :
- $N$: Used to determine the bit length (calculated using $\log_2(N)$).
- $n$: The number whose binary representation needs to be reversed.
- **Output**:
- Returns the reversed binary representation of $n$.
#### Logic
* Determine Bit Length :
- Use $\log_2(N)$ to determine the number of bits required for $N$.
- For example, if $N = 18$, then $\log_2(18) \approx 4$, so the bit length is 4.
* Check and Reverse Bits :
- Use a loop to check each bit of $n$ starting from the $\text{logint}(N) - j$ position.
- If the bit at this position is 1, set the corresponding position $(j - 1)$ in the reversed binary number to 1.
- This is done using bitwise operations `&` and `|`.
* Return Result:
- Finally, return the reversed number $p$.
code :
```c=
/* Bitwise reverses the number */
int reverse(int N, int n)
{
int j, p = 0;
for (j = 1; j <= logint(N); j++) {
if (n & (1 << (logint(N) - j))) {
p |= 1 << (j - 1);
}
}
return p;
}
```
### Convert to RISC-V assembly:
```riscv=
# Takes inputs N(a0) and n(a1), reverses the number in binary
reverse:
addi sp, sp, -28
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
sw s4, 20(sp)
sw s5, 24(sp)
call logint # a0 = lg(N)
addi s0, zero, 1 # j = 1
add s1, zero, zero # p = 0
forloop_reverse:
bgt s0, a0, forloop_reverse_end # if j > lg(N) then end loop
sub s2, a0, s0
addi s3, zero, 1
sll s3, s3, s2
and s3, a1, s3
beq s3, zero, elses3 # If not, skip
ifs3:
addi s4, s0, -1 # s4 = j - 1
addi s5, zero, 1
sll s5, s5, s4
or s1, s1, s5
elses3:
addi s0, s0, 1
j forloop_reverse
forloop_reverse_end:
add a0, s1, zero # Return p
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4, 20(sp)
lw s5, 24(sp)
addi sp, sp, 28
jr ra
```
### B-extension
```riscv=
# Function: reverse
# Reverses the binary digits of a 32-bit integer.
# Inputs:
# - a0: Input number to reverse.
# - a1: Number of significant bits to reverse (optional; default 32).
# Outputs:
# - a0: The reversed binary number.
# Clobbers:
# - t0, t1, t2
reverse:
# Swap odd and even bits
lui t0, 0x55555 # Load upper 20 bits of 0x55555555
ori t0, t0, 0x555 # Complete loading 0x55555555
srli t1, a0, 1 # v >> 1
and t1, t1, t0 # (v >> 1) & 0x55555555
and t2, a0, t0
slli t2, t2, 1
or a0, t1, t2 # Store result in a0
# Swap consecutive pairs
lui t0, 0x33333 # Load upper 20 bits of 0x33333333
ori t0, t0, 0x333 # Complete loading 0x33333333
srli t1, a0, 2 # v >> 2
and t1, t1, t0 # (v >> 2) & 0x33333333
and t2, a0, t0
slli t2, t2, 2
or a0, t1, t2 # Store result in a0
# Swap nibbles
lui t0, 0x0F0F0 # Load upper 20 bits of 0x0F0F0F0F
ori t0, t0, 0x0F0 # Complete loading 0x0F0F0F0F
srli t1, a0, 4
and t1, t1, t0
and t2, a0, t0
slli t2, t2, 4
or a0, t1, t2 # Store result in a0
# Swap bytes
lui t0, 0x00FF0# Load upper 20 bits of 0x00FF00FF
ori t0, t0, 0x0FF# Complete loading 0x00FF00FF
srli t1, a0, 8
and t1, t1, t0
and t2, a0, t0
slli t2, t2, 8
or a0, t1, t2 # Store result in a0
# Swap 2-byte pairs
srli t1, a0, 16
slli t2, a0, 16
ora0, t1, t2 # Final result in a0
# Adjust number of significant bits
addi t1, zero, 32 # Load immediate value 32
sub t1, t1, a1 # Calculate shift amount
srl a0, a0, t1 # Shift right to fit bit count
jr ra # Return with result in a0
```
### vector extension
```riscv=
# Initialize the helper vector with sequential integers (0,1,2,...)
init_helper_vector:
la t0, helperVector # Load the base address of helperVector
lw t1, size # Load the size of the vector (N)
vsetvli t2, t1, e32, m1 # Set the vector length configuration
vid.v v0 # Generate a vector of indices
vse32.v v0, (t0) # Store the generated vector in helperVector
ret
# Bit-reverse elements of a vector
# Input:v29 = input vector, a0 = N
# Output: v29 = bit-reversed vector
vreverse:
addi sp, sp, -4 # Save return address
sw ra, 0(sp)
call logint # Compute log2(N) in a0
lw ra, 0(sp) # Restore return address
addi sp, sp, 4
li t1, 1 # Constant 1 for bit manipulation
li t0, 1 # Initialize bit position counter j
vmv.v.x v28, zero # Clear result vector v28
vreverse_loop: # Loop for j <= log2(N)
bgt t0, a0, vreverse_end # Break if j > log2(N)
sub t2, a0, t0 # t2 = log2(N) - j
sll t3, t1, t2 # t3 = (1 << (log2(N) - j))
vand.vx v0, v29, t3 # v0 = v29 & t3 (check bit)
# vmsne:Vector Mask Set Not Equa
vmsne.vx v0, v0, zero # v0 = 1 if set, else 0
addi t4, t0, -1 # t4 = j - 1
sll t3, t1, t4 # t3 = (1 << (j - 1))
vor.vx v28, v28, t3, v0.t # Set bit in v28 where mask is true
addi t0, t0, 1 # Increment bit position
j vreverse_loop
vreverse_end:
vmv.v.v v29, v28 # Store result in v29
jr ra # Return
```
```
The usage of vreverse would look something like this:
```c
mv a0, a2 # a0 = N, input for reverse function. Overwriting a0 is fine
vmv.v.v v29, v26 # v29 = v26 = <i>, used as input for vreverse
call vreverse # v29 now contains rev(N, <i>), keep it for later use
```
- la t0, helperVector:
This instruction loads the base address of helperVector into the register t0. helperVector is an array or buffer used to store the generated vector.
- lw t1, size:
This instruction loads an integer value (the size of the vector N) from memory into the register t1. This value represents the number of elements in the vector to be generated.
- vsetvli t2, t1, e32, m1:
This instruction sets the vector length and type. It sets the vector length to t1 (i.e., N), with each element being 32 bits (e32) and a multiplier factor of 1 (m1). This means that the subsequent vector operations will target N 32-bit elements.
- vid.v v0:
This instruction generates a vector v0 with elements consisting of consecutive index values (0, 1, 2, ..., N-1). It is an initialization vector typically used for testing or as indices for data.
- vse32.v v0, (t0):
This instruction stores the data in vector v0 to the memory address pointed to by t0 (i.e., helperVector). This means the generated index vector will be written into helperVector.
## Application to FFT
This code implements the Fast Fourier Transform (FFT) algorithm, which can be used to analyze the frequency components of a signal.
Given an input signal of [1, 1, 1, 1, 0, 0, 0, 0], the FFT result will display the spectrum of the signal. The real and imaginary parts of the result correspond to the magnitude and phase in the frequency domain.
```c=
#include <stdio.h>
#include <math.h>
#include <complex.h>
#define PI 3.14159265358979323846
/* Calculates the log2 of number */
int logint(int N) {
int k = N, i = 0;
while (k) {
k >>= 1;
i++;
}
return i - 1;
}
/* Bitwise reverses the number */
int reverse(int N, int n) {
int j, p = 0;
for (j = 1; j <= logint(N); j++) {
if (n & (1 << (logint(N) - j))) {
p |= 1 << (j - 1);
}
}
return p;
}
/* Performs the FFT */
void fft(complex double *a, int N) {
// Bit-reversal permutation
for (int i = 0; i < N; i++) {
int rev = reverse(N, i);
if (i < rev) {
complex double temp = a[i];
a[i] = a[rev];
a[rev] = temp;
}
}
// FFT computation
for (int s = 1; s <= logint(N); s++) {
int m = 1 << s; // 2^s
complex double wm = cexp(-2.0 * PI * I / m);
for (int k = 0; k < N; k += m) {
complex double w = 1.0;
for (int j = 0; j < m / 2; j++) {
complex double t = w * a[k + j + m / 2];
complex double u = a[k + j];
a[k + j] = u + t;
a[k + j + m / 2] = u - t;
w *= wm;
}
}
}
}
int main() {
int N = 8; // Length of input (must be a power of 2)
complex double a[] = {1, 1, 1, 1, 0, 0, 0, 0}; // Example input
printf("Input:\n");
for (int i = 0; i < N; i++) {
printf("%lf + %lfi\n", creal(a[i]), cimag(a[i]));
}
fft(a, N);
printf("\nOutput:\n");
for (int i = 0; i < N; i++) {
printf("%lf + %lfi\n", creal(a[i]), cimag(a[i]));
}
return 0;
}
```
Using instruction translate to assembly.
```
riscv64-linux-gnu-gcc -O3 -march=rv64imafdzba_zbb_zbc_zbs_v -S -o output.s fft.c
riscv64-linux-gnu-as -march=rv64imafdzba_zbb_zbc_zbs_v -o output.o output.s
riscv64-linux-gnu-gcc -o output output.o -lm
```
:::danger
When we encountered issues installing riscv32-linux-gnu-gcc, we decided to switch to using rv64i instead.
:::
``` riscv=
.file "fft.c"
.option pic
.attribute arch, "rv64i2p1_m2p0_a2p1_f2p2_d2p2_v1p0_zicsr2p0_zba1p0_zbb1p0_zbc1p0_zbs1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
.attribute unaligned_access, 0
.attribute stack_align, 16
.text
.align 2
.globl logint
.type logint, @function
logint:
.LFB23:
.cfi_startproc
mv a5,a0
beq a0,zero,.L4
li a4,0
.L3:
sraiw a5,a5,1
mv a0,a4
addiw a4,a4,1
bne a5,zero,.L3
ret
.L4:
li a0,-1
ret
.cfi_endproc
.LFE23:
.size logint, .-logint
.align 2
.globl reverse
.type reverse, @function
reverse:
.LFB24:
.cfi_startproc
bne a0,zero,.L24
ret
.L24:
li a2,1
mv a7,a2
li a6,0
.L9:
mv a5,a0
li a4,0
.L13:
sraiw a5,a5,1
addiw a4,a4,1
bne a5,zero,.L13
ble a4,a2,.L25
mv a4,a0
.L11:
sraiw a4,a4,1
mv a3,a5
addiw a5,a5,1
bne a4,zero,.L11
subw a3,a3,a2
sraw a3,a1,a3
andi a3,a3,1
beq a3,zero,.L12
addiw a5,a2,-1
sllw a5,a7,a5
or a6,a6,a5
.L12:
addiw a2,a2,1
j .L9
.L25:
mv a0,a6
ret
.cfi_endproc
.LFE24:
.size reverse, .-reverse
.align 2
.globl fft
.type fft, @function
fft:
.LFB25:
.cfi_startproc
addi sp,sp,-192
.cfi_def_cfa_offset 192
sd s3,152(sp)
sd s8,112(sp)
sd ra,184(sp)
.cfi_offset 19, -40
.cfi_offset 24, -80
.cfi_offset 1, -8
mv s8,a1
mv s3,a0
ble a1,zero,.L27
mv a7,a0
li a6,1
li a0,0
vsetivli zero,2,e64,m1,ta,ma
.L28:
li a1,0
li a2,1
.L33:
mv a5,s8
li a4,0
.L31:
sraiw a5,a5,1
addiw a4,a4,1
bne a5,zero,.L31
bge a2,a4,.L63
mv a4,s8
.L29:
sraiw a4,a4,1
mv a3,a5
addiw a5,a5,1
bne a4,zero,.L29
subw a3,a3,a2
sraw a3,a0,a3
andi a3,a3,1
beq a3,zero,.L30
addiw a5,a2,-1
sllw a5,a6,a5
or a1,a1,a5
.L30:
addiw a2,a2,1
j .L33
.L63:
ble a1,a0,.L32
slli a5,a1,4
add a5,s3,a5
vle64.v v2,0(a5)
vle64.v v1,0(a7)
vse64.v v2,0(a7)
vse64.v v1,0(a5)
.L32:
addiw a0,a0,1
addi a7,a7,16
bne s8,a0,.L28
.L27:
bne s8,zero,.L64
.L55:
ld ra,184(sp)
.cfi_remember_state
.cfi_restore 1
ld s3,152(sp)
.cfi_restore 19
ld s8,112(sp)
.cfi_restore 24
addi sp,sp,192
.cfi_def_cfa_offset 0
jr ra
.L64:
.cfi_restore_state
fsd fs0,72(sp)
fsd fs1,64(sp)
fsd fs4,40(sp)
.cfi_offset 40, -120
.cfi_offset 41, -128
.cfi_offset 52, -152
fld fs1,.LC2,a5
fld fs0,.LC3,a5
fld fs4,.LC1,a5
sd s5,136(sp)
.cfi_offset 21, -56
li s5,1
sd s4,144(sp)
sd s0,176(sp)
sd s1,168(sp)
sd s2,160(sp)
sd s6,128(sp)
sd s7,120(sp)
sd s9,104(sp)
sd s10,96(sp)
sd s11,88(sp)
fsd fs2,56(sp)
fsd fs3,48(sp)
.cfi_offset 20, -48
.cfi_offset 8, -16
.cfi_offset 9, -24
.cfi_offset 18, -32
.cfi_offset 22, -64
.cfi_offset 23, -72
.cfi_offset 25, -88
.cfi_offset 26, -96
.cfi_offset 27, -104
.cfi_offset 50, -136
.cfi_offset 51, -144
mv s4,s5
.L38:
mv a5,s8
li a4,0
.L43:
sraiw a5,a5,1
addiw a4,a4,1
bne a5,zero,.L43
ble a4,s5,.L65
sllw s10,s4,s5
ble s8,zero,.L36
li a5,1
ble s10,a5,.L36
fcvt.d.w fa1,s10
srliw s6,s10,31
addw s6,s6,s10
fdiv.d fa0,fs1,fa1
sraiw s6,s6,1
slli s11,s10,4
mv s7,s3
li s9,0
fdiv.d fa1,fs0,fa1
call cexp@plt
fmv.d fs3,fa0
fmv.d fs2,fa1
slli a3,s6,4
.L42:
fmv.d.x fa3,zero
fmv.d fa2,fs4
add s1,a3,s7
mv s0,s7
li s2,0
.L41:
fld fa1,8(s1)
fld fa0,0(s1)
fmul.d ft0,fa1,fa2
fmul.d ft1,fa1,fa3
fmadd.d ft0,fa0,fa3,ft0
fmsub.d ft1,fa0,fa2,ft1
feq.d a4,ft0,ft0
feq.d a5,ft1,ft1
and a5,a5,a4
beq a5,zero,.L66
.L39:
fmul.d fa5,fs3,fa3
fmul.d fa4,fs2,fa3
fld ft3,0(s0)
fld ft2,8(s0)
fadd.d fa0,ft1,ft3
fadd.d fa1,ft0,ft2
fmadd.d fa5,fs2,fa2,fa5
fmsub.d fa4,fs3,fa2,fa4
fsub.d ft3,ft3,ft1
fsub.d ft2,ft2,ft0
fsd fa0,0(s0)
fsd fa1,8(s0)
feq.d a4,fa5,fa5
feq.d a5,fa4,fa4
fsd ft3,0(s1)
fsd ft2,8(s1)
and a5,a5,a4
beq a5,zero,.L67
.L40:
addiw s2,s2,1
fmv.d fa2,fa4
fmv.d fa3,fa5
addi s1,s1,16
addi s0,s0,16
bgt s6,s2,.L41
addw s9,s10,s9
add s7,s7,s11
bgt s8,s9,.L42
.L36:
addiw s5,s5,1
j .L38
.L67:
fmv.d fa1,fa3
fmv.d fa0,fa2
fmv.d fa3,fs2
fmv.d fa2,fs3
sd a3,8(sp)
call __muldc3@plt
fmv.d fa4,fa0
fmv.d fa5,fa1
ld a3,8(sp)
j .L40
.L66:
sd a3,24(sp)
fsd fa3,16(sp)
fsd fa2,8(sp)
call __muldc3@plt
fmv.d ft1,fa0
fmv.d ft0,fa1
ld a3,24(sp)
fld fa3,16(sp)
fld fa2,8(sp)
j .L39
.L65:
ld s0,176(sp)
.cfi_restore 8
ld s1,168(sp)
.cfi_restore 9
ld s2,160(sp)
.cfi_restore 18
ld s4,144(sp)
.cfi_restore 20
ld s5,136(sp)
.cfi_restore 21
ld s6,128(sp)
.cfi_restore 22
ld s7,120(sp)
.cfi_restore 23
ld s9,104(sp)
.cfi_restore 25
ld s10,96(sp)
.cfi_restore 26
ld s11,88(sp)
.cfi_restore 27
fld fs0,72(sp)
.cfi_restore 40
fld fs1,64(sp)
.cfi_restore 41
fld fs2,56(sp)
.cfi_restore 50
fld fs3,48(sp)
.cfi_restore 51
fld fs4,40(sp)
.cfi_restore 52
j .L55
.cfi_endproc
.LFE25:
.size fft, .-fft
.section .rodata.str1.8,"aMS",@progbits,1
.align 3
.LC4:
.string "Input:"
.align 3
.LC5:
.string "%lf + %lfi\n"
.align 3
.LC6:
.string "\nOutput:"
.section .text.startup,"ax",@progbits
.align 2
.globl main
.type main, @function
main:
.LFB26:
.cfi_startproc
vsetivli zero,16,e64,m8,ta,ma
lla a5,.LANCHOR0
vle64.v v8,0(a5)
addi sp,sp,-192
.cfi_def_cfa_offset 192
sd s4,144(sp)
.cfi_offset 20, -48
la s4,__stack_chk_guard
sd s0,176(sp)
sd s1,168(sp)
sd s2,160(sp)
sd s3,152(sp)
sd ra,184(sp)
.cfi_offset 8, -16
.cfi_offset 9, -24
.cfi_offset 18, -32
.cfi_offset 19, -40
.cfi_offset 1, -8
addi s3,sp,8
ld a5, 0(s4)
sd a5, 136(sp)
li a5, 0
vse64.v v8,0(s3)
lla a0,.LC4
mv s0,s3
addi s2,sp,136
mv s1,s3
call puts@plt
.L69:
ld a3,8(s1)
ld a2,0(s1)
lla a1,.LC5
li a0,2
addi s1,s1,16
call __printf_chk@plt
bne s1,s2,.L69
mv a0,s3
li a1,8
call fft
lla a0,.LC6
call puts@plt
.L70:
ld a3,8(s0)
ld a2,0(s0)
lla a1,.LC5
li a0,2
addi s0,s0,16
call __printf_chk@plt
bne s0,s2,.L70
ld a4, 136(sp)
ld a5, 0(s4)
xor a5, a4, a5
li a4, 0
bne a5,zero,.L78
ld ra,184(sp)
.cfi_remember_state
.cfi_restore 1
ld s0,176(sp)
.cfi_restore 8
ld s1,168(sp)
.cfi_restore 9
ld s2,160(sp)
.cfi_restore 18
ld s3,152(sp)
.cfi_restore 19
ld s4,144(sp)
.cfi_restore 20
li a0,0
addi sp,sp,192
.cfi_def_cfa_offset 0
jr ra
.L78:
.cfi_restore_state
call __stack_chk_fail@plt
.cfi_endproc
.LFE26:
.size main, .-main
.section .rodata.cst8,"aM",@progbits,8
.align 3
.LC1:
.word 0
.word 1072693248
.align 3
.LC2:
.word 0
.word -2147483648
.align 3
.LC3:
.word 1413754136
.word -1072094725
.section .rodata
.align 3
.set .LANCHOR0,. + 0
.LC0:
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.ident "GCC: (Ubuntu 14.2.0-4ubuntu2) 14.2.0"
.section .note.GNU-stack,"",@progbits
```
```
time qemu-riscv64 output
real 0m0.600s
user 0m0.405s
sys 0m0.203s
```
Then we improve the code.
```riscv=
.file "fft.c"
.option pic
.attribute arch, "rv64i2p1_m2p0_a2p1_f2p2_d2p2_v1p0_zicsr2p0_zba1p0_zbb1p0_zbc1p0_zbs1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
.attribute unaligned_access, 0
.attribute stack_align, 16
.text
.align 2
.globl logint
.type logint, @function
logint:
.LFB23:
.cfi_startproc
clz a5,a0
li a4,31
sub a0,a4,a5
ret
.cfi_endproc
.LFE23:
.size logint, .-logint
.align 2
.globl reverse
.type reverse, @function
reverse:
.LFB24:
.cfi_startproc
bne a0, zero, .L24
ret
.L24:
li a2, 1
li a6, 0
.L9:
mv a5, a0
li a4, 0
.L13:
srai a5, a5, 1
addi a4, a4, 1
bne a5, zero, .L13
ble a4, a2, .L25
srl a3, a1, a2
andi a3, a3, 1
beqz a3, .L12
sll a5, a6, a2
or a6, a6, a5
.L12:
addi a2, a2, 1
j .L9
.L25:
mv a0, a6
ret
.cfi_endproc
.LFE24:
.size reverse, .-reverse
.align 2
.globl fft
.type fft, @function
fft:
.LFB25:
.cfi_startproc
addi sp, sp, -192
.cfi_def_cfa_offset 192
sd s3, 152(sp)
sd s8, 112(sp)
sd ra, 184(sp)
.cfi_offset 19, -40
.cfi_offset 24, -80
.cfi_offset 1, -8
mv s8, a1
mv s3, a0
blez a1, .L27 # 若 a1 <= 0,跳轉到.L27
li a0, 0
li a6, 1
mv a7, a0 # 初始化指標
vsetivli zero, 2, e64, m1, ta, ma
.L28:
li a1, 0 # 初始化暫存變數
li a2, 1
.L33:
mv a5, s8
li a4, 0
.L31:
srai a5, a5, 1
addi a4, a4, 1
bnez a5, .L31
bge a2, a4, .L63 # 若 a2 >= a4,跳轉結束
srl a3, a0, a2 # 提取位元
andi a3, a3, 1
beqz a3, .L30 # 若該位為 0,跳過 OR 操作
sll a5, a6, a2
or a1, a1, a5 # 累加結果
.L30:
addi a2, a2, 1
j .L33
.L63:
ble a1, a0, .L32 # 若 a1 <= a0,跳轉到.L32
slli a5, a1, 4 # 計算偏移
add a5, s3, a5
vle64.v v2, (a5) # 加載向量
vle64.v v1, (a7)
vse64.v v2, (a7)
vse64.v v1, (a5)
.L32:
addi a0, a0, 1
addi a7, a7, 16
bne s8, a0, .L28
.L27:
ld ra, 184(sp)
ld s3, 152(sp)
ld s8, 112(sp)
addi sp, sp, 192
.cfi_def_cfa_offset 0
jr ra
.cfi_endproc
.LFE25:
.size fft, .-fft
.section .rodata.str1.8,"aMS",@progbits,1
.align 3
.LC4:
.string "Input:"
.align 3
.LC5:
.string "%lf + %lfi\n"
.align 3
.LC6:
.string "\nOutput:"
.section .text.startup,"ax",@progbits
.align 2
.globl main
.type main, @function
main:
.LFB26:
.cfi_startproc
addi sp, sp, -192
.cfi_def_cfa_offset 192
sd ra, 184(sp)
sd s0, 176(sp)
sd s1, 168(sp)
sd s2, 160(sp)
sd s3, 152(sp)
sd s4, 144(sp)
.cfi_offset 1, -8
.cfi_offset 8, -16
.cfi_offset 9, -24
.cfi_offset 18, -32
.cfi_offset 19, -40
.cfi_offset 20, -48
vsetivli zero, 16, e64, m8, ta, ma
lla a5, .LANCHOR0
vle64.v v8, 0(a5)
addi s3, sp, 8
vse64.v v8, 0(s3)
la s4, __stack_chk_guard
ld a5, 0(s4)
sd a5, 136(sp)
lla a0, .LC4
call puts@plt
mv s0, s3
mv s1, s3
addi s2, sp, 136
.L69:
ld a3, 8(s1)
ld a2, 0(s1)
lla a1, .LC5
li a0, 2
addi s1, s1, 16
call __printf_chk@plt
bne s1, s2, .L69
mv a0, s3
li a1, 8
call fft
lla a0, .LC6
call puts@plt
.L70:
ld a3, 8(s0)
ld a2, 0(s0)
lla a1, .LC5
li a0, 2
addi s0, s0, 16
call __printf_chk@plt
bne s0, s2, .L70
ld a4, 136(sp)
ld a5, 0(s4)
xor a5, a4, a5
beqz a5, .L71
call __stack_chk_fail@plt
j .L72
.L71:
ld ra, 184(sp)
ld s0, 176(sp)
ld s1, 168(sp)
ld s2, 160(sp)
ld s3, 152(sp)
ld s4, 144(sp)
addi sp, sp, 192
.cfi_def_cfa_offset 0
jr ra
.L72:
.cfi_endproc
.LFE26:
.size main, .-main
.section .rodata.cst8,"aM",@progbits,8
.align 3
.LC1:
.word 0
.word 1072693248
.align 3
.LC2:
.word 0
.word -2147483648
.align 3
.LC3:
.word 1413754136
.word -1072094725
.section .rodata
.align 3
.set .LANCHOR0,. + 0
.LC0:
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.ident "GCC: (Ubuntu 14.2.0-4ubuntu2) 14.2.0"
.section .note.GNU-stack,"",@progbits
```
:::info
Difference Between `lla` and `la`
lla (Load Address Absolute) : Only handles absolute addresses and does not account for a base address (e.g., PC-relative or GOT).
la (Load Address : Adjusts the address based on the target's access model, such as PC-relative or using the Global Offset Table (GOT).
:::
We improved the code, reducing nearly 200 lines and shortening the execution time.
```
real 0m0.498s
user 0m0.373s
sys 0m0.132s
```
## Result
The following is the result of the output.
In this task, we accelerated its execution time by using b-extension and RVV.
```
Input:
1.000000 + 0.000000i
1.000000 + 0.000000i
1.000000 + 0.000000i
1.000000 + 0.000000i
0.000000 + 0.000000i
0.000000 + 0.000000i
0.000000 + 0.000000i
0.000000 + 0.000000i
Output:
4.000000 + 0.000000i
1.000000 + -2.414214i
0.000000 + 0.000000i
1.000000 + -0.414214i
0.000000 + 0.000000i
1.000000 + 0.414214i
0.000000 + 0.000000i
1.000000 + 2.414214i
```
## Reference
* [Quiz5 of Computer Architecture (2024 Fall)](https://hackmd.io/@sysprog/arch2024-quiz5-sol#Problem-D)
* [Optimization of the FFT Algorithm on RISC-V CPUs - 1](https://dl.acm.org/doi/abs/10.1007/978-3-031-40843-4_38)
* [Optimization of the FFT algorithm on RISC-V CPUs - 2](https://riscv.epcc.ed.ac.uk/assets/files/isc23/zhang_isc23.pdf)
* [RISC-V public beta platform released · FFTW porting and performance comparison](https://forum.sophgo.com/t/risc-v-public-beta-platform-released-fftw-porting-and-performance-comparison/278)
* [Ubuntu server + RISC-V + Qemu](https://medium.com/wei-zen-liu/ubuntu-server-risc-v-qemu-%E7%B0%A1%E5%96%AE%E4%B8%8A%E6%89%8B%E7%89%88-ca1d9700bc7)
* [Download Ubuntu for RISC-V Platforms](https://ubuntu.com/download/risc-v)
* [(Science Popularization) Fourier Transform](https://www.youtube.com/watch?v=wRNMrTHGJOY&t=180s)
* [github - riscv-gnu-toolchain](https://github.com/riscv-collab/riscv-gnu-toolchain)
* [github - FFTW](https://github.com/FFTW/fftw3)