Try   HackMD

Accelerate Fast Fourier Transform

謝廷昇, 王晴文

GitHub

Introduction of the task

Task : Research FFT and its improvement schemes.

Implement the following:
RV32I : Develop the base RV32I ISA implementation.
Using B Extension : Integrate and utilize the Bit-Manipulation (B) extension.
Using RVV : Incorporate and utilize the RISC-V Vector Extension (RVV).

The code should be validated using QEMU.

Finally, prepare a signal processing application that utilizes your improved FFT as a technical demonstration.

Introduction to Fourier transform

DFT (Discrete Fourier transform)

The discrete Fourier transform (DFT) is defined as follows:

Y(k)=n=0N1x(n)WNnk=n=0N1x(n)e2πiNnk

Where:

  • (x(n)) : Input sequence
  • (Y(k)) : Output sequence
  • WNnk=e2πiNnk : Twiddle factor
  • i=1 : Imaginary unit

The Discrete Fourier Transform (DFT) is a mathematical tool that transforms a signal from the time domain to the frequency domain. Its primary purpose is to analyze the spectral characteristics of a signal. DFT is widely used in various fields, including digital signal processing, image processing, sound analysis, and scientific and engineering applications.

Characteristics of DFT:

  • Spectrum Analysis: Decomposes the input signal into a superposition of different frequency components.
  • Inverse Transform (IDFT): Allows the reconstruction of the time-domain signal from the frequency-domain data.
  • High Complexity: The computational complexity of a standard DFT is O( N2 ) , which increases significantly with the length of the input signal.

Applications of DFT:

  • Signal Processing: Filtering, modulation, and spectrum analysis.
    Image Processing: Edge detection and compression (e.g., discrete cosine transform in JPEG).
  • Engineering and Science: Seismic analysis and structural vibration studies.

Cooley-Tukey FFT (Fast Fourier Transform)

An efficient algorithm used for rapidly computing the Discrete Fourier Transform and its inverse . Fourier Transform is a core technique in signal processing for analyzing frequency-domain characteristics, and FFT reduces computational complexity, making the application of Fourier Transform more efficient.

Direct computation of DFT requires O( N2 ) computational complexity (where N is the number of data points), while FFT reduces the complexity to O( NlogN ), significantly improving efficiency. Its applications include signal processing (e.g., spectrum analysis and filtering), image processing (e.g., compression and denoising), audio processing (e.g., music analysis and echo cancellation), and scientific computation (e.g., simulation and solving partial differential equations). The basic principle of FFT is based on the divide-and-conquer approach, where DFT is broken down into smaller sub-problems for recursive computation. By dividing the input data into even and odd parts for separate calculations and then combining the results, the process is greatly optimized. Among the common implementations, the Cooley-Tukey FFT algorithm is the most widely used.

The decomposition method of the Cooley-Tukey FFT algorithm can be expressed as:
Y(k)=n=0N1x(n)WNnk=n=0N/21x(2n)WN2nk+n=0N/21x(2n+1)WN(2n+1)k

=n=0N/21f(n)WN/2nk+n=0N/21g(n)WN/2nk

Where:

  • (f(n)) and (g(n)) are complex sequences of length N/2.
  • WNnk=e2πiNnk : Twiddle factor

PerfMPL-FFT

In terms of algorithm implementation, this method used the SIMD vectorization, and vector features of the RISC-V instruction set architecture to assemble and implement the optimized FFT algorithm manually and optimized the selection and sequencing of instructions to maximize and improve the performance of the FFT algorithm library based on the RISC-V architecture, and finally realize a high-performance algorithm library based on the RISC-V CPUs. The experimental results show that: on the RISC-V platform, the PerfMPL-FFT algorithm library implemented in this paper has a better performance improvement for DFT processing of single-precision and double-precision data compared with the open-source FFTW library.

The key contributions of this paper are summarized as follows:

  • Proposed optimization strategies: We introduced a set of optimization strategies and methods targeting the butterfly kernel in the Cooley-Tukey FFT algorithm.
  • Designed a technology system for RISC-V: Based on the characteristics of the RISC-V architecture, we developed an optimization technology system for the FFT algorithm, enhancing its performance on the RISC-V platform.
  • Implemented an efficient FFT library: We implemented a high-performance FFT algorithm library on the RISC-V platform, demonstrating significant performance advantages.

FFTW (Fast Fourier Transform in the West)

There are various fast Fourier transform (FFT) algorithms, including open-source libraries like FFTW, as well as FFT optimization libraries specifically designed by some chip companies for their own hardware, such as Intel’s MKL_FFT and Huawei’s KML_FFT.

FFTW (Fast Fourier Transform in the West) is an open-source library designed for fast Fourier transform (FFT). It is distributed under a free software license, similar to GPL, and you can download the latest version from its official website or code repository, following the licensing terms for usage. This library enables efficient numerical computations across various computing platforms.

By leveraging pre-decomposition algorithms and caching techniques, FFTW significantly accelerates FFT computations, saving several times the computation time compared to traditional FFT algorithms. FFTW is widely applicable, not only in scientific computing and signal processing but also in fields like image and audio processing. Its exceptional performance makes it a vital tool in numerous scientific and engineering applications.

Here are the URLs for FFTW:
Official Website : http://fftw.org/
GitHub Repository : https://github.com/FFTW/fftw3

QEMU

What is QEMU?

QEMU (Quick Emulator) is an open-source general-purpose machine emulator and virtualization tool that primarily supports the following modes:

  1. User Mode Emulation: In this mode, QEMU allows programs compiled for one CPU architecture to run on a different CPU architecture. eg.qemu-riscv64

  2. System Emulation: In this mode, QEMU provides a complete environment for running a guest system, including the CPU, memory, and peripheral devices. eg.qemu-system-riscv64

When emulating the RISC-V architecture, QEMU can simulate the virt virtual development board. While this board does not correspond to any physical hardware, it supports the emulation of various hardware devices, making it ideal for development and testing purposes.

QEMU for Linux

Install QEMU and Risc-V tool chain

sudo apt update
sudo apt install qemu qemu-kvm libvirt-daemon-system libvirt-clients bridge-utils -y
git clone https://github.com/riscv/riscv-gnu-toolchain

Since we are working with the RISC-V architecture, we need to begin compiling U-Boot and OpenSBI.

  1. Compile U-Boot
git clone https://github.com/u-boot/u-boot.git
cd u-boot
make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- qemu-riscv64_smode_defconfig
make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- -j$(nproc)

Next, the desired kernel will be generated as shown in the figure below.
image

  1. OpenSBI
git clone https://github.com/riscv-software-src/opensbi.git
cd opensbi
make PLATFORM=generic CROSS_COMPILE=riscv64-linux-gnu- PLATFORM_RISCV_XLEN=64

Completing the above steps will successfully install OpenSBI.
Then navigate to the following path.

cd build/platform/generic/firmware

A fw_bin will then be generated, which is our firmware.
image

Download Ubuntu for RISC-V Platforms

Go to the following URL to download the corresponding IMG file (QEMU).
https://ubuntu.com/download/risc-v

Boot the system

Once the kernel and firmware are ready, pass them with the corresponding parameters to QEMU to boot the system.

qemu-system-riscv64 -machine virt -nographic -m 2048 -smp 4 -bios /home/annie/opensbi/build/platform/generic/firmware/fw_jump.bin -kernel /home/annie/u-boot/u-boot.bin -device virtio-net-device,netdev=eth0 -netdev user,id=eth0 -device virtio-rng-pci -drive file=/home/annie/ubuntu-24.10-preinstalled-server-riscv64.img,format=raw,if=virtio

The path should be adjusted to the correct path accordingly.

Basic Commands and Parameter Descriptions
-qemu-system-riscv64
Launches the RISC-V 64-bit version of the QEMU emulator.

-machine virt
Uses the virtual machine type (virt machine type), which is a generic virtualization hardware platform provided by QEMU for RISC-V.

-nographic
Disables graphical output, redirecting display output to the terminal.

-m 2048
Allocates 2048 MB of virtual memory to the emulator.

-smp 4
Configures the emulator with 4 virtual CPU cores.

-bios /home/annie/opensbi/build/platform/generic/firmware/fw_jump.bin
Specifies the OpenSBI firmware fw_jump.bin as the BIOS for booting U-Boot.

-kernel /home/annie/u-boot/u-boot.bin
Loads the U-Boot binary file as the kernel bootloader.

-device virtio-net-device,netdev=eth0
Configures a virtual network device using the VirtIO driver and connects it to the network device with ID eth0.

-netdev user,id=eth0
Simulates a network connection using "user mode networking" and assigns it the ID eth0.

-device virtio-rng-pci
Adds a PCI-based VirtIO random number generator device to provide entropy.

-drive file=/home/annie/ubuntu-24.10-preinstalled-server-riscv64.img,format=raw,if=virtio
Loads the specified Ubuntu system image file (ubuntu-24.10-preinstalled-server-riscv64.img) with the following options:

-format=raw: Reads the image file in raw format.
if=virtio: Attaches the disk as a VirtIO device to the virtual machine.

Then, wait for the system to boot.
When you see the message "Completed socket interaction for boot stage final", press Enter to log in.
The default username is: ubuntu
The default password is: ubuntu
image
Finally, check the system to confirm it is RISC-V:

cd /bin
uname -m
# It will show like this
ubuntu@ubuntu:/bin$ uname -m
riscv64

How to create a file and run it on QEMU

  1. Writing a .s File

Create a simple hello.s. Below is an example program :

.data
msg: .asciz "Hello, QEMU!\n"

.text
.global _start

_start:
    # Write system call
    li a7, 64          # syscall number for write
    li a0, 1           # file descriptor (stdout)
    la a1, msg         # address of the message
    li a2, 13          # length of the message
    ecall

    # Exit system call
    li a7, 93          # syscall number for exit
    li a0, 0           # exit code
    ecall

The directive .global _start is necessary in RISC-V assembly code.

  • Program Entry Point:
    When an executable file is loaded for execution, the operating system looks for the program's entry point. By default, the entry point is usually _start.
    If _start is not defined as a global symbol, the linker may fail to locate it, resulting in the executable being unable to run properly.
  1. Compiling the .s File

Use the riscv64-linux-gnu-as and riscv64-linux-gnu-ld to compile the .s file into an executable :

# Assemble the .s file into an object file
riscv64-linux-gnu-as -o hello.o hello.s

# Link the object file to create an executable
riscv64-linux-gnu-ld -o hello hello.o
  1. Running with QEMU

Run the compiled program using QEMU :

qemu-riscv64 hello

After execution, the terminal should display :

Hello, QEMU!

How to tanslate c into riscv

First, locate the positions of stdio.h and libm.a.

ubuntu@ubuntu:~$ find /usr -name stdio.h
find /usr -name libm.a
/usr/include/riscv64-linux-gnu/bits/stdio.h
/usr/include/newlib/ssp/stdio.h
/usr/include/newlib/sys/stdio.h
/usr/include/newlib/stdio.h
/usr/include/stdio.h
/usr/lib/riscv64-linux-gnu/libm.a

The following instruction will translate C into RISC-V.

riscv64-unknown-elf-gcc -O2 -S fft.c -o fft.s \
-I/usr/include/newlib \
-L/usr/lib/riscv64-linux-gnu -lm

This instruction will translate c into elf file.

gcc -o fft.elf fft.c -lm

How to expand the virtual machine's space

The following code can expand the virtual machine's space.

qemu-img resize /home/annie/ubuntu-24.10-preinstalled-server-riscv64.img +15G

QEMU for Mac

Use $ to operate in terminal

environment: Macbook Air M3 Sonoma 14.6

$ riscv32-unknown-elf-gcc -march=rv32i_zba_zbb_zbc_zbs_v -mabi=ilp32 -S -o output.s input.c

Use Homebrew to install QEMU

$ brew install qemu

Check whether RISC-V has been supported

$ qemu-system-riscv32 --version

If we have installed correctly, we get the version below:
( version 9.2.0 in 2025/1 )

QEMU emulator version 9.2.0
Copyright (c) 2003-2024 Fabrice Bellard and the QEMU Project developers

Then we compile RV32I/B extension/RVV via RISC-V GCC tools

Installation

$ brew tap riscv-software-src/homebrew-riscv
$ brew install riscv-tools

Verification

$ riscv64-unknown-elf-gcc --version

You may notice that the instruction in above is "riscv64".
It's normal.
Because most RISC-V toolchains target 64-bit architectures (riscv64) by default. However, these tools usually support 32-bit targets as well, by specifying the appropriate architecture flags at compile time.

If it shows that commend not found, checking whether it is in the environment variable.

$ echo $PATH

If it is't, we add it manually and make sure again.

$ export PATH="/opt/homebrew/opt/riscv-gnu-toolchain/bin:$PATH"
$ source ~/.zshrc

Compile the C file

$ riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -o test.elf test.c
  • -march=rv32i:specity architecture as RV32I
  • -mabi=ilp32:specify ABI(Application Binary Interface)as 32-bit integer

Issues encountered in QEMU

  • Unable to use ecall 1 in QEMU (unable to directly print numbers).

solution :
Convert a word to ASCII. This is used for all numbers greater than or equal to 0.

.data
buffer: .space 21               # String buffer for the converted number
newline: .asciz "\n"            # Newline character

.text
.global _start
_start:
int_to_string:
    addi sp, sp, -16            # Save stack frame
    sw ra, 12(sp)               # Save return address
    sw t0, 8(sp)                # Save t0
    sw t1, 4(sp)                # Save t1

    mv t0, a0                   # Copy input number (a0) to t0
    addi t1, a1, 16             # Point to the end of the buffer
    li a0, 0                    # Initialize the length counter

loop:
    li t3, 10                   # Load divisor (10)
    rem t2, t0, t3              # t2 = t0 % 10 (remainder)
    addi t2, t2, '0'            # Convert the remainder to ASCII character
    addi t1, t1, -1             # Move the buffer pointer back
    sb t2, 0(t1)                # Store the character in the buffer
    div t0, t0, t3              # t0 = t0 / 10 (integer division)
    addi a0, a0, 1              # Increment the length counter
    bnez t0, loop               # Repeat if t0 != 0

    mv a1, t1                   # Return the string starting address in a1
    lw ra, 12(sp)               # Restore return address
    lw t0, 8(sp)                # Restore t0
    lw t1, 4(sp)                # Restore t1
    addi sp, sp, 16             # Restore stack frame
    ret                         # Return to caller

Code

Part1 : logint

Functionality

The function calculates log2(N), which determines the power of 2 closest to N.

Logic

  • Each iteration right-shifts N by 1 bit, effectively simulating a "divide by 2" operation.
  • For every right shift, the counter i increments, representing how many times N can be divided by 2.
  • The loop stops when k becomes 0. At this point, i contains log2(N)+1. Subtracting 1 gives the correct result.

code :

/* Calculates the log2 of number */ int logint(int N) { int k = N, i = 0; while (k) { k >>= 1; i++; } return i - 1; }

Convert to RISC-V assembly:

# Takes input N (a0), returns its log base 2 in a0 logint: addi sp, sp, -4 sw t0, 0(sp) # store t0 add t0, a0, zero # k = N add a0, zero, zero # i = 0 logloop: beq t0, zero, logloop_end # Exit if k == 0 srai t0, t0, 1 # k >>= 1 addi a0, a0, 1 # i++ j logloop logloop_end: addi a0, a0, -1 # Return i - 1 as the counter was incremented one extra time. lw t0, 0(sp) addi sp, sp, 4 jr ra

Compiling the .s File in RV32I

Use the riscv64-linux-gnu-as and riscv64-linux-gnu-ld to compile the .s file into an executable :

riscv64-linux-gnu-as -o logint.o logint.s
riscv64-linux-gnu-ld -o logint logint.o
qemu-riscv64 logint

Using this instruction will see the time the file taken.

time qemu-riscv64 logint

for example :

real	0m0.139s
user	0m0.053s
sys	0m0.074s

B-extension

set_log2n: clz t0, a0 # Count leading zeros in a0 li t1, 31 # Maximum index for a 32-bit register sub t1, t1, t0 # Compute log2(N) la t0, logsize # Load address of 'logsize' sw t1, 0(t0) jr ra # Return to caller logsize: .word 0

Compiling the .s File in B extension

Use the riscv64-linux-gnu-as and riscv64-linux-gnu-ld to compile the .s file into an executable :

riscv64-linux-gnu-as -march=rv64imafdzbb -o logint.o logint.s
riscv64-linux-gnu-ld -o logint logint.o
time qemu-riscv64 logint

vector extension

# Calculate log base 2 of input # Input:a0 = N # Output: a0 = log2(N) logint: clz a0, a0 # Count leading zeros of a0 addi t0, zero, 31 # Load 31 (32-bit word size - 1) sub a0, t0, a0 jr ra# Return

Part2 : bitwise reverse

Functionality

The function reverse is used to reverse the binary representation of a number n based on the bit length determined by the maximum number N.

  • Input :
    • N: Used to determine the bit length (calculated using log2(N)).
    • n: The number whose binary representation needs to be reversed.
  • Output:
    • Returns the reversed binary representation of n.

Logic

  • Determine Bit Length :

    • Use log2(N) to determine the number of bits required for N.
    • For example, if N=18, then log2(18)4, so the bit length is 4.
  • Check and Reverse Bits :

    • Use a loop to check each bit of n starting from the logint(N)j position.
    • If the bit at this position is 1, set the corresponding position (j1) in the reversed binary number to 1.
    • This is done using bitwise operations & and |.
  • Return Result:

    • Finally, return the reversed number p.

code :

/* Bitwise reverses the number */ int reverse(int N, int n) { int j, p = 0; for (j = 1; j <= logint(N); j++) { if (n & (1 << (logint(N) - j))) { p |= 1 << (j - 1); } } return p; }

Convert to RISC-V assembly:

# Takes inputs N(a0) and n(a1), reverses the number in binary reverse: addi sp, sp, -28 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) sw s3, 16(sp) sw s4, 20(sp) sw s5, 24(sp) call logint # a0 = lg(N) addi s0, zero, 1 # j = 1 add s1, zero, zero # p = 0 forloop_reverse: bgt s0, a0, forloop_reverse_end # if j > lg(N) then end loop sub s2, a0, s0 addi s3, zero, 1 sll s3, s3, s2 and s3, a1, s3 beq s3, zero, elses3 # If not, skip ifs3: addi s4, s0, -1 # s4 = j - 1 addi s5, zero, 1 sll s5, s5, s4 or s1, s1, s5 elses3: addi s0, s0, 1 j forloop_reverse forloop_reverse_end: add a0, s1, zero # Return p lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4, 20(sp) lw s5, 24(sp) addi sp, sp, 28 jr ra

B-extension

# Function: reverse # Reverses the binary digits of a 32-bit integer. # Inputs: # - a0: Input number to reverse. # - a1: Number of significant bits to reverse (optional; default 32). # Outputs: # - a0: The reversed binary number. # Clobbers: # - t0, t1, t2 reverse: # Swap odd and even bits lui t0, 0x55555 # Load upper 20 bits of 0x55555555 ori t0, t0, 0x555 # Complete loading 0x55555555 srli t1, a0, 1 # v >> 1 and t1, t1, t0 # (v >> 1) & 0x55555555 and t2, a0, t0 slli t2, t2, 1 or a0, t1, t2 # Store result in a0 # Swap consecutive pairs lui t0, 0x33333 # Load upper 20 bits of 0x33333333 ori t0, t0, 0x333 # Complete loading 0x33333333 srli t1, a0, 2 # v >> 2 and t1, t1, t0 # (v >> 2) & 0x33333333 and t2, a0, t0 slli t2, t2, 2 or a0, t1, t2 # Store result in a0 # Swap nibbles lui t0, 0x0F0F0 # Load upper 20 bits of 0x0F0F0F0F ori t0, t0, 0x0F0 # Complete loading 0x0F0F0F0F srli t1, a0, 4 and t1, t1, t0 and t2, a0, t0 slli t2, t2, 4 or a0, t1, t2 # Store result in a0 # Swap bytes lui t0, 0x00FF0# Load upper 20 bits of 0x00FF00FF ori t0, t0, 0x0FF# Complete loading 0x00FF00FF srli t1, a0, 8 and t1, t1, t0 and t2, a0, t0 slli t2, t2, 8 or a0, t1, t2 # Store result in a0 # Swap 2-byte pairs srli t1, a0, 16 slli t2, a0, 16 ora0, t1, t2 # Final result in a0 # Adjust number of significant bits addi t1, zero, 32 # Load immediate value 32 sub t1, t1, a1 # Calculate shift amount srl a0, a0, t1 # Shift right to fit bit count jr ra # Return with result in a0

vector extension

# Initialize the helper vector with sequential integers (0,1,2,...) init_helper_vector: la t0, helperVector # Load the base address of helperVector lw t1, size # Load the size of the vector (N) vsetvli t2, t1, e32, m1 # Set the vector length configuration vid.v v0 # Generate a vector of indices vse32.v v0, (t0) # Store the generated vector in helperVector ret # Bit-reverse elements of a vector # Input:v29 = input vector, a0 = N # Output: v29 = bit-reversed vector vreverse: addi sp, sp, -4 # Save return address sw ra, 0(sp) call logint # Compute log2(N) in a0 lw ra, 0(sp) # Restore return address addi sp, sp, 4 li t1, 1 # Constant 1 for bit manipulation li t0, 1 # Initialize bit position counter j vmv.v.x v28, zero # Clear result vector v28 vreverse_loop: # Loop for j <= log2(N) bgt t0, a0, vreverse_end # Break if j > log2(N) sub t2, a0, t0 # t2 = log2(N) - j sll t3, t1, t2 # t3 = (1 << (log2(N) - j)) vand.vx v0, v29, t3 # v0 = v29 & t3 (check bit) # vmsne:Vector Mask Set Not Equa vmsne.vx v0, v0, zero # v0 = 1 if set, else 0 addi t4, t0, -1 # t4 = j - 1 sll t3, t1, t4 # t3 = (1 << (j - 1)) vor.vx v28, v28, t3, v0.t # Set bit in v28 where mask is true addi t0, t0, 1 # Increment bit position j vreverse_loop vreverse_end: vmv.v.v v29, v28 # Store result in v29 jr ra # Return
    The usage of vreverse would look something like this:
    ```c
    mv a0, a2 # a0 = N, input for reverse function. Overwriting a0 is fine
    vmv.v.v v29, v26 # v29 = v26 = <i>, used as input for vreverse

    call vreverse # v29 now contains rev(N, <i>), keep it for later use
  • la t0, helperVector:
    This instruction loads the base address of helperVector into the register t0. helperVector is an array or buffer used to store the generated vector.

  • lw t1, size:
    This instruction loads an integer value (the size of the vector N) from memory into the register t1. This value represents the number of elements in the vector to be generated.

  • vsetvli t2, t1, e32, m1:
    This instruction sets the vector length and type. It sets the vector length to t1 (i.e., N), with each element being 32 bits (e32) and a multiplier factor of 1 (m1). This means that the subsequent vector operations will target N 32-bit elements.

  • vid.v v0:
    This instruction generates a vector v0 with elements consisting of consecutive index values (0, 1, 2, , N-1). It is an initialization vector typically used for testing or as indices for data.

  • vse32.v v0, (t0):
    This instruction stores the data in vector v0 to the memory address pointed to by t0 (i.e., helperVector). This means the generated index vector will be written into helperVector.

Application to FFT

This code implements the Fast Fourier Transform (FFT) algorithm, which can be used to analyze the frequency components of a signal.

Given an input signal of [1, 1, 1, 1, 0, 0, 0, 0], the FFT result will display the spectrum of the signal. The real and imaginary parts of the result correspond to the magnitude and phase in the frequency domain.

#include <stdio.h> #include <math.h> #include <complex.h> #define PI 3.14159265358979323846 /* Calculates the log2 of number */ int logint(int N) { int k = N, i = 0; while (k) { k >>= 1; i++; } return i - 1; } /* Bitwise reverses the number */ int reverse(int N, int n) { int j, p = 0; for (j = 1; j <= logint(N); j++) { if (n & (1 << (logint(N) - j))) { p |= 1 << (j - 1); } } return p; } /* Performs the FFT */ void fft(complex double *a, int N) { // Bit-reversal permutation for (int i = 0; i < N; i++) { int rev = reverse(N, i); if (i < rev) { complex double temp = a[i]; a[i] = a[rev]; a[rev] = temp; } } // FFT computation for (int s = 1; s <= logint(N); s++) { int m = 1 << s; // 2^s complex double wm = cexp(-2.0 * PI * I / m); for (int k = 0; k < N; k += m) { complex double w = 1.0; for (int j = 0; j < m / 2; j++) { complex double t = w * a[k + j + m / 2]; complex double u = a[k + j]; a[k + j] = u + t; a[k + j + m / 2] = u - t; w *= wm; } } } } int main() { int N = 8; // Length of input (must be a power of 2) complex double a[] = {1, 1, 1, 1, 0, 0, 0, 0}; // Example input printf("Input:\n"); for (int i = 0; i < N; i++) { printf("%lf + %lfi\n", creal(a[i]), cimag(a[i])); } fft(a, N); printf("\nOutput:\n"); for (int i = 0; i < N; i++) { printf("%lf + %lfi\n", creal(a[i]), cimag(a[i])); } return 0; }

Using instruction translate to assembly.

riscv64-linux-gnu-gcc -O3 -march=rv64imafdzba_zbb_zbc_zbs_v -S -o output.s fft.c
riscv64-linux-gnu-as -march=rv64imafdzba_zbb_zbc_zbs_v -o output.o output.s
riscv64-linux-gnu-gcc -o output output.o -lm

When we encountered issues installing riscv32-linux-gnu-gcc, we decided to switch to using rv64i instead.

.file "fft.c" .option pic .attribute arch, "rv64i2p1_m2p0_a2p1_f2p2_d2p2_v1p0_zicsr2p0_zba1p0_zbb1p0_zbc1p0_zbs1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0" .attribute unaligned_access, 0 .attribute stack_align, 16 .text .align 2 .globl logint .type logint, @function logint: .LFB23: .cfi_startproc mv a5,a0 beq a0,zero,.L4 li a4,0 .L3: sraiw a5,a5,1 mv a0,a4 addiw a4,a4,1 bne a5,zero,.L3 ret .L4: li a0,-1 ret .cfi_endproc .LFE23: .size logint, .-logint .align 2 .globl reverse .type reverse, @function reverse: .LFB24: .cfi_startproc bne a0,zero,.L24 ret .L24: li a2,1 mv a7,a2 li a6,0 .L9: mv a5,a0 li a4,0 .L13: sraiw a5,a5,1 addiw a4,a4,1 bne a5,zero,.L13 ble a4,a2,.L25 mv a4,a0 .L11: sraiw a4,a4,1 mv a3,a5 addiw a5,a5,1 bne a4,zero,.L11 subw a3,a3,a2 sraw a3,a1,a3 andi a3,a3,1 beq a3,zero,.L12 addiw a5,a2,-1 sllw a5,a7,a5 or a6,a6,a5 .L12: addiw a2,a2,1 j .L9 .L25: mv a0,a6 ret .cfi_endproc .LFE24: .size reverse, .-reverse .align 2 .globl fft .type fft, @function fft: .LFB25: .cfi_startproc addi sp,sp,-192 .cfi_def_cfa_offset 192 sd s3,152(sp) sd s8,112(sp) sd ra,184(sp) .cfi_offset 19, -40 .cfi_offset 24, -80 .cfi_offset 1, -8 mv s8,a1 mv s3,a0 ble a1,zero,.L27 mv a7,a0 li a6,1 li a0,0 vsetivli zero,2,e64,m1,ta,ma .L28: li a1,0 li a2,1 .L33: mv a5,s8 li a4,0 .L31: sraiw a5,a5,1 addiw a4,a4,1 bne a5,zero,.L31 bge a2,a4,.L63 mv a4,s8 .L29: sraiw a4,a4,1 mv a3,a5 addiw a5,a5,1 bne a4,zero,.L29 subw a3,a3,a2 sraw a3,a0,a3 andi a3,a3,1 beq a3,zero,.L30 addiw a5,a2,-1 sllw a5,a6,a5 or a1,a1,a5 .L30: addiw a2,a2,1 j .L33 .L63: ble a1,a0,.L32 slli a5,a1,4 add a5,s3,a5 vle64.v v2,0(a5) vle64.v v1,0(a7) vse64.v v2,0(a7) vse64.v v1,0(a5) .L32: addiw a0,a0,1 addi a7,a7,16 bne s8,a0,.L28 .L27: bne s8,zero,.L64 .L55: ld ra,184(sp) .cfi_remember_state .cfi_restore 1 ld s3,152(sp) .cfi_restore 19 ld s8,112(sp) .cfi_restore 24 addi sp,sp,192 .cfi_def_cfa_offset 0 jr ra .L64: .cfi_restore_state fsd fs0,72(sp) fsd fs1,64(sp) fsd fs4,40(sp) .cfi_offset 40, -120 .cfi_offset 41, -128 .cfi_offset 52, -152 fld fs1,.LC2,a5 fld fs0,.LC3,a5 fld fs4,.LC1,a5 sd s5,136(sp) .cfi_offset 21, -56 li s5,1 sd s4,144(sp) sd s0,176(sp) sd s1,168(sp) sd s2,160(sp) sd s6,128(sp) sd s7,120(sp) sd s9,104(sp) sd s10,96(sp) sd s11,88(sp) fsd fs2,56(sp) fsd fs3,48(sp) .cfi_offset 20, -48 .cfi_offset 8, -16 .cfi_offset 9, -24 .cfi_offset 18, -32 .cfi_offset 22, -64 .cfi_offset 23, -72 .cfi_offset 25, -88 .cfi_offset 26, -96 .cfi_offset 27, -104 .cfi_offset 50, -136 .cfi_offset 51, -144 mv s4,s5 .L38: mv a5,s8 li a4,0 .L43: sraiw a5,a5,1 addiw a4,a4,1 bne a5,zero,.L43 ble a4,s5,.L65 sllw s10,s4,s5 ble s8,zero,.L36 li a5,1 ble s10,a5,.L36 fcvt.d.w fa1,s10 srliw s6,s10,31 addw s6,s6,s10 fdiv.d fa0,fs1,fa1 sraiw s6,s6,1 slli s11,s10,4 mv s7,s3 li s9,0 fdiv.d fa1,fs0,fa1 call cexp@plt fmv.d fs3,fa0 fmv.d fs2,fa1 slli a3,s6,4 .L42: fmv.d.x fa3,zero fmv.d fa2,fs4 add s1,a3,s7 mv s0,s7 li s2,0 .L41: fld fa1,8(s1) fld fa0,0(s1) fmul.d ft0,fa1,fa2 fmul.d ft1,fa1,fa3 fmadd.d ft0,fa0,fa3,ft0 fmsub.d ft1,fa0,fa2,ft1 feq.d a4,ft0,ft0 feq.d a5,ft1,ft1 and a5,a5,a4 beq a5,zero,.L66 .L39: fmul.d fa5,fs3,fa3 fmul.d fa4,fs2,fa3 fld ft3,0(s0) fld ft2,8(s0) fadd.d fa0,ft1,ft3 fadd.d fa1,ft0,ft2 fmadd.d fa5,fs2,fa2,fa5 fmsub.d fa4,fs3,fa2,fa4 fsub.d ft3,ft3,ft1 fsub.d ft2,ft2,ft0 fsd fa0,0(s0) fsd fa1,8(s0) feq.d a4,fa5,fa5 feq.d a5,fa4,fa4 fsd ft3,0(s1) fsd ft2,8(s1) and a5,a5,a4 beq a5,zero,.L67 .L40: addiw s2,s2,1 fmv.d fa2,fa4 fmv.d fa3,fa5 addi s1,s1,16 addi s0,s0,16 bgt s6,s2,.L41 addw s9,s10,s9 add s7,s7,s11 bgt s8,s9,.L42 .L36: addiw s5,s5,1 j .L38 .L67: fmv.d fa1,fa3 fmv.d fa0,fa2 fmv.d fa3,fs2 fmv.d fa2,fs3 sd a3,8(sp) call __muldc3@plt fmv.d fa4,fa0 fmv.d fa5,fa1 ld a3,8(sp) j .L40 .L66: sd a3,24(sp) fsd fa3,16(sp) fsd fa2,8(sp) call __muldc3@plt fmv.d ft1,fa0 fmv.d ft0,fa1 ld a3,24(sp) fld fa3,16(sp) fld fa2,8(sp) j .L39 .L65: ld s0,176(sp) .cfi_restore 8 ld s1,168(sp) .cfi_restore 9 ld s2,160(sp) .cfi_restore 18 ld s4,144(sp) .cfi_restore 20 ld s5,136(sp) .cfi_restore 21 ld s6,128(sp) .cfi_restore 22 ld s7,120(sp) .cfi_restore 23 ld s9,104(sp) .cfi_restore 25 ld s10,96(sp) .cfi_restore 26 ld s11,88(sp) .cfi_restore 27 fld fs0,72(sp) .cfi_restore 40 fld fs1,64(sp) .cfi_restore 41 fld fs2,56(sp) .cfi_restore 50 fld fs3,48(sp) .cfi_restore 51 fld fs4,40(sp) .cfi_restore 52 j .L55 .cfi_endproc .LFE25: .size fft, .-fft .section .rodata.str1.8,"aMS",@progbits,1 .align 3 .LC4: .string "Input:" .align 3 .LC5: .string "%lf + %lfi\n" .align 3 .LC6: .string "\nOutput:" .section .text.startup,"ax",@progbits .align 2 .globl main .type main, @function main: .LFB26: .cfi_startproc vsetivli zero,16,e64,m8,ta,ma lla a5,.LANCHOR0 vle64.v v8,0(a5) addi sp,sp,-192 .cfi_def_cfa_offset 192 sd s4,144(sp) .cfi_offset 20, -48 la s4,__stack_chk_guard sd s0,176(sp) sd s1,168(sp) sd s2,160(sp) sd s3,152(sp) sd ra,184(sp) .cfi_offset 8, -16 .cfi_offset 9, -24 .cfi_offset 18, -32 .cfi_offset 19, -40 .cfi_offset 1, -8 addi s3,sp,8 ld a5, 0(s4) sd a5, 136(sp) li a5, 0 vse64.v v8,0(s3) lla a0,.LC4 mv s0,s3 addi s2,sp,136 mv s1,s3 call puts@plt .L69: ld a3,8(s1) ld a2,0(s1) lla a1,.LC5 li a0,2 addi s1,s1,16 call __printf_chk@plt bne s1,s2,.L69 mv a0,s3 li a1,8 call fft lla a0,.LC6 call puts@plt .L70: ld a3,8(s0) ld a2,0(s0) lla a1,.LC5 li a0,2 addi s0,s0,16 call __printf_chk@plt bne s0,s2,.L70 ld a4, 136(sp) ld a5, 0(s4) xor a5, a4, a5 li a4, 0 bne a5,zero,.L78 ld ra,184(sp) .cfi_remember_state .cfi_restore 1 ld s0,176(sp) .cfi_restore 8 ld s1,168(sp) .cfi_restore 9 ld s2,160(sp) .cfi_restore 18 ld s3,152(sp) .cfi_restore 19 ld s4,144(sp) .cfi_restore 20 li a0,0 addi sp,sp,192 .cfi_def_cfa_offset 0 jr ra .L78: .cfi_restore_state call __stack_chk_fail@plt .cfi_endproc .LFE26: .size main, .-main .section .rodata.cst8,"aM",@progbits,8 .align 3 .LC1: .word 0 .word 1072693248 .align 3 .LC2: .word 0 .word -2147483648 .align 3 .LC3: .word 1413754136 .word -1072094725 .section .rodata .align 3 .set .LANCHOR0,. + 0 .LC0: .word 0 .word 1072693248 .word 0 .word 0 .word 0 .word 1072693248 .word 0 .word 0 .word 0 .word 1072693248 .word 0 .word 0 .word 0 .word 1072693248 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .ident "GCC: (Ubuntu 14.2.0-4ubuntu2) 14.2.0" .section .note.GNU-stack,"",@progbits
time qemu-riscv64 output
real	0m0.600s
user	0m0.405s
sys	0m0.203s

Then we improve the code.

.file "fft.c" .option pic .attribute arch, "rv64i2p1_m2p0_a2p1_f2p2_d2p2_v1p0_zicsr2p0_zba1p0_zbb1p0_zbc1p0_zbs1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0" .attribute unaligned_access, 0 .attribute stack_align, 16 .text .align 2 .globl logint .type logint, @function logint: .LFB23: .cfi_startproc clz a5,a0 li a4,31 sub a0,a4,a5 ret .cfi_endproc .LFE23: .size logint, .-logint .align 2 .globl reverse .type reverse, @function reverse: .LFB24: .cfi_startproc bne a0, zero, .L24 ret .L24: li a2, 1 li a6, 0 .L9: mv a5, a0 li a4, 0 .L13: srai a5, a5, 1 addi a4, a4, 1 bne a5, zero, .L13 ble a4, a2, .L25 srl a3, a1, a2 andi a3, a3, 1 beqz a3, .L12 sll a5, a6, a2 or a6, a6, a5 .L12: addi a2, a2, 1 j .L9 .L25: mv a0, a6 ret .cfi_endproc .LFE24: .size reverse, .-reverse .align 2 .globl fft .type fft, @function fft: .LFB25: .cfi_startproc addi sp, sp, -192 .cfi_def_cfa_offset 192 sd s3, 152(sp) sd s8, 112(sp) sd ra, 184(sp) .cfi_offset 19, -40 .cfi_offset 24, -80 .cfi_offset 1, -8 mv s8, a1 mv s3, a0 blez a1, .L27 # 若 a1 <= 0,跳轉到.L27 li a0, 0 li a6, 1 mv a7, a0 # 初始化指標 vsetivli zero, 2, e64, m1, ta, ma .L28: li a1, 0 # 初始化暫存變數 li a2, 1 .L33: mv a5, s8 li a4, 0 .L31: srai a5, a5, 1 addi a4, a4, 1 bnez a5, .L31 bge a2, a4, .L63 # 若 a2 >= a4,跳轉結束 srl a3, a0, a2 # 提取位元 andi a3, a3, 1 beqz a3, .L30 # 若該位為 0,跳過 OR 操作 sll a5, a6, a2 or a1, a1, a5 # 累加結果 .L30: addi a2, a2, 1 j .L33 .L63: ble a1, a0, .L32 # 若 a1 <= a0,跳轉到.L32 slli a5, a1, 4 # 計算偏移 add a5, s3, a5 vle64.v v2, (a5) # 加載向量 vle64.v v1, (a7) vse64.v v2, (a7) vse64.v v1, (a5) .L32: addi a0, a0, 1 addi a7, a7, 16 bne s8, a0, .L28 .L27: ld ra, 184(sp) ld s3, 152(sp) ld s8, 112(sp) addi sp, sp, 192 .cfi_def_cfa_offset 0 jr ra .cfi_endproc .LFE25: .size fft, .-fft .section .rodata.str1.8,"aMS",@progbits,1 .align 3 .LC4: .string "Input:" .align 3 .LC5: .string "%lf + %lfi\n" .align 3 .LC6: .string "\nOutput:" .section .text.startup,"ax",@progbits .align 2 .globl main .type main, @function main: .LFB26: .cfi_startproc addi sp, sp, -192 .cfi_def_cfa_offset 192 sd ra, 184(sp) sd s0, 176(sp) sd s1, 168(sp) sd s2, 160(sp) sd s3, 152(sp) sd s4, 144(sp) .cfi_offset 1, -8 .cfi_offset 8, -16 .cfi_offset 9, -24 .cfi_offset 18, -32 .cfi_offset 19, -40 .cfi_offset 20, -48 vsetivli zero, 16, e64, m8, ta, ma lla a5, .LANCHOR0 vle64.v v8, 0(a5) addi s3, sp, 8 vse64.v v8, 0(s3) la s4, __stack_chk_guard ld a5, 0(s4) sd a5, 136(sp) lla a0, .LC4 call puts@plt mv s0, s3 mv s1, s3 addi s2, sp, 136 .L69: ld a3, 8(s1) ld a2, 0(s1) lla a1, .LC5 li a0, 2 addi s1, s1, 16 call __printf_chk@plt bne s1, s2, .L69 mv a0, s3 li a1, 8 call fft lla a0, .LC6 call puts@plt .L70: ld a3, 8(s0) ld a2, 0(s0) lla a1, .LC5 li a0, 2 addi s0, s0, 16 call __printf_chk@plt bne s0, s2, .L70 ld a4, 136(sp) ld a5, 0(s4) xor a5, a4, a5 beqz a5, .L71 call __stack_chk_fail@plt j .L72 .L71: ld ra, 184(sp) ld s0, 176(sp) ld s1, 168(sp) ld s2, 160(sp) ld s3, 152(sp) ld s4, 144(sp) addi sp, sp, 192 .cfi_def_cfa_offset 0 jr ra .L72: .cfi_endproc .LFE26: .size main, .-main .section .rodata.cst8,"aM",@progbits,8 .align 3 .LC1: .word 0 .word 1072693248 .align 3 .LC2: .word 0 .word -2147483648 .align 3 .LC3: .word 1413754136 .word -1072094725 .section .rodata .align 3 .set .LANCHOR0,. + 0 .LC0: .word 0 .word 1072693248 .word 0 .word 0 .word 0 .word 1072693248 .word 0 .word 0 .word 0 .word 1072693248 .word 0 .word 0 .word 0 .word 1072693248 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .word 0 .ident "GCC: (Ubuntu 14.2.0-4ubuntu2) 14.2.0" .section .note.GNU-stack,"",@progbits

Difference Between lla and la
lla (Load Address Absolute) : Only handles absolute addresses and does not account for a base address (e.g., PC-relative or GOT).
la (Load Address : Adjusts the address based on the target's access model, such as PC-relative or using the Global Offset Table (GOT).

We improved the code, reducing nearly 200 lines and shortening the execution time.

real	0m0.498s
user	0m0.373s
sys	0m0.132s

Result

The following is the result of the output.
In this task, we accelerated its execution time by using b-extension and RVV.

Input:
1.000000 + 0.000000i
1.000000 + 0.000000i
1.000000 + 0.000000i
1.000000 + 0.000000i
0.000000 + 0.000000i
0.000000 + 0.000000i
0.000000 + 0.000000i
0.000000 + 0.000000i

Output:
4.000000 + 0.000000i
1.000000 + -2.414214i
0.000000 + 0.000000i
1.000000 + -0.414214i
0.000000 + 0.000000i
1.000000 + 0.414214i
0.000000 + 0.000000i
1.000000 + 2.414214i

Reference