Accelerate Fast Fourier Transform

謝廷昇, 王晴文

Introduction of the task

Task : Research FFT and its improvement schemes.

Implement the following:
RV32I : Develop the base RV32I ISA implementation.
Using B Extension : Integrate and utilize the Bit-Manipulation (B) extension.
Using RVV : Incorporate and utilize the RISC-V Vector Extension (RVV).

The code should be validated using QEMU.

Finally, prepare a signal processing application that utilizes your improved FFT as a technical demonstration.

Introduction to Fourier transform

DFT (Discrete Fourier transform)

The discrete Fourier transform (DFT) is defined as follows:

Y (k) = \sum_{n = 0}^{N - 1} x (n) W_{N}^{n k} = \sum_{n = 0}^{N - 1} x (n) e^{- \frac{2 π i}{N} n k}

Where:

(x(n)) : Input sequence
(Y(k)) : Output sequence
$W_{N}^{n k} = e^{- \frac{2 π i}{N} n k}$ : Twiddle factor
$i = \sqrt{- 1}$ : Imaginary unit

The Discrete Fourier Transform (DFT) is a mathematical tool that transforms a signal from the time domain to the frequency domain. Its primary purpose is to analyze the spectral characteristics of a signal. DFT is widely used in various fields, including digital signal processing, image processing, sound analysis, and scientific and engineering applications.

Characteristics of DFT:

Spectrum Analysis: Decomposes the input signal into a superposition of different frequency components.
Inverse Transform (IDFT): Allows the reconstruction of the time-domain signal from the frequency-domain data.
High Complexity: The computational complexity of a standard DFT is O(
$N^{2}$ ) , which increases significantly with the length of the input signal.

Applications of DFT:

Signal Processing: Filtering, modulation, and spectrum analysis.
Image Processing: Edge detection and compression (e.g., discrete cosine transform in JPEG).
Engineering and Science: Seismic analysis and structural vibration studies.

Cooley-Tukey FFT (Fast Fourier Transform)

An efficient algorithm used for rapidly computing the Discrete Fourier Transform and its inverse . Fourier Transform is a core technique in signal processing for analyzing frequency-domain characteristics, and FFT reduces computational complexity, making the application of Fourier Transform more efficient.

Direct computation of DFT requires O(

N^{2}

) computational complexity (where N is the number of data points), while FFT reduces the complexity to O(

N l o g N

), significantly improving efficiency. Its applications include signal processing (e.g., spectrum analysis and filtering), image processing (e.g., compression and denoising), audio processing (e.g., music analysis and echo cancellation), and scientific computation (e.g., simulation and solving partial differential equations). The basic principle of FFT is based on the divide-and-conquer approach, where DFT is broken down into smaller sub-problems for recursive computation. By dividing the input data into even and odd parts for separate calculations and then combining the results, the process is greatly optimized. Among the common implementations, the Cooley-Tukey FFT algorithm is the most widely used.

The decomposition method of the Cooley-Tukey FFT algorithm can be expressed as:

Y (k) = \sum_{n = 0}^{N - 1} x (n) W_{N}^{n k} = \sum_{n = 0}^{N / 2 - 1} x (2 n) W_{N}^{2 n k} + \sum_{n = 0}^{N / 2 - 1} x (2 n + 1) W_{N}^{(2 n + 1) k}

= \sum_{n = 0}^{N / 2 - 1} f (n) W_{N / 2}^{n k} + \sum_{n = 0}^{N / 2 - 1} g (n) W_{N / 2}^{n k}

Where:

(f(n)) and (g(n)) are complex sequences of length
$N / 2$ .
$W_{N}^{n k} = e^{- \frac{2 π i}{N} n k}$ : Twiddle factor

PerfMPL-FFT

In terms of algorithm implementation, this method used the SIMD vectorization, and vector features of the RISC-V instruction set architecture to assemble and implement the optimized FFT algorithm manually and optimized the selection and sequencing of instructions to maximize and improve the performance of the FFT algorithm library based on the RISC-V architecture, and ﬁnally realize a high-performance algorithm library based on the RISC-V CPUs. The experimental results show that: on the RISC-V platform, the PerfMPL-FFT algorithm library implemented in this paper has a better performance improvement for DFT processing of single-precision and double-precision data compared with the open-source FFTW library.

The key contributions of this paper are summarized as follows:

Proposed optimization strategies: We introduced a set of optimization strategies and methods targeting the butterfly kernel in the Cooley-Tukey FFT algorithm.
Designed a technology system for RISC-V: Based on the characteristics of the RISC-V architecture, we developed an optimization technology system for the FFT algorithm, enhancing its performance on the RISC-V platform.
Implemented an efficient FFT library: We implemented a high-performance FFT algorithm library on the RISC-V platform, demonstrating significant performance advantages.

FFTW (Fast Fourier Transform in the West)

There are various fast Fourier transform (FFT) algorithms, including open-source libraries like FFTW, as well as FFT optimization libraries specifically designed by some chip companies for their own hardware, such as Intel’s MKL_FFT and Huawei’s KML_FFT.

FFTW (Fast Fourier Transform in the West) is an open-source library designed for fast Fourier transform (FFT). It is distributed under a free software license, similar to GPL, and you can download the latest version from its official website or code repository, following the licensing terms for usage. This library enables efficient numerical computations across various computing platforms.

By leveraging pre-decomposition algorithms and caching techniques, FFTW significantly accelerates FFT computations, saving several times the computation time compared to traditional FFT algorithms. FFTW is widely applicable, not only in scientific computing and signal processing but also in fields like image and audio processing. Its exceptional performance makes it a vital tool in numerous scientific and engineering applications.

Here are the URLs for FFTW:
Official Website : http://fftw.org/
GitHub Repository : https://github.com/FFTW/fftw3

QEMU

What is QEMU?

QEMU (Quick Emulator) is an open-source general-purpose machine emulator and virtualization tool that primarily supports the following modes:

User Mode Emulation: In this mode, QEMU allows programs compiled for one CPU architecture to run on a different CPU architecture. eg.qemu-riscv64
System Emulation: In this mode, QEMU provides a complete environment for running a guest system, including the CPU, memory, and peripheral devices. eg.qemu-system-riscv64

When emulating the RISC-V architecture, QEMU can simulate the virt virtual development board. While this board does not correspond to any physical hardware, it supports the emulation of various hardware devices, making it ideal for development and testing purposes.

QEMU for Linux

Install QEMU and Risc-V tool chain

sudo apt update
sudo apt install qemu qemu-kvm libvirt-daemon-system libvirt-clients bridge-utils -y
git clone https://github.com/riscv/riscv-gnu-toolchain

Since we are working with the RISC-V architecture, we need to begin compiling U-Boot and OpenSBI.

Compile U-Boot

git clone https://github.com/u-boot/u-boot.git
cd u-boot
make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- qemu-riscv64_smode_defconfig
make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- -j$(nproc)

Next, the desired kernel will be generated as shown in the figure below.

OpenSBI

git clone https://github.com/riscv-software-src/opensbi.git
cd opensbi
make PLATFORM=generic CROSS_COMPILE=riscv64-linux-gnu- PLATFORM_RISCV_XLEN=64

Completing the above steps will successfully install OpenSBI.
Then navigate to the following path.

cd build/platform/generic/firmware

A fw_bin will then be generated, which is our firmware.

Download Ubuntu for RISC-V Platforms

Go to the following URL to download the corresponding IMG file (QEMU).
https://ubuntu.com/download/risc-v

Boot the system

Once the kernel and firmware are ready, pass them with the corresponding parameters to QEMU to boot the system.

qemu-system-riscv64 -machine virt -nographic -m 2048 -smp 4 -bios /home/annie/opensbi/build/platform/generic/firmware/fw_jump.bin -kernel /home/annie/u-boot/u-boot.bin -device virtio-net-device,netdev=eth0 -netdev user,id=eth0 -device virtio-rng-pci -drive file=/home/annie/ubuntu-24.10-preinstalled-server-riscv64.img,format=raw,if=virtio

The path should be adjusted to the correct path accordingly.

Basic Commands and Parameter Descriptions
-qemu-system-riscv64
Launches the RISC-V 64-bit version of the QEMU emulator.

-machine virt
Uses the virtual machine type (virt machine type), which is a generic virtualization hardware platform provided by QEMU for RISC-V.

-nographic
Disables graphical output, redirecting display output to the terminal.

-m 2048
Allocates 2048 MB of virtual memory to the emulator.

-smp 4
Configures the emulator with 4 virtual CPU cores.

-bios /home/annie/opensbi/build/platform/generic/firmware/fw_jump.bin
Specifies the OpenSBI firmware fw_jump.bin as the BIOS for booting U-Boot.

-kernel /home/annie/u-boot/u-boot.bin
Loads the U-Boot binary file as the kernel bootloader.

-device virtio-net-device,netdev=eth0
Configures a virtual network device using the VirtIO driver and connects it to the network device with ID eth0.

-netdev user,id=eth0
Simulates a network connection using "user mode networking" and assigns it the ID eth0.

-device virtio-rng-pci
Adds a PCI-based VirtIO random number generator device to provide entropy.

-drive file=/home/annie/ubuntu-24.10-preinstalled-server-riscv64.img,format=raw,if=virtio
Loads the specified Ubuntu system image file (ubuntu-24.10-preinstalled-server-riscv64.img) with the following options:

-format=raw: Reads the image file in raw format.
if=virtio: Attaches the disk as a VirtIO device to the virtual machine.

Then, wait for the system to boot.
When you see the message "Completed socket interaction for boot stage final", press Enter to log in.
The default username is: ubuntu
The default password is: ubuntu

Finally, check the system to confirm it is RISC-V:

cd /bin
uname -m
# It will show like this
ubuntu@ubuntu:/bin$ uname -m
riscv64

How to create a file and run it on QEMU

Writing a .s File

Create a simple hello.s. Below is an example program :

.data
msg: .asciz "Hello, QEMU!\n"

.text
.global _start

_start:
    # Write system call
    li a7, 64          # syscall number for write
    li a0, 1           # file descriptor (stdout)
    la a1, msg         # address of the message
    li a2, 13          # length of the message
    ecall

    # Exit system call
    li a7, 93          # syscall number for exit
    li a0, 0           # exit code
    ecall

The directive .global _start is necessary in RISC-V assembly code.

Program Entry Point:
When an executable file is loaded for execution, the operating system looks for the program's entry point. By default, the entry point is usually _start.
If _start is not defined as a global symbol, the linker may fail to locate it, resulting in the executable being unable to run properly.

Compiling the .s File

Use the riscv64-linux-gnu-as and riscv64-linux-gnu-ld to compile the .s file into an executable :

# Assemble the .s file into an object file
riscv64-linux-gnu-as -o hello.o hello.s

# Link the object file to create an executable
riscv64-linux-gnu-ld -o hello hello.o

Running with QEMU

Run the compiled program using QEMU :

qemu-riscv64 hello

After execution, the terminal should display :

Hello, QEMU!

How to tanslate c into riscv

First, locate the positions of stdio.h and libm.a.

ubuntu@ubuntu:~$ find /usr -name stdio.h
find /usr -name libm.a
/usr/include/riscv64-linux-gnu/bits/stdio.h
/usr/include/newlib/ssp/stdio.h
/usr/include/newlib/sys/stdio.h
/usr/include/newlib/stdio.h
/usr/include/stdio.h
/usr/lib/riscv64-linux-gnu/libm.a

The following instruction will translate C into RISC-V.

riscv64-unknown-elf-gcc -O2 -S fft.c -o fft.s \
-I/usr/include/newlib \
-L/usr/lib/riscv64-linux-gnu -lm

This instruction will translate c into elf file.

gcc -o fft.elf fft.c -lm

How to expand the virtual machine's space

The following code can expand the virtual machine's space.

qemu-img resize /home/annie/ubuntu-24.10-preinstalled-server-riscv64.img +15G

QEMU for Mac

Use $ to operate in terminal

environment: Macbook Air M3 Sonoma 14.6

$ riscv32-unknown-elf-gcc -march=rv32i_zba_zbb_zbc_zbs_v -mabi=ilp32 -S -o output.s input.c

Use Homebrew to install QEMU

$ brew install qemu

Check whether RISC-V has been supported

$ qemu-system-riscv32 --version

If we have installed correctly, we get the version below:
( version 9.2.0 in 2025/1 )

QEMU emulator version 9.2.0
Copyright (c) 2003-2024 Fabrice Bellard and the QEMU Project developers

Then we compile RV32I/B extension/RVV via RISC-V GCC tools

Installation

$ brew tap riscv-software-src/homebrew-riscv
$ brew install riscv-tools

Verification

$ riscv64-unknown-elf-gcc --version

You may notice that the instruction in above is "riscv64".
It's normal.
Because most RISC-V toolchains target 64-bit architectures (riscv64) by default. However, these tools usually support 32-bit targets as well, by specifying the appropriate architecture flags at compile time.

If it shows that commend not found, checking whether it is in the environment variable.

$ echo $PATH

If it is't, we add it manually and make sure again.

$ export PATH="/opt/homebrew/opt/riscv-gnu-toolchain/bin:$PATH"
$ source ~/.zshrc

Compile the C file

$ riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -o test.elf test.c

-march=rv32i：specity architecture as RV32I
-mabi=ilp32：specify ABI（Application Binary Interface）as 32-bit integer

Issues encountered in QEMU

Unable to use ecall 1 in QEMU (unable to directly print numbers).

solution :
Convert a word to ASCII. This is used for all numbers greater than or equal to 0.

.data
buffer: .space 21               # String buffer for the converted number
newline: .asciz "\n"            # Newline character

.text
.global _start
_start:
int_to_string:
    addi sp, sp, -16            # Save stack frame
    sw ra, 12(sp)               # Save return address
    sw t0, 8(sp)                # Save t0
    sw t1, 4(sp)                # Save t1

    mv t0, a0                   # Copy input number (a0) to t0
    addi t1, a1, 16             # Point to the end of the buffer
    li a0, 0                    # Initialize the length counter

loop:
    li t3, 10                   # Load divisor (10)
    rem t2, t0, t3              # t2 = t0 % 10 (remainder)
    addi t2, t2, '0'            # Convert the remainder to ASCII character
    addi t1, t1, -1             # Move the buffer pointer back
    sb t2, 0(t1)                # Store the character in the buffer
    div t0, t0, t3              # t0 = t0 / 10 (integer division)
    addi a0, a0, 1              # Increment the length counter
    bnez t0, loop               # Repeat if t0 != 0

    mv a1, t1                   # Return the string starting address in a1
    lw ra, 12(sp)               # Restore return address
    lw t0, 8(sp)                # Restore t0
    lw t1, 4(sp)                # Restore t1
    addi sp, sp, 16             # Restore stack frame
    ret                         # Return to caller

Code

Part1 : logint

Functionality

The function calculates

\log_{2} (N)

, which determines the power of 2 closest to

N

Logic

Each iteration right-shifts
$N$ by 1 bit, effectively simulating a "divide by 2" operation.
For every right shift, the counter
$i$ increments, representing how many times
$N$ can be divided by 2.
The loop stops when
$k$ becomes 0. At this point,
$i$ contains
$\log_{2} (N) + 1$ . Subtracting 1 gives the correct result.

code :










/* Calculates the log2 of number */
int logint(int N)
{
    int k = N, i = 0;
    while (k) {
        k >>= 1;
        i++;
    }
    return i - 1;
}

Convert to RISC-V assembly:



















# Takes input N (a0), returns its log base 2 in a0
logint:
    addi sp, sp, -4
    sw t0, 0(sp)          		# store t0

    add t0, a0, zero	  		# k = N
    add a0, zero, zero    		# i = 0

logloop:
    beq t0, zero, logloop_end 		# Exit if k == 0
    srai t0, t0, 1 			# k >>= 1
    addi a0, a0, 1 			# i++
    j logloop

logloop_end:
    addi a0, a0, -1 			# Return i - 1 as the counter was incremented one extra time.
    lw t0, 0(sp)
    addi sp, sp, 4
    jr ra

Compiling the `.s` File in RV32I

Use the riscv64-linux-gnu-as and riscv64-linux-gnu-ld to compile the .s file into an executable :

riscv64-linux-gnu-as -o logint.o logint.s
riscv64-linux-gnu-ld -o logint logint.o
qemu-riscv64 logint

Using this instruction will see the time the file taken.

time qemu-riscv64 logint

for example :

real	0m0.139s
user	0m0.053s
sys	0m0.074s

B-extension










set_log2n:
    clz t0, a0             # Count leading zeros in a0
    li t1, 31              # Maximum index for a 32-bit register
    sub t1, t1, t0         # Compute log2(N)
    la t0, logsize         # Load address of 'logsize'
    sw t1, 0(t0) 

    jr ra                  # Return to caller

logsize: .word 0

Compiling the `.s` File in B extension

Use the riscv64-linux-gnu-as and riscv64-linux-gnu-ld to compile the .s file into an executable :

riscv64-linux-gnu-as -march=rv64imafdzbb -o logint.o logint.s
riscv64-linux-gnu-ld -o logint logint.o
time qemu-riscv64 logint

vector extension








# Calculate log base 2 of input
# Input:a0 = N
# Output: a0 = log2(N)
logint:
    clz a0, a0 # Count leading zeros of a0
    addi t0, zero, 31 # Load 31 (32-bit word size - 1)
    sub a0, t0, a0 
    jr ra# Return

Part2 : bitwise reverse

Functionality

The function reverse is used to reverse the binary representation of a number

n

based on the bit length determined by the maximum number

N

Input :
- $N$ : Used to determine the bit length (calculated using
  $\log_{2} (N)$ ).
- $n$ : The number whose binary representation needs to be reversed.
Output:
- Returns the reversed binary representation of
  $n$ .

Logic

Determine Bit Length :
- Use
  $\log_{2} (N)$ to determine the number of bits required for
  $N$ .
- For example, if
  $N = 18$ , then
  $\log_{2} (18) \approx 4$ , so the bit length is 4.
Check and Reverse Bits :
- Use a loop to check each bit of
  $n$ starting from the
  $logint (N) - j$ position.
- If the bit at this position is 1, set the corresponding position
  $(j - 1)$ in the reversed binary number to 1.
- This is done using bitwise operations & and |.
Return Result:
- Finally, return the reversed number
  $p$ .

code :











/* Bitwise reverses the number */
int reverse(int N, int n)
{
    int j, p = 0;
    for (j = 1; j <= logint(N); j++) {
        if (n & (1 << (logint(N) - j))) {
            p |= 1 << (j - 1);
        }
    }
    return p;
}

Convert to RISC-V assembly:












































# Takes inputs N(a0) and n(a1), reverses the number in binary
reverse:
    addi sp, sp, -28
    sw ra, 0(sp)
    sw s0, 4(sp)
    sw s1, 8(sp)
    sw s2, 12(sp)
    sw s3, 16(sp)
    sw s4, 20(sp)
    sw s5, 24(sp)

    call logint                         # a0 = lg(N)
    addi s0, zero, 1                    # j = 1
    add s1, zero, zero                  # p = 0

forloop_reverse:
    bgt s0, a0, forloop_reverse_end     # if j > lg(N) then end loop
    sub s2, a0, s0
    addi s3, zero, 1
    sll s3, s3, s2
    and s3, a1, s3
    beq s3, zero, elses3                # If not, skip

ifs3:
    addi s4, s0, -1                     # s4 = j - 1
    addi s5, zero, 1
    sll s5, s5, s4
    or s1, s1, s5
elses3:
    addi s0, s0, 1
    j forloop_reverse

forloop_reverse_end:
    add a0, s1, zero # Return p

    lw ra, 0(sp)
    lw s0, 4(sp)
    lw s1, 8(sp)
    lw s2, 12(sp)
    lw s3, 16(sp)
    lw s4, 20(sp)
    lw s5, 24(sp)
    addi sp, sp, 28
    jr ra

B-extension



























































# Function: reverse
# Reverses the binary digits of a 32-bit integer.
# Inputs:
# - a0: Input number to reverse.
# - a1: Number of significant bits to reverse (optional; default 32).
# Outputs:
# - a0: The reversed binary number.
# Clobbers:
# - t0, t1, t2

reverse:
    # Swap odd and even bits
    lui t0, 0x55555         # Load upper 20 bits of 0x55555555
    ori t0, t0, 0x555 # Complete loading 0x55555555
    srli t1, a0, 1 # v >> 1
    and t1, t1, t0 # (v >> 1) & 0x55555555
    and t2, a0, t0
    slli t2, t2, 1
    or a0, t1, t2 # Store result in a0

    # Swap consecutive pairs
    lui t0, 0x33333 # Load upper 20 bits of 0x33333333
    ori t0, t0, 0x333 # Complete loading 0x33333333
    srli t1, a0, 2 # v >> 2
    and t1, t1, t0 # (v >> 2) & 0x33333333
    and t2, a0, t0
    slli t2, t2, 2
    or a0, t1, t2 # Store result in a0

    # Swap nibbles
    lui t0, 0x0F0F0 # Load upper 20 bits of 0x0F0F0F0F
    ori t0, t0, 0x0F0 # Complete loading 0x0F0F0F0F
    srli t1, a0, 4
    and t1, t1, t0
    and t2, a0, t0
    slli t2, t2, 4
    or a0, t1, t2 # Store result in a0

    # Swap bytes
    lui t0, 0x00FF0# Load upper 20 bits of 0x00FF00FF
    ori t0, t0, 0x0FF# Complete loading 0x00FF00FF
    srli t1, a0, 8
    and t1, t1, t0
    and t2, a0, t0 
    slli t2, t2, 8 
    or a0, t1, t2 # Store result in a0

    # Swap 2-byte pairs
    srli t1, a0, 16
    slli t2, a0, 16
    ora0, t1, t2 # Final result in a0

    # Adjust number of significant bits
    addi t1, zero, 32 # Load immediate value 32
    sub t1, t1, a1 # Calculate shift amount
    srl a0, a0, t1 # Shift right to fit bit count

    jr ra # Return with result in a0

vector extension














































# Initialize the helper vector with sequential integers (0,1,2,...)
init_helper_vector:
    la t0, helperVector      # Load the base address of helperVector
    lw t1, size              # Load the size of the vector (N)
    vsetvli t2, t1, e32, m1  # Set the vector length configuration
    vid.v v0                 # Generate a vector of indices
    vse32.v v0, (t0)         # Store the generated vector in helperVector
    ret
    
# Bit-reverse elements of a vector
# Input:v29 = input vector, a0 = N
# Output: v29 = bit-reversed vector
vreverse:
    addi sp, sp, -4           # Save return address
    sw ra, 0(sp)
    
    call logint               # Compute log2(N) in a0
    
    lw ra, 0(sp)              # Restore return address
    addi sp, sp, 4
    
    li t1, 1                  # Constant 1 for bit manipulation
    li t0, 1                  # Initialize bit position counter j
    vmv.v.x v28, zero         # Clear result vector v28
    
vreverse_loop:            # Loop for j <= log2(N)
    bgt t0, a0, vreverse_end  # Break if j > log2(N)
    
    sub t2, a0, t0            # t2 = log2(N) - j
    sll t3, t1, t2            # t3 = (1 << (log2(N) - j))
    
    vand.vx v0, v29, t3       # v0 = v29 & t3 (check bit)
    # vmsne:Vector Mask Set Not Equa
    vmsne.vx v0, v0, zero     # v0 = 1 if set, else 0
    
    addi t4, t0, -1           # t4 = j - 1
    sll t3, t1, t4            # t3 = (1 << (j - 1))
    
    vor.vx v28, v28, t3, v0.t # Set bit in v28 where mask is true
    
    addi t0, t0, 1            # Increment bit position
    j vreverse_loop

vreverse_end:
    vmv.v.v v29, v28          # Store result in v29
    jr ra                     # Return

    The usage of vreverse would look something like this:
    ```c
    mv a0, a2 # a0 = N, input for reverse function. Overwriting a0 is fine
    vmv.v.v v29, v26 # v29 = v26 = <i>, used as input for vreverse

    call vreverse # v29 now contains rev(N, <i>), keep it for later use

la t0, helperVector:
This instruction loads the base address of helperVector into the register t0. helperVector is an array or buffer used to store the generated vector.
lw t1, size:
This instruction loads an integer value (the size of the vector N) from memory into the register t1. This value represents the number of elements in the vector to be generated.
vsetvli t2, t1, e32, m1:
This instruction sets the vector length and type. It sets the vector length to t1 (i.e., N), with each element being 32 bits (e32) and a multiplier factor of 1 (m1). This means that the subsequent vector operations will target N 32-bit elements.
vid.v v0:
This instruction generates a vector v0 with elements consisting of consecutive index values (0, 1, 2, …, N-1). It is an initialization vector typically used for testing or as indices for data.
vse32.v v0, (t0):
This instruction stores the data in vector v0 to the memory address pointed to by t0 (i.e., helperVector). This means the generated index vector will be written into helperVector.

Application to FFT

This code implements the Fast Fourier Transform (FFT) algorithm, which can be used to analyze the frequency components of a signal.

Given an input signal of [1, 1, 1, 1, 0, 0, 0, 0], the FFT result will display the spectrum of the signal. The real and imaginary parts of the result correspond to the magnitude and phase in the frequency domain.









































































#include <stdio.h>
#include <math.h>
#include <complex.h>
#define PI 3.14159265358979323846

/* Calculates the log2 of number */
int logint(int N) {
    int k = N, i = 0;
    while (k) {
        k >>= 1;
        i++;
    }
    return i - 1;
}

/* Bitwise reverses the number */
int reverse(int N, int n) {
    int j, p = 0;
    for (j = 1; j <= logint(N); j++) {
        if (n & (1 << (logint(N) - j))) {
            p |= 1 << (j - 1);
        }
    }
    return p;
}

/* Performs the FFT */
void fft(complex double *a, int N) {
    // Bit-reversal permutation
    for (int i = 0; i < N; i++) {
        int rev = reverse(N, i);
        if (i < rev) {
            complex double temp = a[i];
            a[i] = a[rev];
            a[rev] = temp;
        }
    }

    // FFT computation
    for (int s = 1; s <= logint(N); s++) {
        int m = 1 << s; // 2^s
        complex double wm = cexp(-2.0 * PI * I / m);
        for (int k = 0; k < N; k += m) {
            complex double w = 1.0;
            for (int j = 0; j < m / 2; j++) {
                complex double t = w * a[k + j + m / 2];
                complex double u = a[k + j];
                a[k + j] = u + t;
                a[k + j + m / 2] = u - t;
                w *= wm;
            }
        }
    }
}

int main() {
    int N = 8; // Length of input (must be a power of 2)
    complex double a[] = {1, 1, 1, 1, 0, 0, 0, 0}; // Example input

    printf("Input:\n");
    for (int i = 0; i < N; i++) {
        printf("%lf + %lfi\n", creal(a[i]), cimag(a[i]));
    }

    fft(a, N);

    printf("\nOutput:\n");
    for (int i = 0; i < N; i++) {
        printf("%lf + %lfi\n", creal(a[i]), cimag(a[i]));
    }

    return 0;
}

Using instruction translate to assembly.

riscv64-linux-gnu-gcc -O3 -march=rv64imafdzba_zbb_zbc_zbs_v -S -o output.s fft.c
riscv64-linux-gnu-as -march=rv64imafdzba_zbb_zbc_zbs_v -o output.o output.s
riscv64-linux-gnu-gcc -o output output.o -lm

When we encountered issues installing riscv32-linux-gnu-gcc, we decided to switch to using rv64i instead.







































































































































































































































































































































































































































































.file	"fft.c"
	.option pic
	.attribute arch, "rv64i2p1_m2p0_a2p1_f2p2_d2p2_v1p0_zicsr2p0_zba1p0_zbb1p0_zbc1p0_zbs1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
	.attribute unaligned_access, 0
	.attribute stack_align, 16
	.text
	.align	2
	.globl	logint
	.type	logint, @function
logint:
.LFB23:
	.cfi_startproc
	mv	a5,a0
	beq	a0,zero,.L4
	li	a4,0
.L3:
	sraiw	a5,a5,1
	mv	a0,a4
	addiw	a4,a4,1
	bne	a5,zero,.L3
	ret
.L4:
	li	a0,-1
	ret
	.cfi_endproc
.LFE23:
	.size	logint, .-logint
	.align	2
	.globl	reverse
	.type	reverse, @function
reverse:
.LFB24:
	.cfi_startproc
	bne	a0,zero,.L24
	ret
.L24:
	li	a2,1
	mv	a7,a2
	li	a6,0
.L9:
	mv	a5,a0
	li	a4,0
.L13:
	sraiw	a5,a5,1
	addiw	a4,a4,1
	bne	a5,zero,.L13
	ble	a4,a2,.L25
	mv	a4,a0
.L11:
	sraiw	a4,a4,1
	mv	a3,a5
	addiw	a5,a5,1
	bne	a4,zero,.L11
	subw	a3,a3,a2
	sraw	a3,a1,a3
	andi	a3,a3,1
	beq	a3,zero,.L12
	addiw	a5,a2,-1
	sllw	a5,a7,a5
	or	a6,a6,a5
.L12:
	addiw	a2,a2,1
	j	.L9
.L25:
	mv	a0,a6
	ret
	.cfi_endproc
.LFE24:
	.size	reverse, .-reverse
	.align	2
	.globl	fft
	.type	fft, @function
fft:
.LFB25:
	.cfi_startproc
	addi	sp,sp,-192
	.cfi_def_cfa_offset 192
	sd	s3,152(sp)
	sd	s8,112(sp)
	sd	ra,184(sp)
	.cfi_offset 19, -40
	.cfi_offset 24, -80
	.cfi_offset 1, -8
	mv	s8,a1
	mv	s3,a0
	ble	a1,zero,.L27
	mv	a7,a0
	li	a6,1
	li	a0,0
	vsetivli	zero,2,e64,m1,ta,ma
.L28:
	li	a1,0
	li	a2,1
.L33:
	mv	a5,s8
	li	a4,0
.L31:
	sraiw	a5,a5,1
	addiw	a4,a4,1
	bne	a5,zero,.L31
	bge	a2,a4,.L63
	mv	a4,s8
.L29:
	sraiw	a4,a4,1
	mv	a3,a5
	addiw	a5,a5,1
	bne	a4,zero,.L29
	subw	a3,a3,a2
	sraw	a3,a0,a3
	andi	a3,a3,1
	beq	a3,zero,.L30
	addiw	a5,a2,-1
	sllw	a5,a6,a5
	or	a1,a1,a5
.L30:
	addiw	a2,a2,1
	j	.L33
.L63:
	ble	a1,a0,.L32
	slli	a5,a1,4
	add	a5,s3,a5
	vle64.v	v2,0(a5)
	vle64.v	v1,0(a7)
	vse64.v	v2,0(a7)
	vse64.v	v1,0(a5)
.L32:
	addiw	a0,a0,1
	addi	a7,a7,16
	bne	s8,a0,.L28
.L27:
	bne	s8,zero,.L64
.L55:
	ld	ra,184(sp)
	.cfi_remember_state
	.cfi_restore 1
	ld	s3,152(sp)
	.cfi_restore 19
	ld	s8,112(sp)
	.cfi_restore 24
	addi	sp,sp,192
	.cfi_def_cfa_offset 0
	jr	ra
.L64:
	.cfi_restore_state
	fsd	fs0,72(sp)
	fsd	fs1,64(sp)
	fsd	fs4,40(sp)
	.cfi_offset 40, -120
	.cfi_offset 41, -128
	.cfi_offset 52, -152
	fld	fs1,.LC2,a5
	fld	fs0,.LC3,a5
	fld	fs4,.LC1,a5
	sd	s5,136(sp)
	.cfi_offset 21, -56
	li	s5,1
	sd	s4,144(sp)
	sd	s0,176(sp)
	sd	s1,168(sp)
	sd	s2,160(sp)
	sd	s6,128(sp)
	sd	s7,120(sp)
	sd	s9,104(sp)
	sd	s10,96(sp)
	sd	s11,88(sp)
	fsd	fs2,56(sp)
	fsd	fs3,48(sp)
	.cfi_offset 20, -48
	.cfi_offset 8, -16
	.cfi_offset 9, -24
	.cfi_offset 18, -32
	.cfi_offset 22, -64
	.cfi_offset 23, -72
	.cfi_offset 25, -88
	.cfi_offset 26, -96
	.cfi_offset 27, -104
	.cfi_offset 50, -136
	.cfi_offset 51, -144
	mv	s4,s5
.L38:
	mv	a5,s8
	li	a4,0
.L43:
	sraiw	a5,a5,1
	addiw	a4,a4,1
	bne	a5,zero,.L43
	ble	a4,s5,.L65
	sllw	s10,s4,s5
	ble	s8,zero,.L36
	li	a5,1
	ble	s10,a5,.L36
	fcvt.d.w	fa1,s10
	srliw	s6,s10,31
	addw	s6,s6,s10
	fdiv.d	fa0,fs1,fa1
	sraiw	s6,s6,1
	slli	s11,s10,4
	mv	s7,s3
	li	s9,0
	fdiv.d	fa1,fs0,fa1
	call	cexp@plt
	fmv.d	fs3,fa0
	fmv.d	fs2,fa1
	slli	a3,s6,4
.L42:
	fmv.d.x	fa3,zero
	fmv.d	fa2,fs4
	add	s1,a3,s7
	mv	s0,s7
	li	s2,0
.L41:
	fld	fa1,8(s1)
	fld	fa0,0(s1)
	fmul.d	ft0,fa1,fa2
	fmul.d	ft1,fa1,fa3
	fmadd.d	ft0,fa0,fa3,ft0
	fmsub.d	ft1,fa0,fa2,ft1
	feq.d	a4,ft0,ft0
	feq.d	a5,ft1,ft1
	and	a5,a5,a4
	beq	a5,zero,.L66
.L39:
	fmul.d	fa5,fs3,fa3
	fmul.d	fa4,fs2,fa3
	fld	ft3,0(s0)
	fld	ft2,8(s0)
	fadd.d	fa0,ft1,ft3
	fadd.d	fa1,ft0,ft2
	fmadd.d	fa5,fs2,fa2,fa5
	fmsub.d	fa4,fs3,fa2,fa4
	fsub.d	ft3,ft3,ft1
	fsub.d	ft2,ft2,ft0
	fsd	fa0,0(s0)
	fsd	fa1,8(s0)
	feq.d	a4,fa5,fa5
	feq.d	a5,fa4,fa4
	fsd	ft3,0(s1)
	fsd	ft2,8(s1)
	and	a5,a5,a4
	beq	a5,zero,.L67
.L40:
	addiw	s2,s2,1
	fmv.d	fa2,fa4
	fmv.d	fa3,fa5
	addi	s1,s1,16
	addi	s0,s0,16
	bgt	s6,s2,.L41
	addw	s9,s10,s9
	add	s7,s7,s11
	bgt	s8,s9,.L42
.L36:
	addiw	s5,s5,1
	j	.L38
.L67:
	fmv.d	fa1,fa3
	fmv.d	fa0,fa2
	fmv.d	fa3,fs2
	fmv.d	fa2,fs3
	sd	a3,8(sp)
	call	__muldc3@plt
	fmv.d	fa4,fa0
	fmv.d	fa5,fa1
	ld	a3,8(sp)
	j	.L40
.L66:
	sd	a3,24(sp)
	fsd	fa3,16(sp)
	fsd	fa2,8(sp)
	call	__muldc3@plt
	fmv.d	ft1,fa0
	fmv.d	ft0,fa1
	ld	a3,24(sp)
	fld	fa3,16(sp)
	fld	fa2,8(sp)
	j	.L39
.L65:
	ld	s0,176(sp)
	.cfi_restore 8
	ld	s1,168(sp)
	.cfi_restore 9
	ld	s2,160(sp)
	.cfi_restore 18
	ld	s4,144(sp)
	.cfi_restore 20
	ld	s5,136(sp)
	.cfi_restore 21
	ld	s6,128(sp)
	.cfi_restore 22
	ld	s7,120(sp)
	.cfi_restore 23
	ld	s9,104(sp)
	.cfi_restore 25
	ld	s10,96(sp)
	.cfi_restore 26
	ld	s11,88(sp)
	.cfi_restore 27
	fld	fs0,72(sp)
	.cfi_restore 40
	fld	fs1,64(sp)
	.cfi_restore 41
	fld	fs2,56(sp)
	.cfi_restore 50
	fld	fs3,48(sp)
	.cfi_restore 51
	fld	fs4,40(sp)
	.cfi_restore 52
	j	.L55
	.cfi_endproc
.LFE25:
	.size	fft, .-fft
	.section	.rodata.str1.8,"aMS",@progbits,1
	.align	3
.LC4:
	.string	"Input:"
	.align	3
.LC5:
	.string	"%lf + %lfi\n"
	.align	3
.LC6:
	.string	"\nOutput:"
	.section	.text.startup,"ax",@progbits
	.align	2
	.globl	main
	.type	main, @function
main:
.LFB26:
	.cfi_startproc
	vsetivli	zero,16,e64,m8,ta,ma
	lla	a5,.LANCHOR0
	vle64.v	v8,0(a5)
	addi	sp,sp,-192
	.cfi_def_cfa_offset 192
	sd	s4,144(sp)
	.cfi_offset 20, -48
	la	s4,__stack_chk_guard
	sd	s0,176(sp)
	sd	s1,168(sp)
	sd	s2,160(sp)
	sd	s3,152(sp)
	sd	ra,184(sp)
	.cfi_offset 8, -16
	.cfi_offset 9, -24
	.cfi_offset 18, -32
	.cfi_offset 19, -40
	.cfi_offset 1, -8
	addi	s3,sp,8
	ld	a5, 0(s4)
	sd	a5, 136(sp)
	li	a5, 0
	vse64.v	v8,0(s3)
	lla	a0,.LC4
	mv	s0,s3
	addi	s2,sp,136
	mv	s1,s3
	call	puts@plt
.L69:
	ld	a3,8(s1)
	ld	a2,0(s1)
	lla	a1,.LC5
	li	a0,2
	addi	s1,s1,16
	call	__printf_chk@plt
	bne	s1,s2,.L69
	mv	a0,s3
	li	a1,8
	call	fft
	lla	a0,.LC6
	call	puts@plt
.L70:
	ld	a3,8(s0)
	ld	a2,0(s0)
	lla	a1,.LC5
	li	a0,2
	addi	s0,s0,16
	call	__printf_chk@plt
	bne	s0,s2,.L70
	ld	a4, 136(sp)
	ld	a5, 0(s4)
	xor	a5, a4, a5
	li	a4, 0
	bne	a5,zero,.L78
	ld	ra,184(sp)
	.cfi_remember_state
	.cfi_restore 1
	ld	s0,176(sp)
	.cfi_restore 8
	ld	s1,168(sp)
	.cfi_restore 9
	ld	s2,160(sp)
	.cfi_restore 18
	ld	s3,152(sp)
	.cfi_restore 19
	ld	s4,144(sp)
	.cfi_restore 20
	li	a0,0
	addi	sp,sp,192
	.cfi_def_cfa_offset 0
	jr	ra
.L78:
	.cfi_restore_state
	call	__stack_chk_fail@plt
	.cfi_endproc
.LFE26:
	.size	main, .-main
	.section	.rodata.cst8,"aM",@progbits,8
	.align	3
.LC1:
	.word	0
	.word	1072693248
	.align	3
.LC2:
	.word	0
	.word	-2147483648
	.align	3
.LC3:
	.word	1413754136
	.word	-1072094725
	.section	.rodata
	.align	3
	.set	.LANCHOR0,. + 0
.LC0:
	.word	0
	.word	1072693248
	.word	0
	.word	0
	.word	0
	.word	1072693248
	.word	0
	.word	0
	.word	0
	.word	1072693248
	.word	0
	.word	0
	.word	0
	.word	1072693248
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.ident	"GCC: (Ubuntu 14.2.0-4ubuntu2) 14.2.0"
	.section	.note.GNU-stack,"",@progbits

time qemu-riscv64 output
real	0m0.600s
user	0m0.405s
sys	0m0.203s

Then we improve the code.



























































































































































































































































.file	"fft.c"
	.option pic
	.attribute arch, "rv64i2p1_m2p0_a2p1_f2p2_d2p2_v1p0_zicsr2p0_zba1p0_zbb1p0_zbc1p0_zbs1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
	.attribute unaligned_access, 0
	.attribute stack_align, 16
	.text
	.align	2
	.globl	logint
	.type	logint, @function
logint:
.LFB23:
	.cfi_startproc
	clz 	a5,a0
	li 	a4,31
	sub 	a0,a4,a5
	ret
	.cfi_endproc
.LFE23:
	.size	logint, .-logint
	.align	2
	.globl	reverse
	.type	reverse, @function
reverse:
.LFB24:
   	 .cfi_startproc
   	 bne     a0, zero, .L24      
 	 ret
.L24:
   	 li      a2, 1               
 	 li      a6, 0              
.L9:
        mv      a5, a0              
        li      a4, 0               
.L13:
    	srai    a5, a5, 1           
    	addi    a4, a4, 1           
    	bne     a5, zero, .L13      
    	ble     a4, a2, .L25       
    	srl     a3, a1, a2          
    	andi    a3, a3, 1           
    	beqz    a3, .L12           
        sll     a5, a6, a2        
    	or      a6, a6, a5         
.L12:
    	addi    a2, a2, 1           
    	j       .L9                
.L25:
    	mv      a0, a6             
    	ret
    	.cfi_endproc
.LFE24:
	.size	reverse, .-reverse
	.align	2
	.globl	fft
	.type	fft, @function

fft:
.LFB25:
	.cfi_startproc
    	addi    sp, sp, -192
    	.cfi_def_cfa_offset 192
    	sd      s3, 152(sp)
    	sd      s8, 112(sp)
    	sd      ra, 184(sp)
    	.cfi_offset 19, -40
    	.cfi_offset 24, -80
    	.cfi_offset 1, -8
    	mv      s8, a1
    	mv      s3, a0
    	blez    a1, .L27        # 若 a1 <= 0，跳轉到.L27
    	li      a0, 0
    	li      a6, 1
    	mv      a7, a0          # 初始化指標
    	vsetivli zero, 2, e64, m1, ta, ma
.L28:
    	li      a1, 0           # 初始化暫存變數
    	li      a2, 1
.L33:
    	mv      a5, s8
    	li      a4, 0
.L31:
    	srai    a5, a5, 1
    	addi    a4, a4, 1
    	bnez    a5, .L31
    	bge     a2, a4, .L63    # 若 a2 >= a4，跳轉結束
    	srl     a3, a0, a2      # 提取位元
    	andi    a3, a3, 1
    	beqz    a3, .L30        # 若該位為 0，跳過 OR 操作
    	sll     a5, a6, a2
    	or      a1, a1, a5      # 累加結果
.L30:
    	addi    a2, a2, 1
    	j       .L33
.L63:
    	ble     a1, a0, .L32    # 若 a1 <= a0，跳轉到.L32
    	slli    a5, a1, 4       # 計算偏移
    	add     a5, s3, a5
    	vle64.v v2, (a5)        # 加載向量
    	vle64.v v1, (a7)
    	vse64.v v2, (a7)
    	vse64.v v1, (a5)
.L32:
    	addi    a0, a0, 1
    	addi    a7, a7, 16
    	bne     s8, a0, .L28
.L27:
    	ld      ra, 184(sp)
    	ld      s3, 152(sp)
    	ld      s8, 112(sp)
    	addi    sp, sp, 192
    	.cfi_def_cfa_offset 0
    	jr      ra
    	.cfi_endproc
.LFE25:
	.size	fft, .-fft
	.section	.rodata.str1.8,"aMS",@progbits,1
	.align	3
.LC4:
	.string	"Input:"
	.align	3
.LC5:
	.string	"%lf + %lfi\n"
	.align	3
.LC6:
	.string	"\nOutput:"
	.section	.text.startup,"ax",@progbits
	.align	2
	.globl	main
	.type	main, @function
main:
.LFB26:
	.cfi_startproc
    	addi    sp, sp, -192
    	.cfi_def_cfa_offset 192
    	sd      ra, 184(sp)
    	sd      s0, 176(sp)
    	sd      s1, 168(sp)
    	sd      s2, 160(sp)
    	sd      s3, 152(sp)
    	sd      s4, 144(sp)
    	.cfi_offset 1, -8
    	.cfi_offset 8, -16
    	.cfi_offset 9, -24
    	.cfi_offset 18, -32
    	.cfi_offset 19, -40
    	.cfi_offset 20, -48
    	vsetivli zero, 16, e64, m8, ta, ma
    	lla     a5, .LANCHOR0
    	vle64.v v8, 0(a5)
    	addi    s3, sp, 8
    	vse64.v v8, 0(s3)
    	la      s4, __stack_chk_guard
    	ld      a5, 0(s4)
    	sd      a5, 136(sp)
    	lla     a0, .LC4
    	call    puts@plt
    	mv      s0, s3
    	mv      s1, s3
    	addi    s2, sp, 136
.L69:
    	ld      a3, 8(s1)
    	ld      a2, 0(s1)
    	lla     a1, .LC5
    	li      a0, 2
    	addi    s1, s1, 16
    	call    __printf_chk@plt
    	bne     s1, s2, .L69
    	mv      a0, s3
    	li      a1, 8
    	call    fft
    	lla     a0, .LC6
    	call    puts@plt
.L70:
    	ld      a3, 8(s0)
    	ld      a2, 0(s0)
    	lla     a1, .LC5
    	li      a0, 2
    	addi    s0, s0, 16
    	call    __printf_chk@plt
    	bne     s0, s2, .L70
    	ld      a4, 136(sp)
   	ld      a5, 0(s4)
    	xor     a5, a4, a5
    	beqz    a5, .L71
    	call    __stack_chk_fail@plt
    	j       .L72
.L71:
    	ld      ra, 184(sp)
    	ld      s0, 176(sp)
    	ld      s1, 168(sp)
   	ld      s2, 160(sp)
    	ld      s3, 152(sp)
    	ld      s4, 144(sp)
    	addi    sp, sp, 192
    	.cfi_def_cfa_offset 0
    	jr      ra
.L72:
    	.cfi_endproc
.LFE26:
	.size	main, .-main
	.section	.rodata.cst8,"aM",@progbits,8
	.align	3
.LC1:
	.word	0
	.word	1072693248
	.align	3
.LC2:
	.word	0
	.word	-2147483648
	.align	3
.LC3:
	.word	1413754136
	.word	-1072094725
	.section	.rodata
	.align	3
	.set	.LANCHOR0,. + 0
.LC0:
	.word	0
	.word	1072693248
	.word	0
	.word	0
	.word	0
	.word	1072693248
	.word	0
	.word	0
	.word	0
	.word	1072693248
	.word	0
	.word	0
	.word	0
	.word	1072693248
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.ident	"GCC: (Ubuntu 14.2.0-4ubuntu2) 14.2.0"
	.section	.note.GNU-stack,"",@progbits

Difference Between lla and la
lla (Load Address Absolute)　: Only handles absolute addresses and does not account for a base address (e.g., PC-relative or GOT).
la (Load Address　: Adjusts the address based on the target's access model, such as PC-relative or using the Global Offset Table (GOT).

We improved the code, reducing nearly 200 lines and shortening the execution time.

real	0m0.498s
user	0m0.373s
sys	0m0.132s

Result

The following is the result of the output.
In this task, we accelerated its execution time by using b-extension and RVV.

Input:
1.000000 + 0.000000i
1.000000 + 0.000000i
1.000000 + 0.000000i
1.000000 + 0.000000i
0.000000 + 0.000000i
0.000000 + 0.000000i
0.000000 + 0.000000i
0.000000 + 0.000000i

Output:
4.000000 + 0.000000i
1.000000 + -2.414214i
0.000000 + 0.000000i
1.000000 + -0.414214i
0.000000 + 0.000000i
1.000000 + 0.414214i
0.000000 + 0.000000i
1.000000 + 2.414214i

Accelerate Fast Fourier Transform

Introduction of the task

Introduction to Fourier transform

DFT (Discrete Fourier transform)

Characteristics of DFT:

Applications of DFT:

Cooley-Tukey FFT (Fast Fourier Transform)

PerfMPL-FFT

FFTW (Fast Fourier Transform in the West)

QEMU

What is QEMU?

QEMU for Linux

Install QEMU and Risc-V tool chain

Download Ubuntu for RISC-V Platforms

Boot the system

How to create a file and run it on QEMU

How to tanslate c into riscv

How to expand the virtual machine's space

QEMU for Mac

Issues encountered in QEMU

Code

Part1 : logint

Functionality

Logic

Convert to RISC-V assembly:

Compiling the .s File in RV32I

B-extension

Compiling the .s File in B extension

vector extension

Part2 : bitwise reverse

Functionality

Logic

Convert to RISC-V assembly:

B-extension

vector extension

Application to FFT

Result

Reference

Compiling the `.s` File in RV32I

Compiling the `.s` File in B extension