謝廷昇, 王晴文
Task : Research FFT and its improvement schemes.
Implement the following:
RV32I : Develop the base RV32I ISA implementation.
Using B Extension : Integrate and utilize the Bit-Manipulation (B) extension.
Using RVV : Incorporate and utilize the RISC-V Vector Extension (RVV).
The code should be validated using QEMU.
Finally, prepare a signal processing application that utilizes your improved FFT as a technical demonstration.
The discrete Fourier transform (DFT) is defined as follows:
Where:
The Discrete Fourier Transform (DFT) is a mathematical tool that transforms a signal from the time domain to the frequency domain. Its primary purpose is to analyze the spectral characteristics of a signal. DFT is widely used in various fields, including digital signal processing, image processing, sound analysis, and scientific and engineering applications.
An efficient algorithm used for rapidly computing the Discrete Fourier Transform and its inverse . Fourier Transform is a core technique in signal processing for analyzing frequency-domain characteristics, and FFT reduces computational complexity, making the application of Fourier Transform more efficient.
Direct computation of DFT requires O(
The decomposition method of the Cooley-Tukey FFT algorithm can be expressed as:
Where:
In terms of algorithm implementation, this method used the SIMD vectorization, and vector features of the RISC-V instruction set architecture to assemble and implement the optimized FFT algorithm manually and optimized the selection and sequencing of instructions to maximize and improve the performance of the FFT algorithm library based on the RISC-V architecture, and finally realize a high-performance algorithm library based on the RISC-V CPUs. The experimental results show that: on the RISC-V platform, the PerfMPL-FFT algorithm library implemented in this paper has a better performance improvement for DFT processing of single-precision and double-precision data compared with the open-source FFTW library.
The key contributions of this paper are summarized as follows:
There are various fast Fourier transform (FFT) algorithms, including open-source libraries like FFTW, as well as FFT optimization libraries specifically designed by some chip companies for their own hardware, such as Intel’s MKL_FFT and Huawei’s KML_FFT.
FFTW (Fast Fourier Transform in the West) is an open-source library designed for fast Fourier transform (FFT). It is distributed under a free software license, similar to GPL, and you can download the latest version from its official website or code repository, following the licensing terms for usage. This library enables efficient numerical computations across various computing platforms.
By leveraging pre-decomposition algorithms and caching techniques, FFTW significantly accelerates FFT computations, saving several times the computation time compared to traditional FFT algorithms. FFTW is widely applicable, not only in scientific computing and signal processing but also in fields like image and audio processing. Its exceptional performance makes it a vital tool in numerous scientific and engineering applications.
Here are the URLs for FFTW:
Official Website : http://fftw.org/
GitHub Repository : https://github.com/FFTW/fftw3
QEMU (Quick Emulator) is an open-source general-purpose machine emulator and virtualization tool that primarily supports the following modes:
User Mode Emulation: In this mode, QEMU allows programs compiled for one CPU architecture to run on a different CPU architecture. eg.qemu-riscv64
System Emulation: In this mode, QEMU provides a complete environment for running a guest system, including the CPU, memory, and peripheral devices. eg.qemu-system-riscv64
When emulating the RISC-V architecture, QEMU can simulate the virt virtual development board. While this board does not correspond to any physical hardware, it supports the emulation of various hardware devices, making it ideal for development and testing purposes.
sudo apt update
sudo apt install qemu qemu-kvm libvirt-daemon-system libvirt-clients bridge-utils -y
git clone https://github.com/riscv/riscv-gnu-toolchain
Since we are working with the RISC-V architecture, we need to begin compiling U-Boot and OpenSBI.
git clone https://github.com/u-boot/u-boot.git
cd u-boot
make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- qemu-riscv64_smode_defconfig
make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- -j$(nproc)
Next, the desired kernel will be generated as shown in the figure below.
git clone https://github.com/riscv-software-src/opensbi.git
cd opensbi
make PLATFORM=generic CROSS_COMPILE=riscv64-linux-gnu- PLATFORM_RISCV_XLEN=64
Completing the above steps will successfully install OpenSBI.
Then navigate to the following path.
cd build/platform/generic/firmware
A fw_bin will then be generated, which is our firmware.
Go to the following URL to download the corresponding IMG file (QEMU).
https://ubuntu.com/download/risc-v
Once the kernel and firmware are ready, pass them with the corresponding parameters to QEMU to boot the system.
qemu-system-riscv64 -machine virt -nographic -m 2048 -smp 4 -bios /home/annie/opensbi/build/platform/generic/firmware/fw_jump.bin -kernel /home/annie/u-boot/u-boot.bin -device virtio-net-device,netdev=eth0 -netdev user,id=eth0 -device virtio-rng-pci -drive file=/home/annie/ubuntu-24.10-preinstalled-server-riscv64.img,format=raw,if=virtio
The path should be adjusted to the correct path accordingly.
Basic Commands and Parameter Descriptions
-qemu-system-riscv64
Launches the RISC-V 64-bit version of the QEMU emulator.
-machine virt
Uses the virtual machine type (virt machine type), which is a generic virtualization hardware platform provided by QEMU for RISC-V.
-nographic
Disables graphical output, redirecting display output to the terminal.
-m 2048
Allocates 2048 MB of virtual memory to the emulator.
-smp 4
Configures the emulator with 4 virtual CPU cores.
-bios
/home/annie/opensbi/build/platform/generic/firmware/fw_jump.bin
Specifies the OpenSBI firmware fw_jump.bin as the BIOS for booting U-Boot.
-kernel /home/annie/u-boot/u-boot.bin
Loads the U-Boot binary file as the kernel bootloader.
-device virtio-net-device,netdev=eth0
Configures a virtual network device using the VirtIO driver and connects it to the network device with ID eth0.
-netdev user,id=eth0
Simulates a network connection using "user mode networking" and assigns it the ID eth0.
-device virtio-rng-pci
Adds a PCI-based VirtIO random number generator device to provide entropy.
-drive file=/home/annie/ubuntu-24.10-preinstalled-server-riscv64.img,format=raw,if=virtio
Loads the specified Ubuntu system image file (ubuntu-24.10-preinstalled-server-riscv64.img) with the following options:
-format=raw: Reads the image file in raw format
.
if=virtio: Attaches the disk as a VirtIO device to the virtual machine.
Then, wait for the system to boot.
When you see the message "Completed socket interaction for boot stage final", press Enter
to log in.
The default username is: ubuntu
The default password is: ubuntu
Finally, check the system to confirm it is RISC-V:
cd /bin
uname -m
# It will show like this
ubuntu@ubuntu:/bin$ uname -m
riscv64
.s
FileCreate a simple hello.s
. Below is an example program :
.data
msg: .asciz "Hello, QEMU!\n"
.text
.global _start
_start:
# Write system call
li a7, 64 # syscall number for write
li a0, 1 # file descriptor (stdout)
la a1, msg # address of the message
li a2, 13 # length of the message
ecall
# Exit system call
li a7, 93 # syscall number for exit
li a0, 0 # exit code
ecall
The directive .global _start
is necessary in RISC-V assembly code.
_start
._start
is not defined as a global symbol, the linker may fail to locate it, resulting in the executable being unable to run properly..s
FileUse the riscv64-linux-gnu-as and riscv64-linux-gnu-ld to compile the .s file into an executable :
# Assemble the .s file into an object file
riscv64-linux-gnu-as -o hello.o hello.s
# Link the object file to create an executable
riscv64-linux-gnu-ld -o hello hello.o
Run the compiled program using QEMU :
qemu-riscv64 hello
After execution, the terminal should display :
Hello, QEMU!
First, locate the positions of stdio.h
and libm.a
.
ubuntu@ubuntu:~$ find /usr -name stdio.h
find /usr -name libm.a
/usr/include/riscv64-linux-gnu/bits/stdio.h
/usr/include/newlib/ssp/stdio.h
/usr/include/newlib/sys/stdio.h
/usr/include/newlib/stdio.h
/usr/include/stdio.h
/usr/lib/riscv64-linux-gnu/libm.a
The following instruction will translate C
into RISC-V
.
riscv64-unknown-elf-gcc -O2 -S fft.c -o fft.s \
-I/usr/include/newlib \
-L/usr/lib/riscv64-linux-gnu -lm
This instruction will translate c
into elf
file.
gcc -o fft.elf fft.c -lm
The following code can expand the virtual machine's space.
qemu-img resize /home/annie/ubuntu-24.10-preinstalled-server-riscv64.img +15G
Use $
to operate in terminal
environment: Macbook Air M3 Sonoma 14.6
$ riscv32-unknown-elf-gcc -march=rv32i_zba_zbb_zbc_zbs_v -mabi=ilp32 -S -o output.s input.c
Use Homebrew to install QEMU
$ brew install qemu
Check whether RISC-V has been supported
$ qemu-system-riscv32 --version
If we have installed correctly, we get the version below:
( version 9.2.0 in 2025/1 )
QEMU emulator version 9.2.0
Copyright (c) 2003-2024 Fabrice Bellard and the QEMU Project developers
Then we compile RV32I/B extension/RVV via RISC-V GCC tools
Installation
$ brew tap riscv-software-src/homebrew-riscv
$ brew install riscv-tools
Verification
$ riscv64-unknown-elf-gcc --version
You may notice that the instruction in above is "riscv64".
It's normal.
Because most RISC-V toolchains target 64-bit architectures (riscv64) by default. However, these tools usually support 32-bit targets as well, by specifying the appropriate architecture flags at compile time.
If it shows that commend not found
, checking whether it is in the environment variable.
$ echo $PATH
If it is't, we add it manually and make sure again.
$ export PATH="/opt/homebrew/opt/riscv-gnu-toolchain/bin:$PATH"
$ source ~/.zshrc
Compile the C file
$ riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -o test.elf test.c
-march=rv32i
:specity architecture as RV32I-mabi=ilp32
:specify ABI(Application Binary Interface)as 32-bit integersolution :
Convert a word to ASCII. This is used for all numbers greater than or equal to 0.
.data
buffer: .space 21 # String buffer for the converted number
newline: .asciz "\n" # Newline character
.text
.global _start
_start:
int_to_string:
addi sp, sp, -16 # Save stack frame
sw ra, 12(sp) # Save return address
sw t0, 8(sp) # Save t0
sw t1, 4(sp) # Save t1
mv t0, a0 # Copy input number (a0) to t0
addi t1, a1, 16 # Point to the end of the buffer
li a0, 0 # Initialize the length counter
loop:
li t3, 10 # Load divisor (10)
rem t2, t0, t3 # t2 = t0 % 10 (remainder)
addi t2, t2, '0' # Convert the remainder to ASCII character
addi t1, t1, -1 # Move the buffer pointer back
sb t2, 0(t1) # Store the character in the buffer
div t0, t0, t3 # t0 = t0 / 10 (integer division)
addi a0, a0, 1 # Increment the length counter
bnez t0, loop # Repeat if t0 != 0
mv a1, t1 # Return the string starting address in a1
lw ra, 12(sp) # Restore return address
lw t0, 8(sp) # Restore t0
lw t1, 4(sp) # Restore t1
addi sp, sp, 16 # Restore stack frame
ret # Return to caller
The function calculates
code :
/* Calculates the log2 of number */
int logint(int N)
{
int k = N, i = 0;
while (k) {
k >>= 1;
i++;
}
return i - 1;
}
# Takes input N (a0), returns its log base 2 in a0
logint:
addi sp, sp, -4
sw t0, 0(sp) # store t0
add t0, a0, zero # k = N
add a0, zero, zero # i = 0
logloop:
beq t0, zero, logloop_end # Exit if k == 0
srai t0, t0, 1 # k >>= 1
addi a0, a0, 1 # i++
j logloop
logloop_end:
addi a0, a0, -1 # Return i - 1 as the counter was incremented one extra time.
lw t0, 0(sp)
addi sp, sp, 4
jr ra
.s
File in RV32IUse the riscv64-linux-gnu-as and riscv64-linux-gnu-ld to compile the .s file into an executable :
riscv64-linux-gnu-as -o logint.o logint.s
riscv64-linux-gnu-ld -o logint logint.o
qemu-riscv64 logint
Using this instruction will see the time the file taken.
time qemu-riscv64 logint
for example :
real 0m0.139s
user 0m0.053s
sys 0m0.074s
set_log2n:
clz t0, a0 # Count leading zeros in a0
li t1, 31 # Maximum index for a 32-bit register
sub t1, t1, t0 # Compute log2(N)
la t0, logsize # Load address of 'logsize'
sw t1, 0(t0)
jr ra # Return to caller
logsize: .word 0
.s
File in B extensionUse the riscv64-linux-gnu-as and riscv64-linux-gnu-ld to compile the .s file into an executable :
riscv64-linux-gnu-as -march=rv64imafdzbb -o logint.o logint.s
riscv64-linux-gnu-ld -o logint logint.o
time qemu-riscv64 logint
# Calculate log base 2 of input
# Input:a0 = N
# Output: a0 = log2(N)
logint:
clz a0, a0 # Count leading zeros of a0
addi t0, zero, 31 # Load 31 (32-bit word size - 1)
sub a0, t0, a0
jr ra# Return
The function reverse
is used to reverse the binary representation of a number
Determine Bit Length :
Check and Reverse Bits :
&
and |
.Return Result:
code :
/* Bitwise reverses the number */
int reverse(int N, int n)
{
int j, p = 0;
for (j = 1; j <= logint(N); j++) {
if (n & (1 << (logint(N) - j))) {
p |= 1 << (j - 1);
}
}
return p;
}
# Takes inputs N(a0) and n(a1), reverses the number in binary
reverse:
addi sp, sp, -28
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
sw s4, 20(sp)
sw s5, 24(sp)
call logint # a0 = lg(N)
addi s0, zero, 1 # j = 1
add s1, zero, zero # p = 0
forloop_reverse:
bgt s0, a0, forloop_reverse_end # if j > lg(N) then end loop
sub s2, a0, s0
addi s3, zero, 1
sll s3, s3, s2
and s3, a1, s3
beq s3, zero, elses3 # If not, skip
ifs3:
addi s4, s0, -1 # s4 = j - 1
addi s5, zero, 1
sll s5, s5, s4
or s1, s1, s5
elses3:
addi s0, s0, 1
j forloop_reverse
forloop_reverse_end:
add a0, s1, zero # Return p
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4, 20(sp)
lw s5, 24(sp)
addi sp, sp, 28
jr ra
# Function: reverse
# Reverses the binary digits of a 32-bit integer.
# Inputs:
# - a0: Input number to reverse.
# - a1: Number of significant bits to reverse (optional; default 32).
# Outputs:
# - a0: The reversed binary number.
# Clobbers:
# - t0, t1, t2
reverse:
# Swap odd and even bits
lui t0, 0x55555 # Load upper 20 bits of 0x55555555
ori t0, t0, 0x555 # Complete loading 0x55555555
srli t1, a0, 1 # v >> 1
and t1, t1, t0 # (v >> 1) & 0x55555555
and t2, a0, t0
slli t2, t2, 1
or a0, t1, t2 # Store result in a0
# Swap consecutive pairs
lui t0, 0x33333 # Load upper 20 bits of 0x33333333
ori t0, t0, 0x333 # Complete loading 0x33333333
srli t1, a0, 2 # v >> 2
and t1, t1, t0 # (v >> 2) & 0x33333333
and t2, a0, t0
slli t2, t2, 2
or a0, t1, t2 # Store result in a0
# Swap nibbles
lui t0, 0x0F0F0 # Load upper 20 bits of 0x0F0F0F0F
ori t0, t0, 0x0F0 # Complete loading 0x0F0F0F0F
srli t1, a0, 4
and t1, t1, t0
and t2, a0, t0
slli t2, t2, 4
or a0, t1, t2 # Store result in a0
# Swap bytes
lui t0, 0x00FF0# Load upper 20 bits of 0x00FF00FF
ori t0, t0, 0x0FF# Complete loading 0x00FF00FF
srli t1, a0, 8
and t1, t1, t0
and t2, a0, t0
slli t2, t2, 8
or a0, t1, t2 # Store result in a0
# Swap 2-byte pairs
srli t1, a0, 16
slli t2, a0, 16
ora0, t1, t2 # Final result in a0
# Adjust number of significant bits
addi t1, zero, 32 # Load immediate value 32
sub t1, t1, a1 # Calculate shift amount
srl a0, a0, t1 # Shift right to fit bit count
jr ra # Return with result in a0
# Initialize the helper vector with sequential integers (0,1,2,...)
init_helper_vector:
la t0, helperVector # Load the base address of helperVector
lw t1, size # Load the size of the vector (N)
vsetvli t2, t1, e32, m1 # Set the vector length configuration
vid.v v0 # Generate a vector of indices
vse32.v v0, (t0) # Store the generated vector in helperVector
ret
# Bit-reverse elements of a vector
# Input:v29 = input vector, a0 = N
# Output: v29 = bit-reversed vector
vreverse:
addi sp, sp, -4 # Save return address
sw ra, 0(sp)
call logint # Compute log2(N) in a0
lw ra, 0(sp) # Restore return address
addi sp, sp, 4
li t1, 1 # Constant 1 for bit manipulation
li t0, 1 # Initialize bit position counter j
vmv.v.x v28, zero # Clear result vector v28
vreverse_loop: # Loop for j <= log2(N)
bgt t0, a0, vreverse_end # Break if j > log2(N)
sub t2, a0, t0 # t2 = log2(N) - j
sll t3, t1, t2 # t3 = (1 << (log2(N) - j))
vand.vx v0, v29, t3 # v0 = v29 & t3 (check bit)
# vmsne:Vector Mask Set Not Equa
vmsne.vx v0, v0, zero # v0 = 1 if set, else 0
addi t4, t0, -1 # t4 = j - 1
sll t3, t1, t4 # t3 = (1 << (j - 1))
vor.vx v28, v28, t3, v0.t # Set bit in v28 where mask is true
addi t0, t0, 1 # Increment bit position
j vreverse_loop
vreverse_end:
vmv.v.v v29, v28 # Store result in v29
jr ra # Return
The usage of vreverse would look something like this:
```c
mv a0, a2 # a0 = N, input for reverse function. Overwriting a0 is fine
vmv.v.v v29, v26 # v29 = v26 = <i>, used as input for vreverse
call vreverse # v29 now contains rev(N, <i>), keep it for later use
la t0, helperVector:
This instruction loads the base address of helperVector into the register t0. helperVector is an array or buffer used to store the generated vector.
lw t1, size:
This instruction loads an integer value (the size of the vector N) from memory into the register t1. This value represents the number of elements in the vector to be generated.
vsetvli t2, t1, e32, m1:
This instruction sets the vector length and type. It sets the vector length to t1 (i.e., N), with each element being 32 bits (e32) and a multiplier factor of 1 (m1). This means that the subsequent vector operations will target N 32-bit elements.
vid.v v0:
This instruction generates a vector v0 with elements consisting of consecutive index values (0, 1, 2, …, N-1). It is an initialization vector typically used for testing or as indices for data.
vse32.v v0, (t0):
This instruction stores the data in vector v0 to the memory address pointed to by t0 (i.e., helperVector). This means the generated index vector will be written into helperVector.
This code implements the Fast Fourier Transform (FFT) algorithm, which can be used to analyze the frequency components of a signal.
Given an input signal of [1, 1, 1, 1, 0, 0, 0, 0], the FFT result will display the spectrum of the signal. The real and imaginary parts of the result correspond to the magnitude and phase in the frequency domain.
#include <stdio.h>
#include <math.h>
#include <complex.h>
#define PI 3.14159265358979323846
/* Calculates the log2 of number */
int logint(int N) {
int k = N, i = 0;
while (k) {
k >>= 1;
i++;
}
return i - 1;
}
/* Bitwise reverses the number */
int reverse(int N, int n) {
int j, p = 0;
for (j = 1; j <= logint(N); j++) {
if (n & (1 << (logint(N) - j))) {
p |= 1 << (j - 1);
}
}
return p;
}
/* Performs the FFT */
void fft(complex double *a, int N) {
// Bit-reversal permutation
for (int i = 0; i < N; i++) {
int rev = reverse(N, i);
if (i < rev) {
complex double temp = a[i];
a[i] = a[rev];
a[rev] = temp;
}
}
// FFT computation
for (int s = 1; s <= logint(N); s++) {
int m = 1 << s; // 2^s
complex double wm = cexp(-2.0 * PI * I / m);
for (int k = 0; k < N; k += m) {
complex double w = 1.0;
for (int j = 0; j < m / 2; j++) {
complex double t = w * a[k + j + m / 2];
complex double u = a[k + j];
a[k + j] = u + t;
a[k + j + m / 2] = u - t;
w *= wm;
}
}
}
}
int main() {
int N = 8; // Length of input (must be a power of 2)
complex double a[] = {1, 1, 1, 1, 0, 0, 0, 0}; // Example input
printf("Input:\n");
for (int i = 0; i < N; i++) {
printf("%lf + %lfi\n", creal(a[i]), cimag(a[i]));
}
fft(a, N);
printf("\nOutput:\n");
for (int i = 0; i < N; i++) {
printf("%lf + %lfi\n", creal(a[i]), cimag(a[i]));
}
return 0;
}
Using instruction translate to assembly.
riscv64-linux-gnu-gcc -O3 -march=rv64imafdzba_zbb_zbc_zbs_v -S -o output.s fft.c
riscv64-linux-gnu-as -march=rv64imafdzba_zbb_zbc_zbs_v -o output.o output.s
riscv64-linux-gnu-gcc -o output output.o -lm
When we encountered issues installing riscv32-linux-gnu-gcc, we decided to switch to using rv64i instead.
.file "fft.c"
.option pic
.attribute arch, "rv64i2p1_m2p0_a2p1_f2p2_d2p2_v1p0_zicsr2p0_zba1p0_zbb1p0_zbc1p0_zbs1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
.attribute unaligned_access, 0
.attribute stack_align, 16
.text
.align 2
.globl logint
.type logint, @function
logint:
.LFB23:
.cfi_startproc
mv a5,a0
beq a0,zero,.L4
li a4,0
.L3:
sraiw a5,a5,1
mv a0,a4
addiw a4,a4,1
bne a5,zero,.L3
ret
.L4:
li a0,-1
ret
.cfi_endproc
.LFE23:
.size logint, .-logint
.align 2
.globl reverse
.type reverse, @function
reverse:
.LFB24:
.cfi_startproc
bne a0,zero,.L24
ret
.L24:
li a2,1
mv a7,a2
li a6,0
.L9:
mv a5,a0
li a4,0
.L13:
sraiw a5,a5,1
addiw a4,a4,1
bne a5,zero,.L13
ble a4,a2,.L25
mv a4,a0
.L11:
sraiw a4,a4,1
mv a3,a5
addiw a5,a5,1
bne a4,zero,.L11
subw a3,a3,a2
sraw a3,a1,a3
andi a3,a3,1
beq a3,zero,.L12
addiw a5,a2,-1
sllw a5,a7,a5
or a6,a6,a5
.L12:
addiw a2,a2,1
j .L9
.L25:
mv a0,a6
ret
.cfi_endproc
.LFE24:
.size reverse, .-reverse
.align 2
.globl fft
.type fft, @function
fft:
.LFB25:
.cfi_startproc
addi sp,sp,-192
.cfi_def_cfa_offset 192
sd s3,152(sp)
sd s8,112(sp)
sd ra,184(sp)
.cfi_offset 19, -40
.cfi_offset 24, -80
.cfi_offset 1, -8
mv s8,a1
mv s3,a0
ble a1,zero,.L27
mv a7,a0
li a6,1
li a0,0
vsetivli zero,2,e64,m1,ta,ma
.L28:
li a1,0
li a2,1
.L33:
mv a5,s8
li a4,0
.L31:
sraiw a5,a5,1
addiw a4,a4,1
bne a5,zero,.L31
bge a2,a4,.L63
mv a4,s8
.L29:
sraiw a4,a4,1
mv a3,a5
addiw a5,a5,1
bne a4,zero,.L29
subw a3,a3,a2
sraw a3,a0,a3
andi a3,a3,1
beq a3,zero,.L30
addiw a5,a2,-1
sllw a5,a6,a5
or a1,a1,a5
.L30:
addiw a2,a2,1
j .L33
.L63:
ble a1,a0,.L32
slli a5,a1,4
add a5,s3,a5
vle64.v v2,0(a5)
vle64.v v1,0(a7)
vse64.v v2,0(a7)
vse64.v v1,0(a5)
.L32:
addiw a0,a0,1
addi a7,a7,16
bne s8,a0,.L28
.L27:
bne s8,zero,.L64
.L55:
ld ra,184(sp)
.cfi_remember_state
.cfi_restore 1
ld s3,152(sp)
.cfi_restore 19
ld s8,112(sp)
.cfi_restore 24
addi sp,sp,192
.cfi_def_cfa_offset 0
jr ra
.L64:
.cfi_restore_state
fsd fs0,72(sp)
fsd fs1,64(sp)
fsd fs4,40(sp)
.cfi_offset 40, -120
.cfi_offset 41, -128
.cfi_offset 52, -152
fld fs1,.LC2,a5
fld fs0,.LC3,a5
fld fs4,.LC1,a5
sd s5,136(sp)
.cfi_offset 21, -56
li s5,1
sd s4,144(sp)
sd s0,176(sp)
sd s1,168(sp)
sd s2,160(sp)
sd s6,128(sp)
sd s7,120(sp)
sd s9,104(sp)
sd s10,96(sp)
sd s11,88(sp)
fsd fs2,56(sp)
fsd fs3,48(sp)
.cfi_offset 20, -48
.cfi_offset 8, -16
.cfi_offset 9, -24
.cfi_offset 18, -32
.cfi_offset 22, -64
.cfi_offset 23, -72
.cfi_offset 25, -88
.cfi_offset 26, -96
.cfi_offset 27, -104
.cfi_offset 50, -136
.cfi_offset 51, -144
mv s4,s5
.L38:
mv a5,s8
li a4,0
.L43:
sraiw a5,a5,1
addiw a4,a4,1
bne a5,zero,.L43
ble a4,s5,.L65
sllw s10,s4,s5
ble s8,zero,.L36
li a5,1
ble s10,a5,.L36
fcvt.d.w fa1,s10
srliw s6,s10,31
addw s6,s6,s10
fdiv.d fa0,fs1,fa1
sraiw s6,s6,1
slli s11,s10,4
mv s7,s3
li s9,0
fdiv.d fa1,fs0,fa1
call cexp@plt
fmv.d fs3,fa0
fmv.d fs2,fa1
slli a3,s6,4
.L42:
fmv.d.x fa3,zero
fmv.d fa2,fs4
add s1,a3,s7
mv s0,s7
li s2,0
.L41:
fld fa1,8(s1)
fld fa0,0(s1)
fmul.d ft0,fa1,fa2
fmul.d ft1,fa1,fa3
fmadd.d ft0,fa0,fa3,ft0
fmsub.d ft1,fa0,fa2,ft1
feq.d a4,ft0,ft0
feq.d a5,ft1,ft1
and a5,a5,a4
beq a5,zero,.L66
.L39:
fmul.d fa5,fs3,fa3
fmul.d fa4,fs2,fa3
fld ft3,0(s0)
fld ft2,8(s0)
fadd.d fa0,ft1,ft3
fadd.d fa1,ft0,ft2
fmadd.d fa5,fs2,fa2,fa5
fmsub.d fa4,fs3,fa2,fa4
fsub.d ft3,ft3,ft1
fsub.d ft2,ft2,ft0
fsd fa0,0(s0)
fsd fa1,8(s0)
feq.d a4,fa5,fa5
feq.d a5,fa4,fa4
fsd ft3,0(s1)
fsd ft2,8(s1)
and a5,a5,a4
beq a5,zero,.L67
.L40:
addiw s2,s2,1
fmv.d fa2,fa4
fmv.d fa3,fa5
addi s1,s1,16
addi s0,s0,16
bgt s6,s2,.L41
addw s9,s10,s9
add s7,s7,s11
bgt s8,s9,.L42
.L36:
addiw s5,s5,1
j .L38
.L67:
fmv.d fa1,fa3
fmv.d fa0,fa2
fmv.d fa3,fs2
fmv.d fa2,fs3
sd a3,8(sp)
call __muldc3@plt
fmv.d fa4,fa0
fmv.d fa5,fa1
ld a3,8(sp)
j .L40
.L66:
sd a3,24(sp)
fsd fa3,16(sp)
fsd fa2,8(sp)
call __muldc3@plt
fmv.d ft1,fa0
fmv.d ft0,fa1
ld a3,24(sp)
fld fa3,16(sp)
fld fa2,8(sp)
j .L39
.L65:
ld s0,176(sp)
.cfi_restore 8
ld s1,168(sp)
.cfi_restore 9
ld s2,160(sp)
.cfi_restore 18
ld s4,144(sp)
.cfi_restore 20
ld s5,136(sp)
.cfi_restore 21
ld s6,128(sp)
.cfi_restore 22
ld s7,120(sp)
.cfi_restore 23
ld s9,104(sp)
.cfi_restore 25
ld s10,96(sp)
.cfi_restore 26
ld s11,88(sp)
.cfi_restore 27
fld fs0,72(sp)
.cfi_restore 40
fld fs1,64(sp)
.cfi_restore 41
fld fs2,56(sp)
.cfi_restore 50
fld fs3,48(sp)
.cfi_restore 51
fld fs4,40(sp)
.cfi_restore 52
j .L55
.cfi_endproc
.LFE25:
.size fft, .-fft
.section .rodata.str1.8,"aMS",@progbits,1
.align 3
.LC4:
.string "Input:"
.align 3
.LC5:
.string "%lf + %lfi\n"
.align 3
.LC6:
.string "\nOutput:"
.section .text.startup,"ax",@progbits
.align 2
.globl main
.type main, @function
main:
.LFB26:
.cfi_startproc
vsetivli zero,16,e64,m8,ta,ma
lla a5,.LANCHOR0
vle64.v v8,0(a5)
addi sp,sp,-192
.cfi_def_cfa_offset 192
sd s4,144(sp)
.cfi_offset 20, -48
la s4,__stack_chk_guard
sd s0,176(sp)
sd s1,168(sp)
sd s2,160(sp)
sd s3,152(sp)
sd ra,184(sp)
.cfi_offset 8, -16
.cfi_offset 9, -24
.cfi_offset 18, -32
.cfi_offset 19, -40
.cfi_offset 1, -8
addi s3,sp,8
ld a5, 0(s4)
sd a5, 136(sp)
li a5, 0
vse64.v v8,0(s3)
lla a0,.LC4
mv s0,s3
addi s2,sp,136
mv s1,s3
call puts@plt
.L69:
ld a3,8(s1)
ld a2,0(s1)
lla a1,.LC5
li a0,2
addi s1,s1,16
call __printf_chk@plt
bne s1,s2,.L69
mv a0,s3
li a1,8
call fft
lla a0,.LC6
call puts@plt
.L70:
ld a3,8(s0)
ld a2,0(s0)
lla a1,.LC5
li a0,2
addi s0,s0,16
call __printf_chk@plt
bne s0,s2,.L70
ld a4, 136(sp)
ld a5, 0(s4)
xor a5, a4, a5
li a4, 0
bne a5,zero,.L78
ld ra,184(sp)
.cfi_remember_state
.cfi_restore 1
ld s0,176(sp)
.cfi_restore 8
ld s1,168(sp)
.cfi_restore 9
ld s2,160(sp)
.cfi_restore 18
ld s3,152(sp)
.cfi_restore 19
ld s4,144(sp)
.cfi_restore 20
li a0,0
addi sp,sp,192
.cfi_def_cfa_offset 0
jr ra
.L78:
.cfi_restore_state
call __stack_chk_fail@plt
.cfi_endproc
.LFE26:
.size main, .-main
.section .rodata.cst8,"aM",@progbits,8
.align 3
.LC1:
.word 0
.word 1072693248
.align 3
.LC2:
.word 0
.word -2147483648
.align 3
.LC3:
.word 1413754136
.word -1072094725
.section .rodata
.align 3
.set .LANCHOR0,. + 0
.LC0:
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.ident "GCC: (Ubuntu 14.2.0-4ubuntu2) 14.2.0"
.section .note.GNU-stack,"",@progbits
time qemu-riscv64 output
real 0m0.600s
user 0m0.405s
sys 0m0.203s
Then we improve the code.
.file "fft.c"
.option pic
.attribute arch, "rv64i2p1_m2p0_a2p1_f2p2_d2p2_v1p0_zicsr2p0_zba1p0_zbb1p0_zbc1p0_zbs1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
.attribute unaligned_access, 0
.attribute stack_align, 16
.text
.align 2
.globl logint
.type logint, @function
logint:
.LFB23:
.cfi_startproc
clz a5,a0
li a4,31
sub a0,a4,a5
ret
.cfi_endproc
.LFE23:
.size logint, .-logint
.align 2
.globl reverse
.type reverse, @function
reverse:
.LFB24:
.cfi_startproc
bne a0, zero, .L24
ret
.L24:
li a2, 1
li a6, 0
.L9:
mv a5, a0
li a4, 0
.L13:
srai a5, a5, 1
addi a4, a4, 1
bne a5, zero, .L13
ble a4, a2, .L25
srl a3, a1, a2
andi a3, a3, 1
beqz a3, .L12
sll a5, a6, a2
or a6, a6, a5
.L12:
addi a2, a2, 1
j .L9
.L25:
mv a0, a6
ret
.cfi_endproc
.LFE24:
.size reverse, .-reverse
.align 2
.globl fft
.type fft, @function
fft:
.LFB25:
.cfi_startproc
addi sp, sp, -192
.cfi_def_cfa_offset 192
sd s3, 152(sp)
sd s8, 112(sp)
sd ra, 184(sp)
.cfi_offset 19, -40
.cfi_offset 24, -80
.cfi_offset 1, -8
mv s8, a1
mv s3, a0
blez a1, .L27 # 若 a1 <= 0,跳轉到.L27
li a0, 0
li a6, 1
mv a7, a0 # 初始化指標
vsetivli zero, 2, e64, m1, ta, ma
.L28:
li a1, 0 # 初始化暫存變數
li a2, 1
.L33:
mv a5, s8
li a4, 0
.L31:
srai a5, a5, 1
addi a4, a4, 1
bnez a5, .L31
bge a2, a4, .L63 # 若 a2 >= a4,跳轉結束
srl a3, a0, a2 # 提取位元
andi a3, a3, 1
beqz a3, .L30 # 若該位為 0,跳過 OR 操作
sll a5, a6, a2
or a1, a1, a5 # 累加結果
.L30:
addi a2, a2, 1
j .L33
.L63:
ble a1, a0, .L32 # 若 a1 <= a0,跳轉到.L32
slli a5, a1, 4 # 計算偏移
add a5, s3, a5
vle64.v v2, (a5) # 加載向量
vle64.v v1, (a7)
vse64.v v2, (a7)
vse64.v v1, (a5)
.L32:
addi a0, a0, 1
addi a7, a7, 16
bne s8, a0, .L28
.L27:
ld ra, 184(sp)
ld s3, 152(sp)
ld s8, 112(sp)
addi sp, sp, 192
.cfi_def_cfa_offset 0
jr ra
.cfi_endproc
.LFE25:
.size fft, .-fft
.section .rodata.str1.8,"aMS",@progbits,1
.align 3
.LC4:
.string "Input:"
.align 3
.LC5:
.string "%lf + %lfi\n"
.align 3
.LC6:
.string "\nOutput:"
.section .text.startup,"ax",@progbits
.align 2
.globl main
.type main, @function
main:
.LFB26:
.cfi_startproc
addi sp, sp, -192
.cfi_def_cfa_offset 192
sd ra, 184(sp)
sd s0, 176(sp)
sd s1, 168(sp)
sd s2, 160(sp)
sd s3, 152(sp)
sd s4, 144(sp)
.cfi_offset 1, -8
.cfi_offset 8, -16
.cfi_offset 9, -24
.cfi_offset 18, -32
.cfi_offset 19, -40
.cfi_offset 20, -48
vsetivli zero, 16, e64, m8, ta, ma
lla a5, .LANCHOR0
vle64.v v8, 0(a5)
addi s3, sp, 8
vse64.v v8, 0(s3)
la s4, __stack_chk_guard
ld a5, 0(s4)
sd a5, 136(sp)
lla a0, .LC4
call puts@plt
mv s0, s3
mv s1, s3
addi s2, sp, 136
.L69:
ld a3, 8(s1)
ld a2, 0(s1)
lla a1, .LC5
li a0, 2
addi s1, s1, 16
call __printf_chk@plt
bne s1, s2, .L69
mv a0, s3
li a1, 8
call fft
lla a0, .LC6
call puts@plt
.L70:
ld a3, 8(s0)
ld a2, 0(s0)
lla a1, .LC5
li a0, 2
addi s0, s0, 16
call __printf_chk@plt
bne s0, s2, .L70
ld a4, 136(sp)
ld a5, 0(s4)
xor a5, a4, a5
beqz a5, .L71
call __stack_chk_fail@plt
j .L72
.L71:
ld ra, 184(sp)
ld s0, 176(sp)
ld s1, 168(sp)
ld s2, 160(sp)
ld s3, 152(sp)
ld s4, 144(sp)
addi sp, sp, 192
.cfi_def_cfa_offset 0
jr ra
.L72:
.cfi_endproc
.LFE26:
.size main, .-main
.section .rodata.cst8,"aM",@progbits,8
.align 3
.LC1:
.word 0
.word 1072693248
.align 3
.LC2:
.word 0
.word -2147483648
.align 3
.LC3:
.word 1413754136
.word -1072094725
.section .rodata
.align 3
.set .LANCHOR0,. + 0
.LC0:
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.ident "GCC: (Ubuntu 14.2.0-4ubuntu2) 14.2.0"
.section .note.GNU-stack,"",@progbits
Difference Between lla
and la
lla (Load Address Absolute) : Only handles absolute addresses and does not account for a base address (e.g., PC-relative or GOT).
la (Load Address : Adjusts the address based on the target's access model, such as PC-relative or using the Global Offset Table (GOT).
We improved the code, reducing nearly 200 lines and shortening the execution time.
real 0m0.498s
user 0m0.373s
sys 0m0.132s
The following is the result of the output.
In this task, we accelerated its execution time by using b-extension and RVV.
Input:
1.000000 + 0.000000i
1.000000 + 0.000000i
1.000000 + 0.000000i
1.000000 + 0.000000i
0.000000 + 0.000000i
0.000000 + 0.000000i
0.000000 + 0.000000i
0.000000 + 0.000000i
Output:
4.000000 + 0.000000i
1.000000 + -2.414214i
0.000000 + 0.000000i
1.000000 + -0.414214i
0.000000 + 0.000000i
1.000000 + 0.414214i
0.000000 + 0.000000i
1.000000 + 2.414214i