謝廷昇, 王晴文
GitHub
Introduction of the task
Task : Research FFT and its improvement schemes.
Implement the following:
RV32I : Develop the base RV32I ISA implementation.
Using B Extension : Integrate and utilize the Bit-Manipulation (B) extension.
Using RVV : Incorporate and utilize the RISC-V Vector Extension (RVV).
The code should be validated using QEMU.
Finally, prepare a signal processing application that utilizes your improved FFT as a technical demonstration.
The discrete Fourier transform (DFT) is defined as follows:
Where:
- (x(n)) : Input sequence
- (Y(k)) : Output sequence
- : Twiddle factor
- : Imaginary unit
The Discrete Fourier Transform (DFT) is a mathematical tool that transforms a signal from the time domain to the frequency domain. Its primary purpose is to analyze the spectral characteristics of a signal. DFT is widely used in various fields, including digital signal processing, image processing, sound analysis, and scientific and engineering applications.
Characteristics of DFT:
- Spectrum Analysis: Decomposes the input signal into a superposition of different frequency components.
- Inverse Transform (IDFT): Allows the reconstruction of the time-domain signal from the frequency-domain data.
- High Complexity: The computational complexity of a standard DFT is O( ) , which increases significantly with the length of the input signal.
Applications of DFT:
- Signal Processing: Filtering, modulation, and spectrum analysis.
Image Processing: Edge detection and compression (e.g., discrete cosine transform in JPEG).
- Engineering and Science: Seismic analysis and structural vibration studies.
An efficient algorithm used for rapidly computing the Discrete Fourier Transform and its inverse . Fourier Transform is a core technique in signal processing for analyzing frequency-domain characteristics, and FFT reduces computational complexity, making the application of Fourier Transform more efficient.
Direct computation of DFT requires O( ) computational complexity (where N is the number of data points), while FFT reduces the complexity to O( ), significantly improving efficiency. Its applications include signal processing (e.g., spectrum analysis and filtering), image processing (e.g., compression and denoising), audio processing (e.g., music analysis and echo cancellation), and scientific computation (e.g., simulation and solving partial differential equations). The basic principle of FFT is based on the divide-and-conquer approach, where DFT is broken down into smaller sub-problems for recursive computation. By dividing the input data into even and odd parts for separate calculations and then combining the results, the process is greatly optimized. Among the common implementations, the Cooley-Tukey FFT algorithm is the most widely used.
The decomposition method of the Cooley-Tukey FFT algorithm can be expressed as:
Where:
- (f(n)) and (g(n)) are complex sequences of length .
- : Twiddle factor
PerfMPL-FFT
In terms of algorithm implementation, this method used the SIMD vectorization, and vector features of the RISC-V instruction set architecture to assemble and implement the optimized FFT algorithm manually and optimized the selection and sequencing of instructions to maximize and improve the performance of the FFT algorithm library based on the RISC-V architecture, and finally realize a high-performance algorithm library based on the RISC-V CPUs. The experimental results show that: on the RISC-V platform, the PerfMPL-FFT algorithm library implemented in this paper has a better performance improvement for DFT processing of single-precision and double-precision data compared with the open-source FFTW library.
The key contributions of this paper are summarized as follows:
- Proposed optimization strategies: We introduced a set of optimization strategies and methods targeting the butterfly kernel in the Cooley-Tukey FFT algorithm.
- Designed a technology system for RISC-V: Based on the characteristics of the RISC-V architecture, we developed an optimization technology system for the FFT algorithm, enhancing its performance on the RISC-V platform.
- Implemented an efficient FFT library: We implemented a high-performance FFT algorithm library on the RISC-V platform, demonstrating significant performance advantages.
There are various fast Fourier transform (FFT) algorithms, including open-source libraries like FFTW, as well as FFT optimization libraries specifically designed by some chip companies for their own hardware, such as Intel’s MKL_FFT and Huawei’s KML_FFT.
FFTW (Fast Fourier Transform in the West) is an open-source library designed for fast Fourier transform (FFT). It is distributed under a free software license, similar to GPL, and you can download the latest version from its official website or code repository, following the licensing terms for usage. This library enables efficient numerical computations across various computing platforms.
By leveraging pre-decomposition algorithms and caching techniques, FFTW significantly accelerates FFT computations, saving several times the computation time compared to traditional FFT algorithms. FFTW is widely applicable, not only in scientific computing and signal processing but also in fields like image and audio processing. Its exceptional performance makes it a vital tool in numerous scientific and engineering applications.
Here are the URLs for FFTW:
Official Website : http://fftw.org/
GitHub Repository : https://github.com/FFTW/fftw3
QEMU
What is QEMU?
QEMU (Quick Emulator) is an open-source general-purpose machine emulator and virtualization tool that primarily supports the following modes:
-
User Mode Emulation: In this mode, QEMU allows programs compiled for one CPU architecture to run on a different CPU architecture. eg.qemu-riscv64
-
System Emulation: In this mode, QEMU provides a complete environment for running a guest system, including the CPU, memory, and peripheral devices. eg.qemu-system-riscv64
When emulating the RISC-V architecture, QEMU can simulate the virt virtual development board. While this board does not correspond to any physical hardware, it supports the emulation of various hardware devices, making it ideal for development and testing purposes.
QEMU for Linux
Since we are working with the RISC-V architecture, we need to begin compiling U-Boot and OpenSBI.
- Compile U-Boot
Next, the desired kernel will be generated as shown in the figure below.

- OpenSBI
Completing the above steps will successfully install OpenSBI.
Then navigate to the following path.
A fw_bin will then be generated, which is our firmware.

Go to the following URL to download the corresponding IMG file (QEMU).
https://ubuntu.com/download/risc-v
Boot the system
Once the kernel and firmware are ready, pass them with the corresponding parameters to QEMU to boot the system.
The path should be adjusted to the correct path accordingly.
Basic Commands and Parameter Descriptions
-qemu-system-riscv64
Launches the RISC-V 64-bit version of the QEMU emulator.
-machine virt
Uses the virtual machine type (virt machine type), which is a generic virtualization hardware platform provided by QEMU for RISC-V.
-nographic
Disables graphical output, redirecting display output to the terminal.
-m 2048
Allocates 2048 MB of virtual memory to the emulator.
-smp 4
Configures the emulator with 4 virtual CPU cores.
-bios
/home/annie/opensbi/build/platform/generic/firmware/fw_jump.bin
Specifies the OpenSBI firmware fw_jump.bin as the BIOS for booting U-Boot.
-kernel /home/annie/u-boot/u-boot.bin
Loads the U-Boot binary file as the kernel bootloader.
-device virtio-net-device,netdev=eth0
Configures a virtual network device using the VirtIO driver and connects it to the network device with ID eth0.
-netdev user,id=eth0
Simulates a network connection using "user mode networking" and assigns it the ID eth0.
-device virtio-rng-pci
Adds a PCI-based VirtIO random number generator device to provide entropy.
-drive file=/home/annie/ubuntu-24.10-preinstalled-server-riscv64.img,format=raw,if=virtio
Loads the specified Ubuntu system image file (ubuntu-24.10-preinstalled-server-riscv64.img) with the following options:
-format=raw: Reads the image file in raw format
.
if=virtio: Attaches the disk as a VirtIO device to the virtual machine.
Then, wait for the system to boot.
When you see the message "Completed socket interaction for boot stage final", press Enter
to log in.
The default username is: ubuntu
The default password is: ubuntu

Finally, check the system to confirm it is RISC-V:
How to create a file and run it on QEMU
- Writing a
.s
File
Create a simple hello.s
. Below is an example program :
The directive .global _start
is necessary in RISC-V assembly code.
- Program Entry Point:
When an executable file is loaded for execution, the operating system looks for the program's entry point. By default, the entry point is usually _start
.
If _start
is not defined as a global symbol, the linker may fail to locate it, resulting in the executable being unable to run properly.
- Compiling the
.s
File
Use the riscv64-linux-gnu-as and riscv64-linux-gnu-ld to compile the .s file into an executable :
- Running with QEMU
Run the compiled program using QEMU :
After execution, the terminal should display :
How to tanslate c into riscv
First, locate the positions of stdio.h
and libm.a
.
The following instruction will translate C
into RISC-V
.
This instruction will translate c
into elf
file.
How to expand the virtual machine's space
The following code can expand the virtual machine's space.
QEMU for Mac
Use $
to operate in terminal
environment: Macbook Air M3 Sonoma 14.6
$ riscv32-unknown-elf-gcc -march=rv32i_zba_zbb_zbc_zbs_v -mabi=ilp32 -S -o output.s input.c
Use Homebrew to install QEMU
Check whether RISC-V has been supported
If we have installed correctly, we get the version below:
( version 9.2.0 in 2025/1 )
Then we compile RV32I/B extension/RVV via RISC-V GCC tools
Installation
Verification
You may notice that the instruction in above is "riscv64".
It's normal.
Because most RISC-V toolchains target 64-bit architectures (riscv64) by default. However, these tools usually support 32-bit targets as well, by specifying the appropriate architecture flags at compile time.
If it shows that commend not found
, checking whether it is in the environment variable.
If it is't, we add it manually and make sure again.
Compile the C file
-march=rv32i
:specity architecture as RV32I
-mabi=ilp32
:specify ABI(Application Binary Interface)as 32-bit integer
Issues encountered in QEMU
- Unable to use ecall 1 in QEMU (unable to directly print numbers).
solution :
Convert a word to ASCII. This is used for all numbers greater than or equal to 0.
Code
Part1 : logint
Functionality
The function calculates , which determines the power of 2 closest to .
Logic
- Each iteration right-shifts by 1 bit, effectively simulating a "divide by 2" operation.
- For every right shift, the counter increments, representing how many times can be divided by 2.
- The loop stops when becomes 0. At this point, contains . Subtracting 1 gives the correct result.
code :
Convert to RISC-V assembly:
Compiling the .s
File in RV32I
Use the riscv64-linux-gnu-as and riscv64-linux-gnu-ld to compile the .s file into an executable :
Using this instruction will see the time the file taken.
for example :
B-extension
Compiling the .s
File in B extension
Use the riscv64-linux-gnu-as and riscv64-linux-gnu-ld to compile the .s file into an executable :
vector extension
Part2 : bitwise reverse
Functionality
The function reverse
is used to reverse the binary representation of a number based on the bit length determined by the maximum number .
- Input :
- : Used to determine the bit length (calculated using ).
- : The number whose binary representation needs to be reversed.
- Output:
- Returns the reversed binary representation of .
Logic
-
Determine Bit Length :
- Use to determine the number of bits required for .
- For example, if , then , so the bit length is 4.
-
Check and Reverse Bits :
- Use a loop to check each bit of starting from the position.
- If the bit at this position is 1, set the corresponding position in the reversed binary number to 1.
- This is done using bitwise operations
&
and |
.
-
Return Result:
- Finally, return the reversed number .
code :
Convert to RISC-V assembly:
reverse:
addi sp, sp, -28
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
sw s4, 20(sp)
sw s5, 24(sp)
call logint
addi s0, zero, 1
add s1, zero, zero
forloop_reverse:
bgt s0, a0, forloop_reverse_end
sub s2, a0, s0
addi s3, zero, 1
sll s3, s3, s2
s3, a1, s3
beq s3, zero, elses3
ifs3:
addi s4, s0, -1
addi s5, zero, 1
sll s5, s5, s4
or s1, s1, s5
elses3:
addi s0, s0, 1
j forloop_reverse
forloop_reverse_end:
add a0, s1, zero # Return p
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4, 20(sp)
lw s5, 24(sp)
addi sp, sp, 28
jr ra
B-extension
reverse:
lui t0, 0x55555
ori t0, t0, 0x555
srli t1, a0, 1
and t1, t1, t0
and t2, a0, t0
slli t2, t2, 1
or a0, t1, t2
lui t0, 0x33333
ori t0, t0, 0x333
srli t1, a0, 2
and t1, t1, t0
and t2, a0, t0
slli t2, t2, 2
or a0, t1, t2
lui t0, 0x0F0F0
ori t0, t0, 0x0F0
srli t1, a0, 4
and t1, t1, t0
and t2, a0, t0
slli t2, t2, 4
or a0, t1, t2
lui t0, 0x00FF0
ori t0, t0, 0x0FF
srli t1, a0, 8
and t1, t1, t0
and t2, a0, t0
slli t2, t2, 8
or a0, t1, t2
srli t1, a0, 16
slli t2, a0, 16
ora0, t1, t2
addi t1, zero, 32
sub t1, t1, a1
srl a0, a0, t1
jr ra
vector extension
-
la t0, helperVector:
This instruction loads the base address of helperVector into the register t0. helperVector is an array or buffer used to store the generated vector.
-
lw t1, size:
This instruction loads an integer value (the size of the vector N) from memory into the register t1. This value represents the number of elements in the vector to be generated.
-
vsetvli t2, t1, e32, m1:
This instruction sets the vector length and type. It sets the vector length to t1 (i.e., N), with each element being 32 bits (e32) and a multiplier factor of 1 (m1). This means that the subsequent vector operations will target N 32-bit elements.
-
vid.v v0:
This instruction generates a vector v0 with elements consisting of consecutive index values (0, 1, 2, …, N-1). It is an initialization vector typically used for testing or as indices for data.
-
vse32.v v0, (t0):
This instruction stores the data in vector v0 to the memory address pointed to by t0 (i.e., helperVector). This means the generated index vector will be written into helperVector.
Application to FFT
This code implements the Fast Fourier Transform (FFT) algorithm, which can be used to analyze the frequency components of a signal.
Given an input signal of [1, 1, 1, 1, 0, 0, 0, 0], the FFT result will display the spectrum of the signal. The real and imaginary parts of the result correspond to the magnitude and phase in the frequency domain.
Using instruction translate to assembly.
When we encountered issues installing riscv32-linux-gnu-gcc, we decided to switch to using rv64i instead.
.file "fft.c"
.option pic
.attribute arch, "rv64i2p1_m2p0_a2p1_f2p2_d2p2_v1p0_zicsr2p0_zba1p0_zbb1p0_zbc1p0_zbs1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
.attribute unaligned_access, 0
.attribute stack_align, 16
.text
.align 2
.globl logint
.type logint, @function
logint:
.LFB23:
.cfi_startproc
mv a5,a0
beq a0,zero,.L4
li a4,0
.L3:
sraiw a5,a5,1
mv a0,a4
addiw a4,a4,1
bne a5,zero,.L3
ret
.L4:
li a0,-1
ret
.cfi_endproc
.LFE23:
.size logint, .-logint
.align 2
.globl reverse
.type reverse, @function
reverse:
.LFB24:
.cfi_startproc
bne a0,zero,.L24
ret
.L24:
li a2,1
mv a7,a2
li a6,0
.L9:
mv a5,a0
li a4,0
.L13:
sraiw a5,a5,1
addiw a4,a4,1
bne a5,zero,.L13
ble a4,a2,.L25
mv a4,a0
.L11:
sraiw a4,a4,1
mv a3,a5
addiw a5,a5,1
bne a4,zero,.L11
subw a3,a3,a2
sraw a3,a1,a3
andi a3,a3,1
beq a3,zero,.L12
addiw a5,a2,-1
sllw a5,a7,a5
or a6,a6,a5
.L12:
addiw a2,a2,1
j .L9
.L25:
mv a0,a6
ret
.cfi_endproc
.LFE24:
.size reverse, .-reverse
.align 2
.globl fft
.type fft, @function
fft:
.LFB25:
.cfi_startproc
addi sp,sp,-192
.cfi_def_cfa_offset 192
sd s3,152(sp)
sd s8,112(sp)
sd ra,184(sp)
.cfi_offset 19, -40
.cfi_offset 24, -80
.cfi_offset 1, -8
mv s8,a1
mv s3,a0
ble a1,zero,.L27
mv a7,a0
li a6,1
li a0,0
vsetivli zero,2,e64,m1,ta,ma
.L28:
li a1,0
li a2,1
.L33:
mv a5,s8
li a4,0
.L31:
sraiw a5,a5,1
addiw a4,a4,1
bne a5,zero,.L31
bge a2,a4,.L63
mv a4,s8
.L29:
sraiw a4,a4,1
mv a3,a5
addiw a5,a5,1
bne a4,zero,.L29
subw a3,a3,a2
sraw a3,a0,a3
andi a3,a3,1
beq a3,zero,.L30
addiw a5,a2,-1
sllw a5,a6,a5
or a1,a1,a5
.L30:
addiw a2,a2,1
j .L33
.L63:
ble a1,a0,.L32
slli a5,a1,4
add a5,s3,a5
vle64.v v2,0(a5)
vle64.v v1,0(a7)
vse64.v v2,0(a7)
vse64.v v1,0(a5)
.L32:
addiw a0,a0,1
addi a7,a7,16
bne s8,a0,.L28
.L27:
bne s8,zero,.L64
.L55:
ld ra,184(sp)
.cfi_remember_state
.cfi_restore 1
ld s3,152(sp)
.cfi_restore 19
ld s8,112(sp)
.cfi_restore 24
addi sp,sp,192
.cfi_def_cfa_offset 0
jr ra
.L64:
.cfi_restore_state
fsd fs0,72(sp)
fsd fs1,64(sp)
fsd fs4,40(sp)
.cfi_offset 40, -120
.cfi_offset 41, -128
.cfi_offset 52, -152
fld fs1,.LC2,a5
fld fs0,.LC3,a5
fld fs4,.LC1,a5
sd s5,136(sp)
.cfi_offset 21, -56
li s5,1
sd s4,144(sp)
sd s0,176(sp)
sd s1,168(sp)
sd s2,160(sp)
sd s6,128(sp)
sd s7,120(sp)
sd s9,104(sp)
sd s10,96(sp)
sd s11,88(sp)
fsd fs2,56(sp)
fsd fs3,48(sp)
.cfi_offset 20, -48
.cfi_offset 8, -16
.cfi_offset 9, -24
.cfi_offset 18, -32
.cfi_offset 22, -64
.cfi_offset 23, -72
.cfi_offset 25, -88
.cfi_offset 26, -96
.cfi_offset 27, -104
.cfi_offset 50, -136
.cfi_offset 51, -144
mv s4,s5
.L38:
mv a5,s8
li a4,0
.L43:
sraiw a5,a5,1
addiw a4,a4,1
bne a5,zero,.L43
ble a4,s5,.L65
sllw s10,s4,s5
ble s8,zero,.L36
li a5,1
ble s10,a5,.L36
fcvt.d.w fa1,s10
srliw s6,s10,31
addw s6,s6,s10
fdiv.d fa0,fs1,fa1
sraiw s6,s6,1
slli s11,s10,4
mv s7,s3
li s9,0
fdiv.d fa1,fs0,fa1
call cexp@plt
fmv.d fs3,fa0
fmv.d fs2,fa1
slli a3,s6,4
.L42:
fmv.d.x fa3,zero
fmv.d fa2,fs4
add s1,a3,s7
mv s0,s7
li s2,0
.L41:
fld fa1,8(s1)
fld fa0,0(s1)
fmul.d ft0,fa1,fa2
fmul.d ft1,fa1,fa3
fmadd.d ft0,fa0,fa3,ft0
fmsub.d ft1,fa0,fa2,ft1
feq.d a4,ft0,ft0
feq.d a5,ft1,ft1
and a5,a5,a4
beq a5,zero,.L66
.L39:
fmul.d fa5,fs3,fa3
fmul.d fa4,fs2,fa3
fld ft3,0(s0)
fld ft2,8(s0)
fadd.d fa0,ft1,ft3
fadd.d fa1,ft0,ft2
fmadd.d fa5,fs2,fa2,fa5
fmsub.d fa4,fs3,fa2,fa4
fsub.d ft3,ft3,ft1
fsub.d ft2,ft2,ft0
fsd fa0,0(s0)
fsd fa1,8(s0)
feq.d a4,fa5,fa5
feq.d a5,fa4,fa4
fsd ft3,0(s1)
fsd ft2,8(s1)
and a5,a5,a4
beq a5,zero,.L67
.L40:
addiw s2,s2,1
fmv.d fa2,fa4
fmv.d fa3,fa5
addi s1,s1,16
addi s0,s0,16
bgt s6,s2,.L41
addw s9,s10,s9
add s7,s7,s11
bgt s8,s9,.L42
.L36:
addiw s5,s5,1
j .L38
.L67:
fmv.d fa1,fa3
fmv.d fa0,fa2
fmv.d fa3,fs2
fmv.d fa2,fs3
sd a3,8(sp)
call __muldc3@plt
fmv.d fa4,fa0
fmv.d fa5,fa1
ld a3,8(sp)
j .L40
.L66:
sd a3,24(sp)
fsd fa3,16(sp)
fsd fa2,8(sp)
call __muldc3@plt
fmv.d ft1,fa0
fmv.d ft0,fa1
ld a3,24(sp)
fld fa3,16(sp)
fld fa2,8(sp)
j .L39
.L65:
ld s0,176(sp)
.cfi_restore 8
ld s1,168(sp)
.cfi_restore 9
ld s2,160(sp)
.cfi_restore 18
ld s4,144(sp)
.cfi_restore 20
ld s5,136(sp)
.cfi_restore 21
ld s6,128(sp)
.cfi_restore 22
ld s7,120(sp)
.cfi_restore 23
ld s9,104(sp)
.cfi_restore 25
ld s10,96(sp)
.cfi_restore 26
ld s11,88(sp)
.cfi_restore 27
fld fs0,72(sp)
.cfi_restore 40
fld fs1,64(sp)
.cfi_restore 41
fld fs2,56(sp)
.cfi_restore 50
fld fs3,48(sp)
.cfi_restore 51
fld fs4,40(sp)
.cfi_restore 52
j .L55
.cfi_endproc
.LFE25:
.size fft, .-fft
.section .rodata.str1.8,"aMS",@progbits,1
.align 3
.LC4:
.string "Input:"
.align 3
.LC5:
.string "%lf + %lfi\n"
.align 3
.LC6:
.string "\nOutput:"
.section .text.startup,"ax",@progbits
.align 2
.globl main
.type main, @function
main:
.LFB26:
.cfi_startproc
vsetivli zero,16,e64,m8,ta,ma
lla a5,.LANCHOR0
vle64.v v8,0(a5)
addi sp,sp,-192
.cfi_def_cfa_offset 192
sd s4,144(sp)
.cfi_offset 20, -48
la s4,__stack_chk_guard
sd s0,176(sp)
sd s1,168(sp)
sd s2,160(sp)
sd s3,152(sp)
sd ra,184(sp)
.cfi_offset 8, -16
.cfi_offset 9, -24
.cfi_offset 18, -32
.cfi_offset 19, -40
.cfi_offset 1, -8
addi s3,sp,8
ld a5, 0(s4)
sd a5, 136(sp)
li a5, 0
vse64.v v8,0(s3)
lla a0,.LC4
mv s0,s3
addi s2,sp,136
mv s1,s3
call puts@plt
.L69:
ld a3,8(s1)
ld a2,0(s1)
lla a1,.LC5
li a0,2
addi s1,s1,16
call __printf_chk@plt
bne s1,s2,.L69
mv a0,s3
li a1,8
call fft
lla a0,.LC6
call puts@plt
.L70:
ld a3,8(s0)
ld a2,0(s0)
lla a1,.LC5
li a0,2
addi s0,s0,16
call __printf_chk@plt
bne s0,s2,.L70
ld a4, 136(sp)
ld a5, 0(s4)
xor a5, a4, a5
li a4, 0
bne a5,zero,.L78
ld ra,184(sp)
.cfi_remember_state
.cfi_restore 1
ld s0,176(sp)
.cfi_restore 8
ld s1,168(sp)
.cfi_restore 9
ld s2,160(sp)
.cfi_restore 18
ld s3,152(sp)
.cfi_restore 19
ld s4,144(sp)
.cfi_restore 20
li a0,0
addi sp,sp,192
.cfi_def_cfa_offset 0
jr ra
.L78:
.cfi_restore_state
call __stack_chk_fail@plt
.cfi_endproc
.LFE26:
.size main, .-main
.section .rodata.cst8,"aM",@progbits,8
.align 3
.LC1:
.word 0
.word 1072693248
.align 3
.LC2:
.word 0
.word -2147483648
.align 3
.LC3:
.word 1413754136
.word -1072094725
.section .rodata
.align 3
.set .LANCHOR0,. + 0
.LC0:
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.ident "GCC: (Ubuntu 14.2.0-4ubuntu2) 14.2.0"
.section .note.GNU-stack,"",@progbits
Then we improve the code.
.file "fft.c"
.option pic
.attribute arch, "rv64i2p1_m2p0_a2p1_f2p2_d2p2_v1p0_zicsr2p0_zba1p0_zbb1p0_zbc1p0_zbs1p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
.attribute unaligned_access, 0
.attribute stack_align, 16
.text
.align 2
.globl logint
.type logint, @function
logint:
.LFB23:
.cfi_startproc
clz a5,a0
li a4,31
sub a0,a4,a5
ret
.cfi_endproc
.LFE23:
.size logint, .-logint
.align 2
.globl reverse
.type reverse, @function
reverse:
.LFB24:
.cfi_startproc
bne a0, zero, .L24
ret
.L24:
li a2, 1
li a6, 0
.L9:
mv a5, a0
li a4, 0
.L13:
srai a5, a5, 1
addi a4, a4, 1
bne a5, zero, .L13
ble a4, a2, .L25
srl a3, a1, a2
a3, a3, 1
beqz a3, .L12
sll a5, a6, a2
or a6, a6, a5
.L12:
addi a2, a2, 1
j .L9
.L25:
mv a0, a6
ret
.cfi_endproc
.LFE24:
.size reverse, .-reverse
.align 2
.globl fft
.type fft, @function
fft:
.LFB25:
.cfi_startproc
addi sp, sp, -192
.cfi_def_cfa_offset 192
sd s3, 152(sp)
sd s8, 112(sp)
sd ra, 184(sp)
.cfi_offset 19, -40
.cfi_offset 24, -80
.cfi_offset 1, -8
mv s8, a1
mv s3, a0
blez a1, .L27
li a0, 0
li a6, 1
mv a7, a0
vsetivli zero, 2, e64, m1, ta, ma
.L28:
li a1, 0
li a2, 1
.L33:
mv a5, s8
li a4, 0
.L31:
srai a5, a5, 1
addi a4, a4, 1
bnez a5, .L31
bge a2, a4, .L63
srl a3, a0, a2
a3, a3, 1
beqz a3, .L30
sll a5, a6, a2
or a1, a1, a5
.L30:
addi a2, a2, 1
j .L33
.L63:
ble a1, a0, .L32
slli a5, a1, 4
add a5, s3, a5
vle64.v v2, (a5)
vle64.v v1, (a7)
vse64.v v2, (a7)
vse64.v v1, (a5)
.L32:
addi a0, a0, 1
addi a7, a7, 16
bne s8, a0, .L28
.L27:
ld ra, 184(sp)
ld s3, 152(sp)
ld s8, 112(sp)
addi sp, sp, 192
.cfi_def_cfa_offset 0
jr ra
.cfi_endproc
.LFE25:
.size fft, .-fft
.section .rodata.str1.8,"aMS",@progbits,1
.align 3
.LC4:
.string "Input:"
.align 3
.LC5:
.string "%lf + %lfi\n"
.align 3
.LC6:
.string "\nOutput:"
.section .text.startup,"ax",@progbits
.align 2
.globl main
.type main, @function
main:
.LFB26:
.cfi_startproc
addi sp, sp, -192
.cfi_def_cfa_offset 192
sd ra, 184(sp)
sd s0, 176(sp)
sd s1, 168(sp)
sd s2, 160(sp)
sd s3, 152(sp)
sd s4, 144(sp)
.cfi_offset 1, -8
.cfi_offset 8, -16
.cfi_offset 9, -24
.cfi_offset 18, -32
.cfi_offset 19, -40
.cfi_offset 20, -48
vsetivli zero, 16, e64, m8, ta, ma
lla a5, .LANCHOR0
vle64.v v8, 0(a5)
addi s3, sp, 8
vse64.v v8, 0(s3)
la s4, __stack_chk_guard
ld a5, 0(s4)
sd a5, 136(sp)
lla a0, .LC4
call puts@plt
mv s0, s3
mv s1, s3
addi s2, sp, 136
.L69:
ld a3, 8(s1)
ld a2, 0(s1)
lla a1, .LC5
li a0, 2
addi s1, s1, 16
call __printf_chk@plt
bne s1, s2, .L69
mv a0, s3
li a1, 8
call fft
lla a0, .LC6
call puts@plt
.L70:
ld a3, 8(s0)
ld a2, 0(s0)
lla a1, .LC5
li a0, 2
addi s0, s0, 16
call __printf_chk@plt
bne s0, s2, .L70
ld a4, 136(sp)
ld a5, 0(s4)
xor a5, a4, a5
beqz a5, .L71
call __stack_chk_fail@plt
j .L72
.L71:
ld ra, 184(sp)
ld s0, 176(sp)
ld s1, 168(sp)
ld s2, 160(sp)
ld s3, 152(sp)
ld s4, 144(sp)
addi sp, sp, 192
.cfi_def_cfa_offset 0
jr ra
.L72:
.cfi_endproc
.LFE26:
.size main, .-main
.section .rodata.cst8,"aM",@progbits,8
.align 3
.LC1:
.word 0
.word 1072693248
.align 3
.LC2:
.word 0
.word -2147483648
.align 3
.LC3:
.word 1413754136
.word -1072094725
.section .rodata
.align 3
.set .LANCHOR0,. + 0
.LC0:
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 1072693248
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.word 0
.ident "GCC: (Ubuntu 14.2.0-4ubuntu2) 14.2.0"
.section .note.GNU-stack,"",@progbits
Difference Between lla
and la
lla (Load Address Absolute) : Only handles absolute addresses and does not account for a base address (e.g., PC-relative or GOT).
la (Load Address : Adjusts the address based on the target's access model, such as PC-relative or using the Global Offset Table (GOT).
We improved the code, reducing nearly 200 lines and shortening the execution time.
Result
The following is the result of the output.
In this task, we accelerated its execution time by using b-extension and RVV.
Reference