# Digital signal processing and Optimization
> 林濤
## Problems
* Spike doesn't work. Professor suggested me rv32emu instead.
* When I try to using f_add and f_mul in assembly, there are many issues unsolved. So I tried using C instead but focus on swift and reverse function. (The code f_mul.s and f_add.s arestill provided below)
## Project Introduction
### Goals
Implementation of "Decimation-in-time" FFT algorithm by using C and RISC-V. Test performance using spike.
### What is FFT
Since the complexity of "[Discrete Fourier Transform](https://en.wikipedia.org/wiki/Discrete_Fourier_transform)" is $O(n^2)$, where $n$ is the number of data. The formular is shown below:
\begin{split}
X[k]=\sum^{N-1}_{n=0}x[n]e^{-j2\pi kn/N}
\end{split}
In order to have less complexity, [FFT(Fast Fourier Transform)](https://zh.wikipedia.org/zh-tw/%E5%BF%AB%E9%80%9F%E5%82%85%E9%87%8C%E5%8F%B6%E5%8F%98%E6%8D%A2) technic was introduced.
### What is Decimation-in-time FFT
Consider the definition of DFT mentioned above
\begin{split}
X[k]=\sum^{N-1}_{n=0}x[n]e^{-j2\pi kn/N}=X[k]=\sum^{N-1}_{n=0}x[n]W^{nk}_N\\
\end{split}
the series $x[n]$ can be written as the even part and odd part, that is, $\sum^{N-1}_{n=0}x[n]=\sum^{N/2-1}_{n=0}(x[2n]+x[2n+1])$. Hence, its corresponding DFT can be written as
\begin{split}
X[k]=&\sum^{N/2-1}_{n=0}x[2n]W^{2nk}_N+\sum^{N/2-1}_{n=0}x[2n+1]W^{(2n+1)k}_N\\=&\sum^{N/2-1}_{n=0}x[2n]W^{nk}_{N/2}+W^{k}_N\sum^{N/2-1}_{n=0}x[2n+1]W^{nk}_{N/2}\\=&G[k]+W^k_NH[k]
\end{split}
For example, consider $N=8$ points DFT, we can split it into two $N=4$ DFT than add its corresponding twiddle factor, $W^k_N$.
### Flow graph
Continue after above case. Take $k=0$, we only need to compute $G[0]+W^0_8H[0]$. It means that we only need compute half of DFT(4-points) rather than 8-points, then add up with its corresponding twiddle factor as previous illustration. Here is its flow chart.

Further step, we can decomposite $N/2$-point DFT into four $N/4$-point DFT, as below shown.

In 8 points case, finally, we can obtain 2 points DFT. Now we can also represent 2 points DFT in the form of twiddle factor.
Worthy to notice is that in the each stage(3 in 8 points), we can represent only one pair of twiddle factors. That is, take stage 2,
 $W^{k+N/2}_N=W^{N/2}_NW^k_N=-W^k_N$, for example, $W^6_N=-W^2_N$.
Thus, the above diagram can be simplified as

With this idea, we can decomposite $N$ points DFT(must be the power of 2) into $log_2(N)$ stage. And the complexity will reduce to $O(Nlog_2N)$
### What will I do in this project?
* Implemention FFT in C code, compiling it using spike.
* In above C code, replace funcions by RISC-V, compiling it using spike, and testing its performance.
## Code
### FFT algorithm in C
In this version, all function are implemented using C. The logint() and reverse() functions are refered to [Problem D, quize 5](https://hackmd.io/@sysprog/arch2024-quiz5). The major function, DIT_FFT, is shown below.
``` c
void DIT_FFT(int size, float *real, float *image) {
int i, j, k, stage, step;
int halfStep;
float tempReal, tempImage;
// data swift to new index
for (i = 0; i < size; i++) {
int revIndex = reverse(size, i);
if (i < revIndex) {
// Swap real parts
tempReal = real[i];
real[i] = real[revIndex];
real[revIndex] = tempReal;
// Swap imaginary parts
tempImage = image[i];
image[i] = image[revIndex];
image[revIndex] = tempImage;
}
}
// Perform FFT computation
for (stage = 1; stage <= logint(size); stage++) {
step = 1 << stage; // 2^stage
halfStep = step >> 1; // Step size / 2
for (k = 0; k < size; k += step) {
for (j = 0; j < halfStep; j++) {
// Compute twiddle factor
float angle = -2.0f * M_PI * (float)j / (float)step; // Angle for twiddle factor
float cosValue = cos(angle); // Replace with precomputed cosine lookup table for optimization
float sinValue = sin(angle); // Replace with precomputed sine lookup table for optimization
// Butterfly computation
int index1 = k + j;
int index2 = index1 + halfStep;
tempReal = cosValue * real[index2] - sinValue * image[index2];
tempImage = sinValue * real[index2] + cosValue * image[index2];
real[index2] = real[index1] - tempReal;
image[index2] = image[index1] - tempImage;
real[index1] += tempReal;
image[index1] += tempImage;
}
}
}
}
```
### Data swift in assembly code
Function of this assembly code, we can reordered index in the algorithm of FFT now.
#### Reverse & Logint
This code with some revices refer to quize5 problem D, C.A.2024.
```c
.global logint
logint:
addi sp, sp, -4
sw t0, 0(sp)
add t0, a0, zero # k = N
add a0, zero, zero # i = 0
logloop:
beq t0, zero, logloop_end # Exit if k == 0
srai t0, t0, 1 # k >>= 1
addi a0, a0, 1 # i++
j logloop
logloop_end:
addi a0, a0, -1 # Return i - 1
lw t0, 0(sp)
addi sp, sp, 4
jr ra
reverse:
addi sp, sp, -28
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
sw s4, 20(sp)
sw s5, 24(sp)
jal logint # jump to logint and save position to ra
addi s0, zero, 1 # j = 1
add s1, zero, zero # p = 0
forloop_reverse:
bgt s0, a0, forloop_reverse_end
sub s2, a0, s0# s2 = a0 - s0
addi s3, zero, 1
sll s3, s3, s2
and s3, a1, s3
beq s3, zero, elses3 # If not, skip
ifs3:
addi s4, s0, -1 # s4 = j - 1
addi s5, zero, 1
sll s5, s5, s4
or s1,s1,s5
elses3:
addi s0, s0, 1 # j++
j forloop_reverse
forloop_reverse_end:
add a0, s1, zero # Return p
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4, 20(sp)
lw s5, 24(sp)
addi sp, sp, 28
jr ra
```
#### Swift
```c
// a0=points
// a1=current index
// a2=address of real sequence
// a3=address of image sequence
swift:
addi sp, sp, -28
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
sw s4, 20(sp)
sw s5, 24(sp)
addi s0, a1, 0 #s0 = a1, current index save to s0
# a0 = N(size), a1 = n(current index)
jal reverse # now a0 = reverse (a0)
slli s0, s0, 2 # s0 = s0 * 4 address shift of current index
slli s1, a0, 2 # s1 = a0 * 4 address shift of reversed index
# Deal with real part exchange
add s2, a2, s0 # s2 = address of current index
add s3, a2, s1 # s3 = address of exchange index
lw s4, 0(s2) # s4 = the value of current index
lw s5, 0(s3) # s5 = the value of reverse index
sw s5, 0(s2) # store reversed value to current index
sw s4, 0(s3) # store current value to reversed index
# Deal with image part exchange
add s2, a3, s0 # s2 = address of current index
add s3, a3, s1 # s3 = address of exchange index
lw s4, 0(s2) # s4 = the value of current index
lw s5, 0(s3) # s5 = the value of reverse index
sw s5, 0(s2) # store reversed value to current index
sw s4, 0(s3) # store current value to reversed index
#recover registers
lw ra,0(sp)
lw s0,4(sp)
lw s1,8(sp)
lw s2,12(sp)
lw s3,16(sp)
lw s4,20(sp)
lw s5,24(sp)
addi sp,sp,28
jr ra
```
### float point operation
In this assembly, I try to implement the addition operation of two float point numbers of IEEE 754 standard. Below are the steps how I realized this assembly.
* First, I to compare value of binary numbers of two value without the consideration of sign bit.
* Based on this, I take the larger one as the fixed number, trying bit shift operation on mantissa of smaller one to make sure having same exponent value both.
* With implicit number added in mantissa, summing up both mantissa, and using count leading non-zero position, I use bitwise operations on both exponent and mantissa that I can have exact value of exp and position of mantissa
* Finally with the sign of larger number, nomatter what sign of another float point, it dominate the final sign, put all result in a0 and return.
However, due to deadline, I still can't overcome problems about overflow, even some specil cases.
```assembly
.global f_add
f_add:
addi sp,sp,-24
sw ra, 0(sp)
sw s0,4(sp)
sw s1,8(sp)
sw s2,12(sp)
sw s3,16(sp)
sw s4,20(sp)
slli t0, a0, 1
slli t1, a1, 1 # ignore sign bit, compare which is larger
srli t0, t0, 1
srli t1, t1, 1
#####(now a0 >= a1)
lui t2 ,0x80000
and t2, a0, t2
and s0, a1, t2 #check sign difference, store in s0
bgt t1, t0, reorder
keep_add:
lui t0, 0x7F800 # mask for exp
lui t1, 0x80000 # mask for sign
and s1, a0, t0
and s2, a1, t0 # exp in s1, s2
lui t1, 0x007FF
li t0, 0xFFF
add t0, t0, t1 # mask for mantissa
srli s1,s1,23
srli s2,s2,23
and s3, t0, a0
and s4, t0, a1 # mantissa in s3, s4
lui t1, 0x00800 # implicit bit
sub s2, s1, s2 # exp s1-s2
add s3, s3, t1
add s4, s4, t1 # with implicit
bnez s0, sign_diff
# sign same
# mantissa +
# consider bit =? 0 at position24, if so, shift mantissa and exp -1
srl s4, s4, s2 # align exp
add s2, s3, s4 # result mantissa = s3
j clnz
sign_diff:
#mantissa a0- mantissa a1
# use sign of a0 as major
# case if all zero
srl s4, s4, s2 # align exp
sub s2, s3, s4 # result mantissa = s3
j clnz
reorder:
addi t0, a0, 0
addi a0, a1, 0
addi a1, t0, 0
j keep_add
mantissa_0:
li a0,0
j over
# used for exp shift
# s1 exp, s2, mantissa result with implicit
clnz:
li t0,32 #count non leading zero poosition
li t1,128
beqz s2, mantissa_0
bgt s2,t1, over
clnz_loop:
lui t1, 0x80000 # fitst bit mask
and t1, s2, t1 # check first bit
bnez t1, end_clnz
slli s2, s2,1
addi t0, t0, -1 # position -1
j clnz_loop
end_clnz:
addi t0, t0,-24 # bit position should at 24
lui t1, 0x00800 # implicit bit
bltz t0, clnz_exp_minus
add s1, s1, t0
srli s2, s2, 8
sub s2, s2, t1
slli s1, s1, 23
j result # return with mantissa(no implict)
clnz_exp_minus:
neg t0, t0
sub s1, s1, t0
srli s2, s2, 8 # shift implicit bit to 24th bit position
sub s2, s2, t1
slli s1, s1, 23
result:
# combine , s1 = exp, s2 = result mantissa
add s1, s1, s2
lui t0, 0x80000
and a0, t0, a0 # sign bit
or a0, a0, s1
over: #special case
lw ra,0(sp)
lw s0,4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4,20 (sp)
addi sp,sp,24
jr ra
```
#### float point multiplication
Below are my float point multiplication code.
```c
.global f_mul
f_mul:
addi sp,sp,-28
sw ra,0(sp)
sw s0,4(sp)
sw s1,8(sp)
sw s2,12(sp)
sw s3,16(sp)
sw s4,20(sp)
sw s5,24(sp)
srli s0, a0, 31 # s0 = a0 >> 31, sign bit of s0
srli s1, a1, 31 # s1 = a1 >> 31, sign bit of s1
li t0, 1
lui s2, 0x007FF
li s3, 0xFFF
add s3, s3, s2 # s3, mantissa mask
and s2, a0, s3 # s2 = a0 & 0x7FFFFF, mantissa of a0
and s3, a1, s3 # s3 = a1 & 0x7FFFFF, mantissa of a1
slli t0, t0, 23 # t0 = implicit 1, at bit position 24
li s4, 1
xor s4, s0, s1 # s4 = s0 ^ s1, sign of result
lui s5, 0x7F800 # s5 = 0x7F800000, exponent mask
add s2, s2, t0 # s2 = s2 + t0, mantissa of a0 with implicit 1
add s3, s3, t0 # s3 = s3 + t0, mantissa of a1 with implicit 1
jal mul_mantissa # s2, s3 input
and s1, a0, s5 # s1 = a0 & 0x7F800000, exponent of a0
and s2, a1, s5 # s2 = a1 & 0x7F800000, exponent of a1
slli s4,s4,31 # s4 = s4 << 31, sign of result
add s3, s2, s1 # s3 = s2 + s1, exponent of result
srli s3, s3, 23 # s3 = s3 >> 23, s3 = exponent of result
bgtz s3, normal_exp
addi s3, s3, -126 # s3 = s3 - 127, one bias, s3 = exponent of result
slli s3, s3, 23
or s4, s4, s0 # s4 = s4 + s0, sign + mantissa
or s4, s4, s3 # s4 = s4 + s3, sign + mantissa + exponent
j end_f_mul
normal_exp:
addi s3, s3, -127 # s3 = s3 - 127, one bias, s3 = exponent of result
slli s3, s3, 23 # s3 = s3 << 23, s3 = exponent of result
or s4, s4, s0 # s4 = s4 + s0, sign + mantissa
or s4, s4, s3 # s4 = s4 + s3, sign + mantissa + exponent
end_f_mul:
add a0, s4, zero # a0 = s4 + 0, return value
lw ra, 0(sp) # restore ra
lw s0, 4(sp) # restore s0
lw s1, 8(sp) # restore s1
lw s2, 12(sp) # restore s2
lw s3, 16(sp) # restore s3
lw s4, 20(sp) # restore s4
lw s5, 24(sp) # restore s5
addi sp, sp, 28 # restore sp
jr ra
mul_mantissa:
# mantissa s2, s3
srli s2, s2, 8
srli s3, s3, 8
lui t1, 0x00008 # bit position of implicit 1, start at 16
li t2, 0 # t2 = 0, temp value
add_mantissa:
lui t0, 0x80000 # bit position 32 check
and t3, s3, t1 # t3 = s3 & 0x000008, addition check bit
beqz t3, skip_mul_add # if this bit = 0, skip add
add t2,t2,s2 # t2 = t2 + s2
slli t2, t2, 1 # right shift temp value
slli s3,s3,1 # left shift new 16th bit of s3
and t0, t0, t2 # take out 32th bit of t2
bnez t0, end_mul_mantissa # if 32th bit is 1
j add_mantissa
skip_mul_add:
slli t2, t2, 1 # right shift temp value
slli s3,s3,1 # left shift new 16th bit of s3
and t0, t0, t2 # take out 32th bit of t2
bnez t0, end_mul_mantissa # if 32th bit is 1
j add_mantissa
end_mul_mantissa:
lui t1, 0x80000
sub t2, t2, t1
srli s0,t2,8 # right shift to right position of mantissa
jr ra
```
### Final Result and Summarize
#### When combining assembly code

#### When only using C code

## Reference
* [FFT](https://en.wikipedia.org/wiki/Fast_Fourier_transform)
* [Coooley-Turkey FFT algorithm](https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm)
* [Quiz5 2024](https://hackmd.io/@sysprog/arch2024-quiz5-sol)
* https://github.com/vicLin8712/C.A.project