# Digital signal processing and Optimization > 林濤 ## Problems * Spike doesn't work. Professor suggested me rv32emu instead. * When I try to using f_add and f_mul in assembly, there are many issues unsolved. So I tried using C instead but focus on swift and reverse function. (The code f_mul.s and f_add.s arestill provided below) ## Project Introduction ### Goals Implementation of "Decimation-in-time" FFT algorithm by using C and RISC-V. Test performance using spike. ### What is FFT Since the complexity of "[Discrete Fourier Transform](https://en.wikipedia.org/wiki/Discrete_Fourier_transform)" is $O(n^2)$, where $n$ is the number of data. The formular is shown below: \begin{split} X[k]=\sum^{N-1}_{n=0}x[n]e^{-j2\pi kn/N} \end{split} In order to have less complexity, [FFT(Fast Fourier Transform)](https://zh.wikipedia.org/zh-tw/%E5%BF%AB%E9%80%9F%E5%82%85%E9%87%8C%E5%8F%B6%E5%8F%98%E6%8D%A2) technic was introduced. ### What is Decimation-in-time FFT Consider the definition of DFT mentioned above \begin{split} X[k]=\sum^{N-1}_{n=0}x[n]e^{-j2\pi kn/N}=X[k]=\sum^{N-1}_{n=0}x[n]W^{nk}_N\\ \end{split} the series $x[n]$ can be written as the even part and odd part, that is, $\sum^{N-1}_{n=0}x[n]=\sum^{N/2-1}_{n=0}(x[2n]+x[2n+1])$. Hence, its corresponding DFT can be written as \begin{split} X[k]=&\sum^{N/2-1}_{n=0}x[2n]W^{2nk}_N+\sum^{N/2-1}_{n=0}x[2n+1]W^{(2n+1)k}_N\\=&\sum^{N/2-1}_{n=0}x[2n]W^{nk}_{N/2}+W^{k}_N\sum^{N/2-1}_{n=0}x[2n+1]W^{nk}_{N/2}\\=&G[k]+W^k_NH[k] \end{split} For example, consider $N=8$ points DFT, we can split it into two $N=4$ DFT than add its corresponding twiddle factor, $W^k_N$. ### Flow graph Continue after above case. Take $k=0$, we only need to compute $G[0]+W^0_8H[0]$. It means that we only need compute half of DFT(4-points) rather than 8-points, then add up with its corresponding twiddle factor as previous illustration. Here is its flow chart. ![image](https://hackmd.io/_uploads/r1dULfFw1e.png) Further step, we can decomposite $N/2$-point DFT into four $N/4$-point DFT, as below shown. ![image](https://hackmd.io/_uploads/BJVcSmKDkl.png) In 8 points case, finally, we can obtain 2 points DFT. Now we can also represent 2 points DFT in the form of twiddle factor.![image](https://hackmd.io/_uploads/BJnML7twyg.png) Worthy to notice is that in the each stage(3 in 8 points), we can represent only one pair of twiddle factors. That is, take stage 2, ![image](https://hackmd.io/_uploads/Skh6I7Kwye.png) $W^{k+N/2}_N=W^{N/2}_NW^k_N=-W^k_N$, for example, $W^6_N=-W^2_N$. Thus, the above diagram can be simplified as ![image](https://hackmd.io/_uploads/HJ_sD7Kvyl.png) With this idea, we can decomposite $N$ points DFT(must be the power of 2) into $log_2(N)$ stage. And the complexity will reduce to $O(Nlog_2N)$ ### What will I do in this project? * Implemention FFT in C code, compiling it using spike. * In above C code, replace funcions by RISC-V, compiling it using spike, and testing its performance. ## Code ### FFT algorithm in C In this version, all function are implemented using C. The logint() and reverse() functions are refered to [Problem D, quize 5](https://hackmd.io/@sysprog/arch2024-quiz5). The major function, DIT_FFT, is shown below. ``` c void DIT_FFT(int size, float *real, float *image) { int i, j, k, stage, step; int halfStep; float tempReal, tempImage; // data swift to new index for (i = 0; i < size; i++) { int revIndex = reverse(size, i); if (i < revIndex) { // Swap real parts tempReal = real[i]; real[i] = real[revIndex]; real[revIndex] = tempReal; // Swap imaginary parts tempImage = image[i]; image[i] = image[revIndex]; image[revIndex] = tempImage; } } // Perform FFT computation for (stage = 1; stage <= logint(size); stage++) { step = 1 << stage; // 2^stage halfStep = step >> 1; // Step size / 2 for (k = 0; k < size; k += step) { for (j = 0; j < halfStep; j++) { // Compute twiddle factor float angle = -2.0f * M_PI * (float)j / (float)step; // Angle for twiddle factor float cosValue = cos(angle); // Replace with precomputed cosine lookup table for optimization float sinValue = sin(angle); // Replace with precomputed sine lookup table for optimization // Butterfly computation int index1 = k + j; int index2 = index1 + halfStep; tempReal = cosValue * real[index2] - sinValue * image[index2]; tempImage = sinValue * real[index2] + cosValue * image[index2]; real[index2] = real[index1] - tempReal; image[index2] = image[index1] - tempImage; real[index1] += tempReal; image[index1] += tempImage; } } } } ``` ### Data swift in assembly code Function of this assembly code, we can reordered index in the algorithm of FFT now. #### Reverse & Logint This code with some revices refer to quize5 problem D, C.A.2024. ```c .global logint logint: addi sp, sp, -4 sw t0, 0(sp) add t0, a0, zero # k = N add a0, zero, zero # i = 0 logloop: beq t0, zero, logloop_end # Exit if k == 0 srai t0, t0, 1 # k >>= 1 addi a0, a0, 1 # i++ j logloop logloop_end: addi a0, a0, -1 # Return i - 1 lw t0, 0(sp) addi sp, sp, 4 jr ra reverse: addi sp, sp, -28 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) sw s3, 16(sp) sw s4, 20(sp) sw s5, 24(sp) jal logint # jump to logint and save position to ra addi s0, zero, 1 # j = 1 add s1, zero, zero # p = 0 forloop_reverse: bgt s0, a0, forloop_reverse_end sub s2, a0, s0# s2 = a0 - s0 addi s3, zero, 1 sll s3, s3, s2 and s3, a1, s3 beq s3, zero, elses3 # If not, skip ifs3: addi s4, s0, -1 # s4 = j - 1 addi s5, zero, 1 sll s5, s5, s4 or s1,s1,s5 elses3: addi s0, s0, 1 # j++ j forloop_reverse forloop_reverse_end: add a0, s1, zero # Return p lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4, 20(sp) lw s5, 24(sp) addi sp, sp, 28 jr ra ``` #### Swift ```c // a0=points // a1=current index // a2=address of real sequence // a3=address of image sequence swift: addi sp, sp, -28 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) sw s3, 16(sp) sw s4, 20(sp) sw s5, 24(sp) addi s0, a1, 0 #s0 = a1, current index save to s0 # a0 = N(size), a1 = n(current index) jal reverse # now a0 = reverse (a0) slli s0, s0, 2 # s0 = s0 * 4 address shift of current index slli s1, a0, 2 # s1 = a0 * 4 address shift of reversed index # Deal with real part exchange add s2, a2, s0 # s2 = address of current index add s3, a2, s1 # s3 = address of exchange index lw s4, 0(s2) # s4 = the value of current index lw s5, 0(s3) # s5 = the value of reverse index sw s5, 0(s2) # store reversed value to current index sw s4, 0(s3) # store current value to reversed index # Deal with image part exchange add s2, a3, s0 # s2 = address of current index add s3, a3, s1 # s3 = address of exchange index lw s4, 0(s2) # s4 = the value of current index lw s5, 0(s3) # s5 = the value of reverse index sw s5, 0(s2) # store reversed value to current index sw s4, 0(s3) # store current value to reversed index #recover registers lw ra,0(sp) lw s0,4(sp) lw s1,8(sp) lw s2,12(sp) lw s3,16(sp) lw s4,20(sp) lw s5,24(sp) addi sp,sp,28 jr ra ``` ### float point operation In this assembly, I try to implement the addition operation of two float point numbers of IEEE 754 standard. Below are the steps how I realized this assembly. * First, I to compare value of binary numbers of two value without the consideration of sign bit. * Based on this, I take the larger one as the fixed number, trying bit shift operation on mantissa of smaller one to make sure having same exponent value both. * With implicit number added in mantissa, summing up both mantissa, and using count leading non-zero position, I use bitwise operations on both exponent and mantissa that I can have exact value of exp and position of mantissa * Finally with the sign of larger number, nomatter what sign of another float point, it dominate the final sign, put all result in a0 and return. However, due to deadline, I still can't overcome problems about overflow, even some specil cases. ```assembly .global f_add f_add: addi sp,sp,-24 sw ra, 0(sp) sw s0,4(sp) sw s1,8(sp) sw s2,12(sp) sw s3,16(sp) sw s4,20(sp) slli t0, a0, 1 slli t1, a1, 1 # ignore sign bit, compare which is larger srli t0, t0, 1 srli t1, t1, 1 #####(now a0 >= a1) lui t2 ,0x80000 and t2, a0, t2 and s0, a1, t2 #check sign difference, store in s0 bgt t1, t0, reorder keep_add: lui t0, 0x7F800 # mask for exp lui t1, 0x80000 # mask for sign and s1, a0, t0 and s2, a1, t0 # exp in s1, s2 lui t1, 0x007FF li t0, 0xFFF add t0, t0, t1 # mask for mantissa srli s1,s1,23 srli s2,s2,23 and s3, t0, a0 and s4, t0, a1 # mantissa in s3, s4 lui t1, 0x00800 # implicit bit sub s2, s1, s2 # exp s1-s2 add s3, s3, t1 add s4, s4, t1 # with implicit bnez s0, sign_diff # sign same # mantissa + # consider bit =? 0 at position24, if so, shift mantissa and exp -1 srl s4, s4, s2 # align exp add s2, s3, s4 # result mantissa = s3 j clnz sign_diff: #mantissa a0- mantissa a1 # use sign of a0 as major # case if all zero srl s4, s4, s2 # align exp sub s2, s3, s4 # result mantissa = s3 j clnz reorder: addi t0, a0, 0 addi a0, a1, 0 addi a1, t0, 0 j keep_add mantissa_0: li a0,0 j over # used for exp shift # s1 exp, s2, mantissa result with implicit clnz: li t0,32 #count non leading zero poosition li t1,128 beqz s2, mantissa_0 bgt s2,t1, over clnz_loop: lui t1, 0x80000 # fitst bit mask and t1, s2, t1 # check first bit bnez t1, end_clnz slli s2, s2,1 addi t0, t0, -1 # position -1 j clnz_loop end_clnz: addi t0, t0,-24 # bit position should at 24 lui t1, 0x00800 # implicit bit bltz t0, clnz_exp_minus add s1, s1, t0 srli s2, s2, 8 sub s2, s2, t1 slli s1, s1, 23 j result # return with mantissa(no implict) clnz_exp_minus: neg t0, t0 sub s1, s1, t0 srli s2, s2, 8 # shift implicit bit to 24th bit position sub s2, s2, t1 slli s1, s1, 23 result: # combine , s1 = exp, s2 = result mantissa add s1, s1, s2 lui t0, 0x80000 and a0, t0, a0 # sign bit or a0, a0, s1 over: #special case lw ra,0(sp) lw s0,4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4,20 (sp) addi sp,sp,24 jr ra ``` #### float point multiplication Below are my float point multiplication code. ```c .global f_mul f_mul: addi sp,sp,-28 sw ra,0(sp) sw s0,4(sp) sw s1,8(sp) sw s2,12(sp) sw s3,16(sp) sw s4,20(sp) sw s5,24(sp) srli s0, a0, 31 # s0 = a0 >> 31, sign bit of s0 srli s1, a1, 31 # s1 = a1 >> 31, sign bit of s1 li t0, 1 lui s2, 0x007FF li s3, 0xFFF add s3, s3, s2 # s3, mantissa mask and s2, a0, s3 # s2 = a0 & 0x7FFFFF, mantissa of a0 and s3, a1, s3 # s3 = a1 & 0x7FFFFF, mantissa of a1 slli t0, t0, 23 # t0 = implicit 1, at bit position 24 li s4, 1 xor s4, s0, s1 # s4 = s0 ^ s1, sign of result lui s5, 0x7F800 # s5 = 0x7F800000, exponent mask add s2, s2, t0 # s2 = s2 + t0, mantissa of a0 with implicit 1 add s3, s3, t0 # s3 = s3 + t0, mantissa of a1 with implicit 1 jal mul_mantissa # s2, s3 input and s1, a0, s5 # s1 = a0 & 0x7F800000, exponent of a0 and s2, a1, s5 # s2 = a1 & 0x7F800000, exponent of a1 slli s4,s4,31 # s4 = s4 << 31, sign of result add s3, s2, s1 # s3 = s2 + s1, exponent of result srli s3, s3, 23 # s3 = s3 >> 23, s3 = exponent of result bgtz s3, normal_exp addi s3, s3, -126 # s3 = s3 - 127, one bias, s3 = exponent of result slli s3, s3, 23 or s4, s4, s0 # s4 = s4 + s0, sign + mantissa or s4, s4, s3 # s4 = s4 + s3, sign + mantissa + exponent j end_f_mul normal_exp: addi s3, s3, -127 # s3 = s3 - 127, one bias, s3 = exponent of result slli s3, s3, 23 # s3 = s3 << 23, s3 = exponent of result or s4, s4, s0 # s4 = s4 + s0, sign + mantissa or s4, s4, s3 # s4 = s4 + s3, sign + mantissa + exponent end_f_mul: add a0, s4, zero # a0 = s4 + 0, return value lw ra, 0(sp) # restore ra lw s0, 4(sp) # restore s0 lw s1, 8(sp) # restore s1 lw s2, 12(sp) # restore s2 lw s3, 16(sp) # restore s3 lw s4, 20(sp) # restore s4 lw s5, 24(sp) # restore s5 addi sp, sp, 28 # restore sp jr ra mul_mantissa: # mantissa s2, s3 srli s2, s2, 8 srli s3, s3, 8 lui t1, 0x00008 # bit position of implicit 1, start at 16 li t2, 0 # t2 = 0, temp value add_mantissa: lui t0, 0x80000 # bit position 32 check and t3, s3, t1 # t3 = s3 & 0x000008, addition check bit beqz t3, skip_mul_add # if this bit = 0, skip add add t2,t2,s2 # t2 = t2 + s2 slli t2, t2, 1 # right shift temp value slli s3,s3,1 # left shift new 16th bit of s3 and t0, t0, t2 # take out 32th bit of t2 bnez t0, end_mul_mantissa # if 32th bit is 1 j add_mantissa skip_mul_add: slli t2, t2, 1 # right shift temp value slli s3,s3,1 # left shift new 16th bit of s3 and t0, t0, t2 # take out 32th bit of t2 bnez t0, end_mul_mantissa # if 32th bit is 1 j add_mantissa end_mul_mantissa: lui t1, 0x80000 sub t2, t2, t1 srli s0,t2,8 # right shift to right position of mantissa jr ra ``` ### Final Result and Summarize #### When combining assembly code ![image](https://hackmd.io/_uploads/Syngeok_yg.png) #### When only using C code ![image](https://hackmd.io/_uploads/BJdVesy_kx.png) ## Reference * [FFT](https://en.wikipedia.org/wiki/Fast_Fourier_transform) * [Coooley-Turkey FFT algorithm](https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm) * [Quiz5 2024](https://hackmd.io/@sysprog/arch2024-quiz5-sol) * https://github.com/vicLin8712/C.A.project