Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [kkkkk1109](https://github.com/kkkkk1109) > ###### tags: `RISC-V`, `jserv` ## Using Newton-Raphson to find square root The Newton-Raphson method is a numerical iteration technique used to approximate the root of a function. The method starts with an initial guess for the root and then uses the tangent line at that guess to find the x-intercept. This x-intercept is taken as the next estimated root, and the process is repeated iteratively to obtain an approximate value for the root. \begin{gather*}x_{n+1}=x_n-\dfrac{f(x_n)}{f^{'}(x_n)}\end{gather*} To find the square root of a integer number, the function can be defined as \begin{gather*}{f(x_n)=x_n^2-a}\end{gather*} where $a$ is the integer number The Newton-Raphson technique can be redefined as \begin{gather*}x_{n+1}=x_n-\dfrac{x_n^2-a}{2x_n}\end{gather*} On Simplification \begin{gather*}x_{n+1}=0.5x_n+\dfrac{a}{xn}\end{gather*} ## Utilizing CLZ in Newton-Raphson In Newton-Raphson method, the first guess is play an important part in speed of convergence and iteration times. With **CLZ** (count leading zeros), we can easily find the digits of the root, which leads us to take the guess that closer the actual root, therefore reducing the iteration times and increas the speed of convergence. ## Implementation You can find the source code [here](https://github.com/kkkkk1109/Implement-square-root-using-CLZ). Feel free to fork and modify it. ### C code In this assignment, we are prohibited from utilizing the M or F/U extension in RISC-V. Since the square root of an integer may result in a floating-point value, I employ the `uint_to_float` function to convert the data type from integer to float, adhering to the IEEE 754 standard. ```clike= float uint_to_float(uint32_t u){ unsigned int exp=127+31-count_leading_zeros(u); while ((u & (1 << 23)) == 0) { u <<= 1; } uint32_t flt= (exp << 23) | (u & 0x7FFFFF); return * (float *) &flt; } ``` Since the number we are addressing is positive , the sign bit is always zero. Use the CLZ to find the exponential, and use left shift to allign the integer to the significand. In `division`, for the sake of simplicity, several scenarios are not taken into account, including Infinity, NaN, overflow, underflow, division by zero, and so on. Since the input is a positive integer, the minimum possible result is 1 (the square root of 1). Additionally, in square root calculations, it is highly unlikely to encounter overflow or underflow issues because the largest possible number in this program is limited by the input integer. Therefore, these overlooked cases are not expected to arise, contributing to the reduced complexity of the division algorithm. ```clike= float division(float a,float b){ //change float into int int32_t ia = *(int32_t *)& a; int32_t ib = *(int32_t *)& b; //get the exponential int32_t expa = (ia >> 23) & 0xff ; int32_t expb = (ib >> 23) & 0xff; // parameter used in rounded output uint8_t odd, rnd, sticky; //get the significand uint32_t siga= ((ia & 0x7fffff) | 0x800000); uint32_t sigb= ((ib & 0x7fffff) | 0x800000); //exponential output int32_t expout= expa - expb +127; //allign two numbers' significand if(siga < sigb){ siga = siga << 1; expout--; } // r means result uint32_t r = 0; // star division for(int i= 0 ; i < 25 ; i++){ r= r << 1; if(siga >= sigb){ siga = siga - sigb; r = r | 1; } siga = siga << 1 ; } // sticky to see sticky =(siga != 0); rnd =(r & 1); odd =(r & 2) != 0; r=(r>>1)+ (rnd & (sticky | odd)); int32_t sigout=r & 0x7fffff; int32_t out= ((expout & 0xff) << 23) | (sigout); return *(float *) &out; } ``` In `addition`, cases ignored in division are also ignored. ```clike= float addition(float a,float b){ //change float into int int32_t ia = *(int32_t *)& a; int32_t ib = *(int32_t *)& b; //get the exponential int32_t expa = (ia >> 23) & 0xff ; int32_t expb = (ib >> 23) & 0xff; //get the significand uint32_t siga= ((ia & 0x7fffff) | 0x800000); uint32_t sigb= ((ib & 0x7fffff) | 0x800000); //exponential output int32_t expout=0; int dif= expa - expb; if(dif >= 0){ sigb = sigb >> dif; expout = expa; } else{ dif = -dif; siga = siga >> dif; expout = expb; } uint32_t sigout = siga + sigb; if(sigout >> 24 == 1){ sigout = sigout >>1; expout = expout + 1; } int32_t out=(0 & 1 << 31) | ((expout & 0xff) << 23) | ((sigout) & 0x7fffff); return *(float *) &out; } ``` In main, the steps can be described as follwing 1. Calculate the count of leading zeros (CLZ) for the input. 2. Compute the initial approximation of the square root using the input's digits. 3. Right-shift the input to obtain the first guess. 4. Initiate an iterative process. 5. Exit the loop when the solution converges. ```clike= int main(){ uint32_t r=500; lz =count_leading_zeros(r); lzc = (32 - lz) / 2; ans=uint_to_float(r >> lzc); float t,c; for (i = 0 ; i < iteration ; i++){ t = division(r,ans); t = addition(t,ans); t = division(t,2); ans=t; if(ans == c){ break; } c = ans; } printf("After %d iterations, the square root of %d is %f",i+1,r,ans); return 0; } ``` ### Assembly code In this example, I don't handle addition and multiplication overflow. So please ensure that the result of `mul` at line 33 and `add` at line 34 won't exceed the range of valid value ($-2^{31}$ to $2^{31} - 1$). ```clike= .data intput: .word 500 str1: .string "After " str2: .string " iterations, the square root of " str3: .string " is " .text # s0 = integer(input) # s1 = float(input) # a5 = square root of input main: lw s0, intput # r = test data; mv a1, s0 # utf(a1) jal ra, utf # goto utf mv s1, a1 # store float(input) to s1 jal ra, clz # jump to clz and get a0 =clz(r) addi t0, x0, 32 # t0 = 32 sub t0, t0, a0 # 32 -lz srli t0, t0, 1 # t0 = lzc =(32 - lz)/2 srl a1, s0, t0 # a1 = r >> lzc jal ra, utf # jump to utf and get ans (a1 = utf(a1)) mv a0, x0 # i =0 addi a2, x0, 5 # iteration times in 5 newton: bge a0, a2, end mv a3, s1 # dividend = s0 mv a4, a1 # division = a1 jal ra, division # go to division (r / ans) mv a6, a5 # a6 = t mv a3, a6 # a3 = a6 mv a4, a1 # a4 = ans jal ra, addition # go to addition (t + ans) addi t0, x0, 1 # t0 = 1 slli t0, t0, 23 # t0 = << 1 sub a5, a5, t0 # a5 - t0 (exp-1) do the divide 2 mv a1, a5 # ans = t beq s2, a1 end # if last est == now est mv s2, a1 # c = ans addi a0, a0, 1 # i++ beq x0, x0 newton end: addi t0, a0, 1 la a0, str1 # after li a7, 4 ecall mv a0, t0 #i li a7, 1 ecall la a0, str2 # iteration, the square root of : li a7, 4 ecall mv a0, s0 # input li a7, 1 ecall la a0, str3 # iteration, the square root of : li a7, 4 ecall mv a0, a5 li a7, 2 ecall li a7, 10 ecall clz: mv t0, s0 # t0 = x mv t1, t0 # t1 = t0 srli t0, t0, 1 # x >> 1 or t0, t0, t1 # x = x | x >> 1 srli t1, t0, 2 # x >> 2 or t0, t0, t1 # x = x | x >> 2 srli t1, t0, 4 # x >> 4 or t0, t0, t1 # x = x | x >> 4 srli t1, t0, 8 # x >> 8 or t0, t0, t1 # x = x | x >> 8 # count ones (population count) srli t1, t0, 1 # x >> 1 li t6, 0x55555555 # load mask 0x55555555 and t1, t1, t6 # (x >> 1) & 0x55555555 sub t0, t0, t1 # x -= ((x >> 1) & 0x55555555) srli t1, t0, 2 # x >> 2 li t6, 0x33333333 # load mask 0x33333333 and t1, t1, t6 # (x >> 2) & 0x33333333 and t0, t0, t6 # x & 0x33333333 add t0, t0, t1 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333) srli t1, t0, 4 # x >> 4 add t0, t0, t1 # x + (x >> 4) li t6, 0x0f0f0f0f # load mask 0x0f0f0f0f and t0, t0, t6 # x = ((x >> 4) + x) & 0x0f0f0f0f srli t1, t0, 8 # x >> 8 add t0, t1, t0 # x + (x >> 8) srli t1, t0, 16 # x >> 16 add t0, t1, t0 # x + (x >> 16) andi t0, t0, 0x7f # x & 0x7f addi t1, x0, 32 # t1 = 32 sub a0, t1, t0 # return (32 - (x & 0x7f)) jr ra utf: addi sp, sp, -12 # make space on stack sw ra, 8(sp) # save ra sw s0, 4(sp) # save s0 sw a0, 0(sp) # save a0 mv s0, a1 jal ra, clz mv t0, a1 # t0 = a1 = u ,a0 = clz(u) addi t1, x0, 158 # t1 = 127 + 31 sub t1, t1, a0 # exp = 158 - clz(u) addi t3, x0, 1 # t3 = 1 slli t3, t3, 23 # t3 = t3 << 23 Loop: and t2, t0, t3 # t2 = u & (1 << 23) bne t2, x0, outloop # u & (1 << 23) != 0, jump out of loop slli t0, t0, 1 # u = u << 1 beq x0, x0, Loop # back to loop outloop: slli t1, t1, 23 # exp << 23 li t6, 0x7fffff # load mask 0x7fffff and t0, t0, t6 # u & 0x7fffff or a1, t1, t0 # a1= (exp << 23) | (u & 0x7FFFFF); lw ra, 8(sp) # get return adress lw s0, 4(sp) # get s0 lw a0, 0(sp) # get a0 jr ra # back to main #/////////////////division #a3= dividend #a4= divisor #a5= quotient #/////////////// #t0 = expout; #t1 = siga #t2 = sigb #t3 = i #t4 = 25 division: srli t0, a3, 23 # ia >> 23 andi t0, t0, 0xff # (ia >> 23) & 0xff expa srli t1, a4, 23 # ib >> 23 andi t1, t1, 0xff # (ib >> 23) & 0xff expb sub t0, t0, t1 # (expa - expb) addi t0, t0, 127 # expout = expa -expb +127 li t6, 0x7fffff # load mask 0x7fffff and t1, a3, t6 # ia & 0x7fffff siga and t2, a4, t6 # ib & 0x7fffff sigb li t6, 0x800000 # load mask 0x7fffff or t1, t1, t6 # siga = (ia & 0x7fffff ) | 0x800000 or t2, t2, t6 # sigb = (ib & 0x7fffff ) | 0x800000 div1: bge t1, t2, div2 # siga > sigb go to div2 slli t1, t1, 1 # siga = siga << 1 addi t0, t0, -1 # expout-- mv a5, x0 # r = 0 beq x0, x0, div1 # back to div1 div2: mv t3, x0 # i = 0 addi t4, x0, 25 # t4 = 25 shift_sub: bge t3, t4, div3 # if i >= 25 go to div3 slli a5, a5, 1 # r = r << 1 bltu t1, t2, next # if (siga < sigb) go to next sub t1, t1, t2 # siga = siga - sigb ori a5, a5, 1 # r = r | 1 next: slli t1, t1, 1 # siga = siga << 1 addi t3, t3, 1 # i = i + 1 beq x0, x0, shift_sub # back to shift_sub div3: mv t3, x0 # sticky = 0 bne t1, x0, neq # if siga != 0 go to neq beq x0, x0, round # skip eq and go to round #t3 = sticky #t4 = rnd #t5 = odd #t6 = lui neq: addi t3, x0, 1 # (siga !=0 ) is true round: andi t4, a5, 1 # rnd = (r & 1) andi t5, a5, 2 # r & 2 bne t5, x0, odd # (r & 2) != 0 odd beq x0, x0, out # go to out odd: addi t5, x0, 1 # (r & 2) != 0 is true out: or t3, t3, t5 # sticky | odd and t3, t3, t4 # rnd & (sticky | odd) srli a5, a5, 1 # r = r >> 1 or a5, a5, t3 # r = r >> 1 + (rnd & (sticky | odd)) li t6, 0x7fffff # load mask 0x7fffff and a5, a5, t6 # r & 0x7fffff andi t0, t0, 0xff # expout & 0xff slli t0, t0, 23 # expout = expout << 23 or a5, a5, t0 # out= ((expout & 0xff) << 23) | (sigout) jr ra # back to main #///////////////////addition #a3= a #a4= b #a5= a+b #/////////////// #t0 = expa / expout; #t1 = expb / sigout #t2 = dif #t3 = siga #t4 = sigb addition: srli t0, a3, 23 # ia >> 23 andi t0, t0, 0xff # expa = (ia >> 23) & 0xff srli t1, a4, 23 # ib >> 23 andi t1, t1, 0xff # expb = (ib >> 23) & 0xff sub t2, t0, t1 # dif = expa - expb li t6, 0x7fffff # load mask 0x7fffff and t3, a3, t6 # ia & 0x7fffff and t4, a4, t6 # ib & 0x7fffff li t6, 0x800000 # load mask 0x800000 or t3, t3, t6 # siga = ((ia & 0x7fffff) | 0x800000) or t4, t4, t6 # sigb = ((ib & 0x7fffff) | 0x800000) blt t2, x0, out_b # if dif < 0 , expout = expb, else expout=expa srl t4, t4, t2 # sigb = sigb >> dif beq x0, x0, plus # go to plus out_b: sub t2, x0, t2 # dif= - dif change dif to positive srl t3, t3, t2 # siga = siga >> dif mv t0, t1 # expout = expb plus: add t1, t3, t4 # sigout= siga + sigb srli t2, t1, 24 # sigout >> 24 addi t3, x0, 1 # t3 = 1 bne t2, t3, add_out # if (sigout >> 24) != 1 end addition srli t1, t1, 1 # sigout = sigout >> 1 addi t0, t0, 1 # expout ++ add_out: andi t0, t0, 0xff # expout & 0xff slli t0, t0, 23 # (expout & 0xff) << 23 li t6, 0x7fffff # load mask 0x7fffff and t1, t1, t6 # sigout& 0x7fffff or a5, t0, t1 # out = ((expout & 0xff) <<23) | ((sigout) & 0x7fffff) jr ra ``` ## Analysis I test my code using [Ripes](https://github.com/mortbopet/Ripes) simulator. ### Pseudo instruction The translated code looks like: ``` 00000000 <main>: 0: 10000417 auipc x8 0x10000 4: 00042403 lw x8 0 x8 8: 00040593 addi x11 x8 0 c: 150000ef jal x1 336 <utf> 10: 00058493 addi x9 x11 0 14: 0c0000ef jal x1 192 <clz> 18: 02000293 addi x5 x0 32 1c: 40a282b3 sub x5 x5 x10 20: 0012d293 srli x5 x5 1 24: 005455b3 srl x11 x8 x5 28: 134000ef jal x1 308 <utf> 2c: 00000513 addi x10 x0 0 30: 00500613 addi x12 x0 5 00000034 <newton>: 34: 04c55063 bge x10 x12 64 <end> 38: 00048693 addi x13 x9 0 3c: 00058713 addi x14 x11 0 40: 17c000ef jal x1 380 <division> 44: 00078813 addi x16 x15 0 48: 00080693 addi x13 x16 0 4c: 00058713 addi x14 x11 0 50: 22c000ef jal x1 556 <addition> 54: 00100293 addi x5 x0 1 58: 01729293 slli x5 x5 23 5c: 405787b3 sub x15 x15 x5 60: 00078593 addi x11 x15 0 64: 00b90863 beq x18 x11 16 <end> 68: 00058913 addi x18 x11 0 6c: 00150513 addi x10 x10 1 70: fc0002e3 beq x0 x0 -60 <newton> 00000074 <end>: 74: 00150293 addi x5 x10 1 78: 10000517 auipc x10 0x10000 7c: f8c50513 addi x10 x10 -116 80: 00400893 addi x17 x0 4 84: 00000073 ecall 88: 00028513 addi x10 x5 0 8c: 00100893 addi x17 x0 1 90: 00000073 ecall 94: 10000517 auipc x10 0x10000 98: f7750513 addi x10 x10 -137 9c: 00400893 addi x17 x0 4 a0: 00000073 ecall a4: 00040513 addi x10 x8 0 a8: 00100893 addi x17 x0 1 ac: 00000073 ecall b0: 10000517 auipc x10 0x10000 b4: f7c50513 addi x10 x10 -132 b8: 00400893 addi x17 x0 4 bc: 00000073 ecall c0: 00078513 addi x10 x15 0 c4: 00200893 addi x17 x0 2 c8: 00000073 ecall cc: 00a00893 addi x17 x0 10 d0: 00000073 ecall 000000d4 <clz>: d4: 00040293 addi x5 x8 0 d8: 00028313 addi x6 x5 0 dc: 0012d293 srli x5 x5 1 e0: 0062e2b3 or x5 x5 x6 e4: 0022d313 srli x6 x5 2 e8: 0062e2b3 or x5 x5 x6 ec: 0042d313 srli x6 x5 4 f0: 0062e2b3 or x5 x5 x6 f4: 0082d313 srli x6 x5 8 f8: 0062e2b3 or x5 x5 x6 fc: 0012d313 srli x6 x5 1 100: 55555fb7 lui x31 0x55555 104: 555f8f93 addi x31 x31 1365 108: 01f37333 and x6 x6 x31 10c: 406282b3 sub x5 x5 x6 110: 0022d313 srli x6 x5 2 114: 33333fb7 lui x31 0x33333 118: 333f8f93 addi x31 x31 819 11c: 01f37333 and x6 x6 x31 120: 01f2f2b3 and x5 x5 x31 124: 006282b3 add x5 x5 x6 128: 0042d313 srli x6 x5 4 12c: 006282b3 add x5 x5 x6 130: 0f0f1fb7 lui x31 0xf0f1 134: f0ff8f93 addi x31 x31 -241 138: 01f2f2b3 and x5 x5 x31 13c: 0082d313 srli x6 x5 8 140: 005302b3 add x5 x6 x5 144: 0102d313 srli x6 x5 16 148: 005302b3 add x5 x6 x5 14c: 07f2f293 andi x5 x5 127 150: 02000313 addi x6 x0 32 154: 40530533 sub x10 x6 x5 158: 00008067 jalr x0 x1 0 0000015c <utf>: 15c: ff410113 addi x2 x2 -12 160: 00112423 sw x1 8 x2 164: 00812223 sw x8 4 x2 168: 00a12023 sw x10 0 x2 16c: 00058413 addi x8 x11 0 170: f65ff0ef jal x1 -156 <clz> 174: 00058293 addi x5 x11 0 178: 09e00313 addi x6 x0 158 17c: 40a30333 sub x6 x6 x10 180: 00100e13 addi x28 x0 1 184: 017e1e13 slli x28 x28 23 00000188 <Loop>: 188: 01c2f3b3 and x7 x5 x28 18c: 00039663 bne x7 x0 12 <outloop> 190: 00129293 slli x5 x5 1 194: fe000ae3 beq x0 x0 -12 <Loop> 00000198 <outloop>: 198: 01731313 slli x6 x6 23 19c: 00800fb7 lui x31 0x800 1a0: ffff8f93 addi x31 x31 -1 1a4: 01f2f2b3 and x5 x5 x31 1a8: 005365b3 or x11 x6 x5 1ac: 00812083 lw x1 8 x2 1b0: 00412403 lw x8 4 x2 1b4: 00012503 lw x10 0 x2 1b8: 00008067 jalr x0 x1 0 000001bc <division>: 1bc: 0176d293 srli x5 x13 23 1c0: 0ff2f293 andi x5 x5 255 1c4: 01775313 srli x6 x14 23 1c8: 0ff37313 andi x6 x6 255 1cc: 406282b3 sub x5 x5 x6 1d0: 07f28293 addi x5 x5 127 1d4: 00800fb7 lui x31 0x800 1d8: ffff8f93 addi x31 x31 -1 1dc: 01f6f333 and x6 x13 x31 1e0: 01f773b3 and x7 x14 x31 1e4: 00800fb7 lui x31 0x800 1e8: 01f36333 or x6 x6 x31 1ec: 01f3e3b3 or x7 x7 x31 000001f0 <div1>: 1f0: 00735a63 bge x6 x7 20 <div2> 1f4: 00131313 slli x6 x6 1 1f8: fff28293 addi x5 x5 -1 1fc: 00000793 addi x15 x0 0 200: fe0008e3 beq x0 x0 -16 <div1> 00000204 <div2>: 204: 00000e13 addi x28 x0 0 208: 01900e93 addi x29 x0 25 0000020c <shift_sub>: 20c: 03de5063 bge x28 x29 32 <div3> 210: 00179793 slli x15 x15 1 214: 00736663 bltu x6 x7 12 <next> 218: 40730333 sub x6 x6 x7 21c: 0017e793 ori x15 x15 1 00000220 <next>: 220: 00131313 slli x6 x6 1 224: 001e0e13 addi x28 x28 1 228: fe0002e3 beq x0 x0 -28 <shift_sub> 0000022c <div3>: 22c: 00000e13 addi x28 x0 0 230: 00031463 bne x6 x0 8 <neq> 234: 00000463 beq x0 x0 8 <round> 00000238 <neq>: 238: 00100e13 addi x28 x0 1 0000023c <round>: 23c: 0017fe93 andi x29 x15 1 240: 0027ff13 andi x30 x15 2 244: 000f1463 bne x30 x0 8 <odd> 248: 00000463 beq x0 x0 8 <out> 0000024c <odd>: 24c: 00100f13 addi x30 x0 1 00000250 <out>: 250: 01ee6e33 or x28 x28 x30 254: 01de7e33 and x28 x28 x29 258: 0017d793 srli x15 x15 1 25c: 01c7e7b3 or x15 x15 x28 260: 00800fb7 lui x31 0x800 264: ffff8f93 addi x31 x31 -1 268: 01f7f7b3 and x15 x15 x31 26c: 0ff2f293 andi x5 x5 255 270: 01729293 slli x5 x5 23 274: 0057e7b3 or x15 x15 x5 278: 00008067 jalr x0 x1 0 0000027c <addition>: 27c: 0176d293 srli x5 x13 23 280: 0ff2f293 andi x5 x5 255 284: 01775313 srli x6 x14 23 288: 0ff37313 andi x6 x6 255 28c: 406283b3 sub x7 x5 x6 290: 00800fb7 lui x31 0x800 294: ffff8f93 addi x31 x31 -1 298: 01f6fe33 and x28 x13 x31 29c: 01f77eb3 and x29 x14 x31 2a0: 00800fb7 lui x31 0x800 2a4: 01fe6e33 or x28 x28 x31 2a8: 01feeeb3 or x29 x29 x31 2ac: 0003c663 blt x7 x0 12 <out_b> 2b0: 007edeb3 srl x29 x29 x7 2b4: 00000863 beq x0 x0 16 <plus> 000002b8 <out_b>: 2b8: 407003b3 sub x7 x0 x7 2bc: 007e5e33 srl x28 x28 x7 2c0: 00030293 addi x5 x6 0 000002c4 <plus>: 2c4: 01de0333 add x6 x28 x29 2c8: 01835393 srli x7 x6 24 2cc: 00100e13 addi x28 x0 1 2d0: 01c39663 bne x7 x28 12 <add_out> 2d4: 00135313 srli x6 x6 1 2d8: 00128293 addi x5 x5 1 000002dc <add_out>: 2dc: 0ff2f293 andi x5 x5 255 2e0: 01729293 slli x5 x5 23 2e4: 00800fb7 lui x31 0x800 2e8: ffff8f93 addi x31 x31 -1 2ec: 01f37333 and x6 x6 x31 2f0: 0062e7b3 or x15 x5 x6 2f4: 00008067 jalr x0 x1 0 ``` ### 5-stage pipelined processor We now have the option to select a processor for executing this code. Ripes offers four processor options to choose from, which include the **single-cycle processor**, **5-stage processor**, **5-stage processor with hazard detection**, and **5-stage processor with forward and hazard detection**. In this case, we will opt for the 5-stage processor. The following is the diagram: ![](https://hackmd.io/_uploads/ry4L9ml-a.png) The "5-stage" means this processor using five-stage pipeline to parallelize instructions. The stages are: 1. **IF** (Instruction Fetch): * Fetch the instruction from memory. * Read the program counter (PC), which points to the memory address of the next instruction. * Load the instruction into the instruction fetch buffer. 2. **ID** (Instruction Decode): * Decode the fetched instruction. * Determine the operation to be performed. * Identify source and destination registers. * Calculate any immediate values required. * Generate control signals based on the instruction's opcode. 3. **EX** (Execution): * Perform the actual execution of the instruction. * Execute arithmetic and logic operations. * Calculate memory addresses. * Resolve branch instructions. * Update the program counter if needed. 4. **MEM** (Memory Access): * Access memory as required. * Handle load and store instructions. * Load data from memory (for load instructions). * Store data to memory (for store instructions). * Pass the result of a load instruction to the next stage. 5. **WB** (Write-Back): * Write back the results of the executed instruction to the register file. * Update register values if the instruction modified any registers. * Store the updated value in the appropriate register file location. In Ripes, we can see that each stage are handling the different code ![](https://hackmd.io/_uploads/rkhmU4eWa.png) <style> .text-center{ text-align: center; //文字置中 } .text-left{ text-align: left; //文字靠左 } .text-right{ text-align: right; //文字靠右 } </style> <p class="text-center"> excutable code </p> ![](https://hackmd.io/_uploads/Bk3EUExWp.png) </style> <p class="text-center"> Pipeline </p> Let's take `addi x8, x11, 0` for example ### **IF - Instruction Fetch** ![](https://hackmd.io/_uploads/HkomKElZa.png) In this stage, we can see now the program counter is `0x00000150`, and load the instruction `0x00058413`. Also, the next program counter `0x00000154` are showed on the left side. ### **ID - Instruction Decode** ![](https://hackmd.io/_uploads/r1Z0nEeZT.png) Next, the instruction is decoded to `addi` operation. Get the value `0x000001f4` from register **x11**, and immediate `0x00000000`, then passing to next stage. ### **EX - Execution** ![](https://hackmd.io/_uploads/r1d2pEgbp.png) Here, the ALU performs the `add` instruction ,add `0x000001f4` and `0x00000000`, output the result to next stage. ### **MEM - Memory Access** ![](https://hackmd.io/_uploads/SJcMxBebp.png) The result `0x000001f4` is loaded, and the destination register is `0x000001f4`, which is **x8**. Since we are using `addi` operation, there is no need to load or store to memory. ### **WB - Write Bcak** ![](https://hackmd.io/_uploads/H1ZVeBlb6.png) Last stage, the result data `0x000001f4` is written to the resigter **x8** and is the end of the instruction `addi x8, x11, 0`. After all these stage are done, the register is updated like this: ![](https://hackmd.io/_uploads/B1-3JmZ-T.png) ## Result The test numbers are **500** ,**1160030**, **36** ![](https://hackmd.io/_uploads/rykCfreZT.png) ## Performance Performing the square root of 500 ![](https://hackmd.io/_uploads/ryISaEWWT.png) ## Improvement ##### 1. `uint_to float` In `uint_to_float`, we cam allign the integer to the significand using the clz instead of using a while loop. ```clike= float uint_to_float(uint32_t u){ int c = count_leading_zeros(u); unsigned int exp=127+31-c; u <<= (c-8); uint32_t flt= (exp << 23) | (u & 0x7FFFFF); return * (float *) &flt; } ``` ![](https://hackmd.io/_uploads/rywupEbba.png) Reduce the cycle from `2172` to `1960` ##### 2. `division` In `division`, I found that the square root result with and without rounded in division is almost the same, so I take the `rounded` part away from `division`. ```clike= float division(float a,float b){ //change float into int int32_t ia = *(int32_t *)& a; int32_t ib = *(int32_t *)& b; //get the exponential int32_t expa = (ia >> 23) & 0xff ; int32_t expb = (ib >> 23) & 0xff; // parameter used in rounded output uint8_t odd, rnd, sticky; //get the significand uint32_t siga= ((ia & 0x7fffff) | 0x800000); uint32_t sigb= ((ib & 0x7fffff) | 0x800000); //exponential output int32_t expout= expa - expb +127; //allign two numbers' significand if(siga < sigb){ siga = siga << 1; expout--; } // r means result uint32_t r = 0; // star division for(int i= 0 ; i < 25 ; i++){ r= r << 1; if(siga >= sigb){ siga = siga - sigb; r = r | 1; } siga = siga << 1 ; } r=(r >> 1); int32_t sigout=r & 0x7fffff; int32_t out= ((expout & 0xff) << 23) | (sigout); return *(float *) &out; } ``` ![](https://hackmd.io/_uploads/ry-2pVWbp.png) Reduce the cycle from `1960` to `1890` ## Future Work * Find more way to improve the code ## Reference * [Power of CLZ instruction in numerical computations](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8290400&isnumber=8290323) * [Float division](https://cs.stackexchange.com/questions/80668/simple-algorithm-for-ieee-754-division-on-8-bit-cpu) * [RISC-V Instruction Set Manual](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) * [RISC-V Green Card](https://www.cl.cam.ac.uk/teaching/1617/ECAD+Arch/files/docs/RISCVGreenCardv8-20151013.pdf)