# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < [kkkkk1109](https://github.com/kkkkk1109) >
###### tags: `RISC-V`, `jserv`
## Using Newton-Raphson to find square root
The Newton-Raphson method is a numerical iteration technique used to approximate the root of a function. The method starts with an initial guess for the root and then uses the tangent line at that guess to find the x-intercept. This x-intercept is taken as the next estimated root, and the process is repeated iteratively to obtain an approximate value for the root.
\begin{gather*}x_{n+1}=x_n-\dfrac{f(x_n)}{f^{'}(x_n)}\end{gather*}
To find the square root of a integer number, the function can be defined as
\begin{gather*}{f(x_n)=x_n^2-a}\end{gather*}
where $a$ is the integer number
The Newton-Raphson technique can be redefined as
\begin{gather*}x_{n+1}=x_n-\dfrac{x_n^2-a}{2x_n}\end{gather*}
On Simplification
\begin{gather*}x_{n+1}=0.5x_n+\dfrac{a}{xn}\end{gather*}
## Utilizing CLZ in Newton-Raphson
In Newton-Raphson method, the first guess is play an important part in speed of convergence and iteration times. With **CLZ** (count leading zeros), we can easily find the digits of the root, which leads us to take the guess that closer the actual root, therefore reducing the iteration times and increas the speed of convergence.
## Implementation
You can find the source code [here](https://github.com/kkkkk1109/Implement-square-root-using-CLZ). Feel free to fork and modify it.
### C code
In this assignment, we are prohibited from utilizing the M or F/U extension in RISC-V. Since the square root of an integer may result in a floating-point value, I employ the `uint_to_float` function to convert the data type from integer to float, adhering to the IEEE 754 standard.
```clike=
float uint_to_float(uint32_t u){
unsigned int exp=127+31-count_leading_zeros(u);
while ((u & (1 << 23)) == 0)
{
u <<= 1;
}
uint32_t flt= (exp << 23) | (u & 0x7FFFFF);
return * (float *) &flt;
}
```
Since the number we are addressing is positive , the sign bit is always zero. Use the CLZ to find the exponential, and use left shift to allign the integer to the significand.
In `division`, for the sake of simplicity, several scenarios are not taken into account, including Infinity, NaN, overflow, underflow, division by zero, and so on. Since the input is a positive integer, the minimum possible result is 1 (the square root of 1). Additionally, in square root calculations, it is highly unlikely to encounter overflow or underflow issues because the largest possible number in this program is limited by the input integer. Therefore, these overlooked cases are not expected to arise, contributing to the reduced complexity of the division algorithm.
```clike=
float division(float a,float b){
//change float into int
int32_t ia = *(int32_t *)& a;
int32_t ib = *(int32_t *)& b;
//get the exponential
int32_t expa = (ia >> 23) & 0xff ;
int32_t expb = (ib >> 23) & 0xff;
// parameter used in rounded output
uint8_t odd, rnd, sticky;
//get the significand
uint32_t siga= ((ia & 0x7fffff) | 0x800000);
uint32_t sigb= ((ib & 0x7fffff) | 0x800000);
//exponential output
int32_t expout= expa - expb +127;
//allign two numbers' significand
if(siga < sigb){
siga = siga << 1;
expout--;
}
// r means result
uint32_t r = 0;
// star division
for(int i= 0 ; i < 25 ; i++){
r= r << 1;
if(siga >= sigb){
siga = siga - sigb;
r = r | 1;
}
siga = siga << 1 ;
}
// sticky to see
sticky =(siga != 0);
rnd =(r & 1);
odd =(r & 2) != 0;
r=(r>>1)+ (rnd & (sticky | odd));
int32_t sigout=r & 0x7fffff;
int32_t out= ((expout & 0xff) << 23) | (sigout);
return *(float *) &out;
}
```
In `addition`, cases ignored in division are also ignored.
```clike=
float addition(float a,float b){
//change float into int
int32_t ia = *(int32_t *)& a;
int32_t ib = *(int32_t *)& b;
//get the exponential
int32_t expa = (ia >> 23) & 0xff ;
int32_t expb = (ib >> 23) & 0xff;
//get the significand
uint32_t siga= ((ia & 0x7fffff) | 0x800000);
uint32_t sigb= ((ib & 0x7fffff) | 0x800000);
//exponential output
int32_t expout=0;
int dif= expa - expb;
if(dif >= 0){
sigb = sigb >> dif;
expout = expa;
}
else{
dif = -dif;
siga = siga >> dif;
expout = expb;
}
uint32_t sigout = siga + sigb;
if(sigout >> 24 == 1){
sigout = sigout >>1;
expout = expout + 1;
}
int32_t out=(0 & 1 << 31) | ((expout & 0xff) << 23) | ((sigout) & 0x7fffff);
return *(float *) &out;
}
```
In main, the steps can be described as follwing
1. Calculate the count of leading zeros (CLZ) for the input.
2. Compute the initial approximation of the square root using the input's digits.
3. Right-shift the input to obtain the first guess.
4. Initiate an iterative process.
5. Exit the loop when the solution converges.
```clike=
int main(){
uint32_t r=500;
lz =count_leading_zeros(r);
lzc = (32 - lz) / 2;
ans=uint_to_float(r >> lzc);
float t,c;
for (i = 0 ; i < iteration ; i++){
t = division(r,ans);
t = addition(t,ans);
t = division(t,2);
ans=t;
if(ans == c){
break;
}
c = ans;
}
printf("After %d iterations, the square root of %d is %f",i+1,r,ans);
return 0;
}
```
### Assembly code
In this example, I don't handle addition and multiplication overflow. So please ensure that the result of `mul` at line 33 and `add` at line 34 won't exceed the range of valid value ($-2^{31}$ to $2^{31} - 1$).
```clike=
.data
intput: .word 500
str1: .string "After "
str2: .string " iterations, the square root of "
str3: .string " is "
.text
# s0 = integer(input)
# s1 = float(input)
# a5 = square root of input
main:
lw s0, intput # r = test data;
mv a1, s0 # utf(a1)
jal ra, utf # goto utf
mv s1, a1 # store float(input) to s1
jal ra, clz # jump to clz and get a0 =clz(r)
addi t0, x0, 32 # t0 = 32
sub t0, t0, a0 # 32 -lz
srli t0, t0, 1 # t0 = lzc =(32 - lz)/2
srl a1, s0, t0 # a1 = r >> lzc
jal ra, utf # jump to utf and get ans (a1 = utf(a1))
mv a0, x0 # i =0
addi a2, x0, 5 # iteration times in 5
newton:
bge a0, a2, end
mv a3, s1 # dividend = s0
mv a4, a1 # division = a1
jal ra, division # go to division (r / ans)
mv a6, a5 # a6 = t
mv a3, a6 # a3 = a6
mv a4, a1 # a4 = ans
jal ra, addition # go to addition (t + ans)
addi t0, x0, 1 # t0 = 1
slli t0, t0, 23 # t0 = << 1
sub a5, a5, t0 # a5 - t0 (exp-1) do the divide 2
mv a1, a5 # ans = t
beq s2, a1 end # if last est == now est
mv s2, a1 # c = ans
addi a0, a0, 1 # i++
beq x0, x0 newton
end:
addi t0, a0, 1
la a0, str1 # after
li a7, 4
ecall
mv a0, t0 #i
li a7, 1
ecall
la a0, str2 # iteration, the square root of :
li a7, 4
ecall
mv a0, s0 # input
li a7, 1
ecall
la a0, str3 # iteration, the square root of :
li a7, 4
ecall
mv a0, a5
li a7, 2
ecall
li a7, 10
ecall
clz:
mv t0, s0 # t0 = x
mv t1, t0 # t1 = t0
srli t0, t0, 1 # x >> 1
or t0, t0, t1 # x = x | x >> 1
srli t1, t0, 2 # x >> 2
or t0, t0, t1 # x = x | x >> 2
srli t1, t0, 4 # x >> 4
or t0, t0, t1 # x = x | x >> 4
srli t1, t0, 8 # x >> 8
or t0, t0, t1 # x = x | x >> 8
# count ones (population count)
srli t1, t0, 1 # x >> 1
li t6, 0x55555555 # load mask 0x55555555
and t1, t1, t6 # (x >> 1) & 0x55555555
sub t0, t0, t1 # x -= ((x >> 1) & 0x55555555)
srli t1, t0, 2 # x >> 2
li t6, 0x33333333 # load mask 0x33333333
and t1, t1, t6 # (x >> 2) & 0x33333333
and t0, t0, t6 # x & 0x33333333
add t0, t0, t1 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333)
srli t1, t0, 4 # x >> 4
add t0, t0, t1 # x + (x >> 4)
li t6, 0x0f0f0f0f # load mask 0x0f0f0f0f
and t0, t0, t6 # x = ((x >> 4) + x) & 0x0f0f0f0f
srli t1, t0, 8 # x >> 8
add t0, t1, t0 # x + (x >> 8)
srli t1, t0, 16 # x >> 16
add t0, t1, t0 # x + (x >> 16)
andi t0, t0, 0x7f # x & 0x7f
addi t1, x0, 32 # t1 = 32
sub a0, t1, t0 # return (32 - (x & 0x7f))
jr ra
utf:
addi sp, sp, -12 # make space on stack
sw ra, 8(sp) # save ra
sw s0, 4(sp) # save s0
sw a0, 0(sp) # save a0
mv s0, a1
jal ra, clz
mv t0, a1 # t0 = a1 = u ,a0 = clz(u)
addi t1, x0, 158 # t1 = 127 + 31
sub t1, t1, a0 # exp = 158 - clz(u)
addi t3, x0, 1 # t3 = 1
slli t3, t3, 23 # t3 = t3 << 23
Loop:
and t2, t0, t3 # t2 = u & (1 << 23)
bne t2, x0, outloop # u & (1 << 23) != 0, jump out of loop
slli t0, t0, 1 # u = u << 1
beq x0, x0, Loop # back to loop
outloop:
slli t1, t1, 23 # exp << 23
li t6, 0x7fffff # load mask 0x7fffff
and t0, t0, t6 # u & 0x7fffff
or a1, t1, t0 # a1= (exp << 23) | (u & 0x7FFFFF);
lw ra, 8(sp) # get return adress
lw s0, 4(sp) # get s0
lw a0, 0(sp) # get a0
jr ra # back to main
#/////////////////division
#a3= dividend
#a4= divisor
#a5= quotient
#///////////////
#t0 = expout;
#t1 = siga
#t2 = sigb
#t3 = i
#t4 = 25
division:
srli t0, a3, 23 # ia >> 23
andi t0, t0, 0xff # (ia >> 23) & 0xff expa
srli t1, a4, 23 # ib >> 23
andi t1, t1, 0xff # (ib >> 23) & 0xff expb
sub t0, t0, t1 # (expa - expb)
addi t0, t0, 127 # expout = expa -expb +127
li t6, 0x7fffff # load mask 0x7fffff
and t1, a3, t6 # ia & 0x7fffff siga
and t2, a4, t6 # ib & 0x7fffff sigb
li t6, 0x800000 # load mask 0x7fffff
or t1, t1, t6 # siga = (ia & 0x7fffff ) | 0x800000
or t2, t2, t6 # sigb = (ib & 0x7fffff ) | 0x800000
div1:
bge t1, t2, div2 # siga > sigb go to div2
slli t1, t1, 1 # siga = siga << 1
addi t0, t0, -1 # expout--
mv a5, x0 # r = 0
beq x0, x0, div1 # back to div1
div2:
mv t3, x0 # i = 0
addi t4, x0, 25 # t4 = 25
shift_sub:
bge t3, t4, div3 # if i >= 25 go to div3
slli a5, a5, 1 # r = r << 1
bltu t1, t2, next # if (siga < sigb) go to next
sub t1, t1, t2 # siga = siga - sigb
ori a5, a5, 1 # r = r | 1
next:
slli t1, t1, 1 # siga = siga << 1
addi t3, t3, 1 # i = i + 1
beq x0, x0, shift_sub # back to shift_sub
div3:
mv t3, x0 # sticky = 0
bne t1, x0, neq # if siga != 0 go to neq
beq x0, x0, round # skip eq and go to round
#t3 = sticky
#t4 = rnd
#t5 = odd
#t6 = lui
neq:
addi t3, x0, 1 # (siga !=0 ) is true
round:
andi t4, a5, 1 # rnd = (r & 1)
andi t5, a5, 2 # r & 2
bne t5, x0, odd # (r & 2) != 0 odd
beq x0, x0, out # go to out
odd:
addi t5, x0, 1 # (r & 2) != 0 is true
out:
or t3, t3, t5 # sticky | odd
and t3, t3, t4 # rnd & (sticky | odd)
srli a5, a5, 1 # r = r >> 1
or a5, a5, t3 # r = r >> 1 + (rnd & (sticky | odd))
li t6, 0x7fffff # load mask 0x7fffff
and a5, a5, t6 # r & 0x7fffff
andi t0, t0, 0xff # expout & 0xff
slli t0, t0, 23 # expout = expout << 23
or a5, a5, t0 # out= ((expout & 0xff) << 23) | (sigout)
jr ra # back to main
#///////////////////addition
#a3= a
#a4= b
#a5= a+b
#///////////////
#t0 = expa / expout;
#t1 = expb / sigout
#t2 = dif
#t3 = siga
#t4 = sigb
addition:
srli t0, a3, 23 # ia >> 23
andi t0, t0, 0xff # expa = (ia >> 23) & 0xff
srli t1, a4, 23 # ib >> 23
andi t1, t1, 0xff # expb = (ib >> 23) & 0xff
sub t2, t0, t1 # dif = expa - expb
li t6, 0x7fffff # load mask 0x7fffff
and t3, a3, t6 # ia & 0x7fffff
and t4, a4, t6 # ib & 0x7fffff
li t6, 0x800000 # load mask 0x800000
or t3, t3, t6 # siga = ((ia & 0x7fffff) | 0x800000)
or t4, t4, t6 # sigb = ((ib & 0x7fffff) | 0x800000)
blt t2, x0, out_b # if dif < 0 , expout = expb, else expout=expa
srl t4, t4, t2 # sigb = sigb >> dif
beq x0, x0, plus # go to plus
out_b:
sub t2, x0, t2 # dif= - dif change dif to positive
srl t3, t3, t2 # siga = siga >> dif
mv t0, t1 # expout = expb
plus:
add t1, t3, t4 # sigout= siga + sigb
srli t2, t1, 24 # sigout >> 24
addi t3, x0, 1 # t3 = 1
bne t2, t3, add_out # if (sigout >> 24) != 1 end addition
srli t1, t1, 1 # sigout = sigout >> 1
addi t0, t0, 1 # expout ++
add_out:
andi t0, t0, 0xff # expout & 0xff
slli t0, t0, 23 # (expout & 0xff) << 23
li t6, 0x7fffff # load mask 0x7fffff
and t1, t1, t6 # sigout& 0x7fffff
or a5, t0, t1 # out = ((expout & 0xff) <<23) | ((sigout) & 0x7fffff)
jr ra
```
## Analysis
I test my code using [Ripes](https://github.com/mortbopet/Ripes) simulator.
### Pseudo instruction
The translated code looks like:
```
00000000 <main>:
0: 10000417 auipc x8 0x10000
4: 00042403 lw x8 0 x8
8: 00040593 addi x11 x8 0
c: 150000ef jal x1 336 <utf>
10: 00058493 addi x9 x11 0
14: 0c0000ef jal x1 192 <clz>
18: 02000293 addi x5 x0 32
1c: 40a282b3 sub x5 x5 x10
20: 0012d293 srli x5 x5 1
24: 005455b3 srl x11 x8 x5
28: 134000ef jal x1 308 <utf>
2c: 00000513 addi x10 x0 0
30: 00500613 addi x12 x0 5
00000034 <newton>:
34: 04c55063 bge x10 x12 64 <end>
38: 00048693 addi x13 x9 0
3c: 00058713 addi x14 x11 0
40: 17c000ef jal x1 380 <division>
44: 00078813 addi x16 x15 0
48: 00080693 addi x13 x16 0
4c: 00058713 addi x14 x11 0
50: 22c000ef jal x1 556 <addition>
54: 00100293 addi x5 x0 1
58: 01729293 slli x5 x5 23
5c: 405787b3 sub x15 x15 x5
60: 00078593 addi x11 x15 0
64: 00b90863 beq x18 x11 16 <end>
68: 00058913 addi x18 x11 0
6c: 00150513 addi x10 x10 1
70: fc0002e3 beq x0 x0 -60 <newton>
00000074 <end>:
74: 00150293 addi x5 x10 1
78: 10000517 auipc x10 0x10000
7c: f8c50513 addi x10 x10 -116
80: 00400893 addi x17 x0 4
84: 00000073 ecall
88: 00028513 addi x10 x5 0
8c: 00100893 addi x17 x0 1
90: 00000073 ecall
94: 10000517 auipc x10 0x10000
98: f7750513 addi x10 x10 -137
9c: 00400893 addi x17 x0 4
a0: 00000073 ecall
a4: 00040513 addi x10 x8 0
a8: 00100893 addi x17 x0 1
ac: 00000073 ecall
b0: 10000517 auipc x10 0x10000
b4: f7c50513 addi x10 x10 -132
b8: 00400893 addi x17 x0 4
bc: 00000073 ecall
c0: 00078513 addi x10 x15 0
c4: 00200893 addi x17 x0 2
c8: 00000073 ecall
cc: 00a00893 addi x17 x0 10
d0: 00000073 ecall
000000d4 <clz>:
d4: 00040293 addi x5 x8 0
d8: 00028313 addi x6 x5 0
dc: 0012d293 srli x5 x5 1
e0: 0062e2b3 or x5 x5 x6
e4: 0022d313 srli x6 x5 2
e8: 0062e2b3 or x5 x5 x6
ec: 0042d313 srli x6 x5 4
f0: 0062e2b3 or x5 x5 x6
f4: 0082d313 srli x6 x5 8
f8: 0062e2b3 or x5 x5 x6
fc: 0012d313 srli x6 x5 1
100: 55555fb7 lui x31 0x55555
104: 555f8f93 addi x31 x31 1365
108: 01f37333 and x6 x6 x31
10c: 406282b3 sub x5 x5 x6
110: 0022d313 srli x6 x5 2
114: 33333fb7 lui x31 0x33333
118: 333f8f93 addi x31 x31 819
11c: 01f37333 and x6 x6 x31
120: 01f2f2b3 and x5 x5 x31
124: 006282b3 add x5 x5 x6
128: 0042d313 srli x6 x5 4
12c: 006282b3 add x5 x5 x6
130: 0f0f1fb7 lui x31 0xf0f1
134: f0ff8f93 addi x31 x31 -241
138: 01f2f2b3 and x5 x5 x31
13c: 0082d313 srli x6 x5 8
140: 005302b3 add x5 x6 x5
144: 0102d313 srli x6 x5 16
148: 005302b3 add x5 x6 x5
14c: 07f2f293 andi x5 x5 127
150: 02000313 addi x6 x0 32
154: 40530533 sub x10 x6 x5
158: 00008067 jalr x0 x1 0
0000015c <utf>:
15c: ff410113 addi x2 x2 -12
160: 00112423 sw x1 8 x2
164: 00812223 sw x8 4 x2
168: 00a12023 sw x10 0 x2
16c: 00058413 addi x8 x11 0
170: f65ff0ef jal x1 -156 <clz>
174: 00058293 addi x5 x11 0
178: 09e00313 addi x6 x0 158
17c: 40a30333 sub x6 x6 x10
180: 00100e13 addi x28 x0 1
184: 017e1e13 slli x28 x28 23
00000188 <Loop>:
188: 01c2f3b3 and x7 x5 x28
18c: 00039663 bne x7 x0 12 <outloop>
190: 00129293 slli x5 x5 1
194: fe000ae3 beq x0 x0 -12 <Loop>
00000198 <outloop>:
198: 01731313 slli x6 x6 23
19c: 00800fb7 lui x31 0x800
1a0: ffff8f93 addi x31 x31 -1
1a4: 01f2f2b3 and x5 x5 x31
1a8: 005365b3 or x11 x6 x5
1ac: 00812083 lw x1 8 x2
1b0: 00412403 lw x8 4 x2
1b4: 00012503 lw x10 0 x2
1b8: 00008067 jalr x0 x1 0
000001bc <division>:
1bc: 0176d293 srli x5 x13 23
1c0: 0ff2f293 andi x5 x5 255
1c4: 01775313 srli x6 x14 23
1c8: 0ff37313 andi x6 x6 255
1cc: 406282b3 sub x5 x5 x6
1d0: 07f28293 addi x5 x5 127
1d4: 00800fb7 lui x31 0x800
1d8: ffff8f93 addi x31 x31 -1
1dc: 01f6f333 and x6 x13 x31
1e0: 01f773b3 and x7 x14 x31
1e4: 00800fb7 lui x31 0x800
1e8: 01f36333 or x6 x6 x31
1ec: 01f3e3b3 or x7 x7 x31
000001f0 <div1>:
1f0: 00735a63 bge x6 x7 20 <div2>
1f4: 00131313 slli x6 x6 1
1f8: fff28293 addi x5 x5 -1
1fc: 00000793 addi x15 x0 0
200: fe0008e3 beq x0 x0 -16 <div1>
00000204 <div2>:
204: 00000e13 addi x28 x0 0
208: 01900e93 addi x29 x0 25
0000020c <shift_sub>:
20c: 03de5063 bge x28 x29 32 <div3>
210: 00179793 slli x15 x15 1
214: 00736663 bltu x6 x7 12 <next>
218: 40730333 sub x6 x6 x7
21c: 0017e793 ori x15 x15 1
00000220 <next>:
220: 00131313 slli x6 x6 1
224: 001e0e13 addi x28 x28 1
228: fe0002e3 beq x0 x0 -28 <shift_sub>
0000022c <div3>:
22c: 00000e13 addi x28 x0 0
230: 00031463 bne x6 x0 8 <neq>
234: 00000463 beq x0 x0 8 <round>
00000238 <neq>:
238: 00100e13 addi x28 x0 1
0000023c <round>:
23c: 0017fe93 andi x29 x15 1
240: 0027ff13 andi x30 x15 2
244: 000f1463 bne x30 x0 8 <odd>
248: 00000463 beq x0 x0 8 <out>
0000024c <odd>:
24c: 00100f13 addi x30 x0 1
00000250 <out>:
250: 01ee6e33 or x28 x28 x30
254: 01de7e33 and x28 x28 x29
258: 0017d793 srli x15 x15 1
25c: 01c7e7b3 or x15 x15 x28
260: 00800fb7 lui x31 0x800
264: ffff8f93 addi x31 x31 -1
268: 01f7f7b3 and x15 x15 x31
26c: 0ff2f293 andi x5 x5 255
270: 01729293 slli x5 x5 23
274: 0057e7b3 or x15 x15 x5
278: 00008067 jalr x0 x1 0
0000027c <addition>:
27c: 0176d293 srli x5 x13 23
280: 0ff2f293 andi x5 x5 255
284: 01775313 srli x6 x14 23
288: 0ff37313 andi x6 x6 255
28c: 406283b3 sub x7 x5 x6
290: 00800fb7 lui x31 0x800
294: ffff8f93 addi x31 x31 -1
298: 01f6fe33 and x28 x13 x31
29c: 01f77eb3 and x29 x14 x31
2a0: 00800fb7 lui x31 0x800
2a4: 01fe6e33 or x28 x28 x31
2a8: 01feeeb3 or x29 x29 x31
2ac: 0003c663 blt x7 x0 12 <out_b>
2b0: 007edeb3 srl x29 x29 x7
2b4: 00000863 beq x0 x0 16 <plus>
000002b8 <out_b>:
2b8: 407003b3 sub x7 x0 x7
2bc: 007e5e33 srl x28 x28 x7
2c0: 00030293 addi x5 x6 0
000002c4 <plus>:
2c4: 01de0333 add x6 x28 x29
2c8: 01835393 srli x7 x6 24
2cc: 00100e13 addi x28 x0 1
2d0: 01c39663 bne x7 x28 12 <add_out>
2d4: 00135313 srli x6 x6 1
2d8: 00128293 addi x5 x5 1
000002dc <add_out>:
2dc: 0ff2f293 andi x5 x5 255
2e0: 01729293 slli x5 x5 23
2e4: 00800fb7 lui x31 0x800
2e8: ffff8f93 addi x31 x31 -1
2ec: 01f37333 and x6 x6 x31
2f0: 0062e7b3 or x15 x5 x6
2f4: 00008067 jalr x0 x1 0
```
### 5-stage pipelined processor
We now have the option to select a processor for executing this code. Ripes offers four processor options to choose from, which include the **single-cycle processor**, **5-stage processor**, **5-stage processor with hazard detection**, and **5-stage processor with forward and hazard detection**. In this case, we will opt for the 5-stage processor. The following is the diagram:

The "5-stage" means this processor using five-stage pipeline to parallelize instructions. The stages are:
1. **IF** (Instruction Fetch):
* Fetch the instruction from memory.
* Read the program counter (PC), which points to the memory address of the next instruction.
* Load the instruction into the instruction fetch buffer.
2. **ID** (Instruction Decode):
* Decode the fetched instruction.
* Determine the operation to be performed.
* Identify source and destination registers.
* Calculate any immediate values required.
* Generate control signals based on the instruction's opcode.
3. **EX** (Execution):
* Perform the actual execution of the instruction.
* Execute arithmetic and logic operations.
* Calculate memory addresses.
* Resolve branch instructions.
* Update the program counter if needed.
4. **MEM** (Memory Access):
* Access memory as required.
* Handle load and store instructions.
* Load data from memory (for load instructions).
* Store data to memory (for store instructions).
* Pass the result of a load instruction to the next stage.
5. **WB** (Write-Back):
* Write back the results of the executed instruction to the register file.
* Update register values if the instruction modified any registers.
* Store the updated value in the appropriate register file location.
In Ripes, we can see that each stage are handling the different code

<style>
.text-center{
text-align: center; //文字置中
}
.text-left{
text-align: left; //文字靠左
}
.text-right{
text-align: right; //文字靠右
}
</style>
<p class="text-center">
excutable code
</p>

</style>
<p class="text-center">
Pipeline
</p>
Let's take `addi x8, x11, 0` for example
### **IF - Instruction Fetch**

In this stage, we can see now the program counter is `0x00000150`, and load the instruction `0x00058413`. Also, the next program counter `0x00000154` are showed on the left side.
### **ID - Instruction Decode**

Next, the instruction is decoded to `addi` operation. Get the value `0x000001f4` from register **x11**, and immediate `0x00000000`, then passing to next stage.
### **EX - Execution**

Here, the ALU performs the `add` instruction ,add `0x000001f4` and `0x00000000`, output the result to next stage.
### **MEM - Memory Access**

The result `0x000001f4` is loaded, and the destination register is `0x000001f4`, which is **x8**.
Since we are using `addi` operation, there is no need to load or store to memory.
### **WB - Write Bcak**

Last stage, the result data `0x000001f4` is written to the resigter **x8** and is the end of the instruction `addi x8, x11, 0`.
After all these stage are done, the register is updated like this:

## Result
The test numbers are **500** ,**1160030**, **36**

## Performance
Performing the square root of 500

## Improvement
##### 1. `uint_to float`
In `uint_to_float`, we cam allign the integer to the significand using the clz instead of using a while loop.
```clike=
float uint_to_float(uint32_t u){
int c = count_leading_zeros(u);
unsigned int exp=127+31-c;
u <<= (c-8);
uint32_t flt= (exp << 23) | (u & 0x7FFFFF);
return * (float *) &flt;
}
```

Reduce the cycle from `2172` to `1960`
##### 2. `division`
In `division`, I found that the square root result with and without rounded in division is almost the same, so I take the `rounded` part away from `division`.
```clike=
float division(float a,float b){
//change float into int
int32_t ia = *(int32_t *)& a;
int32_t ib = *(int32_t *)& b;
//get the exponential
int32_t expa = (ia >> 23) & 0xff ;
int32_t expb = (ib >> 23) & 0xff;
// parameter used in rounded output
uint8_t odd, rnd, sticky;
//get the significand
uint32_t siga= ((ia & 0x7fffff) | 0x800000);
uint32_t sigb= ((ib & 0x7fffff) | 0x800000);
//exponential output
int32_t expout= expa - expb +127;
//allign two numbers' significand
if(siga < sigb){
siga = siga << 1;
expout--;
}
// r means result
uint32_t r = 0;
// star division
for(int i= 0 ; i < 25 ; i++){
r= r << 1;
if(siga >= sigb){
siga = siga - sigb;
r = r | 1;
}
siga = siga << 1 ;
}
r=(r >> 1);
int32_t sigout=r & 0x7fffff;
int32_t out= ((expout & 0xff) << 23) | (sigout);
return *(float *) &out;
}
```

Reduce the cycle from `1960` to `1890`
## Future Work
* Find more way to improve the code
## Reference
* [Power of CLZ instruction in numerical computations](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8290400&isnumber=8290323)
* [Float division](https://cs.stackexchange.com/questions/80668/simple-algorithm-for-ieee-754-division-on-8-bit-cpu)
* [RISC-V Instruction Set Manual](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf)
* [RISC-V Green Card](https://www.cl.cam.ac.uk/teaching/1617/ECAD+Arch/files/docs/RISCVGreenCardv8-20151013.pdf)