# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < [`Yao1201`](https://github.com/Yao1201) >
## Convolution by using bfloat16
### 1.Convolution
In Algebraic definition, the discreet convolution of two vectors $x[n]$ and $h[n]$ is defined as :
$$y[n]=x[n]*h[n]=\sum_{k=-∞}^{∞} x[k]h[n-k]$$
It is also an crucial concept in the field of digital signal processing. An example of LTI system is that an output $y[n]$ can be calculated as a input signal $x[n]$ and a impulse response $h[n]$
### 2.bfloat16
The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory.This format is a shortened (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format.Its uniqueness lies in its retention of 8 exponent bis, preserving the dynamic range, but support only an 7 fraction bit, which reduce its precision.

## Implementation
You can find the source code [here](https://github.com/Yao1201/Convolution-by-using-bfloat16).
### C code
```clike=
#include <stdio.h>
float fp32_to_bf16(float x);
int main(){
float a[3]={1.2,1.203125,2.31}; #test data
float b[3]={2.3125,3.46,3.453125}; #from quiz1(problem2)
float c[5];
for(int k=0;k<3;k++){
a[k]=fp32_to_bf16(a[k]);
b[k]=fp32_to_bf16(b[k]);
}
for (int i = 0; i < 5; i++) {
c[i] = 0.0;
for (int j = 0; j < 3; j++) {
if ((i - j) >= 0 && (i - j) < 3) {
// Main calculation
c[i] = c[i] + a[j] * b[i - j];
}
}
printf("%f\t", c[i] );
}
}
float fp32_to_bf16(float x)
{
float y = x;
int *p = (int *) &y;
unsigned int exp = *p & 0x7F800000;
unsigned int man = *p & 0x007FFFFF;
if (exp == 0 && man == 0) /* zero */
return x;
if (exp == 0x7F800000) / infinity or NaN */
return x;
/* Normalized number */
/* round to nearest */
float r = x;
int *pr = (int *) &r;
*pr &= 0xFF800000; /* r has the same exp as x */
r /= 0x100;
y = x + r;
*p &= 0xFFFF0000;
return y;
}
```
### Assembly code
In this example, I need to implement bfloat16 addition`(fadd)` and multiplication`(bf16mul)`.
1. **Store arrays to memory**
Input the test data
1. **Transfer fp32 to bf16**
Transfer the format of input data.
1. **Loops**
Indces i, j determine which value in arrays will participate the calculation.
1. **Multiplication & Addition**
Because we prohibit the use of the F extension, we need to write a function for floating-point addition and multiplication.
```clike=
.data
#test 1 (from problem2)
arr1: .word 0x3f99999a, 0x3f9a0000, 0x4013d70a #1.2 1.203125 2.31
arr2: .word 0x40140000, 0x405d70a4, 0x405d0000 #2.3125 3.46 3.453125
#test 2
#arr1: .word 0x3f6b851f, 0x417ccccd, 0x3df5c28f #0.92,15.8,0.12
#arr2: .word 0x3f451eb8, 0x40a7ae14, 0x41200000 #0.77,5.24,10
#test 3
#arr1: .word 0x404ccccd, 0x3f4ccccd, 0x4019999a #3.2,0.8,2.4
#arr2: .word 0x40f00000, 0x3fbae148, 0xc0066666 #7.5,1.46,1.2
arr3: .word 0, 0, 0
len: .word 3 #array length=3
str1: .string "y[n]=["
str2: .string " end program"
str3: .string "]"
space: .string " "
#a1,a2 --> x[],h[]
#a3 --bf16-->[ x h ]
#s3 ---> result
.text
#-----------------------------------------------------------#
main:
la s1, arr1 #s1=x[]
la s2, arr2 #s2=h[]
la s3, arr3 #s3=y[]
lw s4, len
add s5, x0, x0 #s5=0 transfer counter
jal x_fp32tobf16 #transfer to bf16
jal h_fp32tobf16
jal convolution #convolution and print
end_program:
la a0, str3
li a7, 4
ecall
la a0, str2
li a7, 4
ecall
li a7,10
ecall
#---------------------------fp32tobf16-----------------------------------#
x_fp32tobf16:
lw a1,0(s1)
addi s1, s1, 4 #x[] next element
li t0, 0x7F800000 #t0=0x7F800000
and s6, a1, t0 #s6=exp (*p&0x7F800000)
li t1, 0x007FFFFF #t0=0x007FFFFF
and s7, a1, t1 #s7=man (*p&0x007FFFFF)
or t1, s6, s7
beqz t1,eqzero
beq s6, t0, infinity_NaN
li t0,0xFF800000
and t1, t0, a1 #keep sign+exp
srli t0, s7, 8 #man >>8
li t2, 0x8000
or t0, t0, t2
add t0, t0, s7
or t0, t0, t1 #assemble sign + exp + man
li t1,0xFFFF0000 #truncate redundant bits
and t0, t0, t1
sw t0, 0(a3) #store bf16 in a3[]=
addi a3, a3, 4
addi s5, s5, 1 #counter++
blt s5, s4, x_fp32tobf16 #counter< 3 loop
addi s5, x0, 0
ret
#------------------------------------------------------------------------#
h_fp32tobf16:
lw a2,0(s2)
addi s2, s2, 4 #x[] next element
li t0, 0x7F800000 #t0=0x7F800000
and s6, a2, t0 #s6=exp (*p&0x7F800000)
li t1, 0x007FFFFF #t0=0x007FFFFF
and s7, a2, t1 #s7=man (*p&0x007FFFFF)
or t1, s6, s7
beqz t1,eqzero
beq s6, t0, infinity_NaN
li t0,0xFF800000
and t1, t0, a2 #keep sign+exp
srli t0, s7, 8 #man >>8
li t2, 0x8000
or t0, t0, t2
add t0, t0, s7
or t0, t0, t1 #assemble sign + exp + man
li t1,0xFFFF0000 #truncate redundant bits
and t0, t0, t1
sw t0, 0(a3) #store bf16 in a4[]=
addi a3, a3, 4
addi s5, s5, 1 #counter++
blt s5, s4, h_fp32tobf16 #counter< 3 loop
addi a3, a3, -24 #correction a3
ret
eqzero:
jal return_val
infinity_NaN:
jal return_val
return_val:
li a7,2
ecall
end:
li a7,10
ecall
#---------------------------convolution-------------------------------#
convolution:
addi a5, x0, 0 #i=0 counter
loop_i:
add s11, x0, x0
addi t6, x0, 5 #2*len-1=5
addi a6, x0, 0 #j=0 counter
loop_j:
blt a5, a6, end_j #
addi t3, a6, 3 #j+3
bge a5, t3, end_j
#---bf16 mul----#
sub t1, a5, a6 #i-j
add t1, t1, t1
add t1, t1, t1 #4*(i-j)
add s6, a3, t1
lw a2, (12)s6 #h[]
add t1, a6, a6
add t1, t1, t1 #4*j
add s5, a3, t1
lw a1,0(s5) #x[]
jal t6, bf16mul ## t1 = mul_result
add a1, t1, x0
add a2, x0, x0
jal t6,fadd #t0 =fadd_result
add a1, t0, x0
add a2, s11, x0
jal t6,fadd
add s11, x0, t0 #y=y+x*h
end_j:
addi a6,a6, 1 #j++
blt a6, s4, loop_j #if j < 3 -> loop_j
sw s11, (0)s3
addi s3, s3, 4
mv a0,s11 #print result
li a7,2
ecall
la a0, space
li a7, 4
ecall
addi a5, a5, 1 #i++
addi t6, x0, 5 #2*len-1=5
blt a5, t6, loop_i #i<5 --> loop_i
jal end_program
#----------------------bf16 addition------------------------#
fadd:
# fadd(a1 , a2)
bnez a2, fadd!=z #anyone =0 --> mv
mv t0, a1
jalr t6
fadd!=z:
srli t0, a1, 23 #a1=val_1
andi t0, t0, 0xFF #t0=exp_1 assume result exp
srli t1, a2, 23 #a2=val_2
andi t1, t1, 0xFF #t1=exp_2
srli t2, a1, 16
andi t2, t2, 0x7F #man_1
srli t3, a2, 16
andi t3, t3, 0x7F #man_2
li t4, 0x80 #mask of significand
or t2, t2, t4 #t2=sig_1
or t3, t3, t4 #t3=sig_2
sub t4, t0, t1 #diff between exp_1,2
blt t4, x0, swap #if exp_1<exp_2 change result exp
srl t3, t3, t4 #sig_2 >> diff
jal fadd_1
swap:
add t0, t1, x0 #change exp_2 to result exp
sub t4, x0, t4 #diff=-diff
srl t2, t2, t4 #sig_1 >> diff
fadd_1:
add t4, t2, t3 #t4=sig_1+sig_2 result sig
srli t5, t4, 8 #check
addi t1, x0, 1
bne t5, t1, fadd_result
srli t4, t4,1
addi t0, t0, 1 #t0=result_exp=exp+1
fadd_result:
slli t0, t0, 23 #result exp<<23
andi t4, t4, 0x7F #result man
slli t4, t4, 16 #man <<16
or t0, t0, t4 #result
jalr t6
#--------------------bf16 multiplication------------------------#
bf16mul:
##bf16mul(a1, a2)
srli s5, a1, 31 #s5= sign_1
srli s6, a2, 31 #s6= sign_2
li t4, 0x80 #significand mask
srli s7, a1, 16
andi s7, s7, 0x7F
srli s8, a2, 16
andi s8, s8, 0x7F
or s7, s7, t4 #s7= sig_1 7+1bits
or s8, s8, t4 #s8= sig_2
srli s9, a1, 23
srli s10, a2, 23
andi s9, s9, 0xFF #s9= exp_1 8bits
andi s10, s10, 0xFF #s10=exp_2
add t4, x0, x0 #imul result
add t0, x0, x0 #i counter
imul:
addi t1, x0, 8 #t1=8
bgt t0 , t1, fmul_
getbit:
srl t2, s8, t0
andi t2, t2, 1
addi t0, t0, 1 #i++
beqz t2, imul
addi t0,t0 ,-1
sll t3, s7, t0 #t3=sig_1<<i
add t4, t3, t4 #r+=a64<<i
addi t0, t0, 1 #i++
jal imul
fmul_1:
srli t4 ,t4, 7 #imul32>>23 ##
srli t5, t4, 8 #getbit(t4,24)
andi t5, t5, 1 #sig shift mshift
srl t4, t4, t5 #t4= result sig
add t0,s9, s10 #ea+eb
addi t0, t0, -127 #-127 ertmp er
bnez t5, inc
jal fmul_2 #er=ertmp
inc:
#mask lowest zero
#--mask &= (mask << 1) | 0x1--#
slli t2, t0, 1 #ori mask=t0
ori t2, t2, 0x1
and t1, t0, t2
#--mask &= (mask << 2) | 0x3--#
slli t2, t1, 2
ori t2, t2, 0x3
and t1, t1, t2
#--mask &= (mask << 4) | 0xF--#
slli t2, t1, 4
ori t2, t2, 0xF
and t1, t1, t2
#--mask &= (mask << 8) | 0xFF--#
slli t2, t1, 8
ori t2, t2, 0xFF
and t1, t1, t2
#--mask &= (mask << 16) | 0xFFFF--#
slli t2, t1, 16
li t3, 0xFFFF
or t2, t2, t3
and t1, t1, t2
#--mask &= (mask << 32) | 0xFFFFFFFF--#
li t3, 0x20
sll t2, t1, t3
li t3, 0xFFFFFFFF
or t2, t2, t3
and t1, t1, t2 #return mask = t1
#--z1 = mask ^ ((mask << 1) | 1)--#
slli t2, t1, 1
ori t2, t2, 1
xor t2, t1, t2 #z1
#--return (x & ~mask) | z1--#
xor t1, t1, t3 #~mask
and t1, t0, t1 #x&~mask
or t0, t1, t2 #inc return = t0
fmul_2:
#--result = (sr << 31) | ((er & 0xFF) << 23) | (mr & 0x7FFFFF)--#
xor t1, s5, s6 #sign result
slli t1, t1, 31 #sign<<31
andi t0, t0, 0xFF
slli t0, t0,23 #exp result
andi t4, t4, 0x7F #man result
slli t4, t4, 16 #man <<16
or t1, t1, t0
or t1, t1, t4 #t1= result
jalr t6
```
### Analysis
I test my code using [Ripes](https://github.com/mortbopet/Ripes) simulator
#### Pseudo instruction
```clike=
00000000 <main>:
0: 10000497 auipc x9 0x10000
4: 00048493 addi x9 x9 0
8: 10000917 auipc x18 0x10000
c: 00490913 addi x18 x18 4
10: 10000997 auipc x19 0x10000
14: 00898993 addi x19 x19 8
18: 10000a17 auipc x20 0x10000
1c: 00ca2a03 lw x20 12 x20
20: 00000ab3 add x21 x0 x0
24: 024000ef jal x1 36 <x_fp32tobf16>
28: 084000ef jal x1 132 <y_fp32tobf16>
2c: 0fc000ef jal x1 252 <convolution>
00000030 <end_program>:
30: 10000517 auipc x10 0x10000
34: ffb50513 addi x10 x10 -5
38: 00400893 addi x17 x0 4
3c: 00000073 ecall
40: 00a00893 addi x17 x0 10
44: 00000073 ecall
00000048 <x_fp32tobf16>:
48: 0004a583 lw x11 0 x9
4c: 00448493 addi x9 x9 4
50: 7f8002b7 lui x5 0x7f800
54: 0055fb33 and x22 x11 x5
58: 00800337 lui x6 0x800
5c: fff30313 addi x6 x6 -1
60: 0065fbb3 and x23 x11 x6
64: 017b6333 or x6 x22 x23
68: 0a030463 beq x6 x0 168 <eqzero>
6c: 0a5b0463 beq x22 x5 168 <infinity_NaN>
70: ff8002b7 lui x5 0xff800
74: 00b2f333 and x6 x5 x11
78: 008bd293 srli x5 x23 8
7c: 000083b7 lui x7 0x8
80: 0072e2b3 or x5 x5 x7
84: 017282b3 add x5 x5 x23
88: 0062e2b3 or x5 x5 x6
8c: ffff0337 lui x6 0xffff0
90: 0062f2b3 and x5 x5 x6
94: 0056a023 sw x5 0 x13
98: 00468693 addi x13 x13 4
9c: 001a8a93 addi x21 x21 1
a0: fb4ac4e3 blt x21 x20 -88 <x_fp32tobf16>
a4: 00000a93 addi x21 x0 0
a8: 00008067 jalr x0 x1 0
000000ac <y_fp32tobf16>:
ac: 00092603 lw x12 0 x18
b0: 00490913 addi x18 x18 4
b4: 7f8002b7 lui x5 0x7f800
b8: 00567b33 and x22 x12 x5
bc: 00800337 lui x6 0x800
c0: fff30313 addi x6 x6 -1
c4: 00667bb3 and x23 x12 x6
c8: 017b6333 or x6 x22 x23
cc: 04030263 beq x6 x0 68 <eqzero>
d0: 045b0263 beq x22 x5 68 <infinity_NaN>
d4: ff8002b7 lui x5 0xff800
d8: 00c2f333 and x6 x5 x12
dc: 008bd293 srli x5 x23 8
e0: 000083b7 lui x7 0x8
e4: 0072e2b3 or x5 x5 x7
e8: 017282b3 add x5 x5 x23
ec: 0062e2b3 or x5 x5 x6
f0: ffff0337 lui x6 0xffff0
f4: 0062f2b3 and x5 x5 x6
f8: 0056a023 sw x5 0 x13
fc: 00468693 addi x13 x13 4
100: 001a8a93 addi x21 x21 1
104: fb4ac4e3 blt x21 x20 -88 <y_fp32tobf16>
108: fe868693 addi x13 x13 -24
10c: 00008067 jalr x0 x1 0
00000110 <eqzero>:
110: 008000ef jal x1 8 <return_val>
00000114 <infinity_NaN>:
114: 004000ef jal x1 4 <return_val>
00000118 <return_val>:
118: 00200893 addi x17 x0 2
11c: 00000073 ecall
00000120 <end>:
120: 00a00893 addi x17 x0 10
124: 00000073 ecall
00000128 <convolution>:
128: 00000793 addi x15 x0 0
0000012c <loop_i>:
12c: 00000db3 add x27 x0 x0
130: 00500f93 addi x31 x0 5
134: 00000813 addi x16 x0 0
00000138 <loop_j>:
138: 0507c863 blt x15 x16 80 <end_j>
13c: 00380e13 addi x28 x16 3
140: 05c7d463 bge x15 x28 72 <end_j>
144: 41078333 sub x6 x15 x16
148: 00630333 add x6 x6 x6
14c: 00630333 add x6 x6 x6
150: 00668b33 add x22 x13 x6
154: 00cb2603 lw x12 12 x22
158: 01080333 add x6 x16 x16
15c: 00630333 add x6 x6 x6
160: 00668ab3 add x21 x13 x6
164: 000aa583 lw x11 0 x21
168: 0dc00fef jal x31 220 <bf16mul>
16c: 000305b3 add x11 x6 x0
170: 00000633 add x12 x0 x0
174: 05000fef jal x31 80 <fadd>
178: 000285b3 add x11 x5 x0
17c: 000d8633 add x12 x27 x0
180: 04400fef jal x31 68 <fadd>
184: 00500db3 add x27 x0 x5
00000188 <end_j>:
188: 00180813 addi x16 x16 1
18c: fb4846e3 blt x16 x20 -84 <loop_j>
190: 01b9a023 sw x27 0 x19
194: 00498993 addi x19 x19 4
198: 000d8513 addi x10 x27 0
19c: 00200893 addi x17 x0 2
1a0: 00000073 ecall
1a4: 10000517 auipc x10 0x10000
1a8: e9450513 addi x10 x10 -364
1ac: 00400893 addi x17 x0 4
1b0: 00000073 ecall
1b4: 00178793 addi x15 x15 1
1b8: 00500f93 addi x31 x0 5
1bc: f7f7c8e3 blt x15 x31 -144 <loop_i>
1c0: e71ff0ef jal x1 -400 <end_program>
000001c4 <fadd>:
1c4: 00061663 bne x12 x0 12 <fadd!=z>
1c8: 00058293 addi x5 x11 0
1cc: 000f80e7 jalr x1 x31 0
000001d0 <fadd!=z>:
1d0: 0175d293 srli x5 x11 23
1d4: 0ff2f293 andi x5 x5 255
1d8: 01765313 srli x6 x12 23
1dc: 0ff37313 andi x6 x6 255
1e0: 0105d393 srli x7 x11 16
1e4: 07f3f393 andi x7 x7 127
1e8: 01065e13 srli x28 x12 16
1ec: 07fe7e13 andi x28 x28 127
1f0: 08000e93 addi x29 x0 128
1f4: 01d3e3b3 or x7 x7 x29
1f8: 01de6e33 or x28 x28 x29
1fc: 40628eb3 sub x29 x5 x6
200: 000ec663 blt x29 x0 12 <swap>
204: 01de5e33 srl x28 x28 x29
208: 010000ef jal x1 16 <fadd_1>
0000020c <swap>:
20c: 000302b3 add x5 x6 x0
210: 41d00eb3 sub x29 x0 x29
214: 01d3d3b3 srl x7 x7 x29
00000218 <fadd_1>:
218: 01c38eb3 add x29 x7 x28
21c: 008edf13 srli x30 x29 8
220: 00100313 addi x6 x0 1
224: 006f1663 bne x30 x6 12 <fadd_result>
228: 001ede93 srli x29 x29 1
22c: 00128293 addi x5 x5 1
00000230 <fadd_result>:
230: 01729293 slli x5 x5 23
234: 07fefe93 andi x29 x29 127
238: 010e9e93 slli x29 x29 16
23c: 01d2e2b3 or x5 x5 x29
240: 000f80e7 jalr x1 x31 0
00000244 <bf16mul>:
244: 01f5da93 srli x21 x11 31
248: 01f65b13 srli x22 x12 31
24c: 08000e93 addi x29 x0 128
250: 0105db93 srli x23 x11 16
254: 07fbfb93 andi x23 x23 127
258: 01065c13 srli x24 x12 16
25c: 07fc7c13 andi x24 x24 127
260: 01dbebb3 or x23 x23 x29
264: 01dc6c33 or x24 x24 x29
268: 0175dc93 srli x25 x11 23
26c: 01765d13 srli x26 x12 23
270: 0ffcfc93 andi x25 x25 255
274: 0ffd7d13 andi x26 x26 255
278: 00000eb3 add x29 x0 x0
27c: 000002b3 add x5 x0 x0
00000280 <imul>:
280: 00800313 addi x6 x0 8
284: 02534463 blt x6 x5 40 <fmul_1>
00000288 <getbit>:
288: 005c53b3 srl x7 x24 x5
28c: 0013f393 andi x7 x7 1
290: 00128293 addi x5 x5 1
294: fe0386e3 beq x7 x0 -20 <imul>
298: fff28293 addi x5 x5 -1
29c: 005b9e33 sll x28 x23 x5
2a0: 01de0eb3 add x29 x28 x29
2a4: 00128293 addi x5 x5 1
2a8: fd9ff0ef jal x1 -40 <imul>
000002ac <fmul_1>:
2ac: 007ede93 srli x29 x29 7
2b0: 008edf13 srli x30 x29 8
2b4: 001f7f13 andi x30 x30 1
2b8: 01eedeb3 srl x29 x29 x30
2bc: 01ac82b3 add x5 x25 x26
2c0: f8128293 addi x5 x5 -127
2c4: 000f1463 bne x30 x0 8 <inc>
2c8: 074000ef jal x1 116 <fmul_2>
000002cc <inc>:
2cc: 00129393 slli x7 x5 1
2d0: 0013e393 ori x7 x7 1
2d4: 0072f333 and x6 x5 x7
2d8: 00231393 slli x7 x6 2
2dc: 0033e393 ori x7 x7 3
2e0: 00737333 and x6 x6 x7
2e4: 00431393 slli x7 x6 4
2e8: 00f3e393 ori x7 x7 15
2ec: 00737333 and x6 x6 x7
2f0: 00831393 slli x7 x6 8
2f4: 0ff3e393 ori x7 x7 255
2f8: 00737333 and x6 x6 x7
2fc: 01031393 slli x7 x6 16
300: 00010e37 lui x28 0x10
304: fffe0e13 addi x28 x28 -1
308: 01c3e3b3 or x7 x7 x28
30c: 00737333 and x6 x6 x7
310: 02000e13 addi x28 x0 32
314: 01c313b3 sll x7 x6 x28
318: fff00e13 addi x28 x0 -1
31c: 01c3e3b3 or x7 x7 x28
320: 00737333 and x6 x6 x7
324: 00131393 slli x7 x6 1
328: 0013e393 ori x7 x7 1
32c: 007343b3 xor x7 x6 x7
330: 01c34333 xor x6 x6 x28
334: 0062f333 and x6 x5 x6
338: 007362b3 or x5 x6 x7
0000033c <fmul_2>:
33c: 016ac333 xor x6 x21 x22
340: 01f31313 slli x6 x6 31
344: 0ff2f293 andi x5 x5 255
348: 01729293 slli x5 x5 23
34c: 07fefe93 andi x29 x29 127
350: 010e9e93 slli x29 x29 16
354: 00536333 or x6 x6 x5
358: 01d36333 or x6 x6 x29
35c: 000f80e7 jalr x1 x31 0
```
### 5-stage pipelined processor
Risc-v provide several processor for us, such as **single-cycle processor**, **5-stage processor**, **5-stage processor with hazard detection**, and **5-stage processor with forward and hazard detection**. Here I choose the **5-stage processor** and we can divide it into 5 stage.And I will take an example to explain the function of each stage.

The 5-stage processor includes:
* **Instruction fetch (IF)**
At this point, the CPU reads instructions from memory at the address represented by the value of the Program Counter (PC).
* **Instruction decode (ID)**
The decoder stage's objective is to transmit and decode the fetched instructions to the control unit.
* **Execute (EX)**
In this stage, calculations are performed.ALU processes the operations based on the Execute Command input. The ALU performs shift operations both logical and arithmetic, as well as arithmetic and logical operations including ADD, SUB, AND, OR, NOR, and XOR.
* **Memory access (MEM)**
The loading and unloading of values into and out of registers is the main activity and function in the memory stage. They are also in charge of retrieving and transmitting the data from the memory module. The main job of the data value is to be stored in the appropriate destination registers in accordance with the instruction.
* **Writeback (WB)**
In this stage the calculated or retrieved value will be written back to the register specified in the instructions.
Now, I use the I-type instruction such as **addi x5, x6, 20** for example.
1. **Instruction fetch (IF)**

IF stage will use the PC to fetch the instrution when the data pass the instruction memory.And PC will add 4 to fetch next instruction.
1. **Instruction decode (ID)**

In IF, instrution is decoded to rs1, rs2, rd, opcode. If the opcode is I-type, Imm. will be used.
So, the register will get the value of rs1, rs2.And instruction is decoded to `addi` operation and immediate value `0x00000014`.
1. **Execute (EX)**

Here, ALU perform `addi` instruction, add`0x00000000` and `0x00000014`
1. **Memory access (MEM)**

Because `addi` instruction don't need load or store memory, it just pass the data to next stage.
1. **Writeback (WB)**

Finally, the result `0x00000014` will be written to the register x5, and end the instruction `addi, x5, x6, 20`.
After all these stage finished, the register is updated like :

## Result
| test1 | 1 | 2 | 3 |
| ------ |:------:|:--------:|:--------:|
| $x[n]$ | 1.2 | 1.203125 | 2.31 |
| $h[n]$ | 2.3125 | 3.46 | 3.453125 |
Convolution by using bfloat16

Convolution by using FP32

| test2 | 1 | 2 | 3 |
| ------ |:------:|:--------:|:--------:|
| $x[n]$ | 0.92 | 15.8 | 0.12 |
| $h[n]$ | 0.77 | 5.24 | 10 |
Convolution by using bfloat16

Convolution by using FP32

| test3 | 1 | 2 | 3 |
| ------ |:------:|:--------:|:--------:|
| $x[n]$ | 3.2 | 0.8 | 2.4 |
| $h[n]$ | 7.5 | 1.46 | 1.2 |
Convolution by using bfloat16

Convolution by using FP32

We can see that there is a little different in the precision of these two floating point format.
## Performance
Performance of test 1

## Reference
* [Design and development of a 5-stage Pipelined RISC processor](https://www.irjet.net/archives/V9/i10/IRJET-V9I10149.pdf)
* [bfloat16 floating-point format
](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)
* [IEEE 754 floating point operation]( phttps://www.cs.auckland.ac.nz/compsci210s1c/lectures/Cprog/Floatingpoint.pdf)