Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [`Yao1201`](https://github.com/Yao1201) > ## Convolution by using bfloat16 ### 1.Convolution In Algebraic definition, the discreet convolution of two vectors $x[n]$ and $h[n]$ is defined as : $$y[n]=x[n]*h[n]=\sum_{k=-∞}^{∞} x[k]h[n-k]$$ It is also an crucial concept in the field of digital signal processing. An example of LTI system is that an output $y[n]$ can be calculated as a input signal $x[n]$ and a impulse response $h[n]$ ### 2.bfloat16 The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory.This format is a shortened (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format.Its uniqueness lies in its retention of 8 exponent bis, preserving the dynamic range, but support only an 7 fraction bit, which reduce its precision. ![reference link](https://hackmd.io/_uploads/Sk35vDGW6.png) ## Implementation You can find the source code [here](https://github.com/Yao1201/Convolution-by-using-bfloat16). ### C code ```clike= #include <stdio.h> float fp32_to_bf16(float x); int main(){ float a[3]={1.2,1.203125,2.31}; #test data float b[3]={2.3125,3.46,3.453125}; #from quiz1(problem2) float c[5]; for(int k=0;k<3;k++){ a[k]=fp32_to_bf16(a[k]); b[k]=fp32_to_bf16(b[k]); } for (int i = 0; i < 5; i++) { c[i] = 0.0; for (int j = 0; j < 3; j++) { if ((i - j) >= 0 && (i - j) < 3) { // Main calculation c[i] = c[i] + a[j] * b[i - j]; } } printf("%f\t", c[i] ); } } float fp32_to_bf16(float x) { float y = x; int *p = (int *) &y; unsigned int exp = *p & 0x7F800000; unsigned int man = *p & 0x007FFFFF; if (exp == 0 && man == 0) /* zero */ return x; if (exp == 0x7F800000) / infinity or NaN */ return x; /* Normalized number */ /* round to nearest */ float r = x; int *pr = (int *) &r; *pr &= 0xFF800000; /* r has the same exp as x */ r /= 0x100; y = x + r; *p &= 0xFFFF0000; return y; } ``` ### Assembly code In this example, I need to implement bfloat16 addition`(fadd)` and multiplication`(bf16mul)`. 1. **Store arrays to memory** Input the test data 1. **Transfer fp32 to bf16** Transfer the format of input data. 1. **Loops** Indces i, j determine which value in arrays will participate the calculation. 1. **Multiplication & Addition** Because we prohibit the use of the F extension, we need to write a function for floating-point addition and multiplication. ```clike= .data #test 1 (from problem2) arr1: .word 0x3f99999a, 0x3f9a0000, 0x4013d70a #1.2 1.203125 2.31 arr2: .word 0x40140000, 0x405d70a4, 0x405d0000 #2.3125 3.46 3.453125 #test 2 #arr1: .word 0x3f6b851f, 0x417ccccd, 0x3df5c28f #0.92,15.8,0.12 #arr2: .word 0x3f451eb8, 0x40a7ae14, 0x41200000 #0.77,5.24,10 #test 3 #arr1: .word 0x404ccccd, 0x3f4ccccd, 0x4019999a #3.2,0.8,2.4 #arr2: .word 0x40f00000, 0x3fbae148, 0xc0066666 #7.5,1.46,1.2 arr3: .word 0, 0, 0 len: .word 3 #array length=3 str1: .string "y[n]=[" str2: .string " end program" str3: .string "]" space: .string " " #a1,a2 --> x[],h[] #a3 --bf16-->[ x h ] #s3 ---> result .text #-----------------------------------------------------------# main: la s1, arr1 #s1=x[] la s2, arr2 #s2=h[] la s3, arr3 #s3=y[] lw s4, len add s5, x0, x0 #s5=0 transfer counter jal x_fp32tobf16 #transfer to bf16 jal h_fp32tobf16 jal convolution #convolution and print end_program: la a0, str3 li a7, 4 ecall la a0, str2 li a7, 4 ecall li a7,10 ecall #---------------------------fp32tobf16-----------------------------------# x_fp32tobf16: lw a1,0(s1) addi s1, s1, 4 #x[] next element li t0, 0x7F800000 #t0=0x7F800000 and s6, a1, t0 #s6=exp (*p&0x7F800000) li t1, 0x007FFFFF #t0=0x007FFFFF and s7, a1, t1 #s7=man (*p&0x007FFFFF) or t1, s6, s7 beqz t1,eqzero beq s6, t0, infinity_NaN li t0,0xFF800000 and t1, t0, a1 #keep sign+exp srli t0, s7, 8 #man >>8 li t2, 0x8000 or t0, t0, t2 add t0, t0, s7 or t0, t0, t1 #assemble sign + exp + man li t1,0xFFFF0000 #truncate redundant bits and t0, t0, t1 sw t0, 0(a3) #store bf16 in a3[]= addi a3, a3, 4 addi s5, s5, 1 #counter++ blt s5, s4, x_fp32tobf16 #counter< 3 loop addi s5, x0, 0 ret #------------------------------------------------------------------------# h_fp32tobf16: lw a2,0(s2) addi s2, s2, 4 #x[] next element li t0, 0x7F800000 #t0=0x7F800000 and s6, a2, t0 #s6=exp (*p&0x7F800000) li t1, 0x007FFFFF #t0=0x007FFFFF and s7, a2, t1 #s7=man (*p&0x007FFFFF) or t1, s6, s7 beqz t1,eqzero beq s6, t0, infinity_NaN li t0,0xFF800000 and t1, t0, a2 #keep sign+exp srli t0, s7, 8 #man >>8 li t2, 0x8000 or t0, t0, t2 add t0, t0, s7 or t0, t0, t1 #assemble sign + exp + man li t1,0xFFFF0000 #truncate redundant bits and t0, t0, t1 sw t0, 0(a3) #store bf16 in a4[]= addi a3, a3, 4 addi s5, s5, 1 #counter++ blt s5, s4, h_fp32tobf16 #counter< 3 loop addi a3, a3, -24 #correction a3 ret eqzero: jal return_val infinity_NaN: jal return_val return_val: li a7,2 ecall end: li a7,10 ecall #---------------------------convolution-------------------------------# convolution: addi a5, x0, 0 #i=0 counter loop_i: add s11, x0, x0 addi t6, x0, 5 #2*len-1=5 addi a6, x0, 0 #j=0 counter loop_j: blt a5, a6, end_j # addi t3, a6, 3 #j+3 bge a5, t3, end_j #---bf16 mul----# sub t1, a5, a6 #i-j add t1, t1, t1 add t1, t1, t1 #4*(i-j) add s6, a3, t1 lw a2, (12)s6 #h[] add t1, a6, a6 add t1, t1, t1 #4*j add s5, a3, t1 lw a1,0(s5) #x[] jal t6, bf16mul ## t1 = mul_result add a1, t1, x0 add a2, x0, x0 jal t6,fadd #t0 =fadd_result add a1, t0, x0 add a2, s11, x0 jal t6,fadd add s11, x0, t0 #y=y+x*h end_j: addi a6,a6, 1 #j++ blt a6, s4, loop_j #if j < 3 -> loop_j sw s11, (0)s3 addi s3, s3, 4 mv a0,s11 #print result li a7,2 ecall la a0, space li a7, 4 ecall addi a5, a5, 1 #i++ addi t6, x0, 5 #2*len-1=5 blt a5, t6, loop_i #i<5 --> loop_i jal end_program #----------------------bf16 addition------------------------# fadd: # fadd(a1 , a2) bnez a2, fadd!=z #anyone =0 --> mv mv t0, a1 jalr t6 fadd!=z: srli t0, a1, 23 #a1=val_1 andi t0, t0, 0xFF #t0=exp_1 assume result exp srli t1, a2, 23 #a2=val_2 andi t1, t1, 0xFF #t1=exp_2 srli t2, a1, 16 andi t2, t2, 0x7F #man_1 srli t3, a2, 16 andi t3, t3, 0x7F #man_2 li t4, 0x80 #mask of significand or t2, t2, t4 #t2=sig_1 or t3, t3, t4 #t3=sig_2 sub t4, t0, t1 #diff between exp_1,2 blt t4, x0, swap #if exp_1<exp_2 change result exp srl t3, t3, t4 #sig_2 >> diff jal fadd_1 swap: add t0, t1, x0 #change exp_2 to result exp sub t4, x0, t4 #diff=-diff srl t2, t2, t4 #sig_1 >> diff fadd_1: add t4, t2, t3 #t4=sig_1+sig_2 result sig srli t5, t4, 8 #check addi t1, x0, 1 bne t5, t1, fadd_result srli t4, t4,1 addi t0, t0, 1 #t0=result_exp=exp+1 fadd_result: slli t0, t0, 23 #result exp<<23 andi t4, t4, 0x7F #result man slli t4, t4, 16 #man <<16 or t0, t0, t4 #result jalr t6 #--------------------bf16 multiplication------------------------# bf16mul: ##bf16mul(a1, a2) srli s5, a1, 31 #s5= sign_1 srli s6, a2, 31 #s6= sign_2 li t4, 0x80 #significand mask srli s7, a1, 16 andi s7, s7, 0x7F srli s8, a2, 16 andi s8, s8, 0x7F or s7, s7, t4 #s7= sig_1 7+1bits or s8, s8, t4 #s8= sig_2 srli s9, a1, 23 srli s10, a2, 23 andi s9, s9, 0xFF #s9= exp_1 8bits andi s10, s10, 0xFF #s10=exp_2 add t4, x0, x0 #imul result add t0, x0, x0 #i counter imul: addi t1, x0, 8 #t1=8 bgt t0 , t1, fmul_ getbit: srl t2, s8, t0 andi t2, t2, 1 addi t0, t0, 1 #i++ beqz t2, imul addi t0,t0 ,-1 sll t3, s7, t0 #t3=sig_1<<i add t4, t3, t4 #r+=a64<<i addi t0, t0, 1 #i++ jal imul fmul_1: srli t4 ,t4, 7 #imul32>>23 ## srli t5, t4, 8 #getbit(t4,24) andi t5, t5, 1 #sig shift mshift srl t4, t4, t5 #t4= result sig add t0,s9, s10 #ea+eb addi t0, t0, -127 #-127 ertmp er bnez t5, inc jal fmul_2 #er=ertmp inc: #mask lowest zero #--mask &= (mask << 1) | 0x1--# slli t2, t0, 1 #ori mask=t0 ori t2, t2, 0x1 and t1, t0, t2 #--mask &= (mask << 2) | 0x3--# slli t2, t1, 2 ori t2, t2, 0x3 and t1, t1, t2 #--mask &= (mask << 4) | 0xF--# slli t2, t1, 4 ori t2, t2, 0xF and t1, t1, t2 #--mask &= (mask << 8) | 0xFF--# slli t2, t1, 8 ori t2, t2, 0xFF and t1, t1, t2 #--mask &= (mask << 16) | 0xFFFF--# slli t2, t1, 16 li t3, 0xFFFF or t2, t2, t3 and t1, t1, t2 #--mask &= (mask << 32) | 0xFFFFFFFF--# li t3, 0x20 sll t2, t1, t3 li t3, 0xFFFFFFFF or t2, t2, t3 and t1, t1, t2 #return mask = t1 #--z1 = mask ^ ((mask << 1) | 1)--# slli t2, t1, 1 ori t2, t2, 1 xor t2, t1, t2 #z1 #--return (x & ~mask) | z1--# xor t1, t1, t3 #~mask and t1, t0, t1 #x&~mask or t0, t1, t2 #inc return = t0 fmul_2: #--result = (sr << 31) | ((er & 0xFF) << 23) | (mr & 0x7FFFFF)--# xor t1, s5, s6 #sign result slli t1, t1, 31 #sign<<31 andi t0, t0, 0xFF slli t0, t0,23 #exp result andi t4, t4, 0x7F #man result slli t4, t4, 16 #man <<16 or t1, t1, t0 or t1, t1, t4 #t1= result jalr t6 ``` ### Analysis I test my code using [Ripes](https://github.com/mortbopet/Ripes) simulator #### Pseudo instruction ```clike= 00000000 <main>: 0: 10000497 auipc x9 0x10000 4: 00048493 addi x9 x9 0 8: 10000917 auipc x18 0x10000 c: 00490913 addi x18 x18 4 10: 10000997 auipc x19 0x10000 14: 00898993 addi x19 x19 8 18: 10000a17 auipc x20 0x10000 1c: 00ca2a03 lw x20 12 x20 20: 00000ab3 add x21 x0 x0 24: 024000ef jal x1 36 <x_fp32tobf16> 28: 084000ef jal x1 132 <y_fp32tobf16> 2c: 0fc000ef jal x1 252 <convolution> 00000030 <end_program>: 30: 10000517 auipc x10 0x10000 34: ffb50513 addi x10 x10 -5 38: 00400893 addi x17 x0 4 3c: 00000073 ecall 40: 00a00893 addi x17 x0 10 44: 00000073 ecall 00000048 <x_fp32tobf16>: 48: 0004a583 lw x11 0 x9 4c: 00448493 addi x9 x9 4 50: 7f8002b7 lui x5 0x7f800 54: 0055fb33 and x22 x11 x5 58: 00800337 lui x6 0x800 5c: fff30313 addi x6 x6 -1 60: 0065fbb3 and x23 x11 x6 64: 017b6333 or x6 x22 x23 68: 0a030463 beq x6 x0 168 <eqzero> 6c: 0a5b0463 beq x22 x5 168 <infinity_NaN> 70: ff8002b7 lui x5 0xff800 74: 00b2f333 and x6 x5 x11 78: 008bd293 srli x5 x23 8 7c: 000083b7 lui x7 0x8 80: 0072e2b3 or x5 x5 x7 84: 017282b3 add x5 x5 x23 88: 0062e2b3 or x5 x5 x6 8c: ffff0337 lui x6 0xffff0 90: 0062f2b3 and x5 x5 x6 94: 0056a023 sw x5 0 x13 98: 00468693 addi x13 x13 4 9c: 001a8a93 addi x21 x21 1 a0: fb4ac4e3 blt x21 x20 -88 <x_fp32tobf16> a4: 00000a93 addi x21 x0 0 a8: 00008067 jalr x0 x1 0 000000ac <y_fp32tobf16>: ac: 00092603 lw x12 0 x18 b0: 00490913 addi x18 x18 4 b4: 7f8002b7 lui x5 0x7f800 b8: 00567b33 and x22 x12 x5 bc: 00800337 lui x6 0x800 c0: fff30313 addi x6 x6 -1 c4: 00667bb3 and x23 x12 x6 c8: 017b6333 or x6 x22 x23 cc: 04030263 beq x6 x0 68 <eqzero> d0: 045b0263 beq x22 x5 68 <infinity_NaN> d4: ff8002b7 lui x5 0xff800 d8: 00c2f333 and x6 x5 x12 dc: 008bd293 srli x5 x23 8 e0: 000083b7 lui x7 0x8 e4: 0072e2b3 or x5 x5 x7 e8: 017282b3 add x5 x5 x23 ec: 0062e2b3 or x5 x5 x6 f0: ffff0337 lui x6 0xffff0 f4: 0062f2b3 and x5 x5 x6 f8: 0056a023 sw x5 0 x13 fc: 00468693 addi x13 x13 4 100: 001a8a93 addi x21 x21 1 104: fb4ac4e3 blt x21 x20 -88 <y_fp32tobf16> 108: fe868693 addi x13 x13 -24 10c: 00008067 jalr x0 x1 0 00000110 <eqzero>: 110: 008000ef jal x1 8 <return_val> 00000114 <infinity_NaN>: 114: 004000ef jal x1 4 <return_val> 00000118 <return_val>: 118: 00200893 addi x17 x0 2 11c: 00000073 ecall 00000120 <end>: 120: 00a00893 addi x17 x0 10 124: 00000073 ecall 00000128 <convolution>: 128: 00000793 addi x15 x0 0 0000012c <loop_i>: 12c: 00000db3 add x27 x0 x0 130: 00500f93 addi x31 x0 5 134: 00000813 addi x16 x0 0 00000138 <loop_j>: 138: 0507c863 blt x15 x16 80 <end_j> 13c: 00380e13 addi x28 x16 3 140: 05c7d463 bge x15 x28 72 <end_j> 144: 41078333 sub x6 x15 x16 148: 00630333 add x6 x6 x6 14c: 00630333 add x6 x6 x6 150: 00668b33 add x22 x13 x6 154: 00cb2603 lw x12 12 x22 158: 01080333 add x6 x16 x16 15c: 00630333 add x6 x6 x6 160: 00668ab3 add x21 x13 x6 164: 000aa583 lw x11 0 x21 168: 0dc00fef jal x31 220 <bf16mul> 16c: 000305b3 add x11 x6 x0 170: 00000633 add x12 x0 x0 174: 05000fef jal x31 80 <fadd> 178: 000285b3 add x11 x5 x0 17c: 000d8633 add x12 x27 x0 180: 04400fef jal x31 68 <fadd> 184: 00500db3 add x27 x0 x5 00000188 <end_j>: 188: 00180813 addi x16 x16 1 18c: fb4846e3 blt x16 x20 -84 <loop_j> 190: 01b9a023 sw x27 0 x19 194: 00498993 addi x19 x19 4 198: 000d8513 addi x10 x27 0 19c: 00200893 addi x17 x0 2 1a0: 00000073 ecall 1a4: 10000517 auipc x10 0x10000 1a8: e9450513 addi x10 x10 -364 1ac: 00400893 addi x17 x0 4 1b0: 00000073 ecall 1b4: 00178793 addi x15 x15 1 1b8: 00500f93 addi x31 x0 5 1bc: f7f7c8e3 blt x15 x31 -144 <loop_i> 1c0: e71ff0ef jal x1 -400 <end_program> 000001c4 <fadd>: 1c4: 00061663 bne x12 x0 12 <fadd!=z> 1c8: 00058293 addi x5 x11 0 1cc: 000f80e7 jalr x1 x31 0 000001d0 <fadd!=z>: 1d0: 0175d293 srli x5 x11 23 1d4: 0ff2f293 andi x5 x5 255 1d8: 01765313 srli x6 x12 23 1dc: 0ff37313 andi x6 x6 255 1e0: 0105d393 srli x7 x11 16 1e4: 07f3f393 andi x7 x7 127 1e8: 01065e13 srli x28 x12 16 1ec: 07fe7e13 andi x28 x28 127 1f0: 08000e93 addi x29 x0 128 1f4: 01d3e3b3 or x7 x7 x29 1f8: 01de6e33 or x28 x28 x29 1fc: 40628eb3 sub x29 x5 x6 200: 000ec663 blt x29 x0 12 <swap> 204: 01de5e33 srl x28 x28 x29 208: 010000ef jal x1 16 <fadd_1> 0000020c <swap>: 20c: 000302b3 add x5 x6 x0 210: 41d00eb3 sub x29 x0 x29 214: 01d3d3b3 srl x7 x7 x29 00000218 <fadd_1>: 218: 01c38eb3 add x29 x7 x28 21c: 008edf13 srli x30 x29 8 220: 00100313 addi x6 x0 1 224: 006f1663 bne x30 x6 12 <fadd_result> 228: 001ede93 srli x29 x29 1 22c: 00128293 addi x5 x5 1 00000230 <fadd_result>: 230: 01729293 slli x5 x5 23 234: 07fefe93 andi x29 x29 127 238: 010e9e93 slli x29 x29 16 23c: 01d2e2b3 or x5 x5 x29 240: 000f80e7 jalr x1 x31 0 00000244 <bf16mul>: 244: 01f5da93 srli x21 x11 31 248: 01f65b13 srli x22 x12 31 24c: 08000e93 addi x29 x0 128 250: 0105db93 srli x23 x11 16 254: 07fbfb93 andi x23 x23 127 258: 01065c13 srli x24 x12 16 25c: 07fc7c13 andi x24 x24 127 260: 01dbebb3 or x23 x23 x29 264: 01dc6c33 or x24 x24 x29 268: 0175dc93 srli x25 x11 23 26c: 01765d13 srli x26 x12 23 270: 0ffcfc93 andi x25 x25 255 274: 0ffd7d13 andi x26 x26 255 278: 00000eb3 add x29 x0 x0 27c: 000002b3 add x5 x0 x0 00000280 <imul>: 280: 00800313 addi x6 x0 8 284: 02534463 blt x6 x5 40 <fmul_1> 00000288 <getbit>: 288: 005c53b3 srl x7 x24 x5 28c: 0013f393 andi x7 x7 1 290: 00128293 addi x5 x5 1 294: fe0386e3 beq x7 x0 -20 <imul> 298: fff28293 addi x5 x5 -1 29c: 005b9e33 sll x28 x23 x5 2a0: 01de0eb3 add x29 x28 x29 2a4: 00128293 addi x5 x5 1 2a8: fd9ff0ef jal x1 -40 <imul> 000002ac <fmul_1>: 2ac: 007ede93 srli x29 x29 7 2b0: 008edf13 srli x30 x29 8 2b4: 001f7f13 andi x30 x30 1 2b8: 01eedeb3 srl x29 x29 x30 2bc: 01ac82b3 add x5 x25 x26 2c0: f8128293 addi x5 x5 -127 2c4: 000f1463 bne x30 x0 8 <inc> 2c8: 074000ef jal x1 116 <fmul_2> 000002cc <inc>: 2cc: 00129393 slli x7 x5 1 2d0: 0013e393 ori x7 x7 1 2d4: 0072f333 and x6 x5 x7 2d8: 00231393 slli x7 x6 2 2dc: 0033e393 ori x7 x7 3 2e0: 00737333 and x6 x6 x7 2e4: 00431393 slli x7 x6 4 2e8: 00f3e393 ori x7 x7 15 2ec: 00737333 and x6 x6 x7 2f0: 00831393 slli x7 x6 8 2f4: 0ff3e393 ori x7 x7 255 2f8: 00737333 and x6 x6 x7 2fc: 01031393 slli x7 x6 16 300: 00010e37 lui x28 0x10 304: fffe0e13 addi x28 x28 -1 308: 01c3e3b3 or x7 x7 x28 30c: 00737333 and x6 x6 x7 310: 02000e13 addi x28 x0 32 314: 01c313b3 sll x7 x6 x28 318: fff00e13 addi x28 x0 -1 31c: 01c3e3b3 or x7 x7 x28 320: 00737333 and x6 x6 x7 324: 00131393 slli x7 x6 1 328: 0013e393 ori x7 x7 1 32c: 007343b3 xor x7 x6 x7 330: 01c34333 xor x6 x6 x28 334: 0062f333 and x6 x5 x6 338: 007362b3 or x5 x6 x7 0000033c <fmul_2>: 33c: 016ac333 xor x6 x21 x22 340: 01f31313 slli x6 x6 31 344: 0ff2f293 andi x5 x5 255 348: 01729293 slli x5 x5 23 34c: 07fefe93 andi x29 x29 127 350: 010e9e93 slli x29 x29 16 354: 00536333 or x6 x6 x5 358: 01d36333 or x6 x6 x29 35c: 000f80e7 jalr x1 x31 0 ``` ### 5-stage pipelined processor Risc-v provide several processor for us, such as **single-cycle processor**, **5-stage processor**, **5-stage processor with hazard detection**, and **5-stage processor with forward and hazard detection**. Here I choose the **5-stage processor** and we can divide it into 5 stage.And I will take an example to explain the function of each stage. ![](https://hackmd.io/_uploads/H1x_ftGWT.png) The 5-stage processor includes: * **Instruction fetch (IF)** At this point, the CPU reads instructions from memory at the address represented by the value of the Program Counter (PC). * **Instruction decode (ID)** The decoder stage's objective is to transmit and decode the fetched instructions to the control unit. * **Execute (EX)** In this stage, calculations are performed.ALU processes the operations based on the Execute Command input. The ALU performs shift operations both logical and arithmetic, as well as arithmetic and logical operations including ADD, SUB, AND, OR, NOR, and XOR. * **Memory access (MEM)** The loading and unloading of values into and out of registers is the main activity and function in the memory stage. They are also in charge of retrieving and transmitting the data from the memory module. The main job of the data value is to be stored in the appropriate destination registers in accordance with the instruction. * **Writeback (WB)** In this stage the calculated or retrieved value will be written back to the register specified in the instructions. Now, I use the I-type instruction such as **addi x5, x6, 20** for example. 1. **Instruction fetch (IF)** ![](https://hackmd.io/_uploads/SJUzGcz-T.png) IF stage will use the PC to fetch the instrution when the data pass the instruction memory.And PC will add 4 to fetch next instruction. 1. **Instruction decode (ID)** ![](https://hackmd.io/_uploads/Bk3sf5fbT.png) In IF, instrution is decoded to rs1, rs2, rd, opcode. If the opcode is I-type, Imm. will be used. So, the register will get the value of rs1, rs2.And instruction is decoded to `addi` operation and immediate value `0x00000014`. 1. **Execute (EX)** ![](https://hackmd.io/_uploads/ryhgVqf-p.png) Here, ALU perform `addi` instruction, add`0x00000000` and `0x00000014` 1. **Memory access (MEM)** ![](https://hackmd.io/_uploads/ryg4IiG-6.png) Because `addi` instruction don't need load or store memory, it just pass the data to next stage. 1. **Writeback (WB)** ![](https://hackmd.io/_uploads/SkHVwizZp.png) Finally, the result `0x00000014` will be written to the register x5, and end the instruction `addi, x5, x6, 20`. After all these stage finished, the register is updated like : ![](https://hackmd.io/_uploads/B1RHKjzZ6.png) ## Result | test1 | 1 | 2 | 3 | | ------ |:------:|:--------:|:--------:| | $x[n]$ | 1.2 | 1.203125 | 2.31 | | $h[n]$ | 2.3125 | 3.46 | 3.453125 | Convolution by using bfloat16 ![](https://hackmd.io/_uploads/SkraTsMZ6.png) Convolution by using FP32 ![](https://hackmd.io/_uploads/ByxL0iMZa.png) | test2 | 1 | 2 | 3 | | ------ |:------:|:--------:|:--------:| | $x[n]$ | 0.92 | 15.8 | 0.12 | | $h[n]$ | 0.77 | 5.24 | 10 | Convolution by using bfloat16 ![](https://hackmd.io/_uploads/B1O_khzb6.png) Convolution by using FP32 ![](https://hackmd.io/_uploads/H1e1g2fWp.png) | test3 | 1 | 2 | 3 | | ------ |:------:|:--------:|:--------:| | $x[n]$ | 3.2 | 0.8 | 2.4 | | $h[n]$ | 7.5 | 1.46 | 1.2 | Convolution by using bfloat16 ![](https://hackmd.io/_uploads/BkgCgnGZp.png) Convolution by using FP32 ![](https://hackmd.io/_uploads/SkPJW2zZa.png) We can see that there is a little different in the precision of these two floating point format. ## Performance Performance of test 1 ![](https://hackmd.io/_uploads/ry_rznMba.png) ## Reference * [Design and development of a 5-stage Pipelined RISC processor](https://www.irjet.net/archives/V9/i10/IRJET-V9I10149.pdf) * [bfloat16 floating-point format ](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) * [IEEE 754 floating point operation]( phttps://www.cs.auckland.ac.nz/compsci210s1c/lectures/Cprog/Floatingpoint.pdf)