Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < `P76131254` > ## Problem B The bfloat16 format is a 16-bit floating-point representation, designed to provide a wide dynamic range by using a floating radix point. ``` ┌ sign │ │ ┌ exponent │ │ │ │ ┌ mantissa │ │ │ │┌──┴───┐┌─┴───┐ 0b0000000000000000 bfloat16 ``` BF16 offers several advantages like: Memory Efficiency: BF16 uses only 16 bits, which means it requires half the memory compared to full 32-bit floating-point numbers (FP32). Faster Computation: Because BF16 representations are smaller, they allow for faster computation. The reduced precision speeds up matrix multiplications and other intensive operations. ### bf16 to fp32 ```c static inline float bf16_to_fp32(bf16_t h) { union { float f; uint32_t i; } u = {.i = (uint32_t)h.bits << 16}; return u.f; } ``` ### fp32 to bf16 ```c static inline bf16_t fp32_to_bf16(float s) { bf16_t h; union { float f; uint32_t i; } u = {.f = s}; if ((u.i & 0x7fffffff) > 0x7f800000) { /* NaN */ h.bits = (u.i >> 16) | 64; /* force to quiet */ return h; } h.bits = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10; return h; } ``` ## Bilinear Interpolation by using BF16 and related arithmetic operations Bilinear interpolation is a fundamental resampling technique in computer vision and image processing, also known as bilinear filtering or bilinear texture mapping. It is commonly applied in image rotation and augmentation tasks. However, these operations can significantly increase computational complexity. To address this, I am exploring the use of BF16, which offers advantages like faster computation and reduced memory usage, to enhance the efficiency of bilinear interpolation. ### Bilinear Interpolation ![BilinearInterpolationV2](https://hackmd.io/_uploads/HynCqq8JJg.svg) Suppose that we want to find the value of the point (x, y). It is assumed that we know the value of f at the four points $Q11 = (x1, y1)$, $Q12 = (x1, y2)$, $Q21 = (x2, y1)$, and $Q22 = (x2, y2)$. First, we calculate linear interpolation in the x-direction. $$ f(x, y_1) = \frac{x_2 - x}{x_2 - x_1} f(Q_{11}) + \frac{x - x_1}{x_2 - x_1} f(Q_{21}), $$ $$ f(x, y_2) = \frac{x_2 - x}{x_2 - x_1} f(Q_{12}) + \frac{x - x_1}{x_2 - x_1} f(Q_{22}). $$ Then, we calculate by interpolating in the y-direction to obtain the desired estimate: $$ f(x, y) = \frac{y_2 - y}{y_2 - y_1} f(x, y_1) + \frac{y - y_1}{y_2 - y_1} f(x, y_2) $$ :::danger Improve the English writing. ::: ## Implement To use bilinear interpolation to compute the value at point `(1,1)` based on the values at `(0,0)`, `(2,0)`, `(0,2)`, and`(2,2)` Start by performing linear interpolation caculate point `(1,1)` and `(2,1)` ``` ------------------------- ------------------------- y=0 | 0.235 | 0.0 | 0.272 | | 0.235 | 0.252 | 0.272 | |------------------------ |------------------------ y=1 | 0.0 | 0.0 | 0.0 | --> | 0.0 | 0.0 | 0.0 | |------------------------ |------------------------ y=2 | 0.333 | 0.0 | 0.916 | | 0.333 | 0.621 | 0.916 | ------------------------- ------------------------- x=0 x=1 x=2 x=0 x=1 x=2 ``` Here, we interpolated the values at `x=1` for `y=0` and `y=2` to obtain the values at points `(1,0)` and `(1,2)`. Next, we use these interpolated values to compute the value at `(1,1)` using bilinear interpolation, based on the values we calculated in the previous step. ``` ------------------------- ------------------------- y=0 | 0.235 | 0.252 | 0.272 | | 0.235 | 0.252 | 0.272 | |------------------------ |------------------------ y=1 | 0.0 | 0.0 | 0.0 | --> | 0.0 | 0.435 | 0.0 | |------------------------ |------------------------ y=2 | 0.333 | 0.621 | 0.916 | | 0.333 | 0.621 | 0.916 | ------------------------- ------------------------- x=0 x=1 x=2 x=0 x=1 x=2 ``` The result value of```(1,1)``` is 0.435. I use the BF16 data type to store data and perform addition and multiplication operations. ## C code ### `bf16_t` Data Type ```c typedef struct { uint32_t bits; } bf16_t; ``` ### convert `FP32` to `BF16` ```c static inline bf16_t fp32_to_bf16(float s) { bf16_t h; union { float f; uint32_t i; } u = {.f = s}; if ((u.i & 0x7fffffff) > 0x7f800000) { /* NaN */ h.bits = (u.i >> 16) | 64; /* force to quiet */ return h; } ``` ### convert `BF16` to `FP32` ```c static inline float bf16_to_fp32(bf16_t h) { union { float f; uint32_t i; } u = {.i = (uint32_t)h.bits << 16}; return u.f; } ``` ### add of `BF16` The `bf16_add` uses `my_clz` to find how many leading zeros are present in the mantissa. ```c static inline int my_clz(uint16_t x) { int count = 0; for (int i = 15; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } bf16_t bf16_add(bf16_t a, bf16_t b) { uint16_t ia = a.bits; uint16_t ib = b.bits; /* sign */ int sa = ia >> 15; int sb = ib >> 15; /* mantissa */ int16_t ma = (ia & 0x7F) | 0x80; int16_t mb = (ib & 0x7F) | 0x80; /* exponent */ int16_t ea = (ia >> 7) & 0xFF; int16_t eb = (ib >> 7) & 0xFF; // sub_exp and mb shift if (ea > eb) { mb >>= (ea - eb); eb = ea; } else if (eb > ea) { ma >>= (eb - ea); ea = eb; } // int16_t mrtmp; if (sa == sb) { // add mrtmp = ma + mb; } else { // sub if (ma >= mb) { mrtmp = ma - mb; } else { mrtmp = mb - ma; sa = sb; } } //use my clz int clz = my_clz(mrtmp); printf("%d",clz); int16_t shift = 0; if (clz <= 8) { shift = 8 - clz; mrtmp >>= shift; ea += shift; } else { shift = clz - 8; mrtmp <<= shift; ea -= shift; } uint16_t mr = mrtmp & 0x7F; uint16_t er = ea & 0xFF; uint16_t r = (sa << 15) | (er << 7) | mr; bf16_t result = {.bits = r}; return result; } ``` ### multiplication of `BF16` ```c bf16_t bf16_mul(bf16_t a, bf16_t b) { uint16_t ia = a.bits, ib = b.bits; int sa = ia >> 15, sb = ib >> 15; int16_t ma = (ia & 0x7F) | 0x80, mb = (ib & 0x7F) | 0x80; int16_t ea = (ia >> 7) & 0xFF, eb = (ib >> 7) & 0xFF; int32_t mrtmp = imul16(ma, mb); int mshift = (mrtmp >> 15) & 1; uint16_t mr = mrtmp >> (mshift + 7); int16_t ertmp = ea + eb - 127; int16_t er = mshift ? ertmp + 1 : ertmp; int sr = sa ^ sb; uint16_t r = (sr << 15) | ((er & 0xFF) << 7) | (mr & 0x007F); return (bf16_t){.bits = r}; } ``` ### Bilinear_interpolation Automatically run the three tests for verification. :::danger Have you considered the corner cases? ::: ```c int main() { //use linear interpolation to find point[1][1] float points[3][3][3] = { {{0.235 , 0 , 0.272}, {0 , 0 , 0 }, {0.333 , 0 , 0.916}}, {{0.358 , 0 , 27610}, {0 , 0 , 0 }, {3399 , 0 , 8.1225}}, {{-3556 , 0 , 3.3365}, {0 , 0 , 0 }, {11.6782, 0 , 9881}} }; for(int i = 0; i<3 ; i++){ bf16_t p01 = bf16_add(bf16_mul(fp32_to_bf16((float)0.5),fp32_to_bf16(points[i][0][0])),bf16_mul(fp32_to_bf16((float)0.5),fp32_to_bf16(points[i][0][2]))); // printf("result1 (Float): %f\n", bf16_to_fp32(p01)); bf16_t p21 = bf16_add(bf16_mul(fp32_to_bf16((float)0.5),fp32_to_bf16(points[i][2][0])),bf16_mul(fp32_to_bf16((float)0.5),fp32_to_bf16(points[i][2][2]))); // printf("result1 (Float): %f\n", bf16_to_fp32(p21)); bf16_t p11 = bf16_add(bf16_mul(fp32_to_bf16((float)0.5),p01),bf16_mul(fp32_to_bf16((float)0.5),p21)); printf("result1 (Float): %f\n", bf16_to_fp32(p11)); } return 0; } ``` The complete C code is as follow [Bilinear_interpolation.c](https://github.com/handsomesDavid11/CA_hw1/blob/main/Bilinear_interpolation.c). ## Assembly Code v1 ### fp32_to_bf16 ```javascript= fp32_to_bf16: mv t0, a0 # exp li t6, 0x7F800000 and t1, t0, t6 # mantissa li t6, 0x007FFFFF and t2, t0, t6 # remove the other bit srli t2, t2, 16 slli t2, t2, 16 # t0 = sign, t1 = exp, t2 = mantissa srli t0, t0, 31 slli t0, t0, 31 or t0, t0, t1 or t0, t0, t2 mv a0, t0 ret ``` ### bf16_mul The ```bf16_mul``` function uses ```imul16``` to calculate the result of multiplying the mantissas. ```javascript= bf16_mul: # input a4, a5 addi sp, sp, -48 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) sw s6, 24(sp) sw s7, 28(sp) sw s8, 32(sp) sw s9, 36(sp) sw s10, 40(sp) sw ra, 44(sp) # a5, a4 input mv s0, a5 mv s1, a4 # mantissa s2, s3 li t0, 0x7fffff li t1, 0x800000 and t2, s0, t0 or t2, t2, t1 srli s2, t2, 16 and t2, s1, t0 or t2, t2, t1 srli s3, t2, 16 # exp s4, s5 li t0, 23 li t1, 0xff srl t2, s0, t0 and s4, t2, t1 srl t2, s1, t0 and s5, t2, t1 add s4, s4, s5 addi s4, s4, -127 mv a0, s2 mv a1, s3 call imul16 mv s6, a0 srli t0, s6, 15 andi s7, t0, 1 beq s7, x0, no_shift srli s6, s6, 1 addi s4, s4, 1 no_shift: srli s6, s6, 7 andi s6, s6, 0x7f li s8, 0 slli s4, s4, 23 slli s6, s6, 16 or s8, s4, s8 or s8, s6, s8 mv a0, s8 lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) lw s6, 24(sp) lw s7, 28(sp) lw s8, 32(sp) lw s9, 36(sp) lw s10, 40(sp) lw ra, 44(sp) addi sp, sp, 48 ret ``` ### imul16 The loop in ```imul16``` only needs to run 16 times, which is half the number of loop cycles compared to the ```imul``` in a typical ```FP32``` multiplication. This can significantly improve the computation speed of ```BF16```. ```javascript= imul16: addi sp, sp, -16 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw ra, 12(sp) mv s0, a0 mv s1, a1 mv s2, x0 li t2, 16 loop_imul: andi t0, s1, 1 beq t0, x0, no_add slli t1, s0, 16 add s2, s2, t1 no_add: srli s1, s1, 1 addi t2, t2, -1 srli s2, s2, 1 beq t2, x0, done_imul j loop_imul done_imul: mv a0, s2 lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw ra, 12(sp) addi sp, sp, 16 ret ``` ### bf16_add The ```bf16_add``` function uses ```my_clz``` to find how many leading zeros there are, and based on that, ```bf16_add``` determines how to align to the correct position. ```javascript= bf16_add: # input a4, a5 addi sp, sp, -48 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) sw s6, 24(sp) sw s7, 28(sp) sw s8, 32(sp) sw s9, 36(sp) sw s10, 40(sp) sw ra, 44(sp) add s0, a5, x0 add s1, a4, x0 li t0, 0x7fffffff and t1, s0, t0 and t2, s1, t0 blt t2, t1, noswap mv t0, s0 mv s0, s1 mv s1, t0 noswap: # sign s2 s3 srli s2, s0, 31 srli s3, s1, 31 # mantissa s4 s5 li t0, 0x7fffff li t1, 0x800000 and t2, s0, t0 or s4, t2, t1 and t2, s1, t0 or s5, t2, t1 # exp s6 s7 li t0, 23 li t1, 0xff srl t2, s0, t0 and s6, t2, t1 srl t2, s1, t0 and s7, t2, t1 # exp diff s8 sub s8, s6, s7 srl s5, s5, s8 # sub or add or t0, s2, s3 bne t0, x0, sub add s4, s4, s5 j ma_exit sub: sub s4, s4, s5 ma_exit: mv a0, s4 call my_clz # s9 = my_clz output mv s9, a0 mv s10, x0 li t0, 8 blt t0, s9, shift_sub # shift_add li t0, 8 sub s10, t0, s9 srl s4, s4, s10 add s6, s6, s10 j shift_exit shift_sub: li t0, 8 sub s10, s9, t0 sll s4, s4, s10 sub s6, s6, s10 shift_exit: li t0, 0x80000000 and t0, s0, t0 li t1, 23 sll t1, s6, t1 li t2, 0x7fffff and t2, s4, t2 or t0, t0, t1 or a0, t0, t2 lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) lw s6, 24(sp) lw s7, 28(sp) lw s8, 32(sp) lw s9, 36(sp) lw s10, 40(sp) lw ra, 44(sp) addi sp, sp, 48 ret ``` ### my_clz ```javascript= my_clz: # input a0 li t0, 0 li t1, 31 li t2, 1 clz_loop: sll t3, t2, t1 and t3, a0, t3 bne t3, x0, clz_done addi t0, t0, 1 addi t1, t1, -1 bge t1, x0, clz_loop clz_done: mv a0, t0 ret ``` ### Bilinear_interpolation ```javascript= Bilinear_interpolation: addi sp, sp, -52 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) sw s6, 24(sp) sw s7, 28(sp) sw s8, 32(sp) sw s9, 36(sp) sw s10, 40(sp) sw s11, 44(sp) sw ra, 48(sp) mv s0, a0 mv s1, a1 #-------------------(0,0)-------------------# lw a0, 0(s0) call fp32_to_bf16 mv a4, a0 lw a0, 0(s1) call fp32_to_bf16 mv a5, a0 jal ra, bf16_mul mv s2, a0 #-------------------(2,0)-------------------# lw a0, 4(s0) call fp32_to_bf16 mv a4, a0 lw a0, 0(s1) call fp32_to_bf16 mv a5, a0 jal ra, bf16_mul mv s3, a0 #-------------------(1,0)-------------------# mv a4, s2 mv a5, s3 call bf16_add mv s8, a0 #-------------------(0,2)-------------------# lw a0, 8(s0) call fp32_to_bf16 mv a4, a0 lw a0, 0(s1) call fp32_to_bf16 mv a5, a0 jal ra, bf16_mul mv s2, a0 #-------------------(2,2)-------------------# lw a0, 12(s0) call fp32_to_bf16 mv a4, a0 lw a0, 0(s1) call fp32_to_bf16 mv a5, a0 jal ra, bf16_mul mv s3, a0 #-------------------(1,2)-------------------# mv a4, s2 mv a5, s3 call bf16_add mv s9, a0 #-------------------(0,1)-------------------# mv a0, s8 call fp32_to_bf16 mv a4, a0 lw a0, 0(s1) call fp32_to_bf16 mv a5, a0 jal ra, bf16_mul mv s10, a0 #-------------------(1,1)-------------------# mv a0, s9 call fp32_to_bf16 mv a4, a0 lw a0, 0(s1) call fp32_to_bf16 mv a5, a0 call bf16_mul mv s11, a0 mv a4, s10 mv a5, s11 call bf16_add lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) lw s6, 24(sp) lw s7, 28(sp) lw s8, 32(sp) lw s9, 36(sp) lw s10, 40(sp) lw s11, 44(sp) lw ra, 48(sp) addi sp, sp, 52 ret ``` :::danger Use fewer instructions. ::: The complete assembly code is as follow [Bilinear_interpolation_v1.s](https://github.com/handsomesDavid11/CA_hw1/blob/main/Bilinear_interpolation_v1.s). ### Test data ``` .data test0: .word 0x3e70a3d7, 0x3e8b4396, 0x3eaa7efa, 0x3f6a7efa # 0.235 0.272 0.333 0.916 test1: .word 0x3eb74bc7, 0x3fb62824, 0x40d8b6ae, 0x4101f5c3 test2: .word 0x3de353f8, 0x40558937, 0x413ad9e8, 0x408a4674 ``` ### Result ``` result1: 0.436523 result2: 7480 result3: 3352 ``` :::danger Do not use screenshots for plain text content, as this is inaccessible to visually impaired users. ::: ### discussion on bilinear interpolation corner case ``` ------------------------- y=0 | 139550| 0.0 | 0.111 | |------------------------ y=1 | 0.0 | 0.0 | 0.0 | |------------------------ y=2 | 0.653 | 0.0 | 0.323 | ------------------------- x=0 x=1 x=2 ``` ``` result:34816.2 ``` In this example, one of the interpolation points is significantly higher than the other three. While interpolation can still proceed, the result may encounter certain issues: ```Distortion of the interpolated value```: The significantly higher point may disproportionately influence the result. When one value is much larger, the symmetry is broken, and the interpolated value is skewed towards the higher point, even if it is equidistant from lower values. As a result, the higher value "dominates" the interpolation, reducing both accuracy and smoothness in the transitions between points. ```Accuracy degradation```: The large disparity between the highest value and the other points can reduce the overall accuracy. The high value strongly affects the result, making the interpolation less representative of the other data points. In this case, we can see that the result is heavily influenced by the significantly higher point at ```(0,0)```, with little to no impact from the other three points. The precision of the other three values could be calculated up to the third decimal place, but after interpolation, the result is only shown to the first decimal, greatly reducing the accuracy. ## Improvement Although BF16 multiplication uses imul16 to reduce the loop cycles by half, during the actual design, it was found that the mantissa is only 8 bits. In fact, using imul8 can complete the mantissa calculation, and the loop cycles for imul8 would be halved again, further improving the computation speed. ### imul8 assembly ```javascript= imul8: addi sp, sp, -16 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw ra, 12(sp) mv s0, a0 mv s1, a1 mv s2, x0 li t2, 8 loop_imul: andi t0, s1, 1 beq t0, x0, no_add slli t1, s0, 8 add s2, s2, t1 no_add: srli s1, s1, 1 addi t2, t2, -1 srli s2, s2, 1 beq t2, x0, done_imul j loop_imul done_imul: mv a0, s2 lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw ra, 12(sp) addi sp, sp, 16 ret ``` The modified assembly code is here [Bilinear_interpolation_v2.s](https://github.com/handsomesDavid11/CA_hw1/blob/main/Bilinear_interpolation_v2.s). ### The result of the first version ![螢幕擷取畫面 2024-10-12 160126](https://hackmd.io/_uploads/SyJy62vykl.png) cycles = 7134 ### Second version ![image](https://hackmd.io/_uploads/B1wbuM_ykg.png) Cycles = 5664 The results indicate that modifying the code can reduce 7134 - 5664 = 1470 cycles. ## Analysis This program is simulation on the 5-stage pipelined processor which includes IF,ID,EX,MEM and WB stages. **Instruction Fetch**: Retrieves the instruction to be executed from memory. **Instruction Decode:** Decodes the fetched instruction to determine what the instruction is supposed to do during this cycle. **Execute**: The stage where various computations are performed, depending on the specific instruction. **Memory Access**: The stage where data can be either written to or read from memory. **Register Write-back**: Writes the result of the instruction back to the registers. This pipelining technique is used to improve their performance by allowing multiple instructions to be in various stages of execution simultaneously like the example below. ![image](https://hackmd.io/_uploads/HJdvrQu1kl.png) Let's use the instruction ```addi x6 x0 255``` as an example to explain how it works in a 5-stage RISC-V CPU. ![image](https://hackmd.io/_uploads/Bk0VoXOy1l.png) ### IF stage * ```addi x6 x0 255```instruction is located at address 0x00000280 in the instruction memory. * mechine code about ```addi x6 x0 255``` is 0x0ff00313. * I-type Format: ``` | 31 20 | 19 15 | 14 12 | 11 07 | 06 00 | +----------------------+---------+------------+--------+------------+ | immediate[11:0] | rs1 | funct3 | rd | opcode | +----------------------+---------+------------+--------+------------+ | 000011111111 | 00000 | 000 | 00110 | 010011 | +----------------------+---------+------------+--------+------------+ ``` * The opcode is 010011 and the funct3 is 000, indicating an I-type instruction. * The source register (rs1) is x0, the destination register (rd) is x6, and the immediate value (imm) is 255. ![image](https://hackmd.io/_uploads/SkhWpXO1yl.png) ### ID stage * R1 idx = 0x00 * rd = 0x06 * imm = 0x000000ff ![image](https://hackmd.io/_uploads/rkRIH0F1kx.png) ![image](https://hackmd.io/_uploads/SyFAFm9yye.png) ### EXE stage * Op1 = ```0x00000000``` from x0 * Op2 = ```0x000000ff``` from Imm * Res = ```0x00000002``` (Op1 + Op2) ![image](https://hackmd.io/_uploads/SJ1crAYkJl.png) ### MEM stage * I-type instr don't need to read or write memory , pass Res(0x00000006) through this stage and go to WB stage. ![image](https://hackmd.io/_uploads/rkb6H0tykx.png) ### WB stage * The multiplexer select Res from ALU as final output. The output value is ```0x000000ff```. * The output value and rd are send back to registers block. Finally, the value ```0x000000ff```will be write into x6 register. ![image](https://hackmd.io/_uploads/HyGRHAty1e.png) * After all these stage are done, the register is updated. ```x6``` = ```0x000000ff``` ![image](https://hackmd.io/_uploads/BkZm375kkx.png) ## Reference * [2024 Quiz 1 Problem B](https://hackmd.io/@sysprog/arch2024-quiz1-sol) * [2024 Quiz 1 Problem C](https://hackmd.io/@sysprog/arch2024-quiz1-sol) * [Bilinear interpolation - Wikipedia](https://en.wikipedia.org/wiki/Bilinear_interpolation)