# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [`jimmylu890303`](https://github.com/jimmylu890303) > ## Implementations of Float32 multiplication & BFloat16 multiplication ### 1. Format of Float32 and BFloat16 | Type | sign bit | exponent bits | significand bits | | -------- |:--------:|:------------:|:---------------:| | Float32 | 1 | 8 | 23 | | Bfloat16 | 1 | 8 | 7 | - Disadvantage: The architecture comparison between the two shows that Bfloat16 has fewer significand bits, only 7 bits, which results in lower precision compared to Float32. - Advantage: Bfloat16 is employed to reduce storage requirements and enhance the calculation speed of machine learning algorithms. ### 2. Difference in Design When Implementing Multiplication: The simple algorithm for IEEE754 32-bit multiplication involves the following steps: - 1. Check if either of the operands (A and B) is zero (early exit). - 2. Check for potential exponent overflow and throw corresponding overflow errors. - 3. Compute the sign as Csign = Asign XOR Bsign. - 4. Compute the exponent Cexponent = Aexponent + Bexponent - 127. - 5. Compute the mantissa Cmantissa = Amantissa * Bmantissa (23-bit integer multiplication) and round the result according to the currently set rounding mode. - 6. If Cmantissa overflows, normalize the results. When using Float32 to multiply two Float32 numbers, at step 5, which involves computing the mantissa Cmantissa, we **need to use two registers** to store the result of Amantissa * Bmantissa. Furthermore, we need to **iterate 24 times** (24 bits * 24 bits, where 1 bit is the integer part, and 23 bits are the mantissa) for the multiplication. However, when using Bfloat16, at step 5, we **only require one register** to store the result of Amantissa * Bmantissa. Additionally, we only need to **iterate 8 times** (8 bits * 8 bits, where one bit represents the integer part, and 7 bits are the mantissa) for the multiplication. In conclusion, Bfloat16 is **more hardware-friendly (requiring only one register)** and **time-efficient (fewer iterations)** when performing multiplication operations. ### 3. Implement Float32 multiplication with C - The code below is an implementation of float multiplication. ```code= float float_mul (float A, float B) { int32_t *A_int_ptr = (int *) &A; int32_t *B_int_ptr = (int *) &B; /* Check if either of the operands (A and B) is zero */ if(*A_int_ptr == 0x80000000 | *A_int_ptr == 0x00000000) return A; if(*B_int_ptr == 0x80000000 | *B_int_ptr == 0x00000000) return B; /* sign bit */ int32_t sign = 0x80000000; int32_t a_sign = *A_int_ptr & sign; int32_t b_sign = *B_int_ptr & sign; int32_t r_sign = a_sign ^ b_sign; /* exp bit */ int32_t expo = 0x7F800000; int32_t a_exp = *A_int_ptr & expo; int32_t b_exp = *B_int_ptr & expo; int32_t bias = 0x3F800000; int32_t r_exp = a_exp + b_exp - bias; /* mantisa */ int32_t man = 0x007FFFFF; int32_t a_man = *A_int_ptr & man; a_man = a_man | 0x00800000; int32_t b_man = *B_int_ptr & man; b_man = b_man | 0x00800000; int64_t r = unsign32_mul (a_man, b_man); /* normalization */ int64_t needNorm = r & 0x800000000000; r = r >> 23; int64_t r_man = r & 0x7FFFFF; if (needNorm) { r_exp = r_exp + 0x00800000; r = r >> 1; r_man = r & 0x7FFFFF; } int32_t result = r_sign + r_exp + (int32_t) r_man; return *(float*)&result; } ``` - The code below is an implementation of mantissa multiplication when performing floating-point multiplication. ```code= int64_t unsign32_mul (int32_t A, int32_t B) { int32_t LSB = 1; int64_t multiplican = (int64_t) A; int64_t result = 0; int loopCounter = 24; while (loopCounter > 0){ int32_t lsb = B & LSB; /* if lsb = 1, do addition */ if (lsb){ result = result + multiplican; } multiplican = multiplican << 1; B = B >> 1; loopCounter--; } return result; } ``` From the above code, we can see that we need to use 64 bits to store the multiplication result when performing a multiplication of 24-bit mantissas. We need to iterate 24 times to complete the multiplication. - The assembly code of the implementation of float multiplication can be found on [my GitHub](https://github.com/jimmylu890303/2023_Computer_Architecture/blob/main/assignment/hw1/FloatMul.s). ### 4. Implement BFloat16 multiplication with C - The code below is an implementation of bfloat multiplication. ```code= float bfloat_mul(float A, float B){ int32_t* A_int_ptr = (int *)&A; int32_t* B_int_ptr = (int *)&B; /* Check if either of the operands (A and B) is zero */ if(*A_int_ptr == 0x80000000 | *A_int_ptr == 0x00000000) return A; if(*B_int_ptr == 0x80000000 | *B_int_ptr == 0x00000000) return B; /* sign bit */ int32_t sign = 0x80000000; int32_t a_sign = *A_int_ptr&sign; int32_t b_sign = *B_int_ptr&sign; int32_t r_sign = a_sign^b_sign; /* exp bit */ int32_t expo = 0x7F800000; int32_t a_exp = *A_int_ptr&expo; int32_t b_exp = *B_int_ptr&expo; int32_t bias = 0x3F800000; int32_t r_exp = a_exp+b_exp-bias; /* mantisa */ int32_t man = 0x007F0000; int32_t a_man = *A_int_ptr&man; a_man = a_man | 0x00800000; a_man = a_man >> 16; int32_t b_man = *B_int_ptr&man; b_man = b_man | 0x00800000; b_man = b_man >> 16; int32_t r = unsign8_mul(a_man,b_man); /* normalization */ int32_t needNorm = r & 0x00008000; r = r >> 7; int32_t r_man = r & 0x7F; if(needNorm){ r_exp = r_exp + 0x00800000; r = r >> 1; r_man = r & 0x7F; } r_man = r_man << 16; int32_t result = r_sign+r_exp + r_man; return *(float*)&result; } ``` - The code below is an implementation of mantissa multiplication when performing bfloating-point multiplication. ```code= int32_t unsign8_mul(int32_t A, int32_t B){ int32_t LSB = 1; int32_t multiplican = A; int32_t result = 0; int loopCounter = 8; while(loopCounter>0){ int32_t lsb = B & LSB; /* if lsb = 1, do addition */ if(lsb){ result = result + multiplican; } multiplican = multiplican << 1; B = B >> 1; loopCounter--; } return result; } ``` From the above code, we can see that we only need to use 32 bits to store the multiplication result when performing a multiplication of 8-bit mantissas. We need to iterate 8 times to complete the multiplication. - The assembly code of the implementation of bfloat multiplication can be found on [my GitHub](https://github.com/jimmylu890303/2023_Computer_Architecture/blob/main/assignment/hw1/BFloatMul.s). ### 5. Compare and analyze the results of Float32 multiplication and BFloat16 multiplication ##### The number of cycles required to execute the multiplication. - Float32 multiplication:[GitHub branch](https://github.com/jimmylu890303/2023_Computer_Architecture/blob/float32-mul-branch/assignment/hw1/main.s) ![](https://hackmd.io/_uploads/ByZGLQhe6.png) - BFloat16 multiplication(Using pre-defined data):[GitHub branch](https://github.com/jimmylu890303/2023_Computer_Architecture/blob/bfloat16-mul/assignment/hw1/main.s) ![](https://hackmd.io/_uploads/r1RZc73lp.png) - Convert two Float32 numbers to BFloat16 and then perform BFloat16 multiplication (By invoking the fp32_to_bf16 function twice to convert):[GitHub branch](https://github.com/jimmylu890303/2023_Computer_Architecture/blob/convert-float-and-bloat16-mul/assignment/hw1/main.s) ![](https://hackmd.io/_uploads/S1CDL7nx6.png) We can clearly see that using BFloat16 to perform multiplication can speed up significantly. ##### The precision of the results of multiplication. - the result of Float32 multiplication ![](https://hackmd.io/_uploads/H1zE8m2eT.png) 0xc5664300 = -3684.1875 - the result of BFloat16 multiplication ![](https://hackmd.io/_uploads/HyHtdX3xp.png) 0xc5660000 = -3680.0 Although we sacrifice some computational precision, it is worth it in terms of the benefits we gain ## Implementation of Quiz1 B problem - C code ```cod= float fp32_to_bf16(float x) { float y = x; int *p = (int *) &y; unsigned int exp = *p & 0x7F800000; unsigned int man = *p & 0x007FFFFF; if (exp == 0 && man == 0) /* zero */ return x; if (exp == B01 /* Fill this! */) /* infinity or NaN */ return x; /* Normalized number */ /* round to nearest */ float r = x; int *pr = (int *) &r; *pr &= 0xFF800000; /* r has the same exp as x */ r /= B02 /* Fill this! */; y = x + r; *p &= 0xFFFF0000; return y; } ``` - Assembly code ```code= fp32_to_bf16: # s0 = y = x # s1 = exp # s2 = man addi sp, sp, -20 sw ra, 0(sp) sw a0, 4(sp) sw s0, 8(sp) sw s1, 12(sp) sw s2, 16(sp) # Load y = x in s0 mv s0, a0 # Get actuall exp bit in s1 # Load exp mask la s1 exp_mask lw s1, 0(s1) # Get exp bit and s1, s1, s0 # Get actuall man bit in s2 # Load man mask la s2 man_mask lw s2, 0(s2) # Get man bit and s2, s2, s0 # /* zero */ # Check if exp==0 , t0 = 1, else t0 = 0 mv a0, s1 jal ra, checkZero mv t0, a0 # Check if man==0 , t1 = 1, else t1 = 0 mv a0, s2 jal ra, checkZero mv t1, a0 # Check if (exp==0 and man ==0) return X and t2, t0, t1 bnez t2, returnX # /* infinity or NaN */ la t0, nan_mask lw t0, 0(t0) beq s1, t0, returnX # /* Normalized number */ # /* round to nearest */ # r = r / 256 la t0, div_mask lw t0, 0(t0) # y = x + r add s0, s0, t0 # result la t0, result_mask lw t0, 0(t0) and s0, s0, t0 mv a0, s0 # return val lw s0, 8(sp) lw s1, 12(sp) lw s2, 16(sp) lw ra, 0(sp) addi sp, sp, 20 ret ``` ## Full implementation of my project Follow [my GitHub](https://github.com/jimmylu890303/2023_Computer_Architecture/blob/main/assignment/hw1/main.s) and check my full implementation. ## Analysis The code was tested using the [Ripes simulator](https://github.com/mortbopet/Ripes/releases). Here are the first few lines of the disassembled executable code: ```code= 00000000 <main>: 0: 170000ef jal x1 368 <test> 4: 10000517 auipc x10 0x10000 8: 06850513 addi x10 x10 104 c: 00400893 addi x17 x0 4 10: 00000073 ecall 14: 10000517 auipc x10 0x10000 18: 07250513 addi x10 x10 114 1c: 00400893 addi x17 x0 4 20: 00000073 ecall 24: 42488537 lui x10 0x42488 28: c29305b7 lui x11 0xc2930 2c: 2cc000ef jal x1 716 <float32_mul> 30: 02200893 addi x17 x0 34 34: 00000073 ecall 38: 10000517 auipc x10 0x10000 3c: 03450513 addi x10 x10 52 40: 00400893 addi x17 x0 4 44: 00000073 ecall 48: 42488537 lui x10 0x42488 4c: 04c000ef jal x1 76 <fp32_to_bf16> 50: 00050413 addi x8 x10 0 54: c2930537 lui x10 0xc2930 58: 040000ef jal x1 64 <fp32_to_bf16> 5c: 00050493 addi x9 x10 0 60: 10000517 auipc x10 0x10000 64: 04450513 addi x10 x10 68 68: 00400893 addi x17 x0 4 6c: 00000073 ecall 70: 00040513 addi x10 x8 0 74: 00048593 addi x11 x9 0 78: 400000ef jal x1 1024 <bfloat16_mul> 7c: 02200893 addi x17 x0 34 80: 00000073 ecall 84: 10000517 auipc x10 0x10000 88: fe850513 addi x10 x10 -24 8c: 00400893 addi x17 x0 4 90: 00000073 ecall 94: 5300006f jal x0 1328 <End> 00000170 <test>: 170: ffc10113 addi x2 x2 -4 174: 00112023 sw x1 0 x2 ``` ### Instruction Pipeline Let’s take a look at how the `jal x1 368 <test> `instruction works during execution, which is at address `0x0`. #### IF Stage - At the beginning, PC is `0x0`, meaning that we are going to fetch instruction at `0x0`, `jal x1 368 <test>.` - PC will be update to `0x4` next cycle. - From Instruction memory,we can get instruction machine code(`0x170000ef`). ![](https://hackmd.io/_uploads/S1TpsW2x6.png) #### ID Stage - The machine code `0x170000ef` is decoded by the Decoder. - `0x00 (zero)` and `0x10 (a0)` are sent to the Register circuit to read values. - `0x01 (ra)`` will be sent to the next stage. - Since the opcode is equal to '`jal`,the address of the label `test` is decoded, and its value is `0x170`. ![](https://hackmd.io/_uploads/SJHP2Z2lp.png) #### EX Stage - The ALU takes inputs `0x0 (PC)` and `0x170`, sums them to obtain the new address `0x170 (Label: test)`. - `a0` and `ra` will be sent to next stage. ![](https://hackmd.io/_uploads/r1hKx72e6.png) #### MEM Stage - Flush 2 instructions to NOPs after jal. - The IF stage loads the new instruction `at the address test+4`. - `0x4(return address)` and `0x1(ra index)` will be sent to next stage. ![](https://hackmd.io/_uploads/H15kzXhep.png) #### WB Stage - Write `0x4(return address) into 0x1(ra reg.)` ![](https://hackmd.io/_uploads/Hk-CMQnx6.png) ## Assembly Optimization I am comparing the performance of the optimized code with the [previous code](https://github.com/jimmylu890303/2023_Computer_Architecture/blob/convert-float-and-bloat16-mul/assignment/hw1/main.s), which converts two Float32 numbers to BFloat16 and then performs BFloat16 multiplication (by invoking the fp32_to_bf16 function twice for conversion). Complete optimized code is in [main_optimized.s](https://github.com/jimmylu890303/2023_Computer_Architecture/blob/main/assignment/hw1/main_optimized.s). ### Optimize fp32_to_bf16 (Using reducing jump) - A segment from the original code version ```code= # /* zero */ # Check if exp==0 , t0 = 1, else t0 = 0 mv a0, s1 jal ra, checkZero mv t0, a0 # Check if man==0 , t1 = 1, else t1 = 0 mv a0, s2 jal ra, checkZero mv t1, a0 # Check if (exp==0 and man ==0) return X and t2, t0, t1 bnez t2, returnX ``` - Using reducing jump instruction ```code= # /* zero */ or t0, s1, s2 # Check if (exp==0 and man ==0) return X beqz t0, returnX ``` - The performance before optimization ![](https://hackmd.io/_uploads/S1CDL7nx6.png) - The performance after optimization ![](https://hackmd.io/_uploads/H1uukB2lT.png) We can observe that there's no need to flush the two instructions behind `jal ra, checkZero`,resulting in a reduction in the total number of cycles. ### Optimize BFloat16 (Using loop unrolling) - A segment from the original code version ```code= # Loop to perform multiplication bf16_mul_loop: mv t3, s0 # multiplicand mv t4, s1 # multiplier andi t5, t4, 1 # extract LSB from multiplier beqz t5, bf16_no_add # if LSB=0, skip addition sll t5, t3, t6 # left shift multiplicand with n bits (t6) add t0, t0, t5 # lower 32bits = lower 32bits + left shifted multiplicand bf16_no_add: srli s1, s1, 1 # right shift multiplier addi t6, t6, 1 # shift counter++ addi t2, t2, -1 # Decrement the loop counter bnez t2, bf16_mul_loop # Continue the loop until all bits are processed ``` - Using loop unrolling ```code= # Loop to perform multiplication bf16_mul_loop: mv t3, s0 # multiplicand mv t4, s1 # multiplier andi t5, t4, 1 # extract LSB from multiplier beqz t5, bf16_no_add # if LSB=0, skip addition sll t5, t3, t6 # left shift multiplicand with n bits (t6) add t0, t0, t5 # lower 32bits = lower 32bits + left shifted multiplicand bf16_no_add: srli s1, s1, 1 # right shift multiplier addi t6, t6, 1 # shift counter++ mv t3, s0 # multiplicand mv t4, s1 # multiplier andi t5, t4, 1 # extract LSB from multiplier beqz t5, bf16_no_add1 # if LSB=0, skip addition sll t5, t3, t6 # left shift multiplicand with n bits (t6) add t0, t0, t5 # lower 32bits = lower 32bits + left shifted multiplicand bf16_no_add1: srli s1, s1, 1 # right shift multiplier addi t6, t6, 1 # shift counter++ addi t2, t2, -2 # Decrement the loop counter bnez t2, bf16_mul_loop # Continue the loop until all bits are processed ``` - The performance before optimization ![](https://hackmd.io/_uploads/H1uukB2lT.png) - The performance after optimization ![](https://hackmd.io/_uploads/ryCUjE2e6.png) We can observe that the number of required flush instructions can be reduced when employing loop unrolling.