Computer Architecture HW1

# Computer Architecture HW1: Rectangle Area ## Quiz 1 Problem B Inspired by quiz1 problem B and Leetcode problem [223. Rectangle Area ](https://leetcode.com/problems/rectangle-area/description/) . We reduce the data type of input from fp32 to Bfloat16 to reduce storage requirements and enhance the calculation speed from quiz 1 problem B we can easly convert fp32 data type to bf16 data type ```c= static inline bf16_t fp32_to_bf16(float s) { bf16_t h; union { float f; uint32_t i; } u = {.f = s}; if ((u.i & 0x7fffffff) > 0x7f800000) { /* NaN */ h.bits = (u.i >> 16) | 64; /* force to quiet */ return h; } h.bits = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10; return h; } ``` and I implement it in RV32I assembly ```asm= fp32_to_bf16: # copy s to t0 # t1 is 0x7fffffff mv t0, a0 li t1, 0x7fffffff and t4, t0, t1 li t5, 0x7f800000 bgt t4, t5, handle_nan # ? (u.as_bits & 0x7fffffff) > 0x7f800000 # h = (u.as_bits + (0x7fff + ((u.as_bits >> 16) & 1))) >> 16 srli t2, t0, 16 # t2 = (u.as_bits >> 16) andi t2, t2, 1 # t2 = ((u.as_bits >> 16) & 1) li t3, 0x7fff add t2, t2, t3 # t2 = (0x7fff + ((u.as_bits >> 16) & 1)) add t2, t2, t0 # t2 = (u.as_bits + (0x7fff + ((u.as_bits >> 16) & 1))) srli t2, t2, 16 mv a0, t2 ret # (u.as_bits & 0x7fffffff) > 0x7f800000 handle_nan: # s is in t0 srli t2, t0, 16 #(u.as_bits >> 16) ori t2, t2, 64 #(u.as_bits >> 16) | 64; /* force to quiet */ mv a0, t2 ret ``` ### FP32 vs BF16 ![圖片](https://hackmd.io/_uploads/HJ47NCo11e.png) ![圖片](https://hackmd.io/_uploads/BkZ-4Ask1g.png) Advantages of using BF16 over FP32: 1. Reduced Memory Usage: BF16 uses only 16 bits compared to FP32's 32 bits, which significantly reduces memory consumption. This is particularly beneficial in environments with limited memory resources, such as edge devices or GPUs with smaller memory capacity. 2. Faster Computation: With smaller bit width, BF16 operations can often be executed faster than FP32 on hardware that supports BF16 arithmetic. This can lead to improved performance, especially in large-scale matrix multiplications or deep learning inference tasks. 3. Lower Power Consumption: Due to the smaller data size, BF16 computations typically require less energy than FP32. This is advantageous for power-sensitive applications, such as mobile devices and embedded systems. 4. Hardware Efficiency: Modern processors, GPUs, and AI accelerators increasingly support BF16, making it more attractive for deep learning workloads. With BF16, you can leverage more efficient hardware resources, enabling better parallelism and throughput. 5. Sufficient Precision for Many Applications: In many cases, the precision provided by BF16 is sufficient for training and inference in deep learning models. BF16 keeps the same exponent range as FP32, meaning it can represent very large and very small numbers, which is crucial for certain algorithms. ## Rectangle Area The source code would be in [simple_rectangle.c and simple rectangle.s](https://github.com/rhway666/Computer-Architecture-HW/tree/main/hw1) with some function test for bf16 operations. I've also tried to implement the original leetcode 223. Rectangle Area in recnew.c and rec_new.s but still debugging, I also got the idea of matye we can implemnt the BF16 operation in a SIMD way, but currently it stays in the future work. ![圖片](https://hackmd.io/_uploads/rJ9QSd5Jyx.png) I changed the task to calculate the area of a rectangle by providing the FP32 coordinates of the top-right and bottom-left corners. The C code is as follows, and I provided three sets of test data. ```c= #include <stdio.h> #include <stdint.h> #include <math.h> // define bfloat16（BF16）struct typedef uint16_t bf16_t; typedef union { uint32_t as_bits; float as_value; } float_bits_union; static inline float bits_to_fp32(uint32_t w) { float_bits_union fp32 = {.as_bits = w}; return fp32.as_value; } static inline uint32_t fp32_to_bits(float f) { float_bits_union fp32 = {.as_value = f}; return fp32.as_bits; } static inline bf16_t fp32_to_bf16(float s) { bf16_t h; float_bits_union u = {.as_value = s}; if ((u.as_bits & 0x7fffffff) > 0x7f800000) { /* NaN */ h = (u.as_bits >> 16) | 64; /* force to quiet */ return h; } h = (u.as_bits + (0x7fff + ((u.as_bits >> 16) & 1))) >> 16; return h; } static inline float bf16_to_fp32(bf16_t h) { float_bits_union u = {.as_bits = (uint32_t)h << 16}; return u.as_value; } // BF16 addition bf16_t bf16_add(bf16_t a, bf16_t b) { float fa = bf16_to_fp32(a); float fb = bf16_to_fp32(b); float result = fa + fb; return fp32_to_bf16(result); } // BF16 subtraction bf16_t bf16_sub(bf16_t a, bf16_t b) { float fa = bf16_to_fp32(a); float fb = bf16_to_fp32(b); float result = fa - fb; return fp32_to_bf16(result); } // BF16 multiplication bf16_t bf16_mul(bf16_t a, bf16_t b) { float fa = bf16_to_fp32(a); float fb = bf16_to_fp32(b); float result = fa * fb; return fp32_to_bf16(result); } float compute_area(float ax1, float ay1, float ax2, float ay2) { bf16_t bax1 = fp32_to_bf16(ax1); //s0 bf16_t bay1 = fp32_to_bf16(ay1); //s1 bf16_t bax2 = fp32_to_bf16(ax2); //s2 bf16_t bay2 = fp32_to_bf16(ay2); //s3 bf16_t width = bf16_sub(bax2, bax1); bf16_t highet = bf16_sub(bay2, bay1); bf16_t area = bf16_mul(width, highet); printf("area in bf16 = %x \n", area); return bf16_to_fp32(area); } void print_fp32_hex(float ax1, float ay1, float ax2, float ay2) { printf("Rectangle A:\n"); printf(" (ax1, ay1): 0x%08X, 0x%08X\n", fp32_to_bits(ax1), fp32_to_bits(ay1)); printf(" (ax2, ay2): 0x%08X, 0x%08X\n", fp32_to_bits(ax2), fp32_to_bits(ay2)); } int main() { float ax1 = -8.0f, ay1 = -8.0f, ax2 = 8.0f, ay2 = 8.0f; print_fp32_hex(ax1, ay1, ax2, ay2); float result1 = compute_area(ax1, ay1, ax2, ay2); printf("Total Area in fp32 : %.6f\n", result1); printf("Total Area in fp32 hex : 0x%08X \n", fp32_to_bits(result1)); float bx1 = -4.123456f, by1 = -5.841368f, bx2 = 17.666666f, by2 = 1.022222f; print_fp32_hex(bx1, by1, bx2, by2); float result2 = compute_area(bx1, by1, bx2, by2); printf("Total Area in fp32 : %.6f\n", result2); printf("Total Area in fp32 hex : 0x%08X\n", fp32_to_bits(result2)); float cx1 = -4.0f, cy1 = -4.0f, cx2 = 4.0f, cy2 = 4.0f; print_fp32_hex(cx1, cy1, cx2, cy2); float result3 = compute_area(cx1, cy1, cx2, cy2); printf("Total Area in fp32 : %.6f\n", result3); printf("Total Area in fp32 hex: 0x%08X\n", fp32_to_bits(result3)); return 0; } ``` ```output= Rectangle A: (ax1, ay1): 0xC1000000, 0xC1000000 (ax2, ay2): 0x41000000, 0x41000000 area in bf16 = 4380 Total Area in fp32 : 256.000000 Total Area in fp32 hex : 0x43800000 Rectangle B: (ax1, ay1): 0xC083F35A, 0xC0BAEC7D (ax2, ay2): 0x418D5555, 0x3F82D82C area in bf16 = 4312 Total Area in fp32 : 150.000000 Total Area in fp32 hex : 0x43140000 Rectangle C: (ax1, ay1): 0xC0800000, 0xC0800000 (ax2, ay2): 0x40800000, 0x40800000 area in bf16 = 4280 Total Area in fp32 : 64.000000 Total Area in fp32 hex: 0x42800000 ``` To meet the assignment requirements, I need to simulate floating-point arithmetic using only the RV32I instruction set. The following functions are required: 1. BF16 Addition Simulation: This function will simulate the addition of two BF16 numbers using RV32I integer instructions. 2. BF16 Subtraction Simulation: This function will simulate the subtraction of two BF16 numbers using RV32I integer instructions. 3. BF16 Multiplication Simulation: This function will simulate the multiplication of two BF16 numbers using RV32I integer instructions. 4. BF16 Comparison Simulation: This function will simulate the comparison of two BF16 numbers (greater than, less than, or equal) using RV32I integer instructions. ### bf16_mul in RV32I ``` asm bf16_mul: addi sp, sp, -16 # allocate 4 s register sw s2, 0(sp) # save s2 sw s3, 4(sp) # save s3 sw s4, 8(sp) # save s4 sw s5, 12(sp) # save s5 mv t0, a0 mv t1, a1 # sign bit srli t2, t0, 15 srli t3, t1, 15 xor t2 ,t2, t3 #signbit 在t2 # exp srli t3, t0, 7 srli t4, t1, 7 andi t3, t3, 0xff # mask sign bit andi t4, t4, 0xff add t3, t3, t4 addi t3, t3, -127 #(e1 - 127) + (e2 - 127) shift twice 127 # t3 = exp # frac andi t4, t0, 0x7f andi t5, t1, 0x7f ori t4, t4, 0x80 # add the 1. back ori t5, t5, 0x80 addi t6, x0, 0 # ans addi s2, x0, 8 # iter 7 times frac_mul_loop: beqz s2, end_frac_mul_loop andi s3, t4, 1 # lsb beqz s3, skip_add # 0->skip add add t6, t6, t5 # t6: mul result skip_add: slli t5, t5, 1 # baychensu srli t4, t4, 1 addi s2, s2, -1 # iter -1 j frac_mul_loop end_frac_mul_loop: srli s4, t6, 15 #check highest bit = 1? andi s4, s4, 0x1 #highest bit in s4 beqz s4, highest_bit_zero srli s4, t6, 8 # highestbit is one 15~9 bit addi t3, t3, 1 # exp + 1 andi s4, s4, 0x7f #final frac in s4 j combine_result highest_bit_zero: # 14~8 bit srli s4, t6, 7 #shift 7bit andi s4, s4, 0x7f #final frac in s4 combine_result: # signbit at t2, exp at t3, frac at s4 slli t2, t2, 15 slli t3, t3, 7 or t3, t3, t2 or t3, t3, s4 #final result in t3 lw s2, 0(sp) # s2 lw s3, 4(sp) # s3 lw s4, 8(sp) # s4 lw s5, 12(sp) # s5 addi sp, sp, 16 mv a0, t3 ret ``` ### bf16_add ```asm= bf16_add: addi sp, sp, -36 # allocate space for 4 saved registers (s2 - s6) sw s2, 0(sp) # save s2 sw s3, 4(sp) # save s3 sw s4, 8(sp) # save s4 sw s5, 12(sp) # save s5 sw s6, 16(sp) sw s7, 20(sp) sw s8, 24(sp) sw s9, 28(sp) sw s10, 32(sp) mv t0, a0 # t0 = bf16_a mv t1, a1 # t1 = bf16_b # Step 1: get signbit srli t2, t0, 15 # signbit t0 -> t2 srli t3, t1, 15 # signbit t1 -> t3 # Step 2: get exp srli s2, t0, 7 # s2 = a exp andi s2, s2, 0xFF srli s3, t1, 7 # s3 = b exp andi s3, s3, 0xFF # step3: get frac andi t4, t0, 0x7F # t4 = a frac ori t4, t4, 0x80 andi t5, t1, 0x7F # t5 = b frac ori t5, t5, 0x80 compare_exp: # align small frac bge s2, s3, align_b # align a frac sub t6, s3, s2 # t6 = exp_b - exp_a (b > a) mv s6, s3 # s6 = copy big exp(b) mv s7, t3 # final sign bit in s7 same with b srl t4, t4, t6 # align a frac j sign_check align_b: beq s2, s3, exp_equal sub t6, s2, s3 # t6 = exp_a - exp_b (a > b) mv s6, s2 # s6 = copy big exp(a) mv s7, t2 # final sign bit in s7 same with a srl t5, t5, t6 # align b frac j sign_check exp_equal: mv s6, s2 # s6 = copy big exp(a) or exp b is both ok # compare frac blt t4, t5, b_frac_bigger beq t4, t5, same_num # a > b mv s7, t2 # final sign bit in s7 same with a j sign_check same_num: beq t2, t3 , add_name_num # sub_name_num -> ans = 0 mv a0, x0 lw s2, 0(sp) lw s3, 4(sp) lw s4, 8(sp) lw s5, 12(sp) lw s6, 16(sp) lw s7, 20(sp) lw s8, 24(sp) lw s9, 28(sp) lw s10, 32(sp) addi sp, sp, 36 ret add_name_num: mv s7, t2 # final sign bit in s7 same with a or b is both ok j sign_check b_frac_bigger: mv s7, t3 # final sign bit in s7 same with b # a frac in t4 # b frac in t5 # bigger exp in s6 sign_check: xor s10, t2, t3 # s10 = t2 ^ t3 (check sign bit same?) beqz s10, add_mantissas # branch if the same sub_mantissas: # align small frac bge s2, s3, a_minus_b # s2>=s3 sub t4, t5, t4 # frac_b -aligned_frac_a j count_leading_zero a_minus_b: beq s2, s3, frac_minus_on_same_exp sub t4, t4, t5 # frac_a -aligned_frac_b j count_leading_zero frac_minus_on_same_exp: bge t4, t5 frac_a_greater # a>=b # a < b sub t4, t5, t4 # frac_b -aligned_frac_a j count_leading_zero frac_a_greater: beq t4, t5, zero # a = b # a > b sub t4, t4, t5 # frac_a -aligned_frac_b j count_leading_zero zero: # sub_name_num -> ans = 0 mv a0, x0 lw s2, 0(sp) lw s3, 4(sp) lw s4, 8(sp) lw s5, 12(sp) lw s6, 16(sp) lw s7, 20(sp) lw s8, 24(sp) lw s9, 28(sp) lw s10, 32(sp) addi sp, sp, 36 ret count_leading_zero: li s5, 0 # s5 count highest bit li s9, 7 # Loop from bit 7 to bit 0 check_bit_loop: srl t6, t4, s9 # Shift t4 right by s9 to check bit s9 andi t6, t6, 0x1 # Isolate the bit bnez t6, found_highest_frac # If the bit is 1, branch to found_highest_frac addi s5, s5, 1 # Increment shift count if bit is not 1 addi s9, s9, -1 # Decrement bit position bgez s9, check_bit_loop # Continue loop if s9 >= 0 found_highest_frac: # s6 = copy big exp(a) # final sign bit in s7 same with a # t4 frac sub s6, s6, s5 sll t4, t4, s5 mv s4, t4 # move final frac to s4 # Combine the final result # s7: sign bit, s6: exponent, t4: fraction j combine_bit add_mantissas: # a frac in t4 # b frac in t5 # bigger exp in s6 # t4 = frac a add s4, t4, t5 # s4 = frac add mv s7, t2 #final sign bit in s7 # Step 4: 正規化 srli s5, s4, 8 # check 9bit bnez s5, normalize_right j combine_bit normalize_right: srli s4, s4, 1 # frac shift addi s6, s6, 1 # exp 加 1 combine_bit: # s7: sign bit, s6: exponent, t4: fraction # sign bit is t2 (use the sign of larger exponent value) slli s7, s7, 15 slli s6, s6, 7 andi s4, s4, 0x7F or t3, s7, s6 or a0, t3, s4 lw s2, 0(sp) lw s3, 4(sp) lw s4, 8(sp) lw s5, 12(sp) lw s6, 16(sp) lw s7, 20(sp) lw s8, 24(sp) lw s9, 28(sp) lw s10, 32(sp) addi sp, sp, 36 ret ``` and based on bf16_add we can simply reverse the sign bit of the subtraction ### bf16_sub ```asm= bf16_sub: addi sp, sp, -8 # allocate space for 4 saved registers (s10) sw s10, 0(sp) # caller save sw ra, 4(sp) li s10, 0x8000 xor a1, a1, s10 jal ra, bf16_add lw s10, 0(sp) lw ra, 4(sp) addi sp, sp, 8 ret ``` ![圖片](https://hackmd.io/_uploads/BJ4wZnqyJx.png) ## Result ![圖片](https://hackmd.io/_uploads/SJi1CC9yye.png) ![圖片](https://hackmd.io/_uploads/Bk4rpk2Jkl.png) For addition and subtraction results, there are occasional discrepancies between the C code and the assembly bf16_sub, with a difference of 0x0001. This happens with many different test sets, and I am guessing the possible reason is Rounding Differences: In my C code, when converting from FP32 to BF16, i'm using a rounding method that correctly handles rounding to the nearest even (also known as round half to even). ```c h = (u.as_bits + (0x7FFF + ((u.as_bits >> 16) & 1))) >> 16; ``` This code adds 0x7FFF plus an extra 1 if the least significant bit of the high half is 1. This ensures correct rounding according to the IEEE 754 standard for BF16. But in my assembly code, if the bf16_add function doesn't implement rounding in the same way, I didn't notice this problem when i was implementing bf16_add,it cause the final result to differ by 1 in the BF16 representation. like in this example ```asm= Rectangle A: (ax1, ay1): 0x40833333, 0x40AB3333 (ax2, ay2): 0x40C851EC, 0x41180000 width in bf16 = 400a highet in bf16 = 4085 area in bf16 = 410f Total Area in fp32 : 8.937500 Total Area in fp32 hex: 0x410F0000 ``` the result of bf16_sub in highet will be ![圖片](https://hackmd.io/_uploads/ByTwbCoJJe.png) this would cause the final result in bf16_mul to be different ## Instruction Pipeline take `add t2, t2, t0` in my fp32_to_bf16 implement as an example li t3, 0x7fff add t2, t2, t3 # t2 = (0x7fff + ((u.as_bits >> 16) & 1)) <span style="color: red;"> add t2, t2, t0 srli t2, t2, 16 mv a0, t2 ret ![圖片](https://hackmd.io/_uploads/H1_bOln1Jx.png) ![圖片](https://hackmd.io/_uploads/B1-SLg3yJe.png) ![IMG_1214](https://hackmd.io/_uploads/B1TcvWh1kx.jpg) ### Instruction Fetch (IF) ![圖片](https://hackmd.io/_uploads/HyN5aenkJl.png) PC (Program Counter): 1. The PC points to the memory address 0x0000011C, which contains the add x7, x7, x5 instruction. 2. The PC will then increment to the next instruction address, which is 0x0000011C. This prepares the system to fetch the next instruction in the following cycle. ### Instruction Decode (ID) ![圖片](https://hackmd.io/_uploads/BJLZkZ2ykl.png) #### 1.Decoding instruction The instruction add x7, x7, x5 (0x005383b3) is decoded. ``` 0x005383b3 → 0000 0000 0101 0011 1000 0011 1011 0011 ``` The opcode, function code (funct3 and funct7), and register fields (source and destination registers) are extracted: ![圖片](https://hackmd.io/_uploads/H1_bOln1Jx.png) so it will look like this ``` func7 rs2 rs1 func3 rd opcode 0000000 000101 00111 000 00111 0110011 ``` Opcode: 0110011 (indicates it is an R-type instruction). Funct3: 000 (indicates this is an addition operation). Funct7: 0000000 (part of the addition specification). Source Registers: x7 and x5. Destination Register: x7. #### 2. Register Read: The values from the source registers x7 and x5 are read from the register file. x7: 0x00000000 (this is the current value of x7). x5: 0x40c851ec Value is being fetched and passed into the pipeline. These values will be forwarded to the next stage (the Execute stage). #### 3. Control Signal Generation: The control signals are generated based on the instruction type (R-type in this case). The control unit sets up the ALU control signals for performing an addition in the upcoming Execute stage. The register write enable (Wr En) signal for the destination register x7 is activated, meaning that after the addition, the result will be written back to x7. #### 4. Immediate and Operand Preparation: Since this is an R-type instruction, no immediate value is used (the immediate section would be inactive in this stage). The source operands (x7 and x5) are prepared to be sent to the ALU for execution. ### 3. Instruction Execute (IE) ![圖片](https://hackmd.io/_uploads/BkAxnb2yke.png) #### Data hazard ![IMG_1216](https://hackmd.io/_uploads/HJAoV72kyx.jpg) ```asm= add t2, t2, t3 add t2, t2, t0 srli t2, t2, 16 ``` There is a data hazard in t2 becuase the result of add t2, t2, t3 hasn't be writed back , so to solve this question we can use forwarding. ![IMG_1217](https://hackmd.io/_uploads/Sk8PB721ye.jpg) which is the yellow line here, will forward the EX result of add t2, t2, t3 into op1 ![圖片](https://hackmd.io/_uploads/ByYx7X3yJx.png) #### 1. ALU Operation: The ALU is responsible for executing the arithmetic operation. The values from the source registers x5 and from forwarding are passed into the ALU: Operand 1 (from forwarding): 0x00007fff Operand 2 (from x5): 0x40c851ec The control signals indicate that this is an addition operation (add), so the ALU will perform the addition: ```clike= 0x00007fff + 0x40c851ec = 0x40c8d1eb ``` The result of the ALU operation, 0x40c8d1eb, is computed and stored in the ALU's Result (Res) register. #### 2. Forwarding the Result: The computed result 0x40c8d1eb is forwarded to the next pipeline stage, which is the Memory Access (MEM) stage,and also passed to the next op1 because srli t2, t2, 16 still have data hazard. Since the add instruction does not involve memory access, this stage will bypass memory operations and proceed directly to the write-back stage. #### 3. Control Signals: Control signals are active, enabling the ALU to perform the addition operation. You can see that the correct values are forwarded to the ALU, and the result is computed successfully. **No branch is taken**, as this is an arithmetic operation and not a conditional branch. ### 4. Memory Access (MEM) ![圖片](https://hackmd.io/_uploads/H1tBYX3Jyl.png) Key Control Signals to Track in MEM: MemRead = 0: No read operation from memory. MemWrite = 0: No write operation to memory. ALU Result: The result is forwarded to the next stage (WB) without modification. Since add x7, x7, x5 is an arithmetic instruction, there is no memory access required in the MEM stage. The memory access stage is bypassed for this instruction. The result of the addition, which was computed in the ALU during the Execute (EX) stage, is forwarded directly to the next stage, Write-Back (WB). ### 5. Write Back (WB) ![圖片](https://hackmd.io/_uploads/Hy23cXhJJl.png) Write to Register File: The result from the ALU (forwarded from the MEM stage) is written back into the destination register (x7). Control Signals: The write enable (Wr En) signal for the register file is activated to allow the result to be stored in x7. The correct destination register index (x7) is selected for writing the result. Final Result in Register: After this stage, the new value for x7 is updated in the register file. The value that was computed by the ALU (0x40c8d1eb or similar based on the values of x7 and x5) is now stored in x7.