arch2025-homework1

# Assignment 1 - RISC-V ## SUMMARY - **Opening** - **UTF8 Encoding/Decoding** - **Study of the Processor** - **BF16 All Operations (add/sub/mul/div/sqrt)** --- # Opening Before reading this work, I want to clarify a few things: - First, I'm not entirely certain what the professor expects, which is why this report explains each part of the code in detail, even if it becomes borring. (I will adjust the approach for the next homework if needed) - The more I worked on this assignment, the more I wanted to refactor everything. Unfortunately, there are constraints: - **Better naming conventions than "skip" and "loop" (currently quite unreadable)** - **More comprehensive explanations throughout** - **This document reads more like a development journal than a formal report , I may make statements and revise them shortly after. I apologize for any inconsistencies.** # UTF8 Encoding/Decoding ## Overview & Summary Implementation of an encoder/decoder for the UTF8 format in RISC-V assembly. - **Part 1: Pure C to RISC-V Translation** (encode & decode functions) - **Part 2: Validation & Testing** (test suite and error handling) - **Part 3: Performance Analysis** (instruction count and optimization) --- # Part 1: Pure C to RISC-V Translation **Main registers convention:** - `x1`: input value / addresses - `x2`: return result - `x3-x19`: working registers - `x20`: saved `ra` (return address) - `x21`: stack pointer (`sp`) --- ## encode_uf8 Function This section shows the step-by-step translation from C to RISC-V assembly. ### Save and Initialize ```riscv encode_uf8: addi x21, x21, -4 sw x20, 0(x21) jal x20, init_registers_encode ``` - Saves `ra` on stack for function call preservation ```riscv init_registers_encode: li x2, 0 # return value li x3, 16 # comparison threshold li x4, 32 # CLZ counter li x5, 16 # CLZ step jr x20 ``` - Initializes working registers with constants ```riscv blt x1, x3, skip_encode ``` If value < 16, no encoding is needed; the value is returned directly. ### CLZ (Count Leading Zeros) ```riscv add x6, x1, x0 loop: beq x5, x0, end_loop srl x7, x6, x5 beq x7, x0, skip2 sub x4, x4, x5 addi x6, x7, 0 skip2: li x10, 1 srl x5, x5, x10 j loop ``` **Binary search algorithm** to count leading zeros: - `x4`: counter - `x5`: shift step - `x6`: working value - Shifts and tests to find the first 1 bit ### MSB Calculation (Most Significant Bit) ```riscv sub x3, x4, x6 li x4, 31 sub x4, x4, x3 ``` - `x3` = lz (leading zeros) - `x4` = MSB position (31 - LZ) ### Exponent Calculation ```riscv li x5, 0 blt x4, x11, skip3 sub x5, x4, x12 blt x5, x13, skip4 li x5, 15 ``` - If MSB < 5: exponent = 0 - else: exponent = MSB - 4 - maximum value: 15 ### Overflow Construction ```riscv loop3: bge x7, x5, loop3_end sll x6, x6, x10 addi x6, x6, 16 addi x7, x7, 1 j loop3 ``` overflow = (16 << exp) - 16 ### Overflow Adjustment ```riscv loop4: blt x5, x10, loop4_end blt x6, x1, loop4_end beq x6, x1, loop4_end addi x6, x6, -16 srl x6, x6, x10 addi x5, x5, -1 j loop4 ``` Reduces exponent if overflow is too large compared to the input value. ### Overflow Expansion ```riscv loop5: bge x5, x15, loop5_end sll x7, x6, x10 addi x7, x7, 16 bge x7, x1, loop5_end addi x6, x7, 0 addi x5, x5, 1 j loop5 ``` Increases exponent if possible to maximize precision. ### Mantissa Calculation ```riscv sub x7, x1, x6 srl x7, x7, x5 sll x2, x5, x14 or x2, x2, x7 ``` - Mantissa = (value - overflow) >> exponent - result = (exponent << 4) | mantissa ### Return ```riscv skip_encode: add x2, x0, x1 # if value < 16, just return value encode_return: lw x20, 0(x21) addi x21, x21, 4 jr x20 ``` Uses the stack to handle multiple values (for testing purposes). --- ## decode_uf8 Function ### Register Initialization (decode) ```riscv init_decode: li x2, 0 # return value li x3, 0 # mantissa li x4, 0 # exponent li x6, 0x0F # mantissa mask li x7, 4 # shift amount li x8, 15 # max exponent li x9, 0x7FFF # offset mask jr x20 ``` ### Field Extraction ```riscv and x3, x1, x6 # mantissa = value & 0x0F srl x4, x1, x7 # exponent = value >> 4 ``` Separates 4 bits of mantissa and 4 bits of exponent. ### Offset Calculation ```riscv sub x5, x8, x4 # 15 - exponent srl x5, x9, x5 # 0x7FFF >> (15 - exponent) sll x5, x5, x7 # << 4 ``` Calculates offset = `((0x7FFF >> (15 - exp)) << 4)`. ### Final Decoding ```riscv sll x2, x3, x4 # mantissa << exponent add x2, x2, x5 # + offset ``` Result = `(mantissa << exponent) + offset`. --- # Part 2: Validation & Testing ## Test Data Structure ### Test Values ```riscv initial_value1: .word 976 initial_value2: .word 15 initial_value3: .word 0 ``` **Test values**: 976, 15, and 0 to verify functionality. These values test three different code paths: - **0 and 15**: pass through without calculation (value < 16, direct return via `skip_encode`) - **976**: goes through the full encode/decode process and should return exactly the same value **Note**: Different numbers can be used, but keep in mind there will be an error rate for certain values due to the lossy nature of UF8 encoding. ### Result Management ```riscv test1_pass: la a0, msg_pass li a7, 4 ecall la x10, test_results li x11, 1 sw x11, 0(x10) ``` Prints "PASS" and stores 1 in `test_results[0]`. If test fails: ```riscv la a0, msg_fail li a7, 4 ecall la x10, test_results sw x0, 0(x10) ``` Prints "FAIL" and stores 0 in `test_results[0]`. ### Final Summary ```riscv final_summary: la x10, test_results lw x11, 0(x10) lw x12, 4(x10) lw x13, 8(x10) and x14, x11, x12 and x14, x14, x13 beqz x14, failed ``` - Loads all 3 test results - Performs logical AND: all must be 1 to pass - If x14 = 0, at least one test failed ```riscv la a0, msg_all_pass li a7, 4 ecall ``` If all tests passed, prints "All tests PASSED!". ```riscv failed: la a0, msg_final_fail li a7, 4 ecall ``` If any test failed, prints "Some tests FAILED!". --- # Part 3: Performance Analysis ## Instruction Count Analysis To measure performance, **976** was chosen as a representative value that exercises the full encoding path (976 > 16). ### Methodology: 1. Count instructions in the RISC-V implementation using Ripes step-by-step execution 2. Estimate instructions in the C implementation 3. Identify optimization opportunities ### encode_uf8 function (value = 976): **Total: 102 instructions** - **Initialization**: 10 instructions - **CLZ**: 35 instructions (binary search loop, 5 iterations) - **MSB calculation**: 3 instructions - **Exponent and overflow setup**: 11 instructions - **Overflow construction loops**: 26 instructions (loop3 + loop4 + loop5) - **Mantissa calculation and combine**: 6 instructions - **Return**: 3 instructions - **Other**: 8 instructions (branches, misc, etc.) The analysis reveals that CLZ is expensive, as are the various loops. ### decode_uf8 function (value = 976 encoded): **Total: 23 instructions** - **Initialization**: 15 instructions - **Decode formula**: 8 instructions Initialization is the most costly component. ## C vs RISC-V Comparison | Operation | RISC-V | C language | |-----------|--------|---------------| | Encode | 102 | 77-90 | | Decode | 23 | 12 | Note: These are estimated values for **ONE NUMBER ONLY**. ## Optimization Opportunities ### 1. CLZ Optimization **Current**: Simple translation from the C program. **Potential improvements**: - Use built-in RISC-V modules (not allowed in this assignment) - Use a lookup table stored in memory ### 2. Loop Fusion **Current**: Three separate loops (loop3, loop4, loop5) **Attempted optimization**: Merge into a single loop with combined conditions Multiple fusion attempts were made, but no significant improvement was achieved, so the original version was retained. One attempt to merge loop4 and loop5 saved only 3 instructions: ```riscv merge_loop: # Loop4 blt x6, x1, skip00 # if overflow < value, try expand beq x6, x1, skip3 # if overflow == value, done # Overflow too big, decrease addi x6, x6, -16 # overflow -= 16 srl x6, x6, x10 # overflow >>= 1 addi x5, x5, -1 # exponent-- j merge_loop skip00: # Loop5 bge x5, x15, skip3 # if exponent >= 15, done sll x7, x6, x10 # next_overflow = overflow << 1 addi x7, x7, 16 # next_overflow += 16 bge x7, x1, skip3 # if next_overflow >= value, done addi x6, x7, 0 # overflow = next_overflow addi x5, x5, 1 # exponent++ j merge_loop # continue adjusting ``` ## Optimized Performance Estimate | Optimization | Encode (before) | Encode (after) | Improvement | |--------------|-----------------|----------------|-------------| | Loop merging | 102 | 98 | Minimal | No improvements were found for decode; initialization remains the most expensive component and cannot be easily optimized. ## Conclusion After these unsuccessful optimization attempts, research was conducted on C compilation and efficiency. The key finding is that **cycle count is more important than instruction count for efficiency**. The next section compares C assembly with RISC-V assembly to better understand this principle. --- # Processor Study with Ripes Using the Ripes simulator with RISC-V 32-bit 5-stage processor, the flow of instructions through the pipeline and control signals at the hardware level are analyzed (based on the understanding of question 3 in homework 1). ### Pipeline Configuration: - **Processor**: RISC-V 32-bit Single-cycle / 5-stage pipeline - **ISA**: RV32I - **Stages**: IF → ID → EX → MEM → WB ### Pipeline Stages Overview: 1. **IF** (Instruction Fetch): Fetch instruction from instruction memory using PC 2. **ID** (Instruction Decode): Decode instruction, read source registers from register file 3. **EX** (Execute): Perform ALU operation or calculate memory address 4. **MEM** (Memory Access): Access data memory for load/store instructions 5. **WB** (Write Back): Write result back to destination register --- ## Selected Instructions for Analysis Two representative instructions covering big part of instruction are analyzed: `lw`, `sw`. --- ## Instruction 1: Load Word - `lw x1, 0(x2)` **Context**: Loading test value 976 from memory at the beginning of main program **Purpose**: Read the test value from data memory into register x1 for encoding ### Initial State ![image](https://hackmd.io/_uploads/SyLo6zLTxl.png) ![image](https://hackmd.io/_uploads/BJ2fTMIpeg.png) - Memory address `0x10000000`: contains **976** (0x000003D0) - Register x1: contains **0x00000000** - Register x2: contains address **0x10000000** --- ### Stage 1: Instruction Fetch (IF) ![image](https://hackmd.io/_uploads/r1eXJXUage.png) **What happens**: - PC outputs current instruction address: ![image](https://hackmd.io/_uploads/rymrkmUaxx.png) - Instruction Memory fetches the encoded `lw` instruction: ![image](https://hackmd.io/_uploads/HJbyZ7Lale.png) - 0x00012083 = 0000000000000001001000001**0000011** ![image](https://hackmd.io/_uploads/BJLnxQ8plx.png) --- ### Stage 2: Instruction Decode (ID) ![image](https://hackmd.io/_uploads/HJ_MGQL6gg.png) **What happens**: Control Unit decodes instruction ![image](https://hackmd.io/_uploads/BJP14786gl.png) - `0x02` source address (2) - `0x00` add offset (nothing to add) - `LW` instruction - `0x01` destination register (1) --- ### Stage 3: Execute (EX) ![image](https://hackmd.io/_uploads/H1IvSm86gl.png) **What happens**: - ALU performs address calculation: base_address + offset ![image](https://hackmd.io/_uploads/BytmUQ86gg.png) Calculation: 0x10000000 + 0 = 0x10000000 Result is the memory address to access. --- ### Stage 4: Memory Access (MEM) ![image](https://hackmd.io/_uploads/BywdS7U6lg.png) **What happens**: - Data Memory receives address from ALU (0x10000000) ![image](https://hackmd.io/_uploads/r1GewmI6lx.png) - Signal Write red → Data memory reading only - Data at address 0x10000000 is read: **976** (0x000003D0) ![image](https://hackmd.io/_uploads/B1c3DQUpgg.png) **Control Signals Active**: - **WR en = 0** (inactive), indicating read operation --- ### Stage 5: Write Back (WB) ![image](https://hackmd.io/_uploads/rJQYHmUall.png) **What happens**: ![image](https://hackmd.io/_uploads/rJrWYQLpel.png) - Wr data: 976 (0x000003d0) - Wr idx: 0x01 (register x1) - Wr En: 1 (enable) will write to the register ![image](https://hackmd.io/_uploads/Hy8BF78Tlx.png) ### Result - **Register x1 now contains the test value for use throughout the program** ![image](https://hackmd.io/_uploads/HkOusQ86gx.png) --- ## Instruction 2: Store Word - `sw x20, 0(x21)` Now that the process is understood, this analysis will proceed more quickly. **Context**: Storing encoded result back to memory after encoding **Purpose**: Store the return address (RA) in the stack (this technique is used when there are nested function calls) ### Initial State ![image](https://hackmd.io/_uploads/SyY2wBUTel.png) Expected state: - Register x20: contains the return address - Register x21: contains the current stack address --- ### Stage 1: IF ![image](https://hackmd.io/_uploads/ByCx4BL6le.png) --- ![image](https://hackmd.io/_uploads/BJqKVSIalg.png) - PC: line number 0x00000200 ![image](https://hackmd.io/_uploads/SJ56EB8Tgl.png) - Instructions: 0x014aa023=0000000101001010101000000**0100011** ![image](https://hackmd.io/_uploads/HyglrSU6xe.png) ![image](https://hackmd.io/_uploads/HJUAmHU6ee.png) --- ### Stage 2: ID ![image](https://hackmd.io/_uploads/rJC-4B86eg.png) ![image](https://hackmd.io/_uploads/rkSO8rUpgg.png) **What happens**: - `0x15` Reg1 destination address (x21) - `0x14` Reg2 source address (x20) - `SW` instruction - `0x00` offset (0) ![image](https://hackmd.io/_uploads/HkazDHUTgl.png) - Reading register contents: - Reg 1: 0x00010000 - Reg 2: 0x00000028 ![image](https://hackmd.io/_uploads/Sy61dHITge.png) --- ### Stage 3: EX ![image](https://hackmd.io/_uploads/rJIfEHLpge.png) **What happens**: ![image](https://hackmd.io/_uploads/SkVTOB8pgl.png) - ALU calculates memory address: x21 + 0 --- ### Stage 4: MEM ![image](https://hackmd.io/_uploads/BkRMESITeg.png) **What happens**: ![image](https://hackmd.io/_uploads/HknEKrLplx.png) - Addr: x21 (location to write) - Data in: x20 (data to write) **Control Signals Active**: - **Wr en = 1 (green) = x20 → x21** ![image](https://hackmd.io/_uploads/SJVN9rLalg.png) --- ### Stage 5: WB ![image](https://hackmd.io/_uploads/B13XNS8agx.png) **What happens**: - Data is written to the correct location! ![image](https://hackmd.io/_uploads/r1isqB86lx.png) - No register will be updated - Wr en = 0 (inactive) ![image](https://hackmd.io/_uploads/BkAq9S8axe.png) --- ## Signal Summary Table To avoid repetition, here is a summary table of all instruction types: | Instruction Type | Reg Wr | Mem Wr | Mem to Reg | ALU | Branch | |-----------------|----------|----------|----------|--------|--------| | `lw` (Load) | 1 | 0 | 1 | 1 | 0 | | `sw` (Store) | 0 | 1 | X | 1 | 0 | | `sub` (R-type) | 1 | 0 | 0 | 1 | 0 | | `beq` (B-type) | 0 | 0 | X | 0 | 1 | | `jal` (Jump) | 1 | 0 | 1 | X | 0 | **Legend**: 1 = active, 0 = inactive, X = don't care --- ## Processor Hazard Handling ### Data Hazard Resolution This section was not originally planned, but an important observation was made during the SW instruction presentation. This differs from the processor configuration used at my home institution. The code was initially modified to present SW in an ideal situation: ```riscv encode_uf8: addi x21, x21, 0 sw x20, 0(x21) ``` But the actual code is: ```riscv encode_uf8: addi x21, x21, -4 sw x20, 0(x21) ``` This modifies the stack pointer to avoid overwriting data (standard stack behavior). **Problem**: The instruction needs the updated value of x21 before it has been written back. **Solution in Ripes**: ![image](https://hackmd.io/_uploads/Hkek5ZU8aeg.png) - The `addi x21 x21 -4` instruction is in the MEM stage, so the new value of x21 is already computed - The ALU accepts the forwarding value from the next pipeline stage - The old value from the ID/EX pipeline is bypassed --- ### Branch Hazard Resolution Further investigation revealed another type of hazard handling in the processor. **Code sequence**: ```riscv beq x5, x0, end_loop srl x7, x6, x5 ``` **Problem**: Branch decision is not made until the EX stage, but the processor has already fetched the next instruction. **Solution**: - The next instruction is fetched to prepare for both cases: ![image](https://hackmd.io/_uploads/B1M7EULTxg.png) - If the branch is taken, the entire pipeline is flushed: ![image](https://hackmd.io/_uploads/H1AmEULpgx.png) --- # BF16 Operations ## SUMMARY - **BF16 to F32 Conversion** - **F32 to BF16 Conversion** - **BF16 Addition** - **BF16 Subtraction** - **BF16 Multiplication** - **BF16 Division** - **BF16 Square Root** --- # Part 1: Pure C to RISC-V Translation ## BF16 Format Overview - **16 bits total: 1 sign bit + 8 exponent bits + 7 mantissa bits** - Special values: Inf (exp=0xFF, mant=0), NaN (exp=0xFF, mant≠0), Zero (exp=0, mant=0) (Unlike UF8, these special values exist in BF16) **Register Convention:** - `x1`: result value - `x2`: return address (ra) - `x3`: stack pointer (sp) - `x4-x9`: extracted information (sign, exp, mantissa) - `x10-x15`: temporaries - `x16-x22`: working registers *The register convention has been updated from the UF8 implementation for improved clarity.* --- ## 1. BF16 to F32 Conversion ### Code ```riscv main: la x1, bfloat16 lh x1, 0(x1) slli x4, x1, 16 la x7, float32 sw x4, 0(x7) j end ``` **Explanation:** FP32 = BF16 << 16. --- ## 2. F32 to BF16 Conversion ### Handle Special Cases ```riscv f32_to_bf16: la x1, bfloat32 lw x1, 0(x1) srli x4, x1, 23 andi x4, x4, 0x7FF li x5, 0xFF bne x4, x5, normal_case srli x6, x1, 16 andi x6, x6, 0xFFFF j return_bf16 ``` ### Code ```riscv normal_case: srli x6, x1, 16 andi x6, x6, 1 addi x6, x6, 0x7FFF add x6, x1, x6 srli x6, x6, 16 ``` --- ## 3. BF16 Addition ### Initialize ```riscv init: srli x4, x14, 15 # sign_a srli x5, x15, 15 # sign_b srli x6, x14, 7 # exp_a andi x6, x6, 0xFF srli x7, x15, 7 # exp_b andi x7, x7, 0xFF andi x8, x14, 0x7F # mant_a andi x9, x15, 0x7F # mant_b jr x2 ``` *This initialization pattern is reused for most operations.* ### Handle Special Cases ```riscv add: jal x2, init li x16, 0xFF bne x6, x16, skip1 bne x8, x0, return_a # a is NaN bne x7, x16, skip2 bne x9, x0, return_b # b is NaN beq x4, x5, return_b # Inf + Inf (same sign) li x1, 0xFFC0 # Inf + (-Inf) = NaN j return skip2: return_a: add x1, x14, x0 j return skip1: li x16, 0xFF beq x7, x16, return_b # b is Inf/NaN or x16, x6, x8 beq x16, x0, return_b # a is zero or x16, x7, x9 beq x16, x0, return_a # b is zero ``` ### Add Implicit Bit ```riscv beq x6, x0, skip3 ori x8, x8, 0x80 # add implicit 1 skip3: beq x7, x0, skip4 ori x9, x9, 0x80 # add implicit 1 skip4: ``` ### Align Exponents ```riscv sub x10, x6, x7 # exp_diff blez x10, skip5 add x12, x6, x0 # result_exp = exp_a li x16, 8 bge x10, x16, return_a # diff too large srl x9, x9, x10 # align mant_b j skip7 skip5: bgez x10, skip6 add x12, x7, x0 # result_exp = exp_b li x16, -8 blt x10, x16, return_b # diff too large sub x16, x0, x10 srl x8, x8, x16 # align mant_a j skip7 skip6: add x12, x6, x0 # exponents equal skip7: ``` ### Add/Sub Mantissas ```riscv bne x4, x5, skip8 # different signs add x11, x4, x0 # result_sign add x13, x8, x9 # add mantissas andi x16, x13, 0x100 beq x16, x0, skip9 srli x13, x13, 1 # normalize overflow addi x12, x12, 1 li x16, 0xFF blt x12, x16, skip9 slli x1, x11, 15 # overflow to Inf li x16, 0x7F80 or x1, x1, x16 j return skip9: j combine_result skip8: blt x8, x9, skip10 add x11, x4, x0 sub x13, x8, x9 j skip11 skip10: add x11, x5, x0 sub x13, x9, x8 skip11: beq x13, x0, return_zero skip12: andi x16, x13, 0x80 bne x16, x0, skip13 slli x13, x13, 1 # normalize addi x12, x12, -1 blez x12, return_zero j skip12 skip13: ``` ### Combine Result ```riscv combine_result: slli x1, x11, 15 # sign << 15 andi x16, x12, 0xFF slli x16, x16, 7 # exp << 7 or x1, x1, x16 andi x16, x13, 0x7F # mantissa or x1, x1, x16 j return ``` --- ## 4. BF16 Subtraction ### Code ```riscv sub: li x16, 0x8000 xor x15, x15, x16 # flip sign of b j add # reuse addition ``` **Context:** Subtraction is in the same program as addition (selected during test execution). **Explanation:** Subtraction is implemented as addition with the sign of the second operand flipped. --- ## 5. BF16 Multiplication ### Initialize ```riscv init: srli x4, x10, 15 # sign_a srli x5, x11, 15 # sign_b srli x6, x10, 7 andi x6, x6, 0xFF # exp_a srli x7, x11, 7 andi x7, x7, 0xFF # exp_b andi x8, x10, 0x7F # mant_a andi x9, x11, 0x7F # mant_b xor x12, x4, x5 # result_sign jr x2 ``` ### Handle Special Cases ```riscv mul: jal x2, init li x16, 0xFF bne x6, x16, skip1 bnez x8, return_a # a is NaN or x17, x7, x9 beqz x17, return_NAN # Inf * 0 = NaN slli x1, x12, 15 li x16, 0x7F80 or x1, x1, x16 # return Inf j return skip1: li x16, 0xFF bne x7, x16, skip2 bnez x9, return_b # b is NaN or x17, x6, x8 beqz x17, return_NAN # 0 * Inf = NaN slli x1, x12, 15 li x16, 0x7F80 or x1, x1, x16 # return Inf j return skip2: or x17, x6, x8 or x16, x7, x9 bnez x17, skip3 slli x1, x12, 15 # return signed zero j return skip3: bnez x16, skip4 slli x1, x12, 15 j return skip4: ``` ### Normalize ```riscv li x13, 0 # exponent adjustment bnez x6, skip5 normalize_a: andi x16, x8, 0x80 bnez x16, skip6 slli x8, x8, 1 addi x13, x13, -1 j normalize_a skip6: li x6, 1 j skip7 skip5: ori x8, x8, 0x80 # add implicit bit skip7: bnez x7, skip8 normalize_b: andi x16, x9, 0x80 bnez x16, skip9 slli x9, x9, 1 addi x13, x13, -1 j normalize_b skip9: li x7, 1 j skip10 skip8: ori x9, x9, 0x80 skip10: ``` ### Multiply Mantissas ```riscv li x14, 0 li x17, 0 mul_loop: # result_mant = mant_a * mant_b li x18, 8 beq x17, x18, mul_end andi x18, x9, 1 beqz x18, mul_skip add x14, x14, x8 mul_skip: slli x8, x8, 1 srli x9, x9, 1 addi x17, x17, 1 j mul_loop mul_end: add x15, x6, x7 li x16, 127 sub x15, x15, x16 # result_exp = exp_a + exp_b - 127 add x15, x15, x13 ``` ### Normalize Result ```riscv li x16, 0x8000 and x17, x14, x16 beqz x17, skip11 srli x14, x14, 8 andi x14, x14, 0x7F addi x15, x15, 1 # bit 15 set, shift right j skip12 skip11: srli x14, x14, 7 # bit 15 not set andi x14, x14, 0x7F skip12: ``` ### Handle Overflow/Underflow ```riscv li x16, 0xFF blt x15, x16, skip13 slli x1, x12, 15 li x16, 0x7F80 or x1, x1, x16 # overflow to Inf j return skip13: bgtz x15, skip14 li x17, -6 blt x15, x17, skip15 li x16, 1 sub x16, x16, x15 srl x14, x14, x16 # denormalize li x15, 0 j skip14 skip15: slli x1, x12, 15 # underflow to zero j return skip14: ``` ### Combine Result ```riscv slli x1, x12, 15 andi x16, x15, 0xFF slli x16, x16, 7 or x1, x1, x16 andi x16, x14, 0x7F or x1, x1, x16 j return ``` --- ## 6. BF16 Division ### Initialize ```riscv init: srli x4, x10, 15 srli x5, x11, 15 srli x6, x10, 7 andi x6, x6, 0xFF srli x7, x11, 7 andi x7, x7, 0xFF andi x8, x10, 0x7F andi x9, x11, 0x7F xor x12, x4, x5 # result_sign jr x2 ``` ### Handle Special Cases ```riscv div: jal x2, init li x16, 0xFF bne x7, x16, skip1 bnez x9, return_b # b is NaN li x16, 0xFF bne x6, x16, skip1_1 bnez x8, skip1_1 li x1, 0x7FC0 # Inf / Inf = NaN j return skip1_1: slli x1, x12, 15 # x / Inf = 0 j return skip1: or x17, x7, x9 bnez x17, skip2 or x17, x6, x8 bnez x17, skip2_1 li x1, 0x7FC0 # 0 / 0 = NaN j return skip2_1: slli x1, x12, 15 li x16, 0x7F80 or x1, x1, x16 # x / 0 = Inf j return skip2: li x16, 0xFF bne x6, x16, skip3 bnez x8, return_a # a is NaN slli x1, x12, 15 li x16, 0x7F80 or x1, x1, x16 # Inf / x = Inf j return skip3: or x17, x6, x8 bnez x17, skip4 slli x1, x12, 15 # 0 / x = 0 j return skip4: ``` ### Add Implicit Bit ```riscv bnez x6, skip5 ori x8, x8, 0x80 j skip6 skip5: ori x8, x8, 0x80 skip6: bnez x7, skip7 ori x9, x9, 0x80 j skip8 skip7: ori x9, x9, 0x80 skip8: ``` ### Integer Division ```riscv slli x13, x8, 15 # dividend = mant_a << 15 mv x14, x9 # divisor = mant_b li x15, 0 # quotient = 0 li x18, 0 # loop counter loop: li x19, 16 beq x18, x19, end_loop slli x15, x15, 1 li x20, 15 sub x20, x20, x18 sll x21, x14, x20 bltu x13, x21, skip9 sub x13, x13, x21 ori x15, x15, 1 skip9: addi x18, x18, 1 j loop end_loop: ``` ### Calculate Exponent ```riscv sub x22, x6, x7 addi x22, x22, 127 # result_exp = exp_a - exp_b + 127 bnez x6, skip10 addi x22, x22, -1 # adjust for denormal a skip10: bnez x7, skip11 addi x22, x22, 1 # adjust for denormal b skip11: ``` ### Normalize Result ```riscv li x16, 0x8000 and x17, x15, x16 beqz x17, normalize_loop_start srli x15, x15, 8 # already normalized j skip12 normalize_loop_start: loop_2: li x16, 0x8000 and x17, x15, x16 bnez x17, end_loop_2 li x19, 1 ble x22, x19, end_loop_2 slli x15, x15, 1 addi x22, x22, -1 j loop_2 end_loop_2: srli x15, x15, 8 skip12: andi x15, x15, 0x7F ``` ### Handle Overflow/Underflow ```riscv li x16, 0xFF blt x22, x16, skip13 slli x1, x12, 15 li x16, 0x7F80 or x1, x1, x16 # overflow to Inf j return skip13: bgtz x22, skip14 slli x1, x12, 15 # underflow to zero j return skip14: ``` ### Combine Result ```riscv slli x1, x12, 15 andi x22, x22, 0xFF slli x22, x22, 7 or x1, x1, x22 andi x15, x15, 0x7F or x1, x1, x15 j return ``` --- ## 7. BF16 Square Root **Disclaimer:** AI assistance was used for this implementation due to difficulties encountered. However, even with AI support, challenges remained because both the AI and I struggled with certain RISC-V assembly aspects. A working version using the RV32M extension (multiplication/division instructions) is presented here, as it may be useful for future assignments. **Note: If you wish to review the non-working pure RV32I version with infinite loops, please check the GitHub repository.** ### Initialize ```riscv init: srli x4, x10, 15 # sign srli x5, x10, 7 andi x5, x5, 0xFF # exponent andi x6, x10, 0x7F # mantissa jr x2 ``` ### Handle Special Cases ```riscv sqrt: jal x2, init li x7, 0xFF bne x5, x7, skip1 bnez x6, return_a # NaN bnez x4, return_nan # sqrt(-Inf) j return_a # sqrt(+Inf) skip1: or x7, x5, x6 bnez x7, skip2 li x1, 0 # sqrt(0) = 0 j return skip2: bnez x4, return_nan # sqrt(negative) bnez x5, skip3 li x1, 0 # sqrt(denormal) ≈ 0 j return skip3: ``` ### Calculate Square Root ```riscv li x7, 127 sub x8, x5, x7 # e = exp - 127 ori x11, x6, 0x80 # m = mantissa with implicit bit andi x7, x8, 1 beqz x7, skip4 slli x11, x11, 1 # if e is odd, m *= 2 addi x8, x8, -1 srai x9, x8, 1 addi x9, x9, 127 # new_exp = (e-1)/2 + 127 j skip5 skip4: srai x9, x8, 1 addi x9, x9, 127 # new_exp = e/2 + 127 skip5: ``` ### Binary Search for Square Root ```riscv li x12, 90 # low = 90 li x13, 256 # high = 256 li x14, 128 # result_sqrt = 128 while_loop: bltu x13, x12, end_loop add x15, x12, x13 srli x15, x15, 1 # mid = (low + high) / 2 mul x18, x15, x15 li x7, 128 divu x18, x18, x7 # sq = mid^2 / 128 bltu x11, x18, skip6 mv x14, x15 addi x12, x15, 1 # low = mid + 1 j skip7 skip6: addi x13, x15, -1 # high = mid - 1 skip7: j while_loop end_loop: ``` ### Normalize Result ```riscv li x7, 256 blt x14, x7, skip8 srli x14, x14, 1 addi x9, x9, 1 j skip9 skip8: li x7, 128 bge x14, x7, skip9 loop_2: li x7, 128 bge x14, x7, skip10 li x16, 1 ble x9, x16, skip10 slli x14, x14, 1 addi x9, x9, -1 j loop_2 skip10: skip9: andi x14, x14, 0x7F ``` ### Handle Overflow/Underflow ```riscv li x7, 0xFF bge x9, x7, return_inf blez x9, return_zero ``` ### Combine Result ```riscv andi x9, x9, 0xFF slli x9, x9, 7 or x1, x9, x14 j return ``` --- # Part 2: Validation & Testing ## Test Program ```riscv .data test_a: .word 0x3F80 # 1.0 test_b: .word 0x4000 # 2.0 result_add1: .word 0 msg_test1: .string "Test Add 1.0+2.0: " msg_pass: .string "PASS\n" msg_fail: .string "FAIL\n" .text .global main main: la a0, msg_test1 li a7, 4 ecall la x14, test_a lw x14, 0(x14) la x15, test_b lw x15, 0(x15) jal x2, add la x7, result_add1 sw x1, 0(x7) li x16, 0x4040 beq x1, x16, test1_pass la a0, msg_fail li a7, 4 ecall j test2 test1_pass: la a0, msg_pass li a7, 4 ecall test2: ... ``` **Note: Eight different tests can be executed in the q1-bf16-test.s file (excluding sqrt).** --- # Part 3: Performance Analysis **Initially, this analysis was intended to simply compare instruction counts. However, the goal evolved to surpass C compiler performance by leveraging RISC-V architecture-specific optimizations.** This addresses the assignment's directive: *"Leverage the strengths of the RISC-V architecture for efficiency."* ## BF16 Addition Performance Comparison The following comparison uses a simple test case: - **Operation: result = a + b** - **a = 1.0 (BF16: 0x3F80)** - **b = 2.0 (BF16: 0x4000)** ### Initial Attempt: C with -O2 (Inline Disabled) The first approach was to examine the assembly generated with -O2 optimization. However, the compiler inlined the function by default. To obtain a fair comparison, inlining was disabled using compiler directives. The resulting x86-64 assembly (truncated for brevity): ```assembly bf16_add: .LFB19: .cfi_startproc movl %edi, %ecx movl %esi, %edx movl %edi, %r11d pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 shrw $7, %cx pushq %rbx .cfi_def_cfa_offset 24 .cfi_offset 3, -24 shrw $7, %dx movl %esi, %ebx movl %esi, %r10d movzbl %cl, %ecx movl %edi, %r8d movl %edi, %eax shrw $15, %r11w shrw $15, %bx movzbl %dl, %edx andl $127, %edi andl $127, %r10d cmpw $255, %cx je .L55 ... .cfi_endproc ``` ### Performance Results After counting instructions and estimating cycle counts with AI assistance and manual verification: | Implementation | Instructions | Cycles | |----------------|--------------|--------| | **RISC-V (manual)** 👑 | **21** | **37** | | C (-O2, no inline) | 51 | 81 | **The RISC-V implementation wins!** However, this comparison has significant caveats: 1. **Non-inline compilation** was forced for analysis purposes 2. **-O2 optimization** was used instead of maximum -O3 3. **x86-64 architecture** uses many registers and is designed for 64-bit operations with additional security features compared to the minimal RISC-V implementation ### Adding Inline Optimization | Implementation | Instructions | Cycles | |----------------|--------------|--------| | **RISC-V (manual)** 👑 | **21** | **37** | | C (-O2, no inline) | 51 | 81 | | C (-O2, inline) | 58 | 72 | **Interesting observation:** More instructions but fewer cycles due to better instruction scheduling and elimination of function call overhead. ### Maximum Optimization: -O3 | Implementation | Instructions | Cycles | |----------------|--------------|--------| | **RISC-V (manual)** 👑 | **21** | **37** | | C (-O2, no inline) | 51 | 81 | | C (-O2, inline) | 58 | 72 | | C (-O3) | 42 | 48 | **Why does -O3 perform better?** 1. **The compiler integrates the function directly into main**, eliminating all function call overhead 2. **Common case optimization**: The compiler reorders code to place the most frequent execution paths first, minimizing branch mispredictions ### Lessons Learned The performance advantage of the manual RISC-V implementation comes from: 1. **Minimal overhead**: No function call prologue/epilogue when integrated properly 2. **RISC-V simplicity**: The reduced instruction set means each operation is optimized for the architecture 3. **Hand-tuned paths**: Critical paths (common cases) can be manually optimized to avoid unnecessary branches However, to achieve truly superior performance, the implementation should: - **Place rare cases (like infinity/NaN handling) at the end** to avoid polluting the instruction cache - **Optimize for the common case first** (normal additions) with minimal branching - **Use pipeline-friendly instruction sequences** that avoid data hazards --- # Conclusion ## Summary of Achievements This assignment successfully implemented: 1. **UTF8 encoding/decoding** with complete test coverage 2. **Comprehensive processor pipeline analysis** demonstrating understanding of hazards and forwarding 3. **Complete BF16 floating-point operations** (conversion, arithmetic, and square root) ## Insights ### 1. Assembly Programming Complexity Writing assembly code revealed the complexity. Even simple operations require extensive error handling, normalization, and edge case management. ### 2. Performance Characteristics The analysis demonstrated that: - **Instruction count alone is misleading** - cycle count matter more - **Compiler optimizations are sophisticated** - achieving better performance than -O3 requires deep architectural knowledge (but not in tose simple case) - **Hand-optimization has limits** - while the RISC-V implementation was competitive, the margin diminishes with higher compiler optimization levels ### 3. Architecture-Specific Design The RISC-V architecture's strength lies in: - **Simple, uniform instruction format** enabling fast decode - **Large register file** reducing memory traffic - **Explicit pipeline visibility** allowing manual optimization of data hazards ## Future Improvements For future assignments: 1. **Code Organization**: Implement better naming conventions and modular structure 2. **Optimization Strategy**: Focus on common-case performance rather than exhaustive edge case 3. **Documentation**: Maintain clearer separation between implementation notes and formal analysis ## Final Reflection This assignment was enlightening. While initially frustrating (Due to simple translation of C to risc ,especially the sqrt implementation), it provided invaluable insight into low-level computation, compiler optimization, and computer architecture. The iterative process of writing, testing, and optimizing assembly code, though time-consuming, has deepened my understanding of low levels languages. The realization that manual RISC-V code could compete with (and occasionally exceed) compiled C code was unexpected, even if this advantage diminishes at higher optimization levels. I don't consider this homerwork as good (I mean that i didn't provided a pretty good homework) but this exercise has prepared me well for future work i think. --- **End of Report**