Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [`clare8151214`](https://github.com/clare8151214) > # Problem B > ref: [Quiz1 of Computer Architecture (2025 Fall) Problem-B](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-B) ## Implementation The number of cycles consumed by my implementation is about 50264 ![image](https://hackmd.io/_uploads/rJAJrAdalx.png) > [my code Commit ff1556a](https://github.com/clare8151214/ca2025-quizzes/commit/ff1556a7fda1fe99955b8d99ae6ce8c25d3e1222#diff-7206aed52dcfe0c0fccc1d691f4e306e7eb87b5b8adb0978a09b605c1670a1f7) Meanwhile, the code optimized with gcc -O3 consumes only 29082 cycles: ![image](https://hackmd.io/_uploads/ryd8U0_peg.png) > [gcc -o3 code](https://github.com/clare8151214/ca2025-quizzes/blob/main/q1-uf8_O3.s) This clearly shows how powerful modern compiler optimizations can be. My implementation is slower mainly due to two reasons, which I will analyze and propose optimization strategies for below. ## Reason 1: The Massive Overhead of Function Calls This is the biggest cause of performance difference, likely accounting for more than 80% of the gap. In the main_loop_start loop, each iteration (total 256 times) performs: ``` jal ra, uf8_decode jal ra, uf8_encode ``` And inside uf8_encode, another function call occurs: ``` jal ra, clz ``` This means 3 function calls per iteration. What is the cost of a function call? Each jal may look like a single instruction, but it hides a lot of overhead: Prologue: addi sp, sp, -X, sw ra, ..., sw s0, ..., etc., to save context before entering the function. Epilogue: lw ra, ..., lw s0, ..., addi sp, sp, X, jr ra, to restore state before returning. Pipeline flush: jal and jr ra are jump instructions. In a pipelined processor (like Ripes), each jump flushes the pipeline, causing multiple cycles of wasted time. Estimated cost: Total calls: 256 iterations × 3 calls per iteration = 768 function calls. Assuming ~8 extra instructions per call (prologue/epilogue): ~768 × 8 = 6144 extra instructions. Adding the pipeline flush penalties… most CPU cycles are wasted “preparing for function calls” and “returning,” not actual computation. What does GCC -O3 do? The compiler uses function inlining. It notices that uf8_decode and clz are small and directly embeds their code into the caller instead of calling them. This completely eliminates the overhead described above. --- ### Optimization Strategy 1: Manual Function Inlining This is the most impactful optimization. I can embed the uf8_decode and clz code directly where they are called. 1. Inline uf8_decode in main #### Original Main loop ```ass main_loop_start: # for loop condition check bge s1,s5,loop_exit # if (i >= 256), exit loop # Round-trip Test Logic # 1. Decode fl mv a0, s1 # a0 = fl = i jal ra, uf8_decode mv s0, a0 # s0 = value (result of decode) # 2. Encode value mv a0, s0 # a0 = value (parameter for encode) jal ra, uf8_encode mv s3, a0 # s3 = fl2 (result of encode) # 3. Check if (fl != fl2) beq s3, s1, check_monotonicity # if (fl2 == fl), skip failure li s6, 0 # if failed, passed = false ``` #### New Main loop ```ass main_loop_start: bge s1,s5,loop_exit # if (i >= 256), exit loop mv a0, s1 # Input: a0 = fl = i andi t0, a0, 0x0f # t0 = mantissa srli t1, a0, 4 # t1 = exponent li t2, 0x7FFF li t3, 15 sub t3, t3, t1 # t3 = 15 - exponent srl t4, t2, t3 slli t4, t4, 4 # t4 = offset sll t0, t0, t1 add s0, t0, t4 # s0 = value (decode result) # 2. Encode value mv a0, s0 # a0 = value (parameter for encode) jal ra, uf8_encode mv s3, a0 # s3 = fl2 (result of encode) # 3. Check if (fl != fl2) beq s3, s1, check_monotonicity li s6, 0 # passed = false ``` 2. Inline `clz` inside uf8_encode #### Original uf8_encode: `clz` is called as a separate function. ```ass uf8_encode: mv t0, a0 # t0 = value li t6, 16 bltu t0, t6, encode_small # if value < 16 → return value addi sp, sp, -8 # push frame sw ra, 4(sp) # save ra sw t0, 0(sp) # save original value # call clz mv a0, t0 jal ra, clz # a0 = lz # pop frame lw t0, 0(sp) # restore original value lw ra, 4(sp) # restore ra addi sp, sp, 8 mv t1, a0 # lz li t2, 31 sub t2, t2, t1 # msb = 31 - lz li t3, 0 # exponent li t4, 0 # overflow li t6, 5 blt t2, t6, exp_est_done # if msb < 5 → skip estimate li t6, 4 sub t3, t2, t6 # exponent = msb - 4 li t6, 15 blt t6, t3, cap_to_15 j exp_est_loop_init cap_to_15: mv t3, t6 # exponent = 15 exp_est_loop_init: # overflow = (2^exp - 1) * 16 beqz t3, exp_est_done li t4, 0 li t5, 0 exp_make_ovl: # while (t5 < t3) { overflow = (overflow<<1)+16; t5++; } blt t5, t3, 1f j exp_make_ovl_done 1: slli t4, t4, 1 addi t4, t4, 16 addi t5, t5, 1 j exp_make_ovl exp_make_ovl_done: li t5, 0 # reset, reused in adj_up_loop # backtrack: while (exp>0 && value < overflow) exp_back_check: beqz t3, exp_est_done bltu t0, t4, exp_back_step j exp_est_done exp_back_step: addi t4, t4, -16 srli t4, t4, 1 addi t3, t3, -1 j exp_back_check exp_est_done: # fine-tune exponent / overflow upward adj_up_loop: li t6, 15 beq t3, t6, adj_done slli t5, t4, 1 # next = (overflow << 1) + 16 addi t5, t5, 16 bltu t0, t5, adj_done mv t4, t5 addi t3, t3, 1 j adj_up_loop adj_done: # form result sub t6, t0, t4 # value - overflow srl t6, t6, t3 # >> exponent slli t3, t3, 4 # exponent << 4 or a0, t3, t6 # combine andi a0, a0, 0xFF # keep 8 bits ret encode_small: andi a0, t0, 0xFF # direct return ret clz: li t0, 32 # n = 32 li t1, 16 # c = 16 clz_loop: srl t2, a0, t1 # y = x >> c beq t2, x0, clz_skip sub t0, t0, t1 # n -= c mv a0, t2 # x = y clz_skip: srli t1, t1, 1 # c >>= 1 bne x0, t1, clz_loop sub a0, t0, a0 # return n - x ret ``` #### new uf8_encode `clz` logic is implemented inline, eliminating function call overhead. ```c uf8_encode: mv t0, a0 # t0 = value li t6, 16 bltu t0, t6, encode_small # if (value < 16) return value mv a5, t0 # a5 is our 'x' for the clz logic li t1, 32 # t1 is 'n' li t2, 16 # t2 is 'c' clz_inline_loop: srl t3, a5, t2 # t3 = y = x >> c beq t3, x0, clz_inline_skip sub t1, t1, t2 # n -= c mv a5, t3 # x = y clz_inline_skip: srli t2, t2, 1 bne x0, t2, clz_inline_loop sub t1, t1, a5 # t1 = lz = n - x li t2, 31 sub t2, t2, t1 # msb = 31 - lz li t3, 0 # exponent = 0 li t4, 0 # overflow = 0 li t6, 5 blt t2, t6, exp_est_done # if (msb < 5) li t6, 4 sub t3, t2, t6 # exponent = msb - 4 li t6, 15 bgeu t3, t6, cap_to_15 # if exponent >= 15 j exp_est_loop_init cap_to_15: li t3, 15 # exponent = 15 exp_est_loop_init: beqz t3, exp_est_done li t4, 0 li t5, 0 exp_make_ovl: bgeu t5, t3, exp_make_ovl_done slli t4, t4, 1 addi t4, t4, 16 addi t5, t5, 1 j exp_make_ovl exp_make_ovl_done: li t5, 0 exp_back_check: beqz t3, exp_est_done bltu t0, t4, exp_back_step j exp_est_done exp_back_step: addi t4, t4, -16 srli t4, t4, 1 addi t3, t3, -1 j exp_back_check exp_est_done: adj_up_loop: li t6, 15 beq t3, t6, adj_done slli t5, t4, 1 addi t5, t5, 16 bltu t0, t5, adj_done mv t4, t5 addi t3, t3, 1 j adj_up_loop adj_done: sub t6, t0, t4 srl t6, t6, t3 slli t3, t3, 4 or a0, t3, t6 andi a0, a0, 0xFF ret encode_small: andi a0, t0, 0xFF ret ``` --- Our cycle count decreased from `50624` to `41076` ![image](https://hackmd.io/_uploads/BJxfR0Opxg.png) --- ## Reason 2: Algorithm Structure – Generic Loops vs. Optimized Lookup My uf8_encode function uses three independent loops to iteratively approximate the correct exponent and overflow. This is logically generic but inefficient — each loop requires comparisons and branches, which disrupt the CPU pipeline. What does GCC -O3 do? The compiler abandons loops in favor of a decision tree and lookup table: It uses a series of bleu instructions to quickly determine the range of the input value (e.g., value <= 15, value <= 47, value <= 111). Once the range is known, it jumps directly to a code block with precomputed exponent and overflow values. This “space-for-time” approach produces larger code but drastically shortens execution paths and eliminates loops. --- ### Optimization Strategy 2: Simplify Algorithm and Reduce Loops It’s very difficult to replicate the compiler’s entire decision tree manually, but we can borrow its idea by creating fast paths for the most common cases to reduce the number of loops and branch instructions. This analysis shows why compiler-optimized code runs much faster — not because the logic is fundamentally different, but because function call overhead and loop inefficiency are systematically removed. #### Implementation Steps Determine the fast-path boundaries: We need to identify the threshold values where the exponent changes: * exponent = 0: value ≤ 15 * exponent = 1: value ≤ 47 (uf8_decode(0x1F)) * exponent = 2: value ≤ 111 (uf8_decode(0x2F)) * exponent = 3: value ≤ 239 (uf8_decode(0x3F)) #### Create branches: At the beginning of uf8_encode, add bleu instructions to check these boundary conditions. #### Write fast handling blocks: For each exponent, implement a very short handling block. Since both exponent and overflow are known constants in these cases, calculating the mantissa requires only one subtraction and one shift instruction. #### Build the general fallback path: Convert your existing code (with the inlined clz) into a general path or fallback path that will handle all larger values not caught by the fast-path branches. ```ass uf8_encode: mv t0, a0 # t0 = original value (preserved) li t6, 15 bleu t0, t6, encode_exp_0_fast_path li t6, 47 bleu t0, t6, encode_exp_1_fast_path li t6, 111 bleu t0, t6, encode_exp_2_fast_path li t6, 239 bleu t0, t6, encode_exp_3_fast_path j encode_generic_path encode_exp_0_fast_path: # Handles values <= 15 # For exp=0, the encoded form is the value itself. andi a0, t0, 0xFF ret encode_exp_1_fast_path: # Handles values 16-47 # For exp=1, overflow is 16. mantissa = (value - 16) >> 1 addi t1, t0, -16 # value - 16 srli t1, t1, 1 # >> 1 ori a0, t1, 0x10 # | (1 << 4) ret encode_exp_2_fast_path: # Handles values 48-111 # For exp=2, overflow is 48. mantissa = (value - 48) >> 2 addi t1, t0, -48 # value - 48 srli t1, t1, 2 # >> 2 ori a0, t1, 0x20 # | (2 << 4) ret encode_exp_3_fast_path: # Handles values 112-239 # For exp=3, overflow is 112. mantissa = (value - 112) >> 3 addi t1, t0, -112 # value - 112 srli t1, t1, 3 # >> 3 ori a0, t1, 0x30 # | (3 << 4) ret encode_generic_path: #Inlined clz mv a5, t0 li t1, 32 li t2, 16 clz_inline_loop: srl t3, a5, t2 beq t3, x0, clz_inline_skip sub t1, t1, t2 mv a5, t3 clz_inline_skip: srli t2, t2, 1 bne x0, t2, clz_inline_loop sub t1, t1, a5 # t1 = lz li t2, 31 sub t2, t2, t1 li t3, 0 li t4, 0 li t6, 5 blt t2, t6, exp_est_done li t6, 4 sub t3, t2, t6 li t6, 15 bgeu t3, t6, cap_to_15 j exp_est_loop_init cap_to_15: li t3, 15 exp_est_loop_init: beqz t3, exp_est_done li t4, 0 li t5, 0 exp_make_ovl: bgeu t5, t3, exp_make_ovl_done slli t4, t4, 1 addi t4, t4, 16 addi t5, t5, 1 j exp_make_ovl exp_make_ovl_done: li t5, 0 exp_back_check: beqz t3, exp_est_done bltu t0, t4, exp_back_step j exp_est_done exp_back_step: addi t4, t4, -16 srli t4, t4, 1 addi t3, t3, -1 j exp_back_check exp_est_done: adj_up_loop: li t6, 15 beq t3, t6, adj_done slli t5, t4, 1 addi t5, t5, 16 bltu t0, t5, adj_done mv t4, t5 addi t3, t3, 1 j adj_up_loop adj_done: sub t6, t0, t4 srl t6, t6, t3 slli t3, t3, 4 or a0, t3, t6 andi a0, a0, 0xFF ret ``` --- Our cycle count decreased from `41076` to `38866` ![image](https://hackmd.io/_uploads/rJP0HkKTxl.png) Although it cannot reach the level of GCC -O3 optimization, it is already very close. --- # Problem C > ref: [Quiz1 of Computer Architecture (2025 Fall) Problem-C](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-C) ## Implementation My hand-written code takes roughly 3350 cycles to run: ![image](https://hackmd.io/_uploads/SyCijTqTxx.png) >[my original code Commit d91a170](https://github.com/clare8151214/ca2025-quizzes/commit/d91a17068a04c98151b88e04a7c621943cd6fc23) The code compiled with `gcc -O3` finishes in about 1177 cycles: ![image](https://hackmd.io/_uploads/SJ4O0oqTll.png) > [gcc -o3 code](https://github.com/clare8151214/ca2025-quizzes/blob/main/q1-bfloat16_O3.s) >However, I’ve noticed that this optimization is unsafe—it is valid only within the narrow context of the current test run. If the inputs were to include negative numbers, the simplified version would produce incorrect results. ### How the compiler “thinks” Based on the exact information it can see, the compiler follows one guiding principle—the **“as-if” rule**. It may apply **any** transformation to the code as long as the program’s **observable behavior** remains *as if* the original source had been executed. Here, the only observable behavior is printing the correct **PASS / FAIL** messages. With that in mind, the compiler’s internal reasoning can be outlined like this: #### For `bf16_sqrt` > “The human wrote a generic binary-search algorithm to compute a square root, but I notice it is called only with the constants **4.0** and **9.0**. > Since I already know at compile time that `sqrt(4.0)` is **2.0** and `sqrt(9.0)` is **3.0**, why should I keep the whole algorithm? > The entire loop is dead code for this run, so I’ll replace the function with a simple chain of `if` statements: > > ```c > if (input == 4.0) return 2.0; > else if (input == 9.0) return 3.0; > ``` > > That’s exactly what the `-O3` build produces.” #### For `bf16_add` > “The author implemented a fully-featured adder that handles NaNs, infinities, zeros, negative numbers and subnormals. > In `test_arithmetic`, however, the function is called only with **1.0** and **2.0**—ordinary positive, normalized numbers. > All branches that deal with special cases are therefore never taken; they are dead code in this context and can be eliminated. > What’s more, the function is now tiny. Calling it would cost more (saving/restoring registers, jumping) than executing its body, so I’ll inline the simplified core path directly into `test_arithmetic`, eliminating the call overhead altogether.” Therefore I want to optimize my own implementation—bringing its speed as close as possible to the `-O3` build—**without** sacrificing its full, general-purpose correctness. ### Optimization roadmap – from *macro* to *micro* --- #### Phase 1 – Eliminate memory-access overhead in function calls (**top priority**) At the moment the biggest performance bottleneck is the load/store traffic generated inside each function. Repeated stack accesses (`lw`, `sw`) are expensive. I need to refactor the routines that violate a “high-performance ABI” style. *Functions to tackle first*: **`bf16_eq`, `bf16_lt`** ##### Strategy Rewrite both routines into the classic **prologue → body → epilogue** layout: 1. **Prologue** Save every incoming argument **once** to `s*` (saved) registers. 2. **Body** Operate *only* in registers—no further stack traffic. 3. **Epilogue** Restore the saved registers in one shot and return. ##### Implementation example – refactoring `bf16_eq` *Before (slow version):* ``` bf16_eq: addi sp, sp, -12 sw ra, 8(sp) sw a0, 4(sp) sw a1, 0(sp) jal ra, bf16_isnan # ... lw a0, 0(sp) jal ra, bf16_isnan # ... ``` *After (High Performance):* ```assembly bf16_eq: # --- Prologue --- addi sp, sp, -16 sw ra, 12(sp) sw s0, 8(sp) sw s1, 4(sp) mv s0, a0 # Save original value of a0 into s0 mv s1, a1 # Save original value of a1 into s1 # --- Body - Use registers only --- jal ra, bf16_isnan # a0 holds the value from s0 bnez a0, eq_false_opt mv a0, s1 # Restore value of a1 from s1 into a0 jal ra, bf16_isnan bnez a0, eq_false_opt mv a0, s0 # Restore a0 jal ra, bf16_iszero mv t0, a0 mv a0, s1 # Restore a1 jal ra, bf16_iszero mv t1, a0 and t2, t0, t1 bnez t2, eq_true_opt sub t2, s0, s1 # Compare directly using s0 and s1 seqz a0, t2 j eq_exit_opt eq_true_opt: li a0, 1 j eq_exit_opt eq_false_opt: li a0, 0 eq_exit_opt: # --- Epilogue --- lw ra, 12(sp) lw s0, 8(sp) lw s1, 4(sp) addi sp, sp, 16 ret ``` *I need to perform the exact same refactoring for `bf16_lt`.* --- #### Phase 2: Reduce Control Flow Overhead (Medium Priority) Every jump in a loop (`j`, `bnez`) incurs a performance penalty. By reducing the number of loop iterations, I can significantly lower this overhead. **Target Functions**: `integer_multiply`, `integer_divide_unsigned` **Optimization Strategy**: **Loop Unrolling**. I will duplicate the loop body a few times and adjust the loop counter and step size accordingly. **【Implementation】 Partially Unrolling `integer_multiply` (4 times):** This is a more aggressive version that can achieve better performance. ```assembly integer_multiply: mv t0, a0 # Multiplicand mv t1, a1 # Multiplier li a0, 0 # Result li t3, 8 # 32 bits / 4 iterations per loop = 8 loops int_multiply_loop_unrolled: # Iteration 1 andi t2, t1, 1 beqz t2, bit0_zero add a0, a0, t0 bit0_zero: slli t0, t0, 1 # Iteration 2 andi t2, t1, 2 beqz t2, bit1_zero add a0, a0, t0 bit1_zero: slli t0, t0, 1 # Iteration 3 andi t2, t1, 4 beqz t2, bit2_zero add a0, a0, t0 bit2_zero: slli t0, t0, 1 # Iteration 4 andi t2, t1, 8 beqz t2, bit3_zero add a0, a0, t0 bit3_zero: slli t0, t0, 1 srli t1, t1, 4 # Shift multiplier by 4 bits at once addi t3, t3, -1 bnez t3, int_multiply_loop_unrolled int_multiply_end: ret ``` --- #### Phase 3: Instruction-Level Micro-Optimizations (Low Priority) In the race for cycle counts, every saved instruction matters. I will review my code to find opportunities to accomplish the same work with fewer or faster instructions. **Target Functions**: All functions, especially the core paths of `add`, `mul`, `div`. **Optimization Strategies**: 1. **Eliminate Redundant Instructions**: Look for `mv` instructions that can be omitted or computations that can be merged. 2. **Choose Faster Instructions**: For example, `andi` is often faster than `and` (since one operand is an immediate). 3. **Optimize Branches**: Reorganize `if-else` structures so that the most common case is handled with the fewest jumps. This is known as **branch prediction optimization**. **【Implementation】 Micro-optimizing the normalization part of `bf16_mul`:** **Before:** ```assembly li t5, 0x8000 and t5, t6, t5 beqz t5, mul_no_carry # Carry path j mul_round_ok mul_no_carry: # No carry path ``` **After (Branch Optimization):** Assuming the "no carry" case is more common, I can let it fall through and only have the "carry" case jump. ```assembly li t5, 0x8000 and t5, t6, t5 bnez t5, mul_handle_carry # Only jump if there is a carry # --- No carry path (The common case) --- srli t6, t6, 7 j mul_round_ok mul_handle_carry: # --- Carry path --- srli t6, t6, 8 addi s1, s1, 1 # Fall through to mul_round_ok mul_round_ok: ... ``` --- # Analysis I test my code using [Ripes](https://github.com/mortbopet/Ripes) simulator. >[!Note] AI tools usage >I used ChatGPT to help me prepare for Quiz 1 by reviewing code concepts, improving grammar, conducting background research, summarizing implementation details, and explaining how standard RISC-V instructions are applied. > >I used [Compiler Explorer](https://godbolt.org/) to generate the `gcc -O3` optimized code and then refactored it using GPT. # Reference - [Inline assembler - Wikipedia](https://en.wikipedia.org/wiki/Inline_assembler) - [RISC-V Instruction Set Manual](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) - [決策樹 (Decision tree)](https://ithelp.ithome.com.tw/articles/10271143?sc=hot) - [The as-if rule](https://en.cppreference.com/w/cpp/language/as_if.html)