Assignment 1: RISC-V Assembly and Instruction Pipeline

--- title: 'Assignment 1: RISC-V Assembly and Instruction Pipeline' disqus: hackmd --- Assignment 1: RISC-V Assembly and Instruction Pipeline === >[!Note] AI tools usage >I use ChatGPT to assist with Quiz 1 by providing code explanations, grammar revisions, pre-work research, code summaries, and explanations of standard RISC-V instruction usage. ## Problem B: uf8 - the 8-bit codec ### 1. Overview **What is UF8?** UF8 (microsecond float 8-bit) is a logarithmic encoding scheme that compresses 20-bit unsigned integers ([0, 1,015,792]) into 8-bit symbols with: * 2.5:1 compression ratio (20 bits → 8 bits) * ≤6.25% relative error (maximum) * 3% expected error (average) **Mathematical Foundation** * Encoding Formula: $$E(v) = \begin{cases} v & v < 16 \\ 16e + \left\lfloor \frac{v - \text{offset}(e)}{2^e} \right\rfloor & v \geq 16 \end{cases} $$ Where: $offset(e) = (2^e - 1) · 16$ * Decoding Formula: $$D(b) = m · 2^e + (2^e - 1) · 16$$ Where: $e = ⌊b/16⌋ (exponent)$ $m = b \mod 16 (mantissa)$ ### 2. The RV32I Assembly translation from the C program of uf8 * Original C code: ```c= /* Decode uf8 to uint32_t */ uint32_t uf8_decode(uf8 fl) { uint32_t mantissa = fl & 0x0f; uint8_t exponent = fl >> 4; uint32_t offset = (0x7FFF >> (15 - exponent)) << 4; return (mantissa << exponent) + offset; } /* Encode uint32_t to uf8 */ uf8 uf8_encode(uint32_t value) { /* Use CLZ for fast exponent calculation */ if (value < 16) return value; /* Find appropriate exponent using CLZ hint */ int lz = clz(value); int msb = 31 - lz; /* Start from a good initial guess */ uint8_t exponent = 0; uint32_t overflow = 0; if (msb >= 5) { /* Estimate exponent - the formula is empirical */ exponent = msb - 4; if (exponent > 15) exponent = 15; /* Calculate overflow for estimated exponent */ for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16; /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } } /* Find exact exponent */ while (exponent < 15) { uint32_t next_overflow = (overflow << 1) + 16; if (value < next_overflow) break; overflow = next_overflow; exponent++; } uint8_t mantissa = (value - overflow) >> exponent; return (exponent << 4) | mantissa; } ``` **The Compiler Explorer code and Ripes Execution Info:** > Full Compiler Explorer uf8 RV32I version: [Link](https://github.com/scarlett910/ca2025-quizzes/blob/28b215784036ae93465d461da938ffdaa7aa6415/uf8_compiler_gen.s) The Processor execution result: ![uf8_compiler_cycle](https://hackmd.io/_uploads/rkb6bm56gl.png) **My Approach:** * **uf8 decoding:** I simplified the arithmetic formula of uf8 decoding using basic exponent math instead of using magic numbers and bit shifts Original calculation: ```c= offset = (0x7FFF >> (15 - exponent)) << 4; result = (mantissa << exponent) + offset; ``` Simplified: ```c= offset = ((1 << exponent) - 1) * 16; result = (mantissa << exponent) + offset; ``` * **uf8 encoding:** I rewrote the logic to remove unnecessary branches, function calls, and stack operations that the compiler-generated code introduced. The compiler’s version used a very complex control structure with multiple nested loops and condition blocks (.L8, .L12, .L14, .L19, etc.). I try to optimize the uf8 encoding with these following approaches: 🔹Linear Loop Optimization In my version, I replaced the compiler’s nested conditional structure with a simple linear loop that increases the threshold by left-shifting and adding constants: ```asm= slli a5, a5, 1 addi a5, a5, 16 ``` This part linearly grows the offset in a predictable way until it reaches the range limit. Because it’s just one simple loop with a branch back to the start, it avoids multiple entry and exit conditions that the compiler generated. As a result, the CPU pipeline stays full, and the branch predictor works more efficiently. 🔹 Register-Based Computation All intermediate results are computed and stored directly in registers (a4, a5, etc.), instead of using memory to store local variables like the compiler did: ```asm= sw a5, -24(s0) lw a5, -24(s0) ``` By keeping everything in registers, my code eliminates memory latency and reduces total instruction count significantly. 🔹 Removed Function Call Overhead The compiler’s code calls an external clz (count leading zeros) function using: ```asm= call clz ``` Each call pushes the return address and local variables onto the stack but I perform the same counting logic inline with shift and subtract operations, so the entire calculation happens within the same function — no call, no return, no stack frame changes. 🔹 Reduced Branch Complexity Instead of the compiler’s multi-branch conditions like: ```asm= bgtu a4, a5, .L8 bleu a4, a5, .L19 beq a5, zero, .L16 ``` I structured the control flow into a straightforward sequence where only one condition controls the loop. This simplified design reduces the number of jumps and labels that the CPU must handle. **My code for uf8 codec and the Ripes execution:** ```asm= # uf8 Decode function uf8_decode: andi t0, a0, 0x0F # Extract mantissa (lower 4 bits) srli t1, a0, 4 # Extract exponent (upper 4 bits) li t2, 1 sll t2, t2, t1 # Calculate 2^exponent addi t2, t2, -1 # (2^exponent - 1) slli t2, t2, 4 # offset = (2^exponent - 1) * 16 sll t0, t0, t1 # mantissa_value = mantissa << exponent add a0, t0, t2 # final_value = mantissa_value + offset ret # uf8 encode function uf8_encode: # handle small values (0-15) that don't need compression li t0, 16 blt a0, t0, small # initialize variables for exponent search li t1, 0 li t2, 0 li t4, 15 loop_exp: # calculate next threshold value for current exponent add t3, t2, t0 bgt t3, a0, done_exp mv t2, t3 slli t0, t0, 1 addi t1, t1, 1 blt t1, t4, loop_exp done_exp: # calculate mantissa from remaining value sub t3, a0, t2 # value - base_offset srl t3, t3, t1 # mantissa = (value - base_offset) >> exponent andi t3, t3, 0x0F # ensure mantissa fits in 4 bits # combine exponent and mantissa into final byte slli t1, t1, 4 # shift exponent to upper 4 bits or a0, t1, t3 # combine with mantissa in lower 4 bits ret small: # for values < 16, no encoding needed ret ``` The Processor execution result: ![uf8_mine_cycle](https://hackmd.io/_uploads/HJStfw5agx.png) Compiler-generated code: * Cycles: 57,075 * Instructions: 39,774 My code: * Cycles: 2,761 * Instructions: 1,912 **Conclusion:** The direct translation from the C code has much higher cycle count and instruction count comparing to my approach (higher by 20 times). ### 3. Use code: LeetCode 912. Sort an Array * **Problem context** LeetCode 912 asks to sort an array of integers in ascending order, where: * Array length: 1 ≤ length ≤ 10000 * Value range: -50000 ≤ A[i] ≤ 50000 * Must use O(n log n) time complexity Normally, this means: * Each integer takes 4 bytes (32-bit word) in memory * Sorting algorithms compare and swap full 32-bit values * Memory bandwidth becomes a bottleneck for large arrays * Cache misses occur frequently with large data structures * **Why UF8 helps** 1. Reduced memory footprint Each value in standard int32 takes 4 bytes and in UF8 encoding, each value compresses to 1 byte (for values 0-524272). This is a 75% reduction in memory usage so we can fit 4 times more elements in the same cache line 2. Faster sorting operations * Sorting operates on 8-bit bytes instead of 32-bit words and with fewer memory cycles per comparison and swap * Better cache utilization during the sorting phase 3. Controlled precision trade-off * For values < 16: exact representation (no loss) * For values ≥ 16: ~6.25% maximum relative error * The mantissa-exponent structure preserves ordering relationships * **Implementation Details** > Sort an array RV32I code: ```asm= .data test_header: .string "\n=== Array Sort with UF8 ===\n" before: .string "Before: " after: .string "After: " space: .string " " newline: .string "\n" # Test: [5, 2, 3, 1] test_array: .word 5, 2, 3, 1 .text .globl main main: # Initialize stack li sp, 0x10000000 addi sp, sp, -256 # Print header la a0, test_header li a7, 4 ecall # Print "Before: " la a0, before li a7, 4 ecall # Print original la s0, test_array li s1, 4 li s2, 0 print1: lw a0, 0(s0) li a7, 1 ecall la a0, space li a7, 4 ecall addi s0, s0, 4 addi s2, s2, 1 blt s2, s1, print1 la a0, newline li a7, 4 ecall # Encode values la s0, test_array li s1, 4 li s2, 0 encode: lw a0, 0(s0) jal ra, uf8_encode sw a0, 0(s0) addi s0, s0, 4 addi s2, s2, 1 blt s2, s1, encode # Bubble sort li s3, 3 # outer limit li s4, 0 # outer counter outer: bge s4, s3, sorted sub s5, s3, s4 # inner limit li s6, 0 # inner counter inner: bge s6, s5, next_outer la s0, test_array slli t0, s6, 2 add s0, s0, t0 lw t1, 0(s0) lw t2, 4(s0) ble t1, t2, no_swap sw t2, 0(s0) sw t1, 4(s0) no_swap: addi s6, s6, 1 j inner next_outer: addi s4, s4, 1 j outer sorted: # Decode values la s0, test_array li s1, 4 li s2, 0 decode: lw a0, 0(s0) jal ra, uf8_decode sw a0, 0(s0) addi s0, s0, 4 addi s2, s2, 1 blt s2, s1, decode # Print "After: " la a0, after li a7, 4 ecall # Print sorted la s0, test_array li s1, 4 li s2, 0 print2: lw a0, 0(s0) li a7, 1 ecall la a0, space li a7, 4 ecall addi s0, s0, 4 addi s2, s2, 1 blt s2, s1, print2 la a0, newline li a7, 4 ecall # Exit li a7, 10 ecall # UF8 Encode uf8_encode: li t0, 16 blt a0, t0, small li t1, 0 li t2, 0 li t4, 15 loop_exp: add t3, t2, t0 bgt t3, a0, done_exp mv t2, t3 slli t0, t0, 1 addi t1, t1, 1 blt t1, t4, loop_exp done_exp: sub t3, a0, t2 srl t3, t3, t1 andi t3, t3, 0x0F slli t1, t1, 4 or a0, t1, t3 ret small: ret # UF8 Decode uf8_decode: andi t0, a0, 0x0F srli t1, a0, 4 li t2, 1 sll t2, t2, t1 addi t2, t2, -1 slli t2, t2, 4 sll t0, t0, t1 add a0, t0, t2 ret ``` * **RV32I Code Result in Ripes:** ![RV32I Leetcode912](https://hackmd.io/_uploads/SJXLjq5Tgx.png) **Conclusion:** ``` Memory Efficiency Metrics: Example: Array of 1000 integers - Standard int32: 1000 × 4 bytes = 4,000 bytes - UF8 compressed: 1000 × 1 byte = 1,000 bytes - Memory savings: 3,000 bytes (75% reduction) Performance Benefits: ✓ 75% memory reduction ✓ Faster comparisons (8-bit vs 32-bit) ✓ Maintained sort ordering correctness ✓ Suitable for approximate sorting applications ``` ## Problem C: bfloat16 ### 1. Overview **Why BFloat16?** BFloat16 (Brain Floating Point 16) was developed by Google Brain specifically for machine learning workloads. It offers several critical advantages: | Feature | Float32 | BFloat16 |Benefit | | ---------| ------- | -------- | ---------- | |Total bits|32 | 16 |50% memory reduction| |Exponent bits|8 |8 |Same dynamic range | |Mantissa bits|23 |7 |Lower precision (acceptable for ML) | |Range |±1.18×10⁻³⁸ to ±3.4×10³⁸| Same |No range loss | Bit Layout ``` ┌─────────┬──────────────┬──────────────┐ │Sign (1) │ Exponent (8) │ Mantissa (7) │ └─────────┴──────────────┴──────────────┘ 15 14 7 6 0 S: Sign bit (0 = positive, 1 = negative) E: Exponent bits (8 bits, bias = 127) M: Mantissa/fraction bits (7 bits) ``` The value $v$ of a BFloat16 number is calculated as: $$v = (-1)^S \times 2^{E-127} \times \left(1 + \frac{M}{128}\right)$$ where: * $S \in \{0, 1\}$is the sign bit * $E \in [1, 254]$ is the biased exponent * $M \in [0, 127]$ is the mantissa value Special Cases * Zero: $E = 0, M = 0 \Rightarrow v = (-1)^S \times 0$ * Infinity: $E = 255, M = 0 \Rightarrow v = (-1)^S \times \infty$ * NaN: $E = 255, M \neq 0 \Rightarrow v = \text{NaN}$ * Denormals: Not supported (flush to zero) Square Root $$\sqrt{a} = \sqrt{2^{e_a} \times m_a} = 2^{e_a/2} \times \sqrt{m_a}$$ The bf16_sqrt implementation uses pure integer arithmetic without floating-point library dependencies: 1. Special Case Handling (IEEE-754 compliant): * $\sqrt{+0} = +0$ * $\sqrt{-0} = 0$ * $\sqrt{+\infty} = +\infty$ * $\sqrt{-\infty} = \text{NaN}$ * $\sqrt{\text{NaN}} = \text{NaN}$ * $\sqrt{x} = \text{NaN}$ for all $x < 0$ 2. Exponent Processing: For even exponents ($e_a$ is even): $$e_r = \frac{e_a}{2}, \quad m' = m_a$$ For odd exponents ($e_a$is odd): $$e_r = \frac{e_a - 1}{2}, \quad m' = 2 \times m_a$$ This ensures the mantissa remains in the normalized range $[1, 2)$ or $[2, 4)$ for processing. 3. Mantissa Square Root via Binary Search: The algorithm finds $r$ such that $r^2 \approx m' \times 128$ using binary search. The scaling factor 128 represents the implicit 1.0 in the mantissa representation. 4. Normalization: * Ensure result mantissa is in range $[128, 256)$ * Ajust exponent if necessary * Extract 7-bit mantissa (removing implicit 1) **Mathematical Properties:** * Monotonic: $x < y \Rightarrow \sqrt{x} \leq \sqrt{y}$ * Identity: $\sqrt{x^2} = |x|$ for $x \geq 0$ * Precision: Maximum relative error < 0.5% across all BF16 range ### 2. The RV32I Assembly translation from the C program of bfloat16 > Full bfloat16 RV32I version (too long to add here ¯\_(ツ)_/¯): [Github link](https://github.com/scarlett910/ca2025-quizzes/blob/a8ccb45a364bf7c12484bedeacf2734e5092af2c/hw1_bf16_full.s) **Implementation details:** Core BF16 Operations: * Coversion: `bf16_to_f32`, `f32_to_bf16` * Arithmetic Functions Implemented: * Addition (`bf16_add`) * Subtraction (`bf16_sub`) * Multiplication (`bf16_mul`) * Division (`bf16_div`) * Square Root (`bf16_sqrt`)approximation * Special Value Handling: `bf16_isnan`, `bf16_iszero` * Comparison Operations: `bf16_eq`, `bf16_lt`, `bf16_gt` **Testing result analysis and Ripes processor execution info:** Testing result: ``` ==== BF16 Addition Tests ==== Test 1: 1.0 + 2.0 = 3.0 ✓ PASS Test 2: 3.5 + 2.25 = 5.75 (expected 0x40B8, got 0x4050) X FAIL Test 3: 5.5 + 0.0 = 5.5 ✓ PASS Test 4: -2.5 + 2.5 = 0.0 ✓ PASS ==== BF16 Subtraction Tests ==== Test 1: 5.0 - 3.0 = 2.0 ✓ PASS Test 2: 10.0 - 5.0 = 5.0 ✓ PASS Test 3: 3.0 - 3.0 = 0.0 ✓ PASS ==== BF16 Multiplication Tests ==== Test 1: 3.0 × 4.0 = 12.0 ✓ PASS Test 2: 2.0 × 3.0 = 6.0 ✓ PASS Test 3: 1.5 × 2.0 = 3.0 ✓ PASS Test 4: -2.0 × 3.0 = -6.0 ✓ PASS Test 5: 0.5 × 4.0 = 2.0 ✓ PASS ==== BF16 Division Tests ==== Test 1: 8.0 ÷ 2.0 = 4.0 ✓ PASS Test 2: 9.0 ÷ 3.0 = 3.0 ✓ PASS Test 3: 1.0 ÷ 2.0 = 0.5 ✓ PASS ==== Special Values Tests ==== Test 1: Infinity + 5.0 = Infinity ✓ PASS Test 2: Infinity - Infinity = NaN ✓ PASS Test 3: 2.0 × Infinity = Infinity ✓ PASS Test 4: 0.0 ÷ 0.0 = NaN ✓ PASS Test 5: 5.0 ÷ 0.0 = Infinity ✓ PASS ==== Edge Cases Tests ==== Test 1: tiny + 0.0 = 0.0 ✓ PASS Test 2: huge huge = Infinity ✓ PASS Test 3: +0.0 + -0.0 = +0.0 (expected 0x0000, got 0x8000) X FAIL ==== Rounding Tests ==== Test 1: 1.0 + tiny = 1.0 (expected 0x3F80, got 0x3F81) X FAIL Test 2: 1.5 × 1.5 = 2.25 ✓ PASS ==== Comparison Tests ==== Test 1: 1.0 == 1.0 = true ✓ PASS Test 2: 1.0 == 2.0 = false ✓ PASS Test 3: 1.0 < 2.0 = true ✓ PASS Test 4: 2.0 < 1.0 = false ✓ PASS Test 5: 2.0 > 1.0 = true ✓ PASS Test 6: NaN == NaN = false ✓ PASS ==== Square Root Tests ==== Test 1: sqrt(4.0) = 2.0 ✓ PASS Test 2: sqrt(9.0) = 3.0 ✓ PASS Test 3: sqrt(16.0) = 4.0 ✓ PASS Test 4: sqrt(25.0) = 5.0 ✓ PASS Test 5: sqrt(0.0) = 0.0 ✓ PASS Test 6: sqrt(-2.0) = NaN ✓ PASS ==== Test Summary === Total tests: 37 Passed: 34 Failed: 3 ``` Failed Tests (Rounding Precision Issues): ❌ Test 2 (Addition): 3.5 + 2.25 = 5.75 Expected: 0x40B8, Got: 0x4050 Cause: BF16's 7-bit mantissa cannot precisely represent 5.75 ❌ Test 3 (Edge Cases): +0.0 + -0.0 = +0.0 Expected: 0x0000, Got: 0x8000 (negative zero) Cause: Sign bit handling in zero result cases ❌ Test 1 (Rounding): 1.0 + tiny = 1.0 Expected: 0x3F80, Got: 0x3F81 Cause: Rounding bias in denormal number addition Ripes processor execution: ![bf16_ripes](https://hackmd.io/_uploads/rksG8K9Txx.png) Performance Metrics: | Metric | Value | Analysis | | ------------ | ------ | ---------------------------------- | | Total Cycles | 11,846 | Includes all arithmetic operations | |Instructions Retired |8,803 |Actual instructions executed | |CPI |1.35 |Good for software | ### 3. Use code: Leetcode 311. Sparse Matrix Multiplication * **Problem context** LeetCode 311 asks to compute the product of two sparse matrices, where most elements are zeros. Normally, this means: * We only multiply a small fraction of entries. * Each nonzero multiplication involves floating-point math or large integers. * The main bottleneck is memory bandwidth and cache access, not the arithmetic itself. * **Why bfloat16 helps** 1. Reduced memory bandwidth - Each matrix element in float32 takes 4 bytes. In bfloat16, it takes 2 bytes — exactly half so we can store twice as many matrix elements in cache, each matrix fetch from memory (load/store) is faster - The CPU/FPGA/accelerator performs fewer memory cycles. - For sparse matrix multiplication that accessing nonzero entries is memory-heavy, this drastically cuts data traffic. 2. Preserved dynamic range - Unlike fp16, bfloat16 keeps the same exponent width (8 bits) as float32. It handles large and small values without overflow/underflow. - Keep nearly the same range, perfect for accumulation of approximate results. * **Implementation Details** Idea in C code: Using the optimized approach when addressing the matrix multiplication by doing the zero-skipping ```c= // ============================================================================ // Matrix Structure // ============================================================================ typedef struct { int rows; int cols; bf16_t *data; } Matrix; // ============================================================================ // Sparse Matrix Multiplication // ============================================================================ void sparse_matrix_multiply(Matrix *A, Matrix *B, Matrix *C, int *total_ops, int *skipped_ops) { *total_ops = 0; *skipped_ops = 0; // Initialize C to zeros memset(C->data, 0, C->rows * C->cols * sizeof(bf16_t)); // Optimized sparse matrix multiplication for (int i = 0; i < A->rows; i++) { for (int k = 0; k < A->cols; k++) { bf16_t a_ik = A->data[i * A->cols + k]; // Skip if A[i][k] is zero if (bf16_iszero(a_ik)) { *skipped_ops += B->cols; continue; } for (int j = 0; j < B->cols; j++) { bf16_t b_kj = B->data[k * B->cols + j]; // Skip if B[k][j] is zero if (bf16_iszero(b_kj)) { (*skipped_ops)++; continue; } // C[i][j] += A[i][k] * B[k][j] bf16_t product = bf16_mul(a_ik, b_kj); bf16_t current = C->data[i * C->cols + j]; C->data[i * C->cols + j] = bf16_add(current, product); (*total_ops)++; } } } } ``` > Sparse Matrix Multiplication RV32I code: (too long to add here ¯\_(ツ)_/¯): [Github link](https://github.com/scarlett910/ca2025-quizzes/blob/a8ccb45a364bf7c12484bedeacf2734e5092af2c/bf16_sparse_matrix_mul.s) * **RV32I Code Result in Ripes:** ![RV32I Leetcode311-cycle](https://hackmd.io/_uploads/HkiPTFq6gx.png) ![RV32I Leetcode311](https://hackmd.io/_uploads/Byy10Kq6el.png) **Conclusion:** ``` Efficiency = (Skipped Operations/Total Possible Operations) = (15 / 18) = 83.33% BF16 Memory Savings vs FP32: - FP32: 21 × 4 bytes = 84 bytes - BF16: 21 × 2 bytes = 42 bytes - Savings: 50% ``` ## Reference [Quiz1 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Quiz1-of-Computer-Architecture-2025-Fall) [Leetcode 311. Sparse Matrix Multiplication ](https://www.cnblogs.com/grandyang/p/5282959.html) [LeetCode 912. Sort an Arra](https://www.cnblogs.com/grandyang/p/11483981.html)