Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by <[ianrwan](https://github.com/ianrwan/CA2025_HW1)> ## ProblemB: Optimize 20-bit Unsigned Integer to uf8 Encoding by Using Binary Search ### Overview ProblemB is about 20-bit Unsigned Integer and uf8 mapping. * The format of uf8 ``` ------------------------------------- | Exponent(4bits) | Mantissa(4bits) | ------------------------------------- 0 3 4 7 ``` * Encoding Equation: 20-bit Unsigned Integer to uf8 $$ E(v) = \begin{cases} v, & \text{if } v < 16 \\ 16e + \lfloor(v - \text{offset}(e))/2^e\rfloor, & \text{otherwise} \end{cases} $$ Where $e$ is exponent and $\text{offset}(e) = (2^e - 1) \cdot 16$ is mantissa. * Decoding Equation: uf8 to 20-bit Unsigned Integer $$ D(b) = m \cdot 2^e + (2^e - 1) \cdot 16 $$ Where $e = \lfloor b/16 \rfloor$ is exponent and $m = b \bmod 16$ is mantissa. ### Range Calculation According to the decoding formula, $e$ is between $[0,15]$, $m$ is between $[0,15]$. We can get 20-bit unsigned integer range in each exponents: - **min**: $(2^e - 1) × 16$ - **max**: $31 × 2^e - 16$ - **step_size**: $2^e$ | Exponent | uf8 Expression | 20-bit Unsigned Range | Step Size | |----------|----------|----------|----------| | 0 | 0x0-0xF | [0, 15] | 1 | | 1 | 0x10-0x1F | [16, 46] | 2 | | 2 | 0x20-0x2F | [48, 108] | 4 | | 3 | 0x30-0x3F | [112, 232] | 8 | | 4 | 0x40-0x4F | [240, 480] | 16 | | 5 | 0x50-0x5F | [496, 976] | 32 | | 6 | 0x60-0x6F | [1008, 1968] | 64 | | 7 | 0x70-0x7F | [2032, 3952] | 128 | | 8 | 0x80-0x8F | [4080, 7920] | 256 | | 9 | 0x90-0x9F | [8176, 15856] | 512 | | 10 | 0xA0-0xAF | [16368, 31728] | 1024 | | 11 | 0xB0-0xBF | [32752, 63472] | 2048 | | 12 | 0xC0-0xCF | [65520, 126960] | 4096 | | 13 | 0xD0-0xDF | [131056, 253936] | 8192 | | 14 | 0xE0-0xEF | [262128, 507888] | 16384 | | 15 | 0xF0-0xFF | [524272, 1015792] | 32768 | ### The Problem of Encode Part in Original C code The following code block is the encode part in original C code. #### Original C code ```c= /* Encode uint32_t to uf8 */ uf8 uf8_encode(uint32_t value) { /* Use CLZ for fast exponent calculation */ if (value < 16) return value; /* Find appropriate exponent using CLZ hint */ int lz = clz(value); int msb = 31 - lz; /* Start from a good initial guess */ uint8_t exponent = 0; uint32_t overflow = 0; if (msb >= 5) { /* Estimate exponent - the formula is empirical */ exponent = msb - 4; if (exponent > 15) exponent = 15; /* Calculate overflow for estimated exponent */ for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16; /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } } /* Find exact exponent */ while (exponent < 15) { uint32_t next_overflow = (overflow << 1) + 16; if (value < next_overflow) break; overflow = next_overflow; exponent++; } uint8_t mantissa = (value - overflow) >> exponent; return (exponent << 4) | mantissa; } ``` The following code block is the encode part in RV32I assembly which is translated from the original C code. #### RV32I Assembly Translated from The Original C Code ```riscv= # clz function # a0 = x clz: li t0, 32 # t0 = n = 32 li t1, 16 # t1 = c = 16 clz_while: srl t2, a0, t1 # t2 = y = x >> c beq t2, zero, clz_y_false # if y == 0 jump sub t0, t0, t1 # t0 = n = n-c mv a0, t2 # a0 = x = y clz_y_false: srli t1, t1, 1 # t1 = c = c >> 1 bne t1, zero, clz_while # if t1 != 0 continue sub a0, t0, a0 # a0 = n-x jalr ra # uf8_encode_function # a0 = value uf8_encode: addi sp, sp, -20 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) sw s3, 16(sp) mv s0, a0 # s0 = a0 li t0, 16 # t0 = 16 bge a0, t0, uf8_encode_value_bge_16 # if value >= 16 jump jal zero, uf8_encode_return # return value (a0 doesn't change, a0 is value) uf8_encode_value_bge_16: jal ra, clz # clz(value) (a0 doesn't change, a0 is value) mv t0, a0 # t0 = lz = clz(value) (a0 change to clz(value)) li t1, 31 # t1 = 31 sub s1, t1, t0 # s1 = msb = 31-lz li s2, 0 # s2 = exponent = 0 li s3, 0 # s3 = overflow = 0 li t0, 5 # t0 = 5 ble s1, t0, uf8_encode_while_find_exact_exponent # if msb < 5 jump addi s2, s1, -4 # s2 = exponent = msb-4 li t0, 16 # t0 = 16 ble s2, t0, uf8_encode_for_overflow_init # if exponent < 16 jump (exponent <= 15) li s2, 15 # s2 = exponent = 15 uf8_encode_for_overflow_init: li t0, 0 # t0 = e = 0 uf8_encode_for_overflow: bge t0, s2, uf8_encode_while_adjust # if e >= exponent jump addi t0, t0, 1 # t0 = e = e + 1 slli s3, s3, 1 # s3 = overflow << 1 addi s3, s3, 16 # s3 = s3 + 16 jal zero, uf8_encode_for_overflow # continue uf8_encode_for_overflow uf8_encode_while_adjust: li t0, 1 # t0 = 1 ble s2, t0, uf8_encode_while_find_exact_exponent # if exponent < 1 jump (exponent <= 0) bge s0, s3, uf8_encode_while_find_exact_exponent # if value >= overflow jump addi s3, s3, -16 # s3 = overflow - 16 srli s3, s3, 1 # s3 = (overflow - 16) >> 1 addi s2, s2, -1 # s2 = exponent = exponent - 1 jal zero, uf8_encode_while_adjust # continue uf8_encode_while_adjust uf8_encode_while_find_exact_exponent: li t0, 15 # t0 = 15 bge s2, t0, uf8_encode_return # if exponent >= 15 jump slli t0, s3, 1 # t0 = next_overflow =(overflow << 1) addi t0, t0, 16 # t0 = next_overflow = (overflow << 1 )+16 ble a0, t0, uf8_encode_return # if value < next_overflow break mv s3, t0 # s3 = overflow = next_overflow addi s2, s2, 1 # s2 = exponent = exponent+1 jal zero, uf8_encode_while_find_exact_exponent uf8_encode_return: sub t0, s0, s3 # t0 = mantissa = (value-overflow) srl t0, t0, s2 # t0 = mantissa = (value-overflow) >> exponent slli a0, s2, 4 # a0 = exponent << 4 or a0, a0, t0 # a0 = (exponent) << 4 lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) addi sp, sp, 20 jalr ra ``` We can see in the original C code, `line 23 to 24` and `line 32 to 40`. The code use loop to find out the correct $e$ which may let the worst case take $O(e)$ time. It's possible to calculate $e$ and the range directly, so we can make a table of $e$ and `20-bit unsigned integer range lower bound`. ### Optimize the Encode Part First, we can make the table of $e$ and `20-bit unsigned integer range lower bound`. ```c= static const uint32_t uf8_lower_bound[16] = { // 每個指數 e 對應的最小值 = (2^e - 1) * 16 0, // e=0: (1-1)*16 = 0 16, // e=1: (2-1)*16 = 16 48, // e=2: (4-1)*16 = 48 112, // e=3: (8-1)*16 = 112 240, // e=4: (16-1)*16 = 240 496, // e=5: (32-1)*16 = 496 1008, // e=6: (64-1)*16 = 1008 2032, // e=7: (128-1)*16 = 2032 4080, // e=8: (256-1)*16 = 4080 8176, // e=9: (512-1)*16 = 8176 16368, // e=10: (1024-1)*16 = 16368 32752, // e=11: (2048-1)*16 = 32752 65520, // e=12: (4096-1)*16 = 65520 131056, // e=13: (8192-1)*16 = 131056 262128, // e=14: (16384-1)*16 = 262128 524272 // e=15: (32768-1)*16 = 524272 }; ``` The translated RV32I assembly code: ```riscv= .data uf8_lower_bound: .word 0 # e=0 .word 16 # e=1 .word 48 # e=2 .word 112 # e=3 .word 240 # e=4 .word 496 # e=5 .word 1008 # e=6 .word 2032 # e=7 .word 4080 # e=8 .word 8176 # e=9 .word 16368 # e=10 .word 32752 # e=11 .word 65520 # e=12 .word 131056 # e=13 .word 262128 # e=14 .word 524272 # e=15 ``` Second, we can use binary search to search this table via the via the lower bound to get the corrent $e$. ```c= uf8 uf8_encode_optimized(uint32_t value) { if (value < 16) return value; uint8_t low = 1; uint8_t high = 15; uint8_t exponent = 0; while (low <= high) { uint8_t mid = (low + high) / 2; if (value >= uf8_lower_bound[mid]) { exponent = mid; low = mid + 1; } else { high = mid - 1; } } uint32_t offset = uf8_lower_bound[exponent]; uint8_t mantissa = (value - offset) >> exponent; printf("mantissa %x\n", mantissa); return (exponent << 4) | mantissa; } ``` We can see in `uf8_encode_optimized` funciton, line `10` to `22` is a loop which is made for binary search, and the worst case may be $O(\log e).$ The translated RV32I assembly code: ```riscv= # uf8_encode_optimized_function # a0 = value uf8_encode_optimized: addi sp, sp, -20 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) sw s3, 16(sp) mv s0, a0 # s0 = value li t0, 16 # t0 = 16 bge s0, t0, uf8_encode_optimized_value_bge_16 # if value >= 16 jump jal zero, uf8_encode_optimized_return # return value (a0 is still value) uf8_encode_optimized_value_bge_16: li s1, 1 # s1 = low = 1 li s2, 15 # s2 = high = 15 li s3, 0 # s3 = exponent = 0 uf8_encode_optimized_while: addi t0, s2, 1 # t0 = high + 1 bge s1, t0, uf8_encode_optimized_end_while # if low >= high + 1 jump add t1, s1, s2 # t1 = (low+high) srli t1, t1 , 1 #　t1 = mid = (low+high) / 2 la t2, uf8_lower_bound # t2 = uf8_lower_bound slli t3, t1, 2 # t3 = t1 * 4 add t2, t2 ,t3 # t2 = uf8_lower_bound + mid lw t2, 0(t2) # t2 = uf8_lower_bound[mid] ble s0, t2, uf8_encode_optimized_else # if value < uf8_lower_bound[mid] mv s3, t1 # s3 = exponent = mid addi s1, t1, 1 # s1 = low = mid+1 jal zero, uf8_encode_optimized_while_continue # end if uf8_encode_optimized_else: addi s2, t1, -1 # s2 = high = mid-1 uf8_encode_optimized_while_continue: jal zero, uf8_encode_optimized_while # while uf8_encode_optimized_end_while: la t0, uf8_lower_bound # t0 = uf8_lower_bound slli t4, s3, 2 # t4 = exponent * 4 (translate to bite address) add t1, t0 , t4 # t1 = uf8_lower_bound + exponent lw t2, 0(t1) # t2 = offset = uf8_lower_bound[exponent] sub s0, s0, t2 # s0 = value - offset srl s0, s0, s3 # s0 = mantissa = (value - offset) >> exponent slli s3, s3, 4 # s3 = exponent << 4 or a0, s3, s0 # a0 = (exponent << 4) | mantissa uf8_encode_optimized_return: lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) addi sp, sp, 20 jalr ra ``` ### The Comparison between The Original Code and The Optimized Code We ran the both of the codes in assembly via Ripes, single cycle machine. The input test datas information is in the following: * Initialized Data: `0x64` * d: `47` * n: `4` #### Original Code After we ran the original code, the result we got: | Cycles | CPI | |----------|----------| | 781 | 1 | #### Optimized Code After we ran the optimized code, the result we got: | Cycles | CPI | |----------|----------| | 628 | 1 | #### Result According the the reducing cycles, we can comfirm that the optimized code is better than the original code.