# Assignment1: RISC-V Assembly and Instruction Pipeline contributed < [`vestata`](https://github.com/vestata) > ## Problem B in [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol) ### C code **fp32_to_bf16** ```c static inline bf16_t fp32_to_bf16(float s) { bf16_t h; union { float f; uint32_t i; } u = {.f = s}; if ((u.i & 0x7fffffff) > 0x7f800000) { /* NaN */ h.bits = (u.i >> 16) | 64; /* force to quiet */ return h; } h.bits = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10; return h; } ``` This code is a function written in C that converts a 32-bit floating-point number (FP32) into a 16-bit floating-point format called bfloat16 (BF16). Recall that: ```c ┌ sign │ │ ┌ exponent │ │ │ │ ┌ mantissa │ │ │ │┌──┴───┐┌─┴───┐ 0b0000000000000000 bfloat16 ``` **Handling special case**: In the bfloat16 format, a value is considered NaN (Not a Number) when the exponent is all 1s (`11111111` in binary), and the mantissa (fraction) is not entirely zero. This is similar to how NaNs are represented in the IEEE 754 floating-point formats. If the exponent is all 1s but the mantissa is completely zero, the value represents infinity (positive or negative, depending on the sign bit). To distinguish NaN from infinity, at least one bit in the mantissa must be set to 1. In particular, setting the first bit of the mantissa to 1 by `| 64` ensures that the value is treated as NaN instead of infinity. [Single-precision_floating-point wiki](https://en.wikipedia.org/wiki/Single-precision_floating-point_format#Exponent_encoding) **In normal case**: The expression `(u.i >> 0x10) & 1` checks whether the bfloat16 result would be even or odd. It does this by examining the least significant bit of the lower 16 bits that will be discarded during the conversion.If the value is **odd** (i.e., the result of the check is 1), the code adds an extra 1 to `u.i` to help with rounding.After this adjustment, the value is right-shifted by 16 bits (`>> 0x10`), effectively converting the 32-bit float to bfloat16 by keeping the upper 16 bits. This ensures that the conversion properly handles rounding by following the "round-to-nearest-even" rule. The bfloat16 is stored in the lower 16 bits of a 32-bit word. ### RISC-V Assembly Implementation handling union in risc-v ```c fp32_to_bf16: # Input: a0 = 32-bit float value # Output: a0 = 16-bit bfloat16 value in the lower section of uint32_t add t0, x0, a0 # special case li t1, 0x7fffffff # set t1 to 0x7fffffff and t2, t0, t1 # set t2 to u.i & 0x7fffffff li t3, 0x7f800000 # set t3 to 0x7f800000 bltu t3, t2, handle_nan # normal case srli t5, a0, 0x10 # (u.i >> 0x10) andi t5, t5, 1 # & 1 li t6, 0x7fff # set t6 to 0x7fff add t5, t5, t6 # (0x7fff + ((u.i >> 0x10) & 1)) add a0, a0, t5 # apply round-to-nearest-even srli a0, a0, 16 # move bf16 to less significant 16 bit jr ra handle_nan: add t4, x0, a0 # set t4 to a0 srli t4, t4, 16 # u.i >> 16 ori t4, t4, 64 # | 64, /* force to quiet */ add a0, x0, t4 # set a0 to return jr ra ``` <!-- :::warning I initially aimed to implement bfloat16 vector dot product, but the C code implementation turned out to be quite complex. I'm not confident enough to translate it into assembly from scratch or meet the requirements for this assignment. As a result, I decided to switch to a different topic. If you're interested, you can check it out on my [GitHub](https://github.com/vestata/Computer-Architecture-2024.git). ::: --> ## Problem C in [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol) ### version 1 #### C code ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` #### Assembly code ```assembly main: li a0, 0x13 jal ra, my_clz li a7, 10 ecall my_clz: # Input: a0 = uint32_t # Output: a0 = clz li t0, 0 # count = 0 li t1, 31 # i = 31 clz_loop: li t2, 1 sll t2, t2, t1 # (1U << i) and t2, t2, a0 # x & (1U << i) bnez t2, clz_end # if (x & (1U << i)) break addi t0, t0, 1 # count++ addi t1, t1, -1 # i-- bgez t1, clz_loop # for loop clz_end: mv a0, t0 # set output ret ``` | Info | Value | |:----------------|:--------| | Cycles | 267 | | Instrs. retired | 201 | | CPI | 1.33 | | IPC | 0.753 | | Clock rate | 0 Hz | ### version 2 Use bitmask to implement clz is an easy way to boost the speed of the code. #### C code ```c int my_clz(uint32_t x) { if (x == 0) return 32; int n = 0; if (x <= 0x0000FFFF) { n += 16; x <<= 16; } if (x <= 0x00FFFFFF) { n += 8; x <<= 8; } if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; } if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; } if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; } return n; } ``` #### Assembly ```assembly my_clz: # Input: a0 = uint32_t # Output: a0 = clz beqz a0, clz_zero li t0, 0 # n = 0 li t1, 0x0000FFFF blt t1, a0, shift16_skip addi t0, t0, 16 # n += 16; slli a0, a0, 16 # x <<= 16; shift16_skip: li t1, 0x00FFFFFF blt t1, a0, shift8_skip addi t0, t0, 8 # n += 8; slli a0, a0, 8 # x <<= 8; shift8_skip: li t1, 0x0FFFFFFF blt t1, a0, shift4_skip addi t0, t0, 4 # n += 4; slli a0, a0, 4 # x <<= 4; shift4_skip: li t1, 0x3FFFFFFF blt t1, a0, shift2_skip addi t0, t0, 2 # n += 2; slli a0, a0, 2 # x <<= 2; shift2_skip: li t1, 0x7FFFFFFF blt t1, a0, clz_done addi t0, t0, 1 # n += 1; slli a0, a0, 1 # x <<= 1; j clz_done clz_zero: li a0, 32 ret clz_done: li a0, t0 ret ``` | Info | Value | |:----------------|:--------| | Cycles | 46 | | Instrs. retired | 32 | | CPI | 1.44 | | IPC | 0.696 | | Clock rate | 0 Hz | ## [3307. Find the K-th Character in String Game II](https://leetcode.com/problems/find-the-k-th-character-in-string-game-ii/) >**Description**: Alice and Bob are playing a game. Initially, Alice has a string `word = "a"`. >You are given a positive integer `k`. You are also given an integer array `operations`, where `operations[i]` represents the type of the `ith` operation. >Now Bob will ask Alice to perform all operations in sequence: If `operations[i] == 0`, append a copy of word to itself. If `operations[i] == 1`, generate a new string by changing each character in `word` to its next character in the English alphabet, and append it to the original `word`. For example, performing the operation on `"c"` generates `"cd"` and performing the operation on `"zb"` generates `"zbac"`. Return the value of the `kth` character in `word` after performing all the operations. >Note that the character `'z'` can be changed to `'a'` in the second type of operation. **Solution**: So we need to find how many times `a` changes to reach the input position `k`, the first step is to calculate the word length of the final operation. Since the word initially starts with `a`, in the first iteration, the word length is two. In the second iteration, the word length becomes four. The word length grows exponentially. Therefore, to determine the word size at position `k`, it can be represented as $2^{\log_2(k)+1}$, which corresponds to $\log_2(k)+1$ operations. The next step is to determine in which half of the string position `k` is located at each operation. If `k` is in the right half of the word, we increase the change count by one if `operation[i] == 1`. Then, we subtract the position of `k` to find its parent character. Use a for loop to go through all the operations. In the final step, apply the accumulated changes to the character `a`. ```c char kthCharacter(long long k, int* operations, int operationsSize) { int mutations = 0; for (int op = log2(k); op >= 0; --op) if (k > 1LL << op) { k -= 1LL << op; mutations += operations[op]; } return 'a' + mutations % 26; } ``` The relationship between `ilog2` (integer log base 2) and `clz` (count leading zeros) stems from the fact that finding the logarithm base 2 of an integer can be directly related to identifying the position of the highest set bit in the binary representation of the number. The `log2()` function can be defined by custom function, `my_clz()`. ```c int ilog2(int n){ return 31 - my_clz(n); } ``` Notice that the input of this question is in `long long`, so we hath to modify `my_clz()` and `ilog2()` to support 64-bit format. ```c int my_clz64(uint64_t x) { if (x == 0) return 64; int n = 0; if (x <= 0x00000000FFFFFFFF) { n += 32; x <<= 32; } if (x <= 0x0000FFFFFFFFFFFF) { n += 16; x <<= 16; } if (x <= 0x00FFFFFFFFFFFFFF) { n += 8; x <<= 8; } if (x <= 0x0FFFFFFFFFFFFFFF) { n += 4; x <<= 4; } if (x <= 0x3FFFFFFFFFFFFFFF) { n += 2; x <<= 2; } if (x <= 0x7FFFFFFFFFFFFFFF) { n += 1; x <<= 1; } return n; } int ilog2(long long n){ return 63 - my_clz64(n); } ``` The full code is right [here](https://github.com/vestata/Computer-Architecture-2024/tree/main/hw1). Since we have to handle `long long`, which is 64-bit data, we need to modify the `rv32i` code to use two registers to handle `long long`. Following the C code, we have defined three `long long` operations: `shift_left_64`, `blt_64`, and `sub_64`. ```assembly my_clz64: # Input: # a0 (lower 32 bits of x) # a1 (upper 32 bits of x) # Output: # a0 (clz) or t0, a0, a1 beqz t0, zero_case # if x == 0 li t1, 0 # set n = 0 bnez a1, shift32_skip addi t1, t1, 32 # n += 32 mv a1, a0 # x <<= 32 li a0, 0 # x_low = 0 shift32_skip: # if (x_hi <= 0x0000ffff) li t2, 0xffff sltu t3, t2, a1 # t3(flag) = (0xfff < x_hi) bnez t3, shift16_skip # if (x_hi <= 0000ffff) addi t1, t1, 16 # n += 16 slli a1, a1, 16 # x_hi <<= 16 mv t3, a0 srli t3, t3, 16 or a1, a1, t3 # x_hi = (x_hi << 16) | (x_lo >> 16) slli a0, a0, 16 # shift16_skip: # if (x_hi <= 0x00ffffff) li t2, 0xffffff sltu t3, t2, a1 bnez t3, shift8_skip addi t1, t1, 8 # n += 8 slli a1, a1, 8 # x_hi <<= 8 mv t3, a0 # set a extra register to preserve the bits # being shifted srli t3, t3, 8 or a1, a1, t3 # x_hi = (x_hi << 8) | (x_lo >> 8) slli a0, a0, 8 # shift8_skip: # if (x_hi <= 0x0fffffff) li t2, 0xfffffff sltu t3, t2, a1 bnez t3, shift4_skip addi t1, t1, 4 # n += 4 slli a1, a1, 4 # x_hi <<= 4 mv t3, a0 srli t3, t3, 4 or a1, a1, t3 # x_hi = (x_hi << 4) | (x_lo >> 4) slli a0, a0, 4 # shift4_skip: # if (x_hi <= 0x3fffffff) li t2, 0x3fffffff sltu t3, t2, a1 bnez t3, shift2_skip addi t1, t1, 2 # n += 2 slli a1, a1, 2 # x_hi <<= 2 mv t3, a0 srli t3, t3, 2 or a1, a1, t3 # x_hi = (x_hi << 2) | (x_lo >> 2) slli a0, a0, 2 # shift2_skip: # if (x_hi <= 0x7fffffff) li t2, 0x7fffffff sltu t3, t2, a1 bnez t3, clz_end addi t1, t1, 1 # n += 1 slli a1, a1, 1 # x_hi <<= 1 mv t3, a0 srli t3, t3, 1 or a1, a1, t3 # x_hi = (x_hi << 1) | (x_lo >> 1) slli a0, a0, 1 # clz_end: mv a0, t1 # return n ret zero_case: li a0 64 # retrun 64 ret ilog2: addi sp, sp, -4 sw ra, 0(sp) jal ra, my_clz64 lw ra, 0(sp) addi sp, sp, 4 li t0, 63 sub a0, t0, a0 ret kthCharacter: # Inputs: # a0: k_low (lower 32 bits of k) # a1: k_high (upper 32 bits of k) # a2: operations (pointer to array) # a3: operationsSize # Output: # a0: resulting character addi sp, sp, -28 sw ra, 24(sp) # save return address sw s0, 20(sp) # save mutations sw s1, 16(sp) # save op sw s2, 12(sp) # save k_low sw s3, 8(sp) # save k_high sw s4, 4(sp) # save operations address sw s5, 0(sp) # save operationsSize mv s4, a2 mv s5, s3 li s0, 0 # set s0 to mutation = 0 mv s2, a0 mv s3, a1 jal ra, ilog2 # call ilog2, result in a0 mv s1, a0 # set s1 = op loop: bltz s1, loop_end # branch if op < 0 mv a0, s1 # a0 = op jal ra, shift_left_64 # return a0, a1 for 64 bit data mv t0, a0 # set t0 = shift_reult_low mv t1, a1 # set t1 = shift_reult_high mv a0, t0 # set a0 = shift_result_low mv a1, t1 # set a1 = shift_result_high mv a2, s2 # set a2 = k_low mv a3, s3 # set a3 = k_high jal ra, blt_64 # blt 1LL << op, k, return boolean result # in a0 beqz a0, skip_if mv a0, s2 # a0 = k_low mv a1, s3 # a1 = k_high mv a2, t0 # a2 = shift_result_low mv a3, t1 # a3 = shift_result_high jal ra, sub_64 # return result in a0, a1 # update k mv s2, a0 # s2 = k_low mv s3, a1 # s3 = k_high # operation[op] slli t0, s1, 2 add t1, s4, t0 lw t2, 0(t1) add s0, s0, t2 # s0(mutations) += operations[op] skip_if: addi s1, s1, -1 # op-- j loop loop_end: li t0, 26 rem t1, s0, t0 # set t1 = mutations % 26 li t0, 0x61 # set t0 = ascii of 'a' add a0, t0, t1 # a0 = 'a' + (mutations % 26) lw s5, 0(sp) lw s4, 4(sp) lw s3, 8(sp) lw s2, 12(sp) lw s1, 16(sp) lw s0, 20(sp) lw ra, 24(sp) addi sp, sp, 28 ret # To handle left shift for 64-bit long long shift_left_64: # Inputs(long long): # a0: shift amount(op) # Outputs(long long): # a0: lower 32 bits # a1: upper 32 bits li t4, 32 blt a0, t4, shift_less_32 # check if shift over the # representation of 32-bit # left shift more than 31bits addi a0, a0, -32 # li a1, 1 # sll a1, a1, a0 # set x_hi = 1 << (op - 32) li a0, 0 # set x_low = 0 j end shift_less_32: # left shift more than 32 bits li a1, 1 # temp set 1 sll a0, a1, a0 # set x_low = 1LL << op li a1, 0 # set x_hi = 0 j end end: ret # To handle comaprison of two long long data blt_64: # Inputs(two long long): # a0: x1_low # a1: x1_high # a2: x2_low # a3: x2_high # Outputs(long long): # a0: boolean bltu a1, a3, true beq a1, a3, equal false: li a0, 0 # x1_high > x2_high, return false ret true: # x1 < x2, return true li a0, 1 ret equal: # if x1_high = x2_high, compare lower bltu a0, a2, true # x1_low < x2_low, return true j false # To handle substraction of two long long data sub_64: # Inputs(two long long): # a0: x1_low # a1: x1_high # a2: x2_low # a3: x2_high # Outputs(long long): # a0: r_low # a1: r_high sltu t0, a0, a2 # t0 = (a0 < a2) ? 1 : 0, check if borrow sub a0, a0, a2 # a0 = a0 - a2 sub a1, a1, a3 # a1 = a1 - a3 sub a1, a1, t0 # a1 = a1 - t0 ret ``` ### Test cases **Test case 1:** >Input: k = 12145134613, operations = [0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1] Output: "i" (0x69) **Test case 2:** >Input: k = 10, operations = [0, 1, 0, 1] Output: "b" (0x62) **Test case 3:** >Input: k = 25, operations = [1, 1, 1, 0, 1, 0, 1, 0, 1, 0] Output: "b" (0x62) The test result as below: ```c All tests pass! Program exited with code: 0 ``` The execution info as below: | Info | Value | |-------------------|--------| | Cycles | 2687 | | Instrs. retired | 1709 | | CPI | 1.57 | | IPC | 0.636 | | Clock rate | 0 Hz | ## Analysis ### Pseudo instruction ```assembly 00000000 <main>: 0: 10000297 auipc x5 0x10000 4: 08c28293 addi x5 x5 140 8: 0002a503 lw x10 0 x5 c: 10000317 auipc x6 0x10000 10: 08430313 addi x6 x6 132 14: 00032583 lw x11 0 x6 18: 10000617 auipc x12 0x10000 1c: fe860613 addi x12 x12 -24 20: 10000397 auipc x7 0x10000 24: 07438393 addi x7 x7 116 28: 0003a683 lw x13 0 x7 2c: 1e0000ef jal x1 480 <kthCharacter> 30: 10000e17 auipc x28 0x10000 34: 068e0e13 addi x28 x28 104 38: 000e2e83 lw x29 0 x28 3c: 09c000ef jal x1 156 <check_result> 40: 10000297 auipc x5 0x10000 44: 06c28293 addi x5 x5 108 48: 0002a503 lw x10 0 x5 4c: 10000317 auipc x6 0x10000 50: 06430313 addi x6 x6 100 54: 00032583 lw x11 0 x6 58: 10000617 auipc x12 0x10000 5c: 04460613 addi x12 x12 68 60: 10000397 auipc x7 0x10000 64: 05438393 addi x7 x7 84 68: 0003a683 lw x13 0 x7 6c: 1a0000ef jal x1 416 <kthCharacter> 70: 10000e17 auipc x28 0x10000 74: 048e0e13 addi x28 x28 72 78: 000e2e83 lw x29 0 x28 7c: 05c000ef jal x1 92 <check_result> 80: 10000297 auipc x5 0x10000 84: 06428293 addi x5 x5 100 88: 0002a503 lw x10 0 x5 8c: 10000317 auipc x6 0x10000 90: 05c30313 addi x6 x6 92 94: 00032583 lw x11 0 x6 98: 10000617 auipc x12 0x10000 9c: 02460613 addi x12 x12 36 a0: 10000397 auipc x7 0x10000 a4: 04c38393 addi x7 x7 76 a8: 0003a683 lw x13 0 x7 ac: 160000ef jal x1 352 <kthCharacter> b0: 10000e17 auipc x28 0x10000 b4: 040e0e13 addi x28 x28 64 b8: 000e2e83 lw x29 0 x28 bc: 01c000ef jal x1 28 <check_result> c0: 00400893 addi x17 x0 4 c4: 10000517 auipc x10 0x10000 c8: 03050513 addi x10 x10 48 cc: 00000073 ecall d0: 00a00893 addi x17 x0 10 d4: 00000073 ecall 000000d8 <check_result>: d8: 01d51463 bne x10 x29 8 <fail> dc: 00008067 jalr x0 x1 0 000000e0 <fail>: e0: 00400893 addi x17 x0 4 e4: 10000517 auipc x10 0x10000 e8: 02150513 addi x10 x10 33 ec: 00000073 ecall f0: 00a00893 addi x17 x0 10 f4: 00000073 ecall 000000f8 <my_clz64>: f8: 00b562b3 or x5 x10 x11 fc: 0e028463 beq x5 x0 232 <zero_case> 100: 00000313 addi x6 x0 0 104: 00059863 bne x11 x0 16 <shift32_skip> 108: 02030313 addi x6 x6 32 10c: 00050593 addi x11 x10 0 110: 00000513 addi x10 x0 0 00000114 <shift32_skip>: 114: 000103b7 lui x7 0x10 118: fff38393 addi x7 x7 -1 11c: 00b3be33 sltu x28 x7 x11 120: 000e1e63 bne x28 x0 28 <shift16_skip> 124: 01030313 addi x6 x6 16 128: 01059593 slli x11 x11 16 12c: 00050e13 addi x28 x10 0 130: 010e5e13 srli x28 x28 16 134: 01c5e5b3 or x11 x11 x28 138: 01051513 slli x10 x10 16 0000013c <shift16_skip>: 13c: 010003b7 lui x7 0x1000 140: fff38393 addi x7 x7 -1 144: 00b3be33 sltu x28 x7 x11 148: 000e1e63 bne x28 x0 28 <shift8_skip> 14c: 00830313 addi x6 x6 8 150: 00859593 slli x11 x11 8 154: 00050e13 addi x28 x10 0 158: 008e5e13 srli x28 x28 8 15c: 01c5e5b3 or x11 x11 x28 160: 00851513 slli x10 x10 8 00000164 <shift8_skip>: 164: 100003b7 lui x7 0x10000 168: fff38393 addi x7 x7 -1 16c: 00b3be33 sltu x28 x7 x11 170: 000e1e63 bne x28 x0 28 <shift4_skip> 174: 00430313 addi x6 x6 4 178: 00459593 slli x11 x11 4 17c: 00050e13 addi x28 x10 0 180: 004e5e13 srli x28 x28 4 184: 01c5e5b3 or x11 x11 x28 188: 00451513 slli x10 x10 4 0000018c <shift4_skip>: 18c: 400003b7 lui x7 0x40000 190: fff38393 addi x7 x7 -1 194: 00b3be33 sltu x28 x7 x11 198: 000e1e63 bne x28 x0 28 <shift2_skip> 19c: 00230313 addi x6 x6 2 1a0: 00259593 slli x11 x11 2 1a4: 00050e13 addi x28 x10 0 1a8: 002e5e13 srli x28 x28 2 1ac: 01c5e5b3 or x11 x11 x28 1b0: 00251513 slli x10 x10 2 000001b4 <shift2_skip>: 1b4: 800003b7 lui x7 0x80000 1b8: fff38393 addi x7 x7 -1 1bc: 00b3be33 sltu x28 x7 x11 1c0: 000e1e63 bne x28 x0 28 <clz_end> 1c4: 00130313 addi x6 x6 1 1c8: 00159593 slli x11 x11 1 1cc: 00050e13 addi x28 x10 0 1d0: 001e5e13 srli x28 x28 1 1d4: 01c5e5b3 or x11 x11 x28 1d8: 00151513 slli x10 x10 1 000001dc <clz_end>: 1dc: 00030513 addi x10 x6 0 1e0: 00008067 jalr x0 x1 0 000001e4 <zero_case>: 1e4: 04000513 addi x10 x0 64 1e8: 00008067 jalr x0 x1 0 000001ec <ilog2>: 1ec: ffc10113 addi x2 x2 -4 1f0: 00112023 sw x1 0 x2 1f4: f05ff0ef jal x1 -252 <my_clz64> 1f8: 00012083 lw x1 0 x2 1fc: 00410113 addi x2 x2 4 200: 03f00293 addi x5 x0 63 204: 40a28533 sub x10 x5 x10 208: 00008067 jalr x0 x1 0 0000020c <kthCharacter>: 20c: fe410113 addi x2 x2 -28 210: 00112c23 sw x1 24 x2 214: 00812a23 sw x8 20 x2 218: 00912823 sw x9 16 x2 21c: 01212623 sw x18 12 x2 220: 01312423 sw x19 8 x2 224: 01412223 sw x20 4 x2 228: 01512023 sw x21 0 x2 22c: 00060a13 addi x20 x12 0 230: 00098a93 addi x21 x19 0 234: 00000413 addi x8 x0 0 238: 00050913 addi x18 x10 0 23c: 00058993 addi x19 x11 0 240: fadff0ef jal x1 -84 <ilog2> 244: 00050493 addi x9 x10 0 00000248 <loop>: 248: 0604c063 blt x9 x0 96 <loop_end> 24c: 00048513 addi x10 x9 0 250: 08c000ef jal x1 140 <shift_left_64> 254: 00050293 addi x5 x10 0 258: 00058313 addi x6 x11 0 25c: 00028513 addi x10 x5 0 260: 00030593 addi x11 x6 0 264: 00090613 addi x12 x18 0 268: 00098693 addi x13 x19 0 26c: 0a0000ef jal x1 160 <blt_64> 270: 02050863 beq x10 x0 48 <skip_if> 274: 00090513 addi x10 x18 0 278: 00098593 addi x11 x19 0 27c: 00028613 addi x12 x5 0 280: 00030693 addi x13 x6 0 284: 0a8000ef jal x1 168 <sub_64> 288: 00050913 addi x18 x10 0 28c: 00058993 addi x19 x11 0 290: 00249293 slli x5 x9 2 294: 005a0333 add x6 x20 x5 298: 00032383 lw x7 0 x6 29c: 00740433 add x8 x8 x7 000002a0 <skip_if>: 2a0: fff48493 addi x9 x9 -1 2a4: fa5ff06f jal x0 -92 <loop> 000002a8 <loop_end>: 2a8: 01a00293 addi x5 x0 26 2ac: 02546333 rem x6 x8 x5 2b0: 06100293 addi x5 x0 97 2b4: 00628533 add x10 x5 x6 2b8: 00012a83 lw x21 0 x2 2bc: 00412a03 lw x20 4 x2 2c0: 00812983 lw x19 8 x2 2c4: 00c12903 lw x18 12 x2 2c8: 01012483 lw x9 16 x2 2cc: 01412403 lw x8 20 x2 2d0: 01812083 lw x1 24 x2 2d4: 01c10113 addi x2 x2 28 2d8: 00008067 jalr x0 x1 0 000002dc <shift_left_64>: 2dc: 02000e93 addi x29 x0 32 2e0: 01d54c63 blt x10 x29 24 <shift_less_32> 2e4: fe050513 addi x10 x10 -32 2e8: 00100593 addi x11 x0 1 2ec: 00a595b3 sll x11 x11 x10 2f0: 00000513 addi x10 x0 0 2f4: 0140006f jal x0 20 <end> 000002f8 <shift_less_32>: 2f8: 00100593 addi x11 x0 1 2fc: 00a59533 sll x10 x11 x10 300: 00000593 addi x11 x0 0 304: 0040006f jal x0 4 <end> 00000308 <end>: 308: 00008067 jalr x0 x1 0 0000030c <blt_64>: 30c: 00d5e863 bltu x11 x13 16 <true> 310: 00d58a63 beq x11 x13 20 <equal> 00000314 <false>: 314: 00000513 addi x10 x0 0 318: 00008067 jalr x0 x1 0 0000031c <true>: 31c: 00100513 addi x10 x0 1 320: 00008067 jalr x0 x1 0 00000324 <equal>: 324: fec56ce3 bltu x10 x12 -8 <true> 328: fedff06f jal x0 -20 <false> 0000032c <sub_64>: 32c: 00c532b3 sltu x5 x10 x12 330: 40c50533 sub x10 x10 x12 334: 40d585b3 sub x11 x11 x13 338: 405585b3 sub x11 x11 x5 33c: 00008067 jalr x0 x1 0 ``` ### 5-stage pipelined processor Now we can choose a processor to run this code. Ripes provide four kinds of processor for us to choose, including **single cycle processor**, **5-stage processor**, **5-stage with hazard detection** and **5-stage with forward and hazard detection**. Here we choose the **5 stage processor**. Its block diagram look like this: ![image](https://hackmd.io/_uploads/SkIN77HJ1e.png) The "5-stage" means this processor using five-stage pipeline to parallelize instructions. The stages are: 1. Instruction fetch (IF) 2. Instruction decode and register fetch (ID) 3. Execute (EX) 4. Memory access (MEM) 5. Register write back (WB) You can see that each stage is separated by pipeline registers (the rectangle block with stage names on its each side) in the block diagram. #### I-Format load Detailed Analysis of `lw t2, 0(t1)` in RISC-V Pipeline: >instruction format : `lw t2, 0(t1)` The register `t1` stores the address of the input operation array plus the offset of the variable `op`. We can then check whether the data of operation array is `0` or `1`, and store the result in `t2`. #### 1. Instruction fetch (IF) ![image](https://hackmd.io/_uploads/HJtGOXBy1l.png =40%x) - Follow [online tool for RISC-V Instruction Encoder/Decoder](https://luplab.gitlab.io/rvcodecjs/) we can find the machine code to `lw t2, 0(t1)` is 0x00032383. - PC will increment by 4 automatically using the above adder. - Because there is no branch occur, next instruction will be at PC + 4, so the multiplexer before PC choose input come from adder. #### Instruction decode and register fetch (ID) ![image](https://hackmd.io/_uploads/r1B3tXBkye.png =40%x) | imm[11:0] | rs1[5] | funct3[3] | rd[5] | opcode[7] | Instruction | | ------------ | ------ | --------- | ----- | --------- | ------------- | | imm[11:0] | rs1 | 010 | rd | 0000011 | lw | | 000000000000 | 00110 | 010 | 00111 | 0000011 | lw(in binary) | | 0x00000000 | 0x06 | 0x02 | 0x07 | 0x03 | lw(in hex) | - Recall I-format load, we can split `lw t2, 0(t1)` into five parts. Which is shown above. - `R2 idx` is not used in `lw` so it is `0x00` - Current PC value (`0x00000298`) and next PC value (`0x0000029c`) are just send through this stage, we don't use them. #### 3. Execute (EX) ![image](https://hackmd.io/_uploads/BkxohQHyJx.png =40%x) - Since this is an I-type instruction, the first-level multiplexers (which normally select values from register 1 and register 2) are not used. These register values are filtered out by the second-level multiplexers, as the instruction does not rely on register-to-register operations. - `Reg 1` and `Reg 2` are send to branch block, but no branch is taken. - The second-level multiplexers select the current value of the program counter (PC) and the immediate value (`0`) to be used as the operands for the ALU. - The ALU performs an addition operation between the base address in `0x06` and the immediate value (`0`). The result is the effective memory address, which is then passed to the memory access stage (EX/MEM). #### 4. Memory access (MEM) ![image](https://hackmd.io/_uploads/rJgQlNHJye.png =40%x) - Use as data memory address. Memory read data at address `0x10000084`, so Read `out` is equal to `0x00000001`. The table below denotes the data section of memory. - ![image](https://hackmd.io/_uploads/H1VAMVHyJe.png) - Next PC value (`0x0000029c`) and `Wr idx` (`0x07`) are just send through this stage, we don't use them. - `Reg 2` is send to `Data in`, but memory doesn't enable writing. #### 5. Register write back (WB) ![image](https://hackmd.io/_uploads/SyT9V4HJyg.png =20%x) - The multiplexer choose `Read out` from data memory as final output. So the output value is `0x00000001`. - The output value and `Wr idx` are send back to registers block. Finally, the value `0x00000001` will be write into `x7` register, whose ABI name is `t2`. After all these stage are done, the register is updated like this: ![image](https://hackmd.io/_uploads/B1TBHVry1l.png =20%x) <!-- ## Matrix multiplication The range of representation for 32-bit floating point numbers. | Type | Exp | Fraction | Sign | | -------------------- | ------- | -------- | ---- | | Positive Zero | 0 | 0 | 0 | | Negative Zero | 0 | 0 | 1 | | Denormalised numbers | 0 | non zero | 0 | | -Denormalised numbers | 0 | non zero | 1 | | Normalised numbers | 1...254 | any | any | | Infinities | $2^8$-1 | 0 | any | | NaN | $2^8$-1 | non zero | any | So, we need to handle cases where floating-point calculations result in underflow and overflow. Following previous work, both floating-point addition and floating-point multiplication are required to perform matrix multiplication. Some robust implementations can be referenced from [com-arch2023-quiz](https://hackmd.io/@sysprog/arch2023-quiz1#Problem-C). <!-- --- ```c float f32vecdot(float *A, float *B, int n) { float ret = 0.0f; for (int i = 0; i < n; i++) { ret += A[i] * B[i]; } return ret; } void f32tobf16vecdot(float *A, int n, uint32_t *result) { int packed_size = n >> 1; for (int i = 0; i < packed_size; i++) { bf16_t bf16_a = fp32_to_bf16(A[i << 1]); bf16_t bf16_b = fp32_to_bf16(A[i << 1 + 1]); packed_result[i] = bf16_encode(bf16_a, bf16_b); } } ``` --- ### `f_mul` ```c float fmul32(float a, float b) { /* TODO: Special values like NaN and INF */ int32_t ia = *(int32_t *) &a, ib = *(int32_t *) &b; /* sign */ int sa = ia >> 31; int sb = ib >> 31; /* mantissa */ int32_t ma = (ia & 0x7FFFFF) | 0x800000; int32_t mb = (ib & 0x7FFFFF) | 0x800000; /* exponent */ int32_t ea = ((ia >> 23) & 0xFF); int32_t eb = ((ib >> 23) & 0xFF); // handle zeros if(ea == 0 || eb == 0){ int32_t f_zero = 0 | (sa ^ sb) << 31; return *(float *) &f_zero; } // handle /* 'r' = result */ int64_t mrtmp = imul32(ma, mb) >> 23; int mshift = getbit(mrtmp, C01); int64_t mr = mrtmp >> mshift; int32_t ertmp = ea + eb - C02; int32_t er = mshift ? inc(ertmp) : ertmp; /* TODO: Overflow ^ */ int sr = sa ^ sb; int32_t r = (sr << C03) | ((er & 0xFF) << C04) | (mr & 0x7FFFFF); return *(float *) &r; } ``` --> <!-- Since we are using ripes for analysis, so we will focus at single processor. The data will focus on bfloat16 so the bottleneck of the matrix multiplication will be the the memory access. --> <!-- matrix multiplication 主要分三個步驟 1. 從 main memory load 到 register 2. A * B 3. 累加到 C --> <!-- The first thing we have to deal with is floating point manipulation, since no M or F/D extensions are allowed. To instruction have to be written from scratch, which are `f_add` and `f_mul`. ### `f_add` ```c # Floating-point addition function (f_add) f_add: # Extract significand, exponent, and sign of A li t6, 0x7FFFFF and t0, a0, t6 # t0 = Significand of A (23 bits) srli t2, a0, 23 # t2 = Exponent of A (8 bits) andi t2, t2, 0xFF # Mask the exponent srli t4, a0, 31 # t4 = Sign of A (1 bit) # Extract significand, exponent, and sign of B and t1, a1, t6 # t1 = Significand of B (23 bits) srli t3, a1, 23 # t3 = Exponent of B (8 bits) andi t3, t3, 0xFF # Mask the exponent srli t5, a1, 31 # t5 = Sign of B (1 bit) # Add implicit leading one to significands li t6, 0x800000 # t6 = 1 << 23 or t0, t0, t6 # A's significand with implicit 1 or t1, t1, t6 # B's significand with implicit 1 # Compare exponents and align significands bge t2, t3, align_B # If A's exponent >= B's exponent, align B # Else, align A sub t6, t3, t2 # t6 = Exponent difference srl t0, t0, t6 # Shift A's significand mv t2, t3 # Set exponent to the larger one j add_significands align_B: sub t6, t2, t3 # t6 = Exponent difference srl t1, t1, t6 # Shift B's significand # Exponent t2 is already the larger one add_significands: # Handle sign and add/subtract significands beq t4, t5, same_sign # If signs are the same, add # Signs are different; subtract smaller from larger blt t0, t1, swap_operands sub t0, t0, t1 # t0 = t0 - t1 j normalize swap_operands: sub t0, t1, t0 # t0 = t1 - t0 mv t4, t5 # Result sign is sign of B j normalize same_sign: add t0, t0, t1 # Add significands normalize: # Normalize result li t6, 0x800000 # MSB position normalize_loop: blt t0, t6, shift_left j check_overflow shift_left: slli t0, t0, 1 # Shift left addi t2, t2, -1 # Decrease exponent j normalize_loop check_overflow: srli t3, t0, 24 # Check for overflow beqz t3, finalize srli t0, t0, 1 # Shift right addi t2, t2, 1 # Increase exponent j check_overflow finalize: # Pack result into IEEE 754 format li t6, 0x7FFFFF and t0, t0, t6 # Mask significand to 23 bits slli t2, t2, 23 # Position exponent or t0, t0, t2 # Combine significand and exponent slli t4, t4, 31 # Position sign bit or a0, t0, t4 # Combine all parts jr ra # Return to caller ``` --> <!-- ### fp32 matmul --> <!-- The naive way for baseline. ```c void matmul(float* A, float* B, float* C, int M, int K, int N) { for (int i = 0; i < M; i++) { for (int j = 0; j < N; j++) { for (int k = 0; k < K; k++) { C[i * N + j] += A[i * K + k] * B[k * N + j]; } } } } ``` ```c matmul: # input # a0 - float* A # a1 - float* B # a2 - float* C # a3 - int M # a4 - int K # a5 - int N # stack register addi sp, sp, -24 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) mv s3, a3 # set s3 = M mv s4, a4 # set s4 = K mv s5, a5 # set s5 = N # initial i = 0 li s0, 0 # set s0 = i Loop_i: bge s0, s3, Loop_i_end # if i >= M, go to Loop_i_end # initial j = 0 li s1, 0 # set s1 = j Loop_j: bge s1, s5, Loop_j_end # if j >= N, go to Loop_j_end mul t0, s0, s5 # t0 = i * N add t0, t0, s1 # t0 = i * N + j slli t0, t0, 2 # t0 = (i * N + j) * 4 add t0, t0, a2 # t0 = address of C[i * N + j] lw t0, 0(t0) # t0 = C[i * N + j] # initial k = 0 li s2, 0 # s2 = k Loop_k: bge s2, s4, Loop_k_end # if k >= K, go to Loop_k_end mul t1, s0, s4 # t1 = i * K add t1, t1, s2 # t1 = i * K + k slli t1, t1, 2 # t1 = (i * K + k) * 4 add t1, t1, a0 # t1 = address of A[i * K + k] lw t1, 0(t1) # t1 = A[i * K + k] mul t2, s2, s5 # t2 = k * N add t2, t2, s1 # t2 = k * N + j slli t2, t2, 2 # t2 = (k * N + j) * 4 add t2, t2, a1 # t2 = address of B[k * N + j] lw t2, 0(t2) # t2 = B[k * N + j] mul t3, t1, t2 # t3 = t1 * t2 add t0, t0, t3 # t0 = t0 + t3 addi s2, s2, 1 # k++ j Loop_k # back to Loop_k Loop_k_end: # store the result to C[i * N + j] sw t0, 0(t0) # C[i * N + j] = t0 addi s1, s1, 1 # j++ j Loop_j # back to Loop_j Loop_j_end: addi s0, s0, 1 # i++ j Loop_i # back to Loop_i Loop_i_end: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) addi sp, sp, 24 ret # return ``` --> ## TODO :::warning 1. Check if popcount version clz can go faster 2. Improve `blt_64`, `shift_left_64`, `sub_64` ::: ## Reference [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol) [Leetcode 3307. Find the K-th Character in String Game II](https://leetcode.com/problems/find-the-k-th-character-in-string-game-ii/description/) [online tool for RISC-V Instruction Encoder/Decoder](https://luplab.gitlab.io/rvcodecjs/)