Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [`randyuncle`](https://github.com/randyuncle) > ## Problem B ### Problem statement This problem goals us to implement a uf8 encoding and decoding assembly program with RV32I instructions based on the given C source code in Quiz 1 Problem B. It indicates that the implementation should not contain any floating point extensions provided by RISC-V although the encode and decode algorithms in uf8 format requires floating point calculations. ### The Considered RISC-V Architecture ### The RV32I Assembly Implementations In Quiz 1 Problem B, the uf8 encode and decode algorithm is implemented with following functions. #### `clz`: count the number of leading zeros It calculates the number of leading zeros by the given unsigned 32-bit integer `x`. It is implemented with bit operations that act as a binary search for iteratively searching the MSB bit of the leading zeros. ```c int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return n - x; ``` This function is integrated with function `uf8_encode` in the following for calculating the mapped `uf8` exponent field of the given unsigned 32-bit integer. It is relevent with uf8 encoding algorithm since if we observe the mapping table of the range of unsigned 20-bit integer to uf8 exponent field, the lower bound of each range would be $$ \lfloor Range_{i} \rfloor = i * 16 + \lfloor Range_{i - 1} \rfloor , \, \forall \; 15 \ge i \ge 1, \, i \in \mathbb{I}$$ by substracting exponent with variable `i` in the equation. Also, the encoding algorithm of uf8 will reserve the range of unsigned 20-bit integer `[0 ~ 15]`, which is also the range when `exponent = 0`, as their original value after encoding, which means that there are 4 bits space of encoding mapping is one-to-one mapping. Therefore, we could count the leading zeros to find the MSB bit of the leading zeros from given unsigned 20-bit integer, and substract the result with 4 to avoid from mixing in the one-to-one mapping cases. ```c ... /* Find appropriate exponent using CLZ hint */ int lz = clz(value); int msb = 31 - lz; /* Start from a good initial guess */ uint8_t exponent = 0; uint32_t overflow = 0; if (msb >= 5) { /* Estimate exponent - the formula is empirical */ exponent = msb - 4; if (exponent > 15) exponent = 15; ... ``` From the C source code implementation of `clz` function, it is written in a do-while loop to repeatedly find the number of leading zeros. If we just straightfowardly transfer it into RV32I assembly code, we might use conditional branch instructions to perform a loop. However, it requires branch predition, which might affect the efficiency of RISC-V CPU with 5-stage pipeline while counting leading zeros. Thus, I consider to implement it in branchless version. The branchless version of `clz` could also be refered to the code implementation of finding log2, which is listed in the [Bit Twiddling Hacks](https://graphics.stanford.edu/~seander/bithacks.html) website. - [ ] Case study: Implement log2 in C To study the way to build branchless clz, we could refer to a work: [finding the log2 of a given integer in C](https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious). In general, if we want to implement it in C, we can make it in a obvious way by finding the MSB N set: ```c // ref: https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious unsigned int v; // 32-bit word to find the log base 2 of unsigned int r = 0; // r will be lg(v) while (v >>= 1) // unroll for more speed... { r++; } ``` The above implementation requires $O(n)$ in worst case scenario. Instead, we can write it with the use of `clz` function Quiz 1 Problem b, making the worst case search complexity be O(lg(n)). - [ ] log2 calculation with the use of `clz` ```c // This only considered 32-bit case unsigned int v; // 32-bit word to find the log base 2 of unsigned int r; // r will be lg(v) r = 31 - clz(v); ``` Even with this optimization, can we try to make it even faster? - [ ] Optimize log2 calculation with bit-wise binary search `clz` This would need more adjustments to the implementation of the `clz` source code. Suppose that we only considered the 32-bit case, we could optimize the iterative binary search in the Quiz 1 Problem b with the use of bit-wise operations: ```c // Source: https://hackmd.io/rdTVGkmxSzyTGV9j05qZvw?view int clz(uint32_t x) { if (x == 0) return 32; int n = 1; if ((x >> 16) == 0) { n += 16; x <<= 16; } if ((x >> 24) == 0) { n += 8; x <<= 8; } if ((x >> 28) == 0) { n += 4; x <<= 4; } if ((x >> 30) == 0) { n += 2; x <<= 2; } n = n - (x >> 31); return n; } ``` Then, we transform it into RV32I assembly code manually for furthur testings. ```asm # In here, I use `sltiu` to perform a branchless implementation of the if-else statement clz: addi t1, a0, 0 # t1 = a0 (transfering the input value) addi a0, x0, 0 # a0 = 0 (now a0 is prepared as return value of the function) clz_sh16: srli t2, t1, 16 # t2 = t1 >> 16 sltiu t2, t2, 1 # check if `y` exist (y < 1?) slli t2, t2, 4 # t2 (= 0 or 1) << 4 (= 0 or 16) sll t1, t1, t2 # t1 = t1 << t2 add a0, a0, t2 # a0 -> becomes a leading zero counter clz_sh8: srli t2, t1, 24 sltiu t2, t2, 1 slli t2, t2, 3 # t2 (= 0 or 1) << 3 (= 0 or 8) sll t1, t1, t2 add a0, a0, t2 clz_sh4: srli t2, t1, 28 sltiu t2, t2, 1 slli t2, t2, 2 # t2 (= 0 or 1) << 2 (= 0 or 4) sll, t1, t1, t2 add a0, a0, t2 clz_sh2: srli t2, t1, 30 sltiu t2, t2, 1 slli t2, t2, 1 # t2 (= 0 or 1) << 1 (= 0 or 2) sll t1, t1, t2 add a0, a0, t2 clz_sh1: srli t2, t1, 31 sub a0, a0, t2 addi a0, a0, 1 jalr x0, ra, 0 # function return ``` - [ ] Optimization log2 calculation with the use of multiply and lookup table Nevertheless, from the website [Bit Twiddling Hacks](https://graphics.stanford.edu/~seander/bithacks.html), it lists various of bit-wise operations in C that can fasten up the specific tasks, including finding the log base 2 of an N-bit integer. In the website, it shown a method that [only requires bit shifting and table lookup](https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogDeBruijn), without the use of if-else statement. The method requires the use of [de Bruijn sequence](https://pure.tue.nl/ws/files/4442708/597473.pdf).  ```c // Find the log base 2 of an N-bit integer in O(lg(N)) operations with multiply and lookup // Source: https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogDeBruijn uint32_t v; // find the log base 2 of 32-bit v int r; // result goes here static const int MultiplyDeBruijnBitPosition[32] = { 0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31 }; v |= v >> 1; // first round down to one less than a power of 2 v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; r = MultiplyDeBruijnBitPosition[(uint32_t)(v * 0x07C4ACDDU) >> 27]; ``` Since RV32I instruction set doesn't include instructions that perform multiplication (e.g., MUL), we need to change the calculation of `v * 0x07C4ACDDU` into bit shifts and adds operations. This leads me to consider the use of Harley's algorithm. - [ ] Harley's algorithm This algorithm is as similar as the method in the above, instead of the chosen de Brujin sequence to use as the multiplier. In the algotithm, the author chose the sequence `0x06EB14F9` since this sequence could be expanded into $x \times 7 \times 255^3$, where both $x \times 7$ and $x \times 255$ could be apllied via bit shifting and substraction. The following is my implementation of Harley's algorithm in RISC-V with the C source code in [Hackers Delight](https://github.com/hcs0/Hackers-Delight/blob/master/nlz.c.txt). ```asm .data ... tab32: # table array with 64 entries (32 unused entries is marked as `0xFF`) .byte 32, 31, 0xFF, 16, 0xFF, 30, 3, 0xFF, 15, 0xFF, 0xFF, 0xFF, 29, 10, 2, 0xFF, 0xFF, 0xFF, 12, 14, 21, 0xFF, 19, 0xFF, 0xFF, 28, 0xFF, 25, 0xFF, 9, 1, 0xFF,17, 0xFF, 4, 0xFF, 0xFF, 0xFF, 11, 0xFF, 13, 22, 20, 0xFF, 26, 0xFF, 0xFF, 18, 5, 0xFF, 0xFF, 23, 0xFF, 27, 0xFF, 6, 0xFF, 24, 7, 0xFF, 8, 0xFF, 0, 0xFF ... .text ... # Counting the leading zeros with Harley's algorithm # # The implementation of Harley's algorithm in RV32I instructions, attempting to be # compared with branchless binary search technique. # # The original source code in C refers to the Hackers Delight's `nlz.c.txt`: # https://github.com/hcs0/Hackers-Delight/blob/master/nlz.c.txt .globl clz clz: addi sp, sp, -8 sw ra, 0(sp) sw s0, 4(sp) lui s0, %hi(tab32) addi s0, s0, %lo(tab32) # load the 64-entry table # Round up to the next highest power of 2 # For more details: https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2 srli t0, a0, 1 # temp = x >> 1 or a0, a0, t0 # x = temp (= x >> 1) | x srli t0, a0, 2 or a0, a0, t0 srli t0, a0, 4 or a0, a0, t0 srli t0, a0, 8 or a0, a0, t0 srli t0, a0, 16 or a0, a0, t0 # Multiply with de Brujin sequence 0x06EB14F9 # # Since rv32i doesn't include MUL related instructions, # I choose Harley's algorithm since its debrujin sequence # is easy to calculate under shift and substraction. slli t0, a0, 3 # sub a0, t0, a0 # multiply by 7 slli t0, a0, 8 # sub a0, t0, a0 # multiply by 255 slli t0, a0, 8 sub a0, t0, a0 slli t0, a0, 8 sub a0, t0, a0 # 255^3 srli t0, a0, 26 add s0, s0, t0 # locate to the specific entry of the table lb a0, 0(s0) # load the table entry lw ra, 0(sp) lw s0, 4(sp) jalr x0, ra, 0 # function return ``` - [ ] Simple testings with Ripes - [ ] References [[1] Find the log base 2 of an N-bit integer in O(lg(N)) operations with multiply and lookup](https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogDeBruijn) [[2] Count the consecutive zero bits (trailing) on the right with multiply and lookup](https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightMultLookup) [[3] Round up to the next highest power of 2](https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2) [[4] C03: clz](https://hackmd.io/rdTVGkmxSzyTGV9j05qZvw?view) [[5] 0xff07 的筆記](https://hackmd.io/@0xff07/rkTYU0vKx?type=view) #### `uf8_encode`: encodes a 20-bit unsigned integer into a 8-bit symbol #### `uf8_decode`: decodes a 8-bit symbols into a 20-bit unsigned integer ### Test and Verifications #### Test cases ``` # case 1: 0, 1, 2, 3, 4, 5, 6, 7, 8 # case 2: 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000 # case 3: 0, 6, 135, 1024, 2048, 8192, 16384, 33989, 46024 ``` #### Results ## Problem C