Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by : <[`tyt017`](https://github.com/tyt017/2024arch/tree/main)> (Ya-Tong Tsai) ## Quiz1 Problem C ### test cases | number | input | output | |:------:|:-----:|:--------:| | NaN | 0x7FFF | 0x7FFFFFFF | | -Inifinity | 0xFC00 | 0xFF800000 | | +Inifinity | 0x7C00 | 0x7F800000 | | -0 | 0x8000 | 0x80000000 | | +0 | 0x0000 | 0x00000000 | | Denormalized | 0x0001 | 0x33800000 | | Normalized | 0x3C00 | 0x3F800000 | ### The origin version #### C code ```c= static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } static inline uint32_t fp16_to_fp32(uint16_t h) { const uint32_t w = (uint32_t) h << 16; const uint32_t sign = w & 0x80000000; const uint32_t nonsign = w & 0x7FFFFFFF; uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & 0x7F800000; const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } ``` #### RISC-V Assembly code ```riscv= .data input_data: .word 0x7FFF answer: .word 0x7FFFFFFF str1: .string "TRUE" str2: .string "FALSE" .text input: la s0, input_data # address of input lw t0, 0(s0) slli t0, t0 16 #extend 16-bit to 32-bit li t1, 0x80000000 #sign-bit mask and t1, t0, t1 #sign bit li t2, 0x7FFFFFFF #nonsign-bit mask and t2, t0, t2 #nonsign part addi t3, x0, 31 #loop counter i my_clz: addi t4, x0, 1 #load 1 sll t4, t4, t3 # 1 << i and t4, t2, t4 # nonsign & (1 << i) blt x0, t4, renorm # finish counting addi a0, a0, 1 # count++ addi t3, t3, -1 #--i blt t3, x0, renorm beqz t4, my_clz # keep counting renorm: add t5, a0, x0 # load #of leading zero addi t4, x0, 5 sub t5, t5, t4 bgt t5, x0, result # if (renorm - 5) > 0 add t5, x0, x0 # t5 = renorm_shift result: li t0, 0x04000000 add s1, t2, t0 srai s1, s1, 8 li t0, 0x7F800000 and s1, s1, t0 # s1 = inf_nan_mask addi s2, t2, -1 srai s2, s2, 31 # s2 = zero_mask sll t2, t2, t5 srli t2, t2, 3 li t6, 0x70 sub t6, t6, t5 slli t6, t6, 23 add t2, t2, t6 or t2, t2, s1 xori s2, s2, -1 and t2, t2, s2 or a0, t1, t2 # the result is stored in a0 la s3, answer lw t1, 0(s3) beq a0, t1, true la a0, str2 # the result is wrong li a7, 4 ecall true: # the result is the same with the answer la a0, str1 li a7, 4 ecall ``` :::success After testing, the code above can output the correct result. ::: ## Branchless CLZ To make it more friendly for pipeline system, we can use bitwise AND, OR and SHIFT operators to implement `branchless clz`. > I was going to use the function clz5(x) on [wiki page](https://en.wikipedia.org/wiki/Find_first_set#CLZ), but it still have to use conditionals to judge the next step. In order to achieve "branchless", I change to use the thought of population count which is referenced from [stackoverflow](https://stackoverflow.com/questions/10439242/count-leading-zeroes-in-an-int32). ### C code ```c= int my_clz2(uint32_t x) { x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16; //count the ones x -= x >> 1 & 0x55555555; x = (x >> 2 & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0F0F0F0F; x += x >> 8; x += x >> 16; return 32 - (x & 0x3F); } ``` ### RISC-V Assembly code ```riscv= .data input_data: .word 0x0000 answer: .word 0x00000000 str1: .string "TRUE" str2: .string "FALSE" .text input: la s0, input_data # address of input lw t0, 0(s0) slli t0, t0 16 #extend 16-bit to 32-bit li t1, 0x80000000 #sign-bit mask and t1, t0, t1 #sign bit li t2, 0x7FFFFFFF #nonsign-bit mask and t2, t0, t2 #nonsign part add t4, t2, x0 my_clz: srli t3, t4, 1 or t4, t4, t3 # x |= x >> 1 srli t3, t4, 2 or t4, t4, t3 # x |= x >> 2 srli t3, t4, 4 or t4, t4, t3 # x |= x >> 4 srli t3, t4, 8 or t4, t4, t3 # x |= x >> 8 srli t3, t4, 16 or t4, t4, t3 # x |= x >> 16 li t5, 0x55555555 srli t3, t4, 1 and t3, t3, t5 sub t4, t4, t3 # x -= x >> 1 & 0x55555555 li t5, 0x33333333 srli t3, t4, 2 and t3, t3, t5 and t5, t4, t5 add t4, t3, t5 # x = (x >> 2 & 0x33333333) + (x & 0x33333333) li t5, 0x0F0F0F0F srli t3, t4, 4 add t4, t3, t4 and t4, t4, t5 # x = ((x >> 4) + x) & 0x0F0F0F0F srli t3, t4, 8 add t4, t3, t4 # x += x >> 8 srli t3, t4, 16 add t4, t3, t4 # x += x >> 16 li t5, 0x3F and t4, t4, t5 addi t3, x0, 32 sub t3, t3, t4 # 32 - (x & 0x3F) add a0, x0, t3 # put the result into a0 renorm: add t5, a0, x0 # load #of leading zero addi t4, x0, 5 sub t5, t5, t4 bgt t5, x0, result # if (renorm - 5) > 0 add t5, x0, x0 # t5 = renorm_shift result: li t0, 0x04000000 add s1, t2, t0 srai s1, s1, 8 li t0, 0x7F800000 and s1, s1, t0 # s1 = inf_nan_mask addi s2, t2, -1 srai s2, s2, 31 # s2 = zero_mask sll t2, t2, t5 srli t2, t2, 3 li t6, 0x70 sub t6, t6, t5 slli t6, t6, 23 add t2, t2, t6 or t2, t2, s1 xori s2, s2, -1 and t2, t2, s2 or s1, t1, t2 # the result is stored in s1 la s3, answer lw t1, 0(s3) beq s1, t1, true la a0, str2 li a7, 4 ecall j exit true: la a0, str1 li a7, 4 ecall exit: add x0, x0, x0 # nop ``` ## Elias Gamma Encoding Elias Gamma encoding is a variable-length, lossless entropy encoding method used to encode non-negative integers efficiently. It is particularly useful when smaller integers are more frequent because it produces shorter encodings for smaller numbers. This makes it a practical choice for compressing sequences of small integers, such as in certain data compression algorithms. The encoding flow is showed in the following graph. First, calculate L = $\lfloor \log_2 x \rfloor$, L is the number of leading zero for encoded result, and then the encoded result contains L zeros and the binary form of x. For example, `x = 278`, then `L = 8` and the binary form is `100010110`, so the encoded result is `00000000100010110`. ``` mermaid graph TD; A["Input int number x"]-->B["compute L = lg(x)"]; B["compute L = lg(x)"]-->C["Output L zeros + the binary form of x"]; ``` ### Test cases |#| input | output | |:-:|:-----:|:------:| | 1 | 5 | 00101 | | 2 | 16 | 000010000 | | 3 | 278 | 00000000100010110 | ### Motivation Since "Counting leading Zero (CLZ)" can help us not only to immediately calculate $\lfloor \log_2 x \rfloor$, but also to get the MSB in the binary form of the input, I think it can be implemented into this problem to speedup the performance. In the following part, I will implement the solution with non-branchless CLZ and the other one with branchless CLZ. ### C code (No CLZ) ```c= #include <stdio.h> #include <stdint.h> int binary_length(uint32_t n) { int length = 0; while (n > 0) { length++; n >>= 1; } return length; } // Elias Gamma encoding void elias_gamma_encode(uint32_t n) { if (n == 0) { printf("0\n"); return; } int L = binary_length(n); for (int i = 0; i < L - 1; i++) { printf("0"); } for (int i = L - 1; i >= 0; i--) { printf("%d", (n >> i) & 1); } printf("\n"); } int main() { printf("Elias Gamma encoding:\n"); elias_gamma_encode(5); // output: 00101 elias_gamma_encode(7); // output: 00111 elias_gamma_encode(278); // output: 00000000100010110 return 0; } ``` ### Solution (C code with CLZ) ```c= #include <stdio.h> #include <stdint.h> static inline int log_with_clz(uint32_t x) { int count = 0; if (x == 0) return 32; while ((x & (1U << 31)) == 0) { x <<= 1; count++; } return 32 - count; } // Elias Gamma encoding void elias_gamma_encode(uint32_t n) { if (n == 0) { printf("0\n"); return; } int L = log_with_clz(n); for (int i = 0; i < L - 1; i++) { printf("0"); } for (int i = L - 1; i >= 0; i--) { printf("%d", (n >> i) & 1); } printf("\n"); } int main() { printf("Elias Gamma encoding:\n"); elias_gamma_encode(5); // output: 00101 elias_gamma_encode(16); // output: 000010000 elias_gamma_encode(278); // output: 00000000100010110 return 0; } ``` ### Solution (RISC-V Assembly code with CLZ) The code will directly output the result into console. ```riscv= .data input_data: .word 278 str0: .string "0" str1: .string "1" .text intput: la s0, input_data lw t0, 0(s0) addi t3, x0, 31 my_clz: addi t4, x0, 1 #load 1 sll t4, t4, t3 # 1 << i and t4, t0, t4 # x & (1 << i) blt x0, t4, out_my_clz # finish counting addi a0, a0, 1 # count++ addi t3, t3, -1 #--i blt t3, x0, out_my_clz beqz t4, my_clz # keep counting out_my_clz: li t1, 32 sub t1, t1, a0 add t2, x0, t1 # loop counter for L-1 zeros addi t3, t1, -1 add t5, t1, x0 # loop counter for output_binary output: addi t2, t2, -1 bgt t2, x0, output_zero j output_binary output_zero: # print 0 la a0, str0 li a7, 4 ecall j output output_binary: srl t4, t0, t3 # shift to the MSB andi t1, t4, 1 addi t3, t3, -1 addi t5, t5, -1 blt t5, x0, exit beqz t1, output_zero output_one: # print 1 la a0, str1 li a7, 4 ecall bgt t5, x0, output_binary exit: add x0, x0, x0 # nop ``` :::success After testing, the code above can output the correct result. ::: ### Solution (C code with branchless CLZ) Since the last line of the original branchless CLZ is `32 - (x & 0x3F)`, and $\log_{2}{x}$ is `32 - clz(x)`, I reduce the last line of log_with_clz2(x) to `(x & 0x3F)`. ```c= #include <stdio.h> #include <stdint.h> int log_with_clz2(uint32_t x) { x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16; //count the ones x -= x >> 1 & 0x55555555; x = (x >> 2 & 0x33333333) + (x & 0x33333333); x = (x >> 4) + x & 0x0F0F0F0F; x += x >> 8; x += x >> 16; return (x & 0x3F); } // Elias Gamma encoding void elias_gamma_encode(uint32_t n) { if (n == 0) { printf("0\n"); return; } int L = log_with_clz2(n); for (int i = 0; i < L - 1; i++) { printf("0"); } printf("1"); for (int i = L - 2; i >= 0; i--) { printf("%d", (n >> i) & 1); } printf("\n"); } int main() { printf("Elias Gamma encoding:\n"); elias_gamma_encode(5); // output: 00101 elias_gamma_encode(16); // output: 000010000 elias_gamma_encode(278); // output: 00000000100010110 return 0; } ``` ### Solution (RISC-V Assembly code with branchless CLZ) The code will directly output the result into console. ```riscv= .data input_data: .word 278 str0: .string "0" str1: .string "1" .text intput: la s0, input_data lw t0, 0(s0) add t4, t0, x0 my_clz: srli t3, t4, 1 or t4, t4, t3 # x |= x >> 1 srli t3, t4, 2 or t4, t4, t3 # x |= x >> 2 srli t3, t4, 4 or t4, t4, t3 # x |= x >> 4 srli t3, t4, 8 or t4, t4, t3 # x |= x >> 8 srli t3, t4, 16 or t4, t4, t3 # x |= x >> 16 li t5, 0x55555555 srli t3, t4, 1 and t3, t3, t5 sub t4, t4, t3 # x -= x >> 1 & 0x55555555 li t5, 0x33333333 srli t3, t4, 2 and t3, t3, t5 and t5, t4, t5 add t4, t3, t5 # x = (x >> 2 & 0x33333333) + (x & 0x33333333) li t5, 0x0F0F0F0F srli t3, t4, 4 add t4, t3, t4 and t4, t4, t5 # x = ((x >> 4) + x) & 0x0f0f0f0f srli t3, t4, 8 add t4, t3, t4 # x += x >> 8 srli t3, t4, 16 add t4, t3, t4 # x += x >> 16 li t5, 0x3F and t4, t4, t5 add a0, x0, t4 # put the result into a0 out_my_clz: add t2, x0, a0 # loop counter for L-1 zeros addi t3, a0, -1 addi t5, a0, 0 # loop counter for output_binary output: addi t2, t2, -1 bgt t2, x0, output_zero j output_binary output_zero: la a0, str0 li a7, 4 ecall j output output_binary: srl t4, t0, t3 # shift to the MSB andi t1, t4, 1 addi t3, t3, -1 addi t5, t5, -1 blt t5, x0, exit beqz t1, output_zero output_one: la a0, str1 li a7, 4 ecall bgt t5, x0, output_binary exit: add x0, x0, x0 # nop ``` :::success After testing, the code above can output the correct result. ::: ## Result As the result showed in the following figures, the program using branchless CLZ computes faster than the original one. The reason is that if there is a loop or branch in C code, pipeline processor may have to deal with control hazard caused by mispredicted branch. In this situation, to make the program work correctly, pipeline processor needs to flush the instructions that have been load in to previous stages (before the stage deciding branch or not). ### test case 1 (5) Performance of the original code (with CLZ) ![image](https://hackmd.io/_uploads/HywBFItyJx.png) Performance of the optimal code (with branchless CLZ) ![image](https://hackmd.io/_uploads/SJlKdIKJkl.png) ### test case 2 (16) Performance of the original code (with CLZ) ![image](https://hackmd.io/_uploads/ryT5FItJkx.png) Performance of the optimal code (with branchless CLZ) ![image](https://hackmd.io/_uploads/By10uUtkkx.png) ### test case 3 (278) Performance of the original code (with CLZ) ![image](https://hackmd.io/_uploads/rkih_4ty1l.png) Performance of the optimal code (with branchless CLZ) ![image](https://hackmd.io/_uploads/HkCWt4tyJl.png) ## Analysis ### Pseudo Instruction for Elias Gamma Encoding with Branchless CLZ ``` 00000000 <intput>: 0: 10000417 auipc x8 0x10000 4: 00040413 addi x8 x8 0 8: 00042283 lw x5 0 x8 c: 00028eb3 add x29 x5 x0 00000010 <my_clz>: 10: 001ede13 srli x28 x29 1 14: 01ceeeb3 or x29 x29 x28 18: 002ede13 srli x28 x29 2 1c: 01ceeeb3 or x29 x29 x28 20: 004ede13 srli x28 x29 4 24: 01ceeeb3 or x29 x29 x28 28: 008ede13 srli x28 x29 8 2c: 01ceeeb3 or x29 x29 x28 30: 010ede13 srli x28 x29 16 34: 01ceeeb3 or x29 x29 x28 38: 55555f37 lui x30 0x55555 3c: 555f0f13 addi x30 x30 1365 40: 001ede13 srli x28 x29 1 44: 01ee7e33 and x28 x28 x30 48: 41ce8eb3 sub x29 x29 x28 4c: 33333f37 lui x30 0x33333 50: 333f0f13 addi x30 x30 819 54: 002ede13 srli x28 x29 2 58: 01ee7e33 and x28 x28 x30 5c: 01eeff33 and x30 x29 x30 60: 01ee0eb3 add x29 x28 x30 64: 0f0f1f37 lui x30 0xf0f1 68: f0ff0f13 addi x30 x30 -241 6c: 004ede13 srli x28 x29 4 70: 01de0eb3 add x29 x28 x29 74: 01eefeb3 and x29 x29 x30 78: 008ede13 srli x28 x29 8 7c: 01de0eb3 add x29 x28 x29 80: 010ede13 srli x28 x29 16 84: 01de0eb3 add x29 x28 x29 88: 03f00f13 addi x30 x0 63 8c: 01eefeb3 and x29 x29 x30 90: 01d00533 add x10 x0 x29 00000094 <out_my_clz>: 94: 00a003b3 add x7 x0 x10 98: fff50e13 addi x28 x10 -1 9c: 00050f13 addi x30 x10 0 000000a0 <output>: a0: fff38393 addi x7 x7 -1 a4: 00704463 blt x0 x7 8 <output_zero> a8: 0180006f jal x0 24 <output_binary> 000000ac <output_zero>: ac: 10000517 auipc x10 0x10000 b0: f5850513 addi x10 x10 -168 b4: 00400893 addi x17 x0 4 b8: 00000073 ecall bc: fe5ff06f jal x0 -28 <output> 000000c0 <output_binary>: c0: 01c2deb3 srl x29 x5 x28 c4: 001ef313 andi x6 x29 1 c8: fffe0e13 addi x28 x28 -1 cc: ffff0f13 addi x30 x30 -1 d0: 000f4e63 blt x30 x0 28 <exit> d4: fc030ce3 beq x6 x0 -40 <output_zero> 000000d8 <output_one>: d8: 10000517 auipc x10 0x10000 dc: f2e50513 addi x10 x10 -210 e0: 00400893 addi x17 x0 4 e4: 00000073 ecall e8: fde04ce3 blt x0 x30 -40 <output_binary> 000000ec <exit>: ec: 00000033 add x0 x0 x0 ``` ## 5-stages Pipeline Processor ![image](https://hackmd.io/_uploads/S1W9C4YJyg.png) The five stages for the 5-stages pipeline processor are: | Stage Name | Function | | :--------: | :--------: | | IF | Instruction Fetch | | ID | Instruction Decode & Register Read | | EX | Execution or Address Calculation | | MEM | Data Memory Access | | WB | Write Back | Each stage is separated by the pipeline register. Instruction in different type of format will go through 5 stages with different signal turned on. Take R-type as example. #### R-type format Take `add x29 x5 x0` as example >`add rd, rs1, rs2` >add instruction is used to calculate the result of the two registers (rs1 + rs2) and put the result into destination register (rd) Let's see how it go through each stage. **1. Instruction Fetch (IF)** ![image](https://hackmd.io/_uploads/SylRDDt1ke.png) - the machine code of this instruction is `00028eb3`, so `instr` is equal to `00028eb3`. - Its PC address is `0x0000000c`, so the next PC address is `0x00000010` . - PC will increment by 4 automatically using the above adder. - Because there is no branch occur, next instruction will be at PC + 4, so the multiplexer before PC choose input come from adder. **2. Instruction Decode and register fetch (ID)** ![image](https://hackmd.io/_uploads/rJST_PKyJx.png) - instruction is decoded to four part: - `opcode` = `add` - `Wr idx` = `0x1d` - `R1 idx` = `0x05` - `R2 idx` = `0x00` - Because the previous instruction of this one is `lw`, and we reuse the register `t0 (x5)`, there is a **Read After Write hazard**. Therefore, the processor will insert a `nop` between `lw` and `add`. - ![image](https://hackmd.io/_uploads/ryn92Pt1Jx.png) - This way, `R1` register will not get the correct value at this stage. - Only `R2` register loads the correct value from register memory which is `0x00000000`. - The current PC value `0x0000000c` and next PC value `0x00000010` are just send through this stage, we don't use them. **3. Execute (EX)** ![image](https://hackmd.io/_uploads/BkTaaDYJ1x.png) - First level multiplexers choose the value come from `Forwarding path` and `R2`. - After inserting a `nop`, at this stage `R1` register can get the correct value `0x00000116` which is forwarded from WB stage. - Second level multiplexers still choose the value from rigisters and the value is sent to ALU, instead of the next PC address and the immediate value. - ALU gets `op1` = `0x00000116` and `op2` = `0x00000000`, and add them together. - The result `0x00000116` is output to `Res`. - The result `0x00000116` and `Wr idx (0x1d)` are just send through this stage, we don't use them. **4. Memory access (MEM)** ![image](https://hackmd.io/_uploads/By5iH_tkye.png) - Res from ALU is send to 3 ways: - Pass through this stage and go to WB stage - Send back to EXE stage for next instruction to use - Use as data memory address. - `R2` is sent to Data in, but memory is not able to writing (`Wr en` is unset). - The ALU result `0x00000116` and `Wr idx (0x01d)` are just sent through this stage, we don't use them. **5. Register write back (WB)** ![image](https://hackmd.io/_uploads/B1hFM_t1Jx.png) - The multiplexer choose Res from ALU as final output. So the output value is `0x00000116`. - The output value and `Wr idx (0x1d)` are sent back to registers block. Finally, the value `0x00000116` will be write into `x5` register, whose ABI name is `t0` and register address is `0x1d`. After finishing 5-stage, the value is written into the destination register, as the following figure. ![image](https://hackmd.io/_uploads/ryczIOtJJe.png) ## Reference [Quiz1 of Computer Architecture(2024)](https://hackmd.io/@sysprog/arch2024-quiz1-sol) [Find first set - CLZ](https://en.wikipedia.org/wiki/Find_first_set#CLZ) [Count leading zeroes in an Int32](https://stackoverflow.com/questions/10439242/count-leading-zeroes-in-an-int32) [The Aggregate Magic Algorithms - Population Count](http://aggregate.org/MAGIC/#Population%20Count%20(Ones%20Count)) [Quiz1 of Computer Architecture(2023)](https://hackmd.io/@sysprog/arch2023-quiz1-sol) [Elias Gamma Encoding](https://en.wikipedia.org/wiki/Elias_gamma_coding)