Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < `ChengRay` > ## Problem C in [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol) In problem C, there are three functions: `fabsf`, `my_clz`, and `uint32_t fp16_to_fp32`. ### fabsf >##### IEEE754 >The format of IEEE 754 single precision(32 bits) is as follows: ![IEEE_754](https://hackmd.io/_uploads/rynXb6Xy1x.png) >* `Sign`: 1 bit, it indicates positive or negative. >* `Exponent`: 8 bits, it is represented in a biased format to accommodate both positive and negative exponents. The actual exponent is calculated by subtracting the bias(127). >* `fraction`: 23 bits, the fraction part represents the significant digits of the number, with an implicit leading bit of 1 for normalized number. The purpose is to compute the **absolute value** of a float number. `uint32_t i` & `0x7FFFFFFF` sets the leftmost bit to 0 and retains the original values of the remaining bits. ##### C code ```c #include<stdio.h> #include<stdint.h> static inline float fabsf(float x) { uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value x = *(float *)&i; // Write the modified bits back into the float return x; } int main(){ float test1 = -5.125; float test2 = 1024.25; float test3 = 0; printf("test1: %f" , fabsf(test1)); printf("\ntest2: %f" , fabsf(test2)); printf("\ntest3: %f" , fabsf(test3)); return 0; } ``` ##### RISC-V ```c .data test1: .word 0xC0A40000 # -5.125 in IEEE 754 test2: .word 0x44800800 # 1024.25 in IEEE 754 test3: .word 0x00000000 # 0 in IEEE 754 str1: .string "test1: " str2: .string "\ntest2: " str3: .string "\ntest3: " .text main: #test1 la a0, str1 # Load the address of the str1 li a7, 4 # System call code for printing a string ecall # Print the string la t0, test1 # Load address of test1 lw a0, 0(t0) # Load the 32-bit float into a0 jal ra, fabsf # Call fabsf li a7, 2 # Print the answer ecall #test2 la a0, str2 # Load the address of the str2 li a7, 4 # System call code for printing a string ecall # Print the string la t0, test2 # Load address of test2 lw a0, 0(t0) # Load the 32-bit float into a0 jal ra, fabsf # Call fabsf li a7, 2 # Print the answer ecall #test3 la a0, str3 # Load the address of the str3 li a7, 4 # System call code for printing a string ecall # Print the string la t0, test3 # Load address of test3 lw a0, 0(t0) # Load the 32-bit float into a0 jal ra, fabsf # Call fabsf li a7, 2 # Print the answer ecall #Exit li a0, 0 # Exit code 0 li a7, 93 # Syscall for exit ecall # Make the syscall fabsf: li t0, 0X7FFFFFFF and t1, a0, t0 # Clear the sign bit mv a0, t1 jr x1 ``` ###### Output console ![](https://i.imgur.com/MNSGavT.png) :::danger Do not use screenshots for plain text content, as this is inaccessible to visually impaired users. ::: :::success ChengRay: The linking path has been changed. ::: ### my_clz The function is counts **the number of leading zero bits** in the binary representation of an unsigned **32-bit** integer, starting from the most significant bit. >[!Note] The special case: when the input is **`0`**, the output is **`undefined`**. #### C code_version1 ```c #include <stdio.h> #include <stdint.h> static inline int my_clz(uint32_t x){ if(x==0){ printf("Undefined"); return -1; } int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } int main(){ int test1 = 16; int test2 = 33; int test3 = 0; int result=0; printf("test1: "); result = my_clz(test1); if(result>0)printf("%d" , result); printf("\ntest2: "); result = my_clz(test2); if(result>0)printf("%d" , result); printf("\ntest3: "); result = my_clz(test3); if(result>0)printf("%d\n" , result); return 0; } ``` #### RISC-V version1 ```c .data test1: .word 16 test2: .word 33 test3: .word 0 str1: .string "test1: " str2: .string "\ntest2: " str3: .string "\ntest3: " str4: .string "undefined" .text main: # test1 la a0, str1 # Load the address of the str1 li a7, 4 # System call code for printing a string ecall # Print the string la t0, test1 # Load address of test1 lw a0, 0(t0) # Load the test1 into a0 jal ra, my_clz # Call my_clz # test2 la a0, str2 # Load the address of the str2 li a7, 4 # System call code for printing a string ecall # Print the string la t0, test2 # Load address of test2 lw a0, 0(t0) # Load the test2 into a0 jal ra, my_clz # Call my_clz # test3 la a0, str3 # Load the address of the str3 li a7, 4 # System call code for printing a string ecall # Print the string la t0, test3 # Load address of test3 lw a0, 0(t0) # Load the test3 into a0 jal ra, my_clz # Call my_clz # Exit li a0, 0 # Exit code 0 li a7, 93 # Syscall for exit ecall # Make the syscall my_clz: beqz a0, return0 # if a0 == 0, jump. li t0, 1 # load 1 into t0 li t1, 31 # load 31 into t1 cal: sll t2, t0, t1 # shift bgeu a0, t2, return # input > t2, get numbers of shift. addi t1, t1, -1 # shift number - 1 j cal # loop return: xori t0, t1, 0x1F # 1' complement mv a0, t0 # print value li a7, 1 # Print the answer ecall jr x1 return0: li a0, -1 # return -1 la a0, str4 # Load the address of the str4 li a7, 4 # System call code for printing a string ecall jr x1 ``` ###### output result ![](https://i.imgur.com/AhsClTm.png) ![](https://i.imgur.com/HEtnLcv.png) I have an idea to modify the code of problem A in [2023_Quiz1](https://hackmd.io/@sysprog/arch2023-quiz1-sol). ProblemA in 2023_Quiz1 is the function that counts leading zero for unsigned **64-bit** integers. Thus, I can modify the code into the fuction that calculates for 32-bit integer. The `my_clz` function has a loop, but the modified code does not contain any loops. This may reduce the number of `nop` instructions due to the branch instruction. *Let’s see which version performs the best.* #### C code_version2 ```c #include <stdio.h> #include <stdint.h> int16_t my_clz(uint32_t x){ if(x==0){ printf("test is 0, Undefined\n"); return -1; } x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x7f)); } int main(){ int test1 = 16; int test2 = 33; int test3 = 0; int result = 0; result = my_clz(test1); if(result>0)printf("test1: %d\n" , result); result = my_clz(test2); if(result>0)printf("test2: %d\n" , result); result = my_clz(test3); if(result>0)printf("test3: %d\n" , result); return 0; } ``` #### RISC-V version2 ```c .data test1: .word 16 test2: .word 33 test3: .word 0 str1: .string "test1: " str2: .string "\ntest2: " str3: .string "\ntest3: " str4: .string "undefined" .text main: # test1 la a0, str1 # Load the address of the str1 li a7, 4 # System call code for printing a string ecall # Print the string la t0, test1 # Load address of test1 lw a0, 0(t0) # Load the test1 into a0 jal ra, my_clz2 # Call my_clz # test2 la a0, str2 # Load the address of the str2 li a7, 4 # System call code for printing a string ecall # Print the string la t0, test2 # Load address of test2 lw a0, 0(t0) # Load the test2 into a0 jal ra, my_clz2 # Call my_clz # test3 la a0, str3 # Load the address of the str3 li a7, 4 # System call code for printing a string ecall # Print the string la t0, test3 # Load address of test3 lw a0, 0(t0) # Load the test3 into a0 jal ra, my_clz2 # Call my_clz # Exit li a0, 0 # Exit code 0 li a7, 93 # Syscall for exit ecall # Make the syscall my_clz2: beqz a0, return0 # if a0 == 0, jump. mv t0, a0 srli t1, t0, 1 # x |= (x >> 1); or t0, t0, t1 srli t1, t0, 2 # x |= (x >> 2); or t0, t0, t1 srli t1, t0, 4 # x |= (x >> 4); or t0, t0, t1 srli t1, t0, 8 # x |= (x >> 8); or t0, t0, t1 srli t1, t0, 16 # x |= (x >> 16); or t0, t0, t1 li t2, 0x55555555 srli t1, t0, 1 # x -= ((x >> 1) & 0x55555555); li t3, 0xFFFFFFFF and t1, t1, t2 # t2: 0x55555555 addi t0, t0, 1 # t0+xor(t1)+1 -> I calculate t0+1 first. xor t1, t1, t3 # negative t1, 2' complement.(xor 0xFFFFFFFF) add t0, t0, t1 li t3, 0x33333333 srli t1, t0, 2 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333); and t2, t0, t3 # t3: 0x33333333 and t1, t1, t3 add t0, t1, t2 li t3, 0x0F0F0F0F srli t1, t0, 4 # x = ((x >> 4) + x) & 0x0f0f0f0f; add t0, t1, t0 and t0, t0, t3 # t3: 0x0F0F0F0F srli t1, t0, 8 # x += (x >> 8); add t0, t0, t1 srli t1, t0, 16 # x += (x >> 16); add t0, t0, t1 andi t0, t0, 0x7F #　return (32 - (x & 0x7f)); xori t0, t0, 0x1F addi a0, t0, 1 li a7, 1 ecall jr x1 return0: la a0, str4 # Load the address of the str4 li a7, 4 # System call code for printing a string ecall li a0, -1 # return -1 jr x1 ``` ###### output result ![](https://i.imgur.com/NtsPlPV.png) ![](https://i.imgur.com/9rI09LE.png) #### Performance According to the performance of both. Version 1 has branch instruction, which results in a worse cycle count compared to Version 2. Based on other data, version 2's performance is evidently superior. | | version1(loop) | version2(without loop) | |:--------------- | --------------:| ----------------------:| | Cycles: | 408 | 148 | | Instrs.retired: | 266 | 116 | | CPI: | 1.53 | 1.28 | | IPC: | 0.652 | 0.784 | | Clock rate: | 152.75Hz | 143.83Hz | ### uint32_t fp16_to_fp32 ##### C code version1 ```c #include <stdio.h> #include <stdint.h> int16_t my_clz(uint32_t x){ x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x7f)); } static inline uint32_t fp16_to_fp32(uint16_t h) { const uint32_t w = (uint32_t) h << 16; const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } int main(){ int test1 = 0x4BE0;//15.75: 0100 1011 1110 0000 int test2 = 0xC540;//-5.25: 1100 0101 0100 0000 int test3 = 0x0; test1 = fp16_to_fp32(test1); float *result1 = (float*)&test1; test2 = fp16_to_fp32(test2); float *result2 = (float*)&test2; test3 = fp16_to_fp32(test3); float *result3 = (float*)&test3; printf("1: %.2f\n" , *result1);//Output: 15.75 printf("2: %.2f\n" , *result2);//Output: -5.25 printf("3: %.2f\n" , *result3);//Output: 0.00 return 0; } ``` `zero_mask = (int32_t)(nonsign - 1) >> 31` involves a lot of operations and will result in at least four instructions. If I check whether the input is zero at the very beginning, it will generate at most three cycles, which is clearly better than the original, and I can also exit the function earlier. ##### C code version2 ```c #include <stdio.h> #include <stdint.h> int16_t my_clz(uint32_t x){ x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x7f)); } static inline uint32_t fp16_to_fp32(uint16_t h) { if(h==0)return 0; const uint32_t w = (uint32_t) h << 16; const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); //remove const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask)); } int main(){ int test1 = 0x4BE0;//15.75: 0100 1011 1110 0000 int test2 = 0xC540;//-5.25: 1100 0101 0100 0000 int test3 = 0x0; test1 = fp16_to_fp32(test1); float *result1 = (float*)&test1; test2 = fp16_to_fp32(test2); float *result2 = (float*)&test2; test3 = fp16_to_fp32(test3); float *result3 = (float*)&test3; printf("1: %.2f\n" , *result1);//Output: 15.75 printf("2: %.2f\n" , *result2);//Output: -5.25 printf("3: %.2f\n" , *result3);//Output: 0.00 return 0; } ``` ##### RISC-V ```c .data test1: .word 0x4BE0 # 15.75: 0100 1011 1110 0000 test2: .word 0xc540 # -5.25: 1100 0101 0100 0000 test3: .word 0 # 0 str1: .string "test1: " str2: .string "\ntest2: " str3: .string "\ntest3: " .text main: # test1 la a0, str1 # Load the address of the str1 li a7, 4 # System call code for printing a string ecall # Print the string la t0, test1 # Load address of test1 lw a0, 0(t0) # Load the test1 into a0 jal ra, fp16to32 # Call fp16to32 # test2 la a0, str2 # Load the address of the str2 li a7, 4 # System call code for printing a string ecall # Print the string la t0, test2 # Load address of test2 lw a0, 0(t0) # Load the test2 into a0 jal ra, fp16to32 # Call fp16to32 # test3 la a0, str3 # Load the address of the str3 li a7, 4 # System call code for printing a string ecall # Print the string la t0, test3 # Load address of test3 lw a0, 0(t0) # Load the test3 into a0 jal ra, fp16to32 # Call fp16to32 # Exit li a0, 0 # Exit code 0 li a7, 93 # Syscall for exit ecall # Make the syscall fp16to32: beqz a0, return0 # if a0 == 0, jump. sw ra, -4(sp) # a0 is unuseful after `slli t0, t1, 16` # Thus, I didn't store to stack. slli t0, a0, 16 # w = (uint32_t) h << 16; # t0: w li s0, 0x80000000 # sign = w & UINT32_C(0x80000000); and s1, t0, s0 # s1: sign li t1, 0x7FFFFFFF # nonsign = w & UINT32_C(0x7FFFFFFF); and a0, t0, t1 # a0: nonsign addi t4, a0, 0 # t4: nonsign jal ra, my_clz # execute my_clz li t2, 5 addi t0, a0, 0 # renorm_shift = my_clz(nonsign); # t0: renorm_shift blt t0, t2, ZERO # renorm_shift = renorm_shift > 5? renorm_shift - 5 : 0; addi t0, t0, -5 # t0: renorm_shift continue: li t3, 0x04000000 # inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & li t2, 0x7F800000 # INT32_C(0x7F800000); add t1, t4, t3 srli t1, t1, 8 and t1, t1, t2 # t1: inf_nan_mask sll t4, t4, t0 srli t4, t4, 3 # (nonsign << renorm_shift >> 3) addi t0, t0, 0b110010000 # Original: (0x70 - renorm_shift) << 23) xori t0, t0, 0b111111111 # First, calculate renorm_shift - 0x70, then take the two's complement. addi t0, t0, 1 # 2'complement of 0x70 is 110010000 in 9 bit. slli t0, t0, 23 add t0, t0, t4 # add: Both of above or t0, t0, t1 # | inf_nan_mask or a0, t0, s1 li a7, 2 ecall lw ra, -4(sp) # load return address jr x1 ZERO: li t0, 0 # t0: renorm_shift = 0; j continue my_clz: mv t0, a0 srli t1, t0, 1 # x |= (x >> 1); or t0, t0, t1 srli t1, t0, 2 # x |= (x >> 2); or t0, t0, t1 srli t1, t0, 4 # x |= (x >> 4); or t0, t0, t1 srli t1, t0, 8 # x |= (x >> 8); or t0, t0, t1 srli t1, t0, 16 # x |= (x >> 16); or t0, t0, t1 li t2, 0x55555555 srli t1, t0, 1 # x -= ((x >> 1) & 0x55555555); li t3, 0xFFFFFFFF and t1, t1, t2 # t2: 0x55555555 addi t0, t0, 1 # t0+xor(t1)+1 -> I calculate t0+1 first. xor t1, t1, t3 # negative t1, 2' complement.(xor 0xFFFFFFFF) add t0, t0, t1 li t3, 0x33333333 srli t1, t0, 2 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333); and t2, t0, t3 # t3: 0x33333333 and t1, t1, t3 add t0, t1, t2 li t3, 0x0F0F0F0F srli t1, t0, 4 # x = ((x >> 4) + x) & 0x0f0f0f0f; add t0, t1, t0 and t0, t0, t3 # t3: 0x0F0F0F0F srli t1, t0, 8 # x += (x >> 8); add t0, t0, t1 srli t1, t0, 16 # x += (x >> 16); add t0, t0, t1 andi t0, t0, 0x7F # return (32 - (x & 0x7f)); xori t0, t0, 0x1F addi a0, t0, 1 jr x1 return0: li a7, 2 ecall jr x1 ``` ##### partial code ```c addi t0, t0, 0b110010000 # Original: (0x70 - renorm_shift) << 23) xori t0, t0, 0b111111111 # First, calculate renorm_shift - 0x70, then take the two's complement. addi t0, t0, 1 # 2'complement of 0x70 is 110010000 in 9 bit. slli t0, t0, 23 ``` In the above code details, the result will be shifted by 23 bits in the end, so the previous operations only need to focus on the 32-23=`9 bits`. Therefore, I use a 9-bit value with I-type instructions instead of exceeding 12 bits, which would require loading into a register before execution. This approach improves performance. ###### output result ![](https://i.imgur.com/q2Xr5CQ.png) ![](https://i.imgur.com/osA4gjT.png) ## Minimum One Bit Operations to Make Integers Zero([Leetcode 1611](https://leetcode.com/problems/minimum-one-bit-operations-to-make-integers-zero/description/)) ### Description >Given an integer n, you must transform it into 0 using the following operations any number of times: >###### 1. Change the rightmost (0th) bit in the binary representation of n. >###### 2. Change the ith bit in the binary representation of n if the (i-1)th bit is set to 1 and the (i-2)th through 0th bits are set to 0. >Return the minimum number of operations to transform n into 0. ### Example >* `example1` Input: n = 3 Output: 2 Explanation: The binary representation of 3 is "11". "11" -> "01" with the 2nd operation since the 0th bit is 1. "01" -> "00" with the 1st operation. >* `example2` Input: n = 6 Output: 4 Explanation: The binary representation of 6 is "110". "110" -> "010" with the 2nd operation since the 1st bit is 1 and 0th through 0th bits are 0. "010" -> "011" with the 1st operation. "011" -> "001" with the 2nd operation since the 0th bit is 1. "001" -> "000" with the 1st operation. ### Solution #### Idea for problem solving First, I observed a pattern as described below: | Input | 0 | 1 | 2 | 4 | 8 | | ---------- | ---- | ---- | ---- | ---- | ---- | | binary |`0000`|`0001`|`0010`|`0100`|`1000`| | - | - | - | - | - | - | | round1 | |`0000`| 0011 | 0101 | 1001 | | round2 | | |`0001`| 0111 | 1011 | | round3 | | |`0000`| 0110 | 1010 | | round4 | | | |`0010`| 1110 | | round5 | | | | 0011 | 1111 | | round6 | | | |`0001`| 1101 | | round7 | | | |`0000`| 1100 | | round8 | | | | |`0100`| | round9 | | | | | 0101 | | round10 | | | | | 0111 | | round11 | | | | | 0110 | | round12 | | | | |`0010`| | round13 | | | | | 0011 | | round14 | | | | |`0001`| | round15 | | | | |`0000`| | $2^n$ -> $2^{n-1}$ | move | $2^n$ -> 0 | move | | ------------------ | ------- | ---------- |:-------- | | 1-> 0 | 1 move | 1-> 0 | 1 move | | 10-> 01 | 2 moves | 10-> 00 | 3 moves | | 100-> 010 | 4 moves | 100-> 000 | 7 moves | | 1000->0100 | 8 moves | 1000->0000 | 15 moves | :::danger Use LaTeX annotations when possible. e.g., use $2^n$ instead of `2^n`. ::: :::success ChengRay: I get it. ::: Thus, I can observe that when the input is $2^n$, the result is $2^{n+1}$ - 1. However, if any bit right of the leftmost bit is 1, then it will take fewer moves. For instance: `1100`: Compared to `1000`, `1100` has already eliminated the leftmost `1`, reducing many steps. Therefore, I need to calculate how many steps it takes to go from `1000` to `1100`. I find that the number of steps from `1000` to `1100` is the same as from `0000` to `0100`. I reached a conclusion: subtract when encountering a 1 on the right. But, what if the `1` I subtract has another `1` on its right... does it become a process formed by the `1` I subtracted? In other words, does the process decrease again? A clear example is as follows: ```math Input n = 15, binary is 1111. f(1111) =f(1000)-f(0111) =f(1000)-(f(0100)-f(0011)) =f(1000)-(f(0100)-(f(0010)-f(0001))) =f(1000)-f(0100)+f(0010)-f(0001) ``` Thus, I derived a logic: "As long as the input encounter a bit `1`, the operation is a cycle of one addition(+) and one subtraction(-)." ### C Programming ##### First version ```c int minimumOneBitOperations(int n){ if(n == 0)return 0; int bits = (int)log2(n)+1; int flop = -1; int moves = pow(2, bits)-1; for(int i=bits-2 ; i>=0 ; i--){ if(n>>i & 0b00000001){ moves += flop * (pow(2, i + 1) - 1); flop *= -1; } } return moves; } ``` Considering the conversion to RISC-V, I want to replace the function that use the [math.h](https://www.runoob.com/cprogramming/c-standard-library-math-h.html) library. #### pow() During the modification process, I realized that in `pow(2, bits)`, the base 2 is **fixed**, and since I've learned about ***left shift***, I directly replaced the original `pow(2, bits)` function with `1 << bits`. #### log2() The `log2()` purpose is to get the length of the binary. Write a function `BinaryLen()` as shown below. ##### log() version1 ```c int BinaryLen(int n){ int ans=0; while(n > 0){ ans++; n = n>>1; } return ans; } ``` One loop will generate many `nop` instructions after the branch. The previously mentioned `count_leading_zero` function calculates how many leading zeros there are in a binary number. What I need to compute here is the total length of the binary number. Since the result of this function is obtained by subtracting the calculated value from `32`, I can simply use this result directly, which will give me the length of the binary number. The previous comparison results indicate that the version **without loops** has fewer cycles and provides better performance. ##### log() version2 ```c uint16_t BinaryLen(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (x & 0x7f); //move 32 from my_clz } ``` After optimizing `log()` and `pow()`, the final result of the code is as follows: ```c uint16_t BinaryLen(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (x & 0x7f); } int minimumOneBitOperations(int n){ if(n == 0)return 0; int bits = BinaryLen(n); int flop = -1; int moves = (1<<(bits))-1; for(int i=bits-2 ; i>=0 ; i--){ if(n>>i & 0b00000001){ moves += flop * ((1<<(i + 1)) - 1); flop *= -1; } } return moves; } ``` ##### Full C code ```c #include <stdio.h> #include <stdint.h> uint16_t BinaryLen(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (x & 0x7f); } int minimumOneBitOperations(int n){ if(n == 0)return 0; int bits = BinaryLen(n); int flop = -1; int moves = (1<<(bits))-1; for(int i=bits-2 ; i>=0 ; i--){ if(n>>i & 0b00000001){ moves += flop * ((1<<(i + 1)) - 1); flop *= -1; } } return moves; } int main(){ int test1 = 0; int test2 = 64; int test3 = 15; printf("test1 moves %d\n" , minimumOneBitOperations(test1)); //output 0 printf("test2 moves %d\n" , minimumOneBitOperations(test2)); //output 127 printf("test3 moves %d" , minimumOneBitOperations(test3)); //output 10 return 0; } ``` ##### RISC-V ```c .data test1: .word 16 test2: .word 33 test3: .word 0 str1: .string "test1: " str2: .string "\ntest2: " str3: .string "\ntest3: " .text main: # test1 la a0, str1 # Load the address of the str1 li a7, 4 # System call code for printing a string ecall # Print the string la t0, test1 # Load address of test1 lw a0, 0(t0) # Load the test1 into a0 jal ra, MOBO # Call MOBO # test2 la a0, str2 # Load the address of the str2 li a7, 4 # System call code for printing a string ecall # Print the string la t0, test2 # Load address of test2 lw a0, 0(t0) # Load the test2 into a0 jal ra, MOBO # Call MOBO # test3 la a0, str3 # Load the address of the str3 li a7, 4 # System call code for printing a string ecall # Print the string la t0, test3 # Load address of test3 lw a0, 0(t0) # Load the test3 into a0 jal ra, MOBO # Call MOBO # Exit li a0, 0 # Exit code 0 li a7, 93 # Syscall for exit ecall # Make the syscall MOBO: beqz a0, return0 # if a0 == 0, jump. addi sp, sp, -8 # MOBO can call BinaryLen sw ra, 0(sp) sw a0, 4(sp) jal ra, BinaryLen # a0 is already set, so directly use jal. addi t5, a0, 0 # set return value to t5. #bits = BinaryLen(n); li t4, 1 lw a0, 4(sp) # Restore the original parameters. li t1, 1 # int flop = -1; # Here, I set 1 to represent a negative number. # set 0 to represent a positive number. sll t2, t4, t5 # int moves = (1<<(bits))-1; addi t2, t2, -1 addi t3, t5, -1 # i=bits-1 li t5, 0xFFFFFFFF loop: addi t3, t3, -1 bltz t3, return srl t0, a0, t3 # store n >> i in t0 andi t6, t0, 1 beqz t6, loop addi t0, t3, 1 # (i + 1) sll t0, t4, t0 # (1<<(i + 1)) addi t0, t0, -1 # (1<<(i + 1)) - 1 beqz t1, addmove # if flop is Positive, jump addmove xor t0, t0, t5 # negative number addi t0, t0, 1 addmove: add t2, t2, t0 # moves + flop * ((1<<(i + 1)) - 1); xori t1, t1, 1 j loop return: lw ra, 0(sp) addi sp, sp, 8 addi a0, t2, 0 # return moves; li a7, 1 ecall jr x1 return0: li a0, 0 # input = 0, return 0 li a7, 1 ecall jr x1 BinaryLen: mv t0, a0 srli t1, t0, 1 # x |= (x >> 1); or t0, t0, t1 srli t1, t0, 2 # x |= (x >> 2); or t0, t0, t1 srli t1, t0, 4 # x |= (x >> 4); or t0, t0, t1 srli t1, t0, 8 # x |= (x >> 8); or t0, t0, t1 srli t1, t0, 16 # x |= (x >> 16); or t0, t0, t1 li t2, 0x55555555 srli t1, t0, 1 # x -= ((x >> 1) & 0x55555555); li t3, 0xFFFFFFFF and t1, t1, t2 # t2: 0x55555555 addi t0, t0, 1 # t0+xor(t1)+1 -> I calculate t0+1 first. xor t1, t1, t3 # negative t1, 2' complement.(xor 0xFFFFFFFF) add t0, t0, t1 li t3, 0x33333333 srli t1, t0, 2 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333); and t2, t0, t3 # t3: 0x33333333 and t1, t1, t3 add t0, t1, t2 li t3, 0x0F0F0F0F srli t1, t0, 4 # x = ((x >> 4) + x) & 0x0f0f0f0f; add t0, t1, t0 and t0, t0, t3 # t3: 0x0F0F0F0F srli t1, t0, 8 # x += (x >> 8); add t0, t0, t1 srli t1, t0, 16 # x += (x >> 16); add t0, t0, t1 andi a0, t0, 0x7F #　return (32 - (x & 0x7f)); jr x1 ``` ###### output result ![](https://i.imgur.com/tG9UOas.png) ![](https://i.imgur.com/Ql6qrup.png) ### Analysis #### (1) Five stage > Description: The 5-stage pipeline is a common architectural design used in modern processors, including RISC-V, to **improve instruction throughput and overall performance.** The concept of pipelining allows multiple instructions to be in different stages of execution at the same time, thereby maximizing the utilization of the CPU's resources. Each stage of the pipeline performs a specific function, and together they work in a streamlined manner. ##### Instruction Fetch (IF) In this stage, the processor fetches the instruction from memory based on the current Program Counter (PC). The PC is then updated to point to the next instruction. This stage is crucial for retrieving the instructions that will be executed in subsequent stages. * Show `lw a0, 0(t0)` for example, the instruction is converted to `lw x10 0 x5`. PC is `0x00000018`. Next PC is `0x0000001C`. Fetch the instruction from instruction memory using the `PC:0x00000018`. `0x0002a503` is represents the machine code. ![](https://i.imgur.com/uPaJ9zE.png) ##### Instruction Decode (ID) During the ID stage, the fetched instruction is decoded to determine its operation and the operands required. The control signals are generated, and the source registers are read. This stage ensures that the necessary data is available for execution in the next stage. * Show `mv t0, a0` for example, it's coverted to `addi x5 x10 0`. `Reg 1` is `0x00000010`；`Reg 2` is `0x00000000`, values are delivered to the next stage. ![](https://i.imgur.com/2ttPyvw.png) ##### Execute (EX) In the EX stage, the actual operation specified by the instruction is performed. This can involve arithmetic or logical operations, address calculations for memory access, or branch target calculations. The results of this operation are prepared to be written back to the registers or memory. * To continue the result of the previous instruction. The **selector A** has three input, they are respectively from the `Reg1` of `ID` stage and from the forwarding technique, which comes from EX/MEM and MEM/WB. The **selector B** is the same as selector A. The **selector C** has two input, they are respectively from the output of selector A and from the `PC`, which is needed to calculate the next instruction. The **selector D** has two input, they are respectively from the output of selector B and immediate value of `ID` stage. ![](https://i.imgur.com/LEymLcc.png) ##### Memory Access (MEM) If the instruction involves memory operations (such as load or store), this stage accesses the memory. For load instructions, data is retrieved from memory and prepared for writing back to the register file. For store instructions, data from the register is written to the specified memory location. * Data memory have two input, address and data value. This is for use with the `lw` and `sw` instructions. The `lw` instruction requires an address to read the value(read out). The `sw` instruction requires a value and an address to store the value at that address. ![](https://i.imgur.com/yTHAQ7B.png) ##### Write Back (WB) The final stage is where the results of the executed instruction are written back to the register file. This ensures that the updated values are available for subsequent instructions. The WB stage completes the instruction cycle, allowing the processor to continue with the next instruction. * In WB stage, there have one selector with three input. they are respectively PC+4, output of ALU and readout of data memory.The light in the middle is on, indicating that the value is from output of ALU. ![](https://i.imgur.com/Cekq157.png) --- #### (2) Challenge ##### Data Hazards Data hazards occur when an instruction depends on the result of a previous instruction. This can cause pipeline stalls because subsequent instructions must wait for the required data to become available. Data hazards can be classified into three types: 1. **RAW** (Read After Write): Occurs when an instruction needs to read data that has not yet been written by a previous instruction. * In my code, there are lots of RAW (Read After Write) hazards. Ripes provides forwarding technology to prevent stalls from occurring. 2. **WAR** (Write After Read): Occurs when an instruction's write operation happens after another instruction's read operation, potentially causing errors. * This hazard does not occur in the pipeline but happens in scenarios with `early write and late read` or `out-of-order execution`. Therefore, it won't occur in the 5-stage pipeline. 3. **WAW** (Write After Write): Occurs when two instructions attempt to write to the same register. * Solution: renaming register. ##### Control Hazards Control hazards arise from branch instructions or jumps. When there is a conditional branch instruction in the pipeline, it may be unclear which instruction should be executed next, leading to incorrect instruction fetching. This may necessitate stalls or branch prediction to **minimize** the impact. * The following is the conditional branch that appears in the code. * **`beqz a0, return0`**: The input test data contains at most one *zero*, and the rest are *non-zero*. Therefore, using `beqz a0, return0` will cause at most one jump, which only happens when the input is zero. * **`bltz t3, return`**: Most of the time, when executing this instruction, it will not jump; it will only jump in the last iteration of the loop. * **`beqz t6, loop`**: This conditional branch cannot be determined; whether to jump is based on the input, so it's impossible to predict the probability of the branch being taken. * **`beqz t1, addmove`**: Since the values added by ***move*** will be one positive and one negative, the probability of this conditional branch jumping is 50%. ##### Structural Hazards Structural hazards occur when hardware resources are insufficient to support multiple operations simultaneously. For example, if the processor cannot access both instruction memory and data memory at the same time, structural hazards may arise, requiring certain instructions to be queued or stalled. * This involves whether the memory is separated or unified. Based on executing one instruction at a time, I did not observe any stalls caused by structural hazards. ##### Branch Prediction Misses If the branch prediction mechanism fails to accurately predict the outcome of branch instructions, it can lead to flushing the pipeline and reloading, which adds additional delays and performance loss. * Using static prediction(always not taken). I designed code where branches have a lower probability of being taken. * If I use dynamic branch prediction, the number of `nop` instructions will be about the same, so I will choose static because dynamic prediction requires additional hardware support. ##### Ecall `ecall` is a RISC-V instruction used to trigger a system call. Place the call number in register `a7`, then execute `ecall`, transferring control to the operating system, which executes the requested service and then returns to the original program. * In order for the instruction to operate correctly, the two preceding instructions must complete the `MEM` and `WB` stages before the ecall executes in the `EX` stage, which results in a delay of two cycles compared to normal operation. * It's inevitable that there are two stalls before `ecall` execution. ## Reference IEEE 754 graph: https://en.wikipedia.org/wiki/IEEE_754 arch2024_Quiz1: https://hackmd.io/@sysprog/arch2024-quiz1-sol arch2023_Quiz1: https://hackmd.io/@sysprog/arch2023-quiz1-sol Leetcode1611: https://leetcode.com/problems/minimum-one-bit-operations-to-make-integers-zero/description/ Libary math.h: https://www.runoob.com/cprogramming/c-standard-library-math-h.html