Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < `王彥珽` > :::danger Choose one problem (A, B, or C) from [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol), translate it from C code to a complete RISC-V assembly program, and include the relevant test data. ::: ## Quiz 1 Problem C **Problem C** from **Quiz1** has been selected for this assignment to solve a problem from Leetcode. Below are the corresponding code and assembly code with functions in Proble C. I will write c code and assembly code for each function and vertify them. ### C code and test data for `fabsf` ```c static inline float fabsf(float x) { uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value x = *(float *)&i; // Write the modified bits back into the float return x; } ``` ### assembly code for `fabsf` ```c .data test0: .word 0xD8800000 test1: .word 0xCB000000 test2: .word 0x72000000 mask: .word 0x7fffffff str0: .string "Yes\n" str1: .string "No\n" .text main: lw t0, test0 #Load test0 into t0 lw s0, mask #Load mask into s0 jal ra, fabsf lw t0, test1 #Load test1 into t0 jal ra, fabsf lw t0, test2 #Load test2 into t0 jal ra, fabsf j exit fabsf: and t2,t0,s0 #i &= 0x7FFFFFFF; beq t2,t0, print_no #if(b == -a) la a0 str0 #print Yes li a7, 4 ecall jr ra #back to next testdata print_no: la a0, str1 #print No li a7, 4 ecall jr ra #back to next testdata exit: li a7, 10 #exit syscall ecall ``` The output of assembly code will be the table below: | input to `fabsf` | output from `fabsf` | |:--------------:|:-----------------:| | 0xD8800000 | 0x58800000 | | 0xCB000000 | 0xCB000000 | | 0x72000000 | 0x72000000 | ### C code and test data for `fp16_to_fp32` ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } static inline uint32_t fp16_to_fp32(uint16_t h) { const uint32_t w = (uint32_t) h << 16; const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } ``` ### assembly code for `fp16_to_fp32` ```c .data test0: .word 0x00003857 test1: .word 0x00005555 test2: .word 0x00008877 mask0: .word 0x80000000 mask1: .word 0x7fffffff mask2: .word 0x04000000 mask3: .word 0x7f800000 str0: .string "\n" .text main: lw s0, test0 #Load test0 into s0 jal ra, fp16_to_fp32 lw s0, test1 #Load test1 into s0 jal ra, fp16_to_fp32 lw s0, test2 #Load test2 into s0 jal ra, fp16_to_fp32 j exit fp16_to_fp32: slli s1, s0, 16 # s1 = w, (uint32_t) h << 16; lw t0, mask0 and s2, s1, t0 # s2 = sign, w & UINT32_C(0x80000000); lw t0, mask1 and s3, s1, t0 # s3 = nonsign, w & UINT32_C(0x7FFFFFFF); mv s0, s3 # s0 = nonsign, x = nonsign mv s11, ra jal ra, my_clz # my_clz(x) mv ra, s11 mv s1, a0 # s1 = renorm, renorm_shift = my_clz and -5 lw t0, mask2 add s0, s3, t0 lw t0, mask3 srli s0, s0, 8 and s0, s0 t0 #s0 = inf_nan_mask andi s4, s3, -1 srli s4, s4, 31 #s4 = zero_mask sll s3, s3, s1 srli s3, s3, 3 #s3 = nonsign << renorm_shift >> 3 li t0, 112 #0x70 sub s1, t0, s1 #(0x70 - renorm_shift) slli s1, s1, 23 #s1 = (0x70 - renorm_shift) << 23 add s1, s1, s3 xori s4, s4, -1 #~zero_mask or s1, s1, s0 and s1, s1, s4 or s2, s2, s1 mv a0, s2 mv s11, ra jal ra, print mv ra, s11 jr ra my_clz: li a0, 0 #count = 0 li t0, 31 #i = 31 li s10, 5 for_clz: li t1, 1 sll t1, t1, t0 #1U << i and t2, s0, t1 #x & (1U << i) bne t2, x0, condition_clz #if (x == 1) addi a0, a0, 1 #count += 1 addi t0, t0, -1 #--i bge t0, x0, for_clz #Repeat while i >= 0 condition_clz: blt a0, s10 , false addi a0,a0,-5 jr ra false: sub a0,a0,a0 jr ra print: li a7, 34 ecall la a0, str0 li a7, 4 ecall jr ra exit: li a7, 10 #exit syscall ecall ``` The output of assembly code will be the table below: | input to `fp16_to_fp32` | output from `fp16_to_fp32` | |:-----------------------:|:--------------------------:| | 0x00003857 | 0x3f0ae000 | | 0x00005555 | 0x42aaa000 | | 0x00008877 | 0xb90ee000 | ## Selected LeetCode Problem > LeetCode [762. Prime Number of Set Bits in Binary Representation](https://leetcode.com/problems/prime-number-of-set-bits-in-binary-representation/description/) Given two integers `left` and `right`, return the **count** of numbers in the **inclusive** range `[left, right]` having a **prime number of set bits** in their binary representation. Exanple: Input: `left = 6, right = 10` Output: `4` Explanation: `6 -> 110 (2 set bits, 2 is prime)` `7 -> 111 (3 set bits, 3 is prime)` `8 -> 1000 (1 set bit, 1 is not prime)` `9 -> 1001 (2 set bits, 2 is prime)` `10 -> 1010 (2 set bits, 2 is prime)` `4 numbers have a prime number of set bits.` ## Solution By utilizing the `my_clz` function from Quiz1 Problem 3, we can first determine the number of set bits for each number. Next, we check if this number is a prime. Finally, we sum up the prime numbers and output the result as the final answer. ### C code ```c #include<stdio.h> #include<stdint.h> #include<stdbool.h> static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)){ break; } count++; } return count; } bool is_prime(int a){ if(a < 2){ return false; } for(int i=2 ; i*i<=a ; i++){ if(a%i == 0){ return false; } } return true; } int main() { int left = 10, right = 15, ans = 0, clz_cnt = 0; uint32_t num; for(int i=left ; i <= right ; i++){ int set_bit_cnt = 0; num = (uint32_t)i; while((clz_cnt = my_clz(num)) != 32){ //if my_clz return 32, mean num == 0 num = num << (clz_cnt+1); set_bit_cnt++; //count set bit } if(is_prime(set_bit_cnt)){ ans++; } } printf("The total number of prime number of set bits is: %d ", ans); return 0; } ``` ### Assembly code #### version 1 ```c .data left0: .word 6 right0: .word 10 left1: .word 10 right1: .word 15 left2: .word 20 right2: .word 100 str0: .string "The total number of prime number of set bits is: " str1: .string "\n" .text main: lw s0, left0 # s0 = left0 lw s1, right0 # s1 = right0 li s2, 0 # s2 = ans = 0 li s3, 0 # s3 = set_bit_cnt = 0 li s4, 32 # s4 = 32 jal ra, store_ra lw s0, left1 lw s1, right1 jal ra, store_ra lw s0, left2 lw s1, right2 jal ra store_ra j exit store_ra: mv s11, ra #store ra loop_main: bgt s0, s1, print #left > right , print ans mv t0, s0 # input left to my_clz while: jal ra, my_clz beq a0, s4, is_prime #clz_cnt == 32, check s3 is prime addi s3, s3, 1 #set_bit_cnt++ addi a0, a0, 1 #clz_cnt++ sll t0, t0, a0 #num = num << (clz_cnt+1); j while my_clz: li t1, 0 # count = 0 li t2, 31 # i = 31 clz_for: li t3, 1 sll t3, t3, t2 # 1U << i and t4, t0, t3 # x & (1U << i) bne t4, x0, clz_done # if (x == 0), exit loop addi t1, t1, 1 # count += 1 addi t2, t2, -1 # --i bge t2, x0, clz_for # Repeat while i >= 0 clz_done: mv a0, t1 jr ra is_prime: li t0, 2 # t0 = 2 blt s3, t0, false # set_bit_cnt < 2, not a prime prime_loop: li t1, 0 #count i*i li t3, 0 #counter for 0-i mul: add t1, t1, t0 # t2 = t2 + i addi t3, t3, 1 # t3++ bge t3,t0 , check_prime #t3 = t0, finish i*i j mul check_prime: bgt t1, s3, true # i*i > a, return true mv t1, s3 mod: sub t1, t1, t0 # a = a - i bge t1, t0, mod # a >= i, keep sub beqz t1, false # a == 0, not prime addi t0, t0, 1 # i++ j prime_loop false: li s3, 0 # reset set_bit_cnt addi s0, s0, 1 # left++ j loop_main # another data true: li s3, 0 addi s2, s2, 1 # ans++ addi s0, s0, 1 j loop_main print: la a0, str0 li a7, 4 # print str0 ecall mv a0, s2 li a7, 1 #print ans ecall la a0, str1 li a7, 4 # print \n ecall li s2, 0 # s2 = ans = 0 mv ra, s11 #back to main jr ra exit: li a7, 10 # exit ecall ``` I optimized the `is_prime` function to limit checks to primes less than 32, rather than evaluating each number individually. This change reduces the overall instruction count and decrease the number of branches. :::danger Instead of saying "optimized," prove it! ::: #### version 2 ```c .data left0: .word 6 right0: .word 10 left1: .word 10 right1: .word 15 left2: .word 20 right2: .word 100 prime: .word 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 33 str0: .string "The total number of prime number of set bits is: " str1: .string "\n" .text main: lw s0, left0 # s0 = left0 lw s1, right0 # s1 = right0 li s2, 0 # s2 = ans = 0 li s4, 32 # s4 = 32 jal ra, store_ra lw s0, left1 lw s1, right1 jal ra, store_ra lw s0, left2 lw s1, right2 jal ra store_ra j exit store_ra: mv s11, ra #store ra loop_main: li s3, 0 la a1, prime li a2, 12 bgt s0, s1, print #left > right , print ans mv t0, s0 # input left to my_clz while: jal ra, my_clz beq a0, s4, is_prime #clz_cnt == 32, check s3 is prime addi s3, s3, 1 #set_bit_cnt++ addi a0, a0, 1 #clz_cnt++ sll t0, t0, a0 #num = num << (clz_cnt+1); j while my_clz: li t1, 0 # count = 0 li t2, 31 # i = 31 clz_for: li t3, 1 sll t3, t3, t2 # 1U << i and t4, t0, t3 # x & (1U << i) bne t4, x0, clz_done # if (x == 0), exit loop addi t1, t1, 1 # count += 1 addi t2, t2, -1 # --i bge t2, x0, clz_for # Repeat while i >= 0 clz_done: mv a0, t1 jr ra is_prime: lw a3, 0(a1) addi a1, a1, 4 addi a2, a2, -1 beq s3, a3, true bgt s3, a3, is_prime #bge a2, x0, is_prime addi s0, s0, 1 # left++ j loop_main # another data true: addi s2, s2, 1 # ans++ addi s0, s0, 1 j loop_main print: la a0, str0 li a7, 4 # print str0 ecall mv a0, s2 li a7, 1 #print ans ecall la a0, str1 li a7, 4 # print \n ecall li s2, 0 # s2 = ans = 0 mv ra, s11 #back to main jr ra exit: li a7, 10 # exit ecall ``` Furthermore, I refactored the `my_clz` function to a loop-free version. While this increased the instruction count slightly, it significantly reduced the total cycle count, improving overall performance. version 3 ```c .data left0: .word 6 right0: .word 10 left1: .word 10 right1: .word 15 left2: .word 20 right2: .word 100 prime: .word 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 33 str0: .string "The total number of prime number of set bits is: " str1: .string "\n" .text main: lw s0, left0 # s0 = left0 lw s1, right0 # s1 = right0 li s2, 0 # s2 = ans = 0 li s4, 32 # s4 = 32 jal ra, loop_main lw s0, left1 lw s1, right1 jal ra, loop_main lw s0, left2 lw s1, right2 jal ra loop_main li a7, 10 # exit ecall true: addi s2, s2, 1 # ans++ addi s0, s0, 1 loop_main: li s3, 0 # s3 = set_bit_cnt = 0 la a1, prime li a2, 12 bgt s0, s1, print # left > right , print ans mv t0, s0 # input left to my_clz while: #my_clz srli t1, t0, 1 #x |= (x >> 1); or t2, t0, t1 srli t1, t2, 2 #x |= (x >> 2); or t2, t2, t1 srli t1, t2, 4 #x |= (x >> 4); or t2, t2, t1 srli t1, t2, 8 #x |= (x >> 8); or t2, t2, t1 srli t1, t2, 16 #x |= (x >> 16); or t2, t2, t1 srli t1, t2, 1 #x -= ((x >> 1) & 0x55555555); li t3, 0x55555555 and t1, t1, t3 sub t2, t2, t1 srli t1, t2, 2 #x = ((x >> 2) & 0x33333333) + (x & 0x33333333); li t3, 0x33333333 and t1, t1, t3 and t2, t2, t3 add t2, t2, t1 li t3, 0x0f0f0f0f #x = ((x >> 4) + x) & 0x0f0f0f0f; srli t1, t2, 4 add t2, t1, t2 and t2, t2, t3 srli t1, t2, 8 #x += (x >> 8); add t2, t2, t1 srli t1, t2, 16 #x += (x >> 16); add t2, t2, t1 li t3, 0x0000007f #(32 - (x & 0x7f)) and t2, t2, t3 sub a0, s4, t2 beq a0, s4, is_prime #if clz_cnt == 32, check s3 is prime addi s3, s3, 1 #set_bit_cnt++ addi a0, a0, 1 #clz_cnt++ sll t0, t0, a0 #num = num << (clz_cnt+1); j while is_prime: lw a3, 0(a1) addi a1, a1, 4 addi a2, a2, -1 beq s3, a3, true bgt s3, a3, is_prime addi s0, s0, 1 # left++ j loop_main # another data print: la a0, str0 li a7, 4 # print str0 ecall mv a0, s2 li a7, 1 #print ans ecall la a0, str1 li a7, 4 # print \n ecall li s2, 0 # s2 = ans = 0 ret ``` :::danger Use fewer instructions. ::: ## Result All version outputs: ![output](https://hackmd.io/_uploads/Bkwx9BL11l.png) Execution info among 3 versoins: |version 1|version 2| versoin 3 | |:---------:|:---------:|:---------:| |![with_loop](https://hackmd.io/_uploads/SkZj5BLJye.png)|![v2](https://hackmd.io/_uploads/BJGtl79yJe.png)|![v3](https://hackmd.io/_uploads/H1_tgQ9yJx.png)| From the three images above, it is clear that version 3 executes fewer instructions and contains fewer branches compared to version 1 and version 2, because it uses a loopless `my_clz` function.Though difference between version 1 and 2 is the `is_prime` function ,there is a slight reduction in cycles except from code size. ## Assembly Optimization Analysis ### `is_prime` function Since we're only dealing with 32-bit numbers, I modified the `is_prime` function's logic to directly check for primes less than 32 instead of using a loop. While there is no significant difference with smaller numbers, a noticeable difference emerges as the number of inputs increases. before: ```c is_prime: li t0, 2 # t0 = 2 blt s3, t0, false # set_bit_cnt < 2, not a prime prime_loop: li t1, 0 #count i*i li t3, 0 #counter for 0-i mul: add t1, t1, t0 # t2 = t2 + i addi t3, t3, 1 # t3++ bge t3,t0 , check_prime #t3 = t0, finish i*i j mul check_prime: bgt t1, s3, true # i*i > a, return true mv t1, s3 mod: sub t1, t1, t0 # a = a - i bge t1, t0, mod # a >= i, keep sub beqz t1, false # a == 0, not prime addi t0, t0, 1 # i++ j prime_loop false: li s3, 0 # reset set_bit_cnt addi s0, s0, 1 # left++ j loop_main # another data true: li s3, 0 addi s2, s2, 1 # ans++ addi s0, s0, 1 j loop_main ``` after: ```c .data prime: .word 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 33 .text la a1, prime li a2, 12 is_prime: lw a3, 0(a1) addi a1, a1, 4 addi a2, a2, -1 beq s3, a3, true bgt s3, a3, is_prime addi s0, s0, 1 # left++ j loop_main # another data ``` with a smaller range of numbers(**100 numbers**) | before | after | |:------:|:-----:| |![no_100](https://hackmd.io/_uploads/ByYY6mqJJe.png)|![yes_100](https://hackmd.io/_uploads/SyBqa7cy1x.png)| Based on the two images above, the cycles were reduced by approximately 0.98%. with a larger range of numbers(**100000 numbers**) | before | after | |:------:|:-----:| |![nc_100000](https://hackmd.io/_uploads/BJTT2XcyJx.png)|![yes_100000](https://hackmd.io/_uploads/ByG-Ym5kke.png)| Based on the two images above, the cycles were reduced by approximately 2.36%. From the two sets of tests, one with 100 data points and a cycle reduction of 0.98%, and the other with 100,000 data points and a cycle reduction of 2.36%, we can conclude that **the performance improvements become more pronounced as the data size increases**. This suggests that the optimization has a greater impact on larger datasets, likely due to better efficiency scaling over more iterations. ### `my_clz` function In addition, referring to last year's [Quiz 1](https://hackmd.io/@sysprog/arch2023-quiz1-sol), I replaced `my_clz` with a loopless version. When comparing versions 2 and 3, we observed a significant reduction in cycles, dropping from over 60,000 to nearly 20,000, while IPC improved from 0.7 to 0.9. The key difference between the two versions lies in the implementation of `my_clz`: one uses a loop while the other does not. It is reasonable to infer that the program's main computations are concentrated in `my_clz`. Therefore, if we can minimize branch evaluations within `my_clz`, we can significantly reduce the number of flushed instructions, thereby enhancing performance. Although this results in an increase in code length, the reduction in cycle count makes it a worthwhile trade-off. :::danger Explain and discuss ::: function `count_leading_zeros` modified to count 32-bit in last year's Quiz 1 in c ```c #include <stdint.h> uint16_t count_leading_zeros(uint64_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); /* count ones (population count) */ x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x7f)); } ``` assembly code to `count_leading_zeros` which without loop ```c #my_clz srli t1, t0, 1 #x |= (x >> 1); or t2, t0, t1 srli t1, t2, 2 #x |= (x >> 2); or t2, t2, t1 srli t1, t2, 4 #x |= (x >> 4); or t2, t2, t1 srli t1, t2, 8 #x |= (x >> 8); or t2, t2, t1 srli t1, t2, 16 #x |= (x >> 16); or t2, t2, t1 srli t1, t2, 1 #x -= ((x >> 1) & 0x55555555); li t3, 0x55555555 and t1, t1, t3 sub t2, t2, t1 srli t1, t2, 2 #x = ((x >> 2) & 0x33333333) + (x & 0x33333333); li t3, 0x33333333 and t1, t1, t3 and t2, t2, t3 add t2, t2, t1 li t3, 0x0f0f0f0f #x = ((x >> 4) + x) & 0x0f0f0f0f; srli t1, t2, 4 add t2, t1, t2 and t2, t2, t3 srli t1, t2, 8 #x += (x >> 8); add t2, t2, t1 srli t1, t2, 16 #x += (x >> 16); add t2, t2, t1 li t3, 0x0000007f #(32 - (x & 0x7f)) and t2, t2, t3 sub a0, s4, t2 ``` assembly code to `my_clz` which with loop ```c my_clz: li t1, 0 # count = 0 li t2, 31 # i = 31 clz_for: li t3, 1 sll t3, t3, t2 # 1U << i and t4, t0, t3 # x & (1U << i) bne t4, x0, clz_done # if (x == 0), exit loop addi t1, t1, 1 # count += 1 addi t2, t2, -1 # --i bge t2, x0, clz_for # Repeat while i >= 0 clz_done: mv a0, t1 jr ra ``` | `my_clz` with loop | `my_clz` without loop | | -------- | -------- | | ![yes_100](https://hackmd.io/_uploads/ryQwGN91ye.png) | ![v3_100](https://hackmd.io/_uploads/Hk12G45kkg.png) | Based on the two images above, the cycles were reduced by approximately 70.9%. From the two sets of test results, we can conclude that there is a significant performance improvement in the latter. The drastic reduction in cycle count indicates that the optimizations made in the second test lead to more efficient execution, suggesting that the changes have effectively enhanced the overall performance of the code. ## Analysis We use [Ripes](https://github.com/mortbopet/Ripes) to simulate RISC-V processor Ripes is a 5-stage pipeline instruciton simulator. ![ripes](https://hackmd.io/_uploads/ByhNRBUJ1l.png) | Stage | Description | |:-----:|:-----------:| | IF |The instruction is retrieved from memory based on the program counter (PC)| | ID |The instruction is decoded to determine its type and required operands| | EXE |The ALU performs the specified operation or computes memory addresses| | MEM |Data is read from or written to memory if required by the instruction| | WB |The result is written back to the destination register or memory| Below, I will use a single instruction as an example: ### IF ![IF](https://hackmd.io/_uploads/S19KZ8UJye.png) 1. In the **IF stage**, the program counter (PC) initially holds the value `0x00000000` and is used to fetch the instruction from instruction memory. 2. The instruction memory returns the instruction `0x10000417`, and the PC is updated to `0x00000004` to point to the next instruction. 3. The fetched instruction `0x10000417` is then sent to the instruction decode stage (IFID), where it can be processed for further decoding. ### ID ![ID](https://hackmd.io/_uploads/rkaxM8Ukke.png) The Decoder will decode the instruction `0x10000417` into: | immediate\[31:12] | rd\[11:7] | opcode\[6:0] | |:-------------------- |:---------:|:------------:| | 00010000000000000000 | 01000 | 0010111 | Result in immediate = `0x10000000`, rd = `0x08` and opcode = `AUIPC`. This instruction will operate` rd = immediate + PC`. ### EXE ![EXE](https://hackmd.io/_uploads/SJfbf8U1Jl.png) The ALU adds `PC` and `immediate` ,then output the result to the next stage. ### MEM ![MEM](https://hackmd.io/_uploads/ByewbfIUkye.png) The **MEM** stage passes `alu_res` and `write_back_address` into the next stage. ### WB ![WB](https://hackmd.io/_uploads/H1h-GI8J1e.png) The **WB** stage writes the `alu_res` back to register file which located at `0x08`. ## Reference * [Leetcode 762. Prime Number of Set Bits in Binary Representation](https://leetcode.com/problems/prime-number-of-set-bits-in-binary-representation/description/) * [Quiz1 of Computer Architecture (2024 Fall)](https://hackmd.io/@sysprog/arch2024-quiz1-sol) * [Quiz1 of Computer Architecture (2023 Fall)](https://hackmd.io/@sysprog/arch2023-quiz1-sol) * [Lab1: RV32I Simulator](https://hackmd.io/@sysprog/H1TpVYMdB) :::danger Always refer to primary sources, such as official RISC-V documentation. :::