Computer Architecture

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [`阮祈翰`](https://github.com/akkk74159/computer_architecture)> ## [Problem C](https://hackmd.io/@sysprog/arch2024-quiz1-sol#Problem-C) In Problem C, we have three functions: `fabsf`, `my_clz`, `fp16_to_fp32` ## fabsf ### C code ```c= static inline float fabsf(float x) { uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value x = *(float *)&i; // Write the modified bits back into the float return x; } ``` #### Test case for C code | Input in decimal | Input in Hex | Output in decimal | Output in Hex | |-------------------|---|---|---| |0|0x00000000|0|0x00000000| |-0|0x80000000|0|0x00000000| |Normalized number: -1.15625|0xbf940000|1.15625|0x3f940000| |Denormalized number: -1.39047e-39|0x800f2411|1.39047e-39|0x000f2411| |NaN|0xffffffff|nan|0x7fffffff| |inf|0x7f800000|inf|0x7f800000| |-inf|0xff800000|inf|0x7f800000| ### Assembly ```c= .data case1: .word 0x00000000 # 0.0 case2: .word 0x80000000 # -0.0 case3: .word 0xbf940000 # -1.15625 case4: .word 0x800f2411 # -1.39047e-39 case5: .word 0xffffffff # nan case6: .word 0x7f800000 # inf case7: .word 0xff800000 # -inf output1: .word 0x00000000 # 0.0 output2: .word 0x00000000 # 0.0 output3: .word 0x3f940000 # 1.15625 output4: .word 0x000f2411 # 1.39047e-39 output5: .word 0x7fffffff # nan output6: .word 0x7f800000 # inf output7: .word 0x7f800000 # inf str1: .string "pass " str2: .string "fail " .text main: li s0, 7 # s0 stores amount of test cases la t0, case1 # t0 = address of case1 loop: # run case 1~7 lw a0, 0(t0) jal ra, fabsf lw a1, 28(t0) jal ra, print_result addi t0, t0, 4 # move to next case addi s0, s0, -1 # case counter -1 bnez s0, loop #exit li a7, 10 ecall fabsf: li t1, 0x7FFFFFFF and a0, a0, t1 # x and 0x7fffffff jr ra # print result (pass or fail) print_result: bne a0, a1, fail la a0, str1 # print "pass" li a7, 4 ecall j print_end fail: la a0, str2 # print "fail" li a7, 4 ecall print_end: jr ra # return ``` #### Test case for Assembly | Test case 1 | Test case 2 | Test case 3 | Test case 4 | Test case 5 | Test case 6 | Test case 7 | |----|----|----|----|----|----|----| |pass|pass|pass|pass|pass|pass|pass| ![fabsf](https://hackmd.io/_uploads/Sy0cU1u11e.png) ## my_clz ### C code ```c= static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` #### Test case for C code | Input in decimal | Input in Hex | Output | |-------------------|---|---| |0|0x00000000|32| |1537|0x00000601|21| |4294967295|0xffffffff|0| ### Assembly ```c= .data case1: .word 0 case2: .word 1537 case3: .word 4294967295 output1: .word 32 output2: .word 21 output3: .word 0 str1: .string "pass " str2: .string "fail " .text main: li s0, 3 # s0 stores amount of test cases la t0, case1 # t0 = address of case1 loop: # run case 1~3 lw a0, 0(t0) jal ra, my_clz lw a1, 12(t0) jal ra, print_result addi t0, t0, 4 # move to next case addi s0, s0, -1 # case counter -1 bnez s0, loop #exit li a7, 10 ecall # count leading zero function my_clz: li t2, 1 # init t3 = 1U li t3, 0 # init count = 0 li t4, 31 # init i = 31 my_clz_loop: sll t5, t2, t4 # t5 = 1U << i and t5, t5, a0 # t5 = (x & (1U << i)) bne t5, x0, clz_end # if (t5 != 0), exit the loop addi t3, t3, 1 # count++ addi t4, t4, -1 # --i bge t4, zero, my_clz_loop # if (i >= 0), stay in the loop clz_end: mv a0, t3 # a0 = count jr ra # print result (pass or fail) print_result: bne a0, a1, fail la a0, str1 # print "pass" li a7, 4 ecall j print_end fail: la a0, str2 # print "fail" li a7, 4 ecall print_end: jr ra ``` ### Test case for Assembly | Test case 1 | Test case 2 | Test case 3 | |-------------|-------------|-------------| |pass|pass|pass| ![image](https://hackmd.io/_uploads/ry85woD1yg.png) ## fp16_to_fp32 #### C code ```c= static inline uint32_t fp16_to_fp32(uint16_t h) { const uint32_t w = (uint32_t) h << 16; const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } ``` #### Test case for C code | Input in decimal | Iuput in Hex | Output in decimal | Output in Hex | |-------------------|---|---|---| |0|0x0000|0|0x00000000| |-0|0x8000|-0|0x80000000| |Normalized number: -1.157|0xbca0|-1.15625|0xbf940000| |Denormalized number: -0.00000871|0x8092|-0.000008702278|0xb7120000| |NaN|0xfc01|nan|0xff802000| |inf|0x7c00|inf|0x7f800000| |-inf|0xfc00|-inf|0xff800000| #### Assembly ```c= .data case1: .word 0x0000 # 0.0 case2: .word 0x8000 # -0.0 case3: .word 0xbca0 # -1.157 case4: .word 0x8092 # -0.00000871 case5: .word 0xfc01 # nan case6: .word 0x7c00 # inf case7: .word 0xfc00 # -inf output1: .word 0x00000000 # 0.0 output2: .word 0x80000000 # -0.0 output3: .word 0xbf940000 # -1.15625 output4: .word 0xb7120000 # -0.000008702278 output5: .word 0xff802000 # nan output6: .word 0x7f800000 # inf output7: .word 0xff800000 # -inf str1: .string "pass " str2: .string "fail " .text main: li s0, 7 # s0 stores amount of test cases la t0, case1 # t0 = address of case1 loop: # run case 1~7 lw a0, 0(t0) jal ra, fp16_to_fp32 lw a1, 28(t0) jal ra, print_result addi t0, t0, 4 # move to next case addi s0, s0, -1 # case counter -1 bnez s0, loop #exit li a7, 10 ecall fp16_to_fp32: # s1 = w slli s1, a0, 16 # s1 = h << 16 # s2 = sign li s7, 0x80000000 and s2, s1, s7 # s2 = w & 0x80000000 # s3 = nonsign li s7, 0x7FFFFFFF and s3, s1, s7 # s3 = w & 0x7FFFFFFF # s4 = renorm_shift my_clz: li t2, 1 # init t3 = 1U li t3, 0 # init count = 0 li t4, 31 # init i = 31 my_clz_loop: sll t5, t2, t4 # t5 = 1U << i and t5, t5, s3 # t5 = (x & (1U << i)) bne t5, x0, clz_end # if (t5 != 0), exit the loop addi t3, t3, 1 # count++ addi t4, t4, -1 # --i bge t4, zero, my_clz_loop # if (i >= 0), stay in the loop clz_end: mv s4, t3 # s4 = my_clz(nonsign) addi s4, s4, -5 bge s4, zero, gt5 # if (s4 > 5), s4 = (s4 - 5) add s4, zero, zero # else s4 = 0 gt5: # s5 = inf_nan_mask li s7, 0x04000000 add s5, s3, s7 # s5 = nonsign + 0x04000000 srai s5, s5, 8 # s5 = (nonsign + 0x04000000) >> 8 li s7, 0x7F800000 and s5, s5, s7 # s5 = ((nonsign + 0x04000000) >> 8) & (0x7F800000) # s6 = ~zero_mask addi s6, s3, -1 # s6 = nonsign - 1 srai s6, s6, 31 # s6 = (nonsign - 1) >> 31 li s7, 0xFFFFFFFF xor s6, s6, s7 # s6 = ~zero_mask sll s3, s3, s4 # s3 = nonsign << renorm_shift srli s3, s3, 3 # s3 = s3 >> 3 li s7, 0x70 sub s4, s7, s4 # s4 = 0x70 - renorm_shift slli s4, s4, 23 # s4 = s4 << 23 add s3, s3, s4 # s3 = (nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23) or s3, s3, s5 # s3 = s3 | inf_nan_mask and s3, s3, s6 # s3 = s3 & ~zero_mask or a0, s2, s3 # a0 = sign | s3 jr ra # return # print result (pass or fail) print_result: bne a0, a1, fail la a0, str1 # print "pass" li a7, 4 ecall j print_end fail: la a0, str2 # print "fail" li a7, 4 ecall print_end: jr ra # return ``` #### Test case for Assembly | Test case 1 | Test case 2 | Test case 3 | Test case 4 | Test case 5 | Test case 6 | Test case 7 | |----|----|----|----|----|----|----| |pass|pass|pass|pass|pass|pass|pass| ![fp16tofp32](https://hackmd.io/_uploads/r15Fwbty1l.png) ## LeetCode : 3226. Number of Bit Changes to Make Two Integers Equal [LeetCode : 3226. Number of Bit Changes to Make Two Integers Equal](https://leetcode.com/problems/number-of-bit-changes-to-make-two-integers-equal/) Difficulty : easy Problem description: You are given two positive integers `n` and `k`. You can choose **any** bit in the **binary representation** of n that is equal to 1 and change it to 0. Return the *number of changes* needed to make `n` equal to `k`. If it is impossible, return `-1`. ### Example 1: >Input: n = 13, k = 4 > >Output: 2 > >Explanation: >Initially, the binary >representations of n and k are n = (1101)~2~ and k = (0100)~2~. >We can change the first and fourth >bits of n. The resulting integer is n = (0100)2 = k. ### Example 2: >Input: n = 21, k = 21 > >Output: 0 > >Explanation: >n and k are already equal, so no changes are needed. ### Example 3: >Input: n = 14, k = 13 > >Output: -1 > >Explanation: >It is not possible to make n equal to k. ## Implementation You can check the source code [here](https://github.com/akkk74159/computer_architecture) Using the **my_clz** (counting leading zero) function from `Problem C in Quiz 1` to calculate the number of leading zeros of `n XOR k` (assume it is **num**). We then compare bit difference between `n` and `k` starting from least significant bit. After at most `32 - num` times of shifting bits we can find the number of bit changes to make `n` and `k` equal. ### C code The function of `my_clz`: ```c= static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) // 1U = unsigned int 1 break; count++; } return count; } ``` The function `bitChanges` compare `n` and `k` bits by bits starting from LSB, then return the number of bit to change. ```c= int bitChanges(int n, int k) { int num = n ^ k; num = my_clz(num); int result = 0; for(int i = 0;i < 32 - num;i++){ if((n & 1) == 1 && (k & 1) == 0){ result = result + 1; } else if((n & 1) == 0 && (k & 1) == 1){ result = -1; break; } n >>= 1; k >>= 1; } return result; } ``` ```c= int main(){ int n1 = 13, k1 = 4; int n2 = 21, k2 = 21; int n3 = 14, k3 = 13; int output1 = bitChanges(n1, k1); int output2 = bitChanges(n2, k2); int output3 = bitChanges(n3, k3); printf("%d\n", output1); printf("%d\n", output2); printf("%d\n", output3); return 0; } ``` ### Assembly code ```c= .data n1: .word 13 k1: .word 4 output1: .word 2 n2: .word 21 k2: .word 21 output2: .word 0 n3: .word 14 k3: .word 13 output3: .word -1 str1: .string "pass " str2: .string "fail " .text main: li s0, 3 # s0 stores how many test cases la s1, n1 # s1 stores address of n1 loop: lw a0, 0(s1) # a0 = n lw a1, 4(s1) # a1 = k lw a2, 8(s1) # a2 = correct output addi sp, sp, -4 sw ra, 0(sp) jal ra, bitChanges # goto bitChanges function lw ra, 0(sp) addi sp, sp, 4 jal ra, print_result # goto print result addi s1, s1, 12 # move to the next test case addi s0, s0, -1 # case counter -1 bnez s0, loop #exit li a7, 10 ecall #bitChanges function bitChanges: mv t0, a0 # t0 = n mv t1, a1 # t1 = k xor a0, t0, t1 # a0 = n xor k addi sp, sp, -4 sw ra, 0(sp) jal ra, my_clz # goto my_clz function lw ra, 0(sp) addi sp, sp, 4 li t2, 0 # init i = 0 li t3, 32 # let t3 = 32 sub t3, t3, a0 # let t3 = (32 - num) li t4, 0 # init result = 0 bne t3, x0, bitChanges_loop # if (32 - num) != 0 then start looping li a0, 0 # if (32 - num) == 0, the result is 0 j bitChanges_end bitChanges_loop: andi s2, t0, 1 # s2 = (n & 1) andi s3, t1, 1 # s3 = (k & 1) beq s2, zero, skip_add # if s2 == 0, goto skip_add bne s3, zero, skip_add # if s3 == 1, goto skip_add addi t4, t4, 1 # if(s2 == 1) && (s3 == 0), then result++ skip_add: bne s2, zero, shift # if s2 == 1, goto shift beq s3, zero, shift # if s3 == 0, goto shift li a0, -1 # if (s2 == 0) && (s3 == 1), then result = -1 j bitChanges_end shift: srli t0, t0, 1 # n >>= 1 srli t1, t1, 1 # k >>= 1 addi t2, t2, 1 # i++ bne t2, t3, bitChanges_loop # if(i < (32 - num)), stay in the loop mv a0, t4 # else, let a0 = result bitChanges_end: jr ra # return #clz function my_clz: li t2, 1 # init t2 = 1U li t3, 0 # init count = 0 li t4, 31 # init i = 31 my_clz_loop: sll t5, t2, t4 # t5 = 1U << i and t5, t5, a0 # t5 = (x & (1U << i)) bne t5, x0, my_clz_end # if (t5 != 0), exit the loop addi t3, t3, 1 # count++ addi t4, t4, -1 # --i bge t4, zero, my_clz_loop # if (i >= 0), stay in the loop my_clz_end: mv a0, t3 # a0 = count jr ra # return # print result (pass or fail) print_result: bne a0, a2, fail la a0, str1 # print "pass" li a7, 4 ecall j print_end fail: la a0, str2 # print "fail" li a7, 4 ecall print_end: jr ra # return ``` ## Result ### Test cases: #### Case 1 >Input: n = 13, k = 4 > >Output: 2 #### Case 2 >Input: n = 21, k = 21 > >Output: 0 #### Case 3 >Input: n = 14, k = 13 > >Output: -1 ### C program output ||n|k|output| |-|-|-|-| |Case1|13|4|2| |Case2|21|21|0| |Case3|14|13|-1| ### Assembly output ||n|k|output| |-|-|-|-| |Case1|13|4|pass| |Case2|21|21|pass| |Case3|14|13|pass| ### Performance | Execution info|| | -------------- | --- | | Cycles | 981 | | Instrs. retired | 713 | | CPI | 1.38 | | IPC | 0.727 | | Clock rate | 10.87Hz | ## Second version We implement the **branchless** `my_clz`, meaning there will be no branch instruction in `my_clz` function. ### C code ```c= int blclz(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return 32 - (x & 0x7f); //move 32 from my_clz } ``` ### Assembly ```c= .data n1: .word 13 k1: .word 4 output1: .word 2 n2: .word 21 k2: .word 21 output2: .word 0 n3: .word 14 k3: .word 13 output3: .word -1 str1: .string "pass " str2: .string "fail " .text main: li s0, 3 # s0 stores how many test cases la s1, n1 # s1 stores address of n1 loop: lw a0, 0(s1) # a0 = n lw a1, 4(s1) # a1 = k lw a2, 8(s1) # a2 = correct output addi sp, sp, -4 sw ra, 0(sp) jal ra, bitChanges # goto bitChanges function lw ra, 0(sp) addi sp, sp, 4 jal ra, print_result # goto print result addi s1, s1, 12 # move to the next test case addi s0, s0, -1 # case counter -1 bnez s0, loop #exit li a7, 10 ecall #bitChanges function bitChanges: mv t0, a0 # t0 = n mv t1, a1 # t1 = k xor a0, t0, t1 # a0 = n xor k addi sp, sp, -4 sw ra, 0(sp) jal ra, blclz # goto my_clz function lw ra, 0(sp) addi sp, sp, 4 li t2, 0 # init i = 0 li t3, 32 # let t3 = 32 sub t3, t3, a0 # let t3 = (32 - num) li t4, 0 # init result = 0 bne t3, x0, bitChanges_loop # if (32 - num) != 0 then start looping li a0, 0 # if (32 - num) == 0, the result is 0 j bitChanges_end bitChanges_loop: andi s2, t0, 1 # s2 = (n & 1) andi s3, t1, 1 # s3 = (k & 1) beq s2, zero, skip_add # if s2 == 0, goto skip_add bne s3, zero, skip_add # if s3 == 1, goto skip_add addi t4, t4, 1 # if(s2 == 1) && (s3 == 0), then result++ skip_add: bne s2, zero, shift # if s2 == 1, goto shift beq s3, zero, shift # if s3 == 0, goto shift li a0, -1 # if (s2 == 0) && (s3 == 1), then result = -1 j bitChanges_end shift: srli t0, t0, 1 # n >>= 1 srli t1, t1, 1 # k >>= 1 addi t2, t2, 1 # i++ bne t2, t3, bitChanges_loop # if(i < (32 - num)), stay in the loop mv a0, t4 # else, let a0 = result bitChanges_end: jr ra # return #clz function blclz: srli t2, a0, 1 # x >> 1 or a0, a0, t2 # x |= (x >> 1) srli t2, a0, 2 # x >> 2 or a0, a0, t2 # x |= (x >> 2) srli t2, a0, 4 # x >> 4 or a0, a0, t2 # x |= (x >> 4) srli t2, a0, 8 # x >> 8 or a0, a0, t2 # x |= (x >> 8) srli t2, a0, 16 # x >> 16 or a0, a0, t2 # x |= (x >> 16) srli t2, a0, 1 # x >> 1 li t4, 0x55555555 and t2, t2, t4 # t2 = ((x >> 1) & 0x55555555) sub a0, a0, t2 # x = x - t2 srli t2, a0, 2 # x >> 2 li t4, 0x33333333 and t2, t2, t4 # t2 = ((x >> 2) & 0x33333333) and t4, a0, t4 # t4 = x & 0x33333333 add a0, t2, t4 # x = t2 & t4 srli t2, a0, 4 # x >> 4 add t2, t2, a0 # t2 = (x >> 4) + x li t4, 0x0f0f0f0f and a0, t2, t4 # x = t2 & 0x0f0f0f0f srli t2, a0, 8 # x >> 8 add a0, a0, t2 # x += (x >> 8) srli t2, a0, 16 # x >> 16 add a0, a0, t2 # x += (x >> 16) andi a0, a0, 0x7F # x & 0x7f li t2, 32 sub a0, t2, a0 # a0 = 32 - (x & 0x7f) clz_end: jr ra # return # print result (pass or fail) print_result: bne a0, a2, fail la a0, str1 # print "pass" li a7, 4 ecall j print_end fail: la a0, str2 # print "fail" li a7, 4 ecall print_end: jr ra # return ``` ### Performance We transform the original `my_clz` function to `blclz` function, resulting less cycles and instructions for the program to run. That indicates a better improvement for the code. | Execution info|Version1|Version2| | ------------- | ------ | ------ | | Cycles |981| 340 | | Instrs. retired|713| 254 | | CPI |1.38| 1.34 | | IPC |7.27| 0.747 | | Clock rate |10.87Hz| 10.99Hz | ## Analysis We use [Ripe](https://github.com/mortbopet/Ripes) to simulate RISC-V processor. ### Pseudo Instruction Ripes will automatically convert the pseudo instructions to the Executable instructions. ``` 00000000 <main>: 0: 00300413 addi x8 x0 3 4: 10000497 auipc x9 0x10000 8: ffc48493 addi x9 x9 -4 0000000c <loop>: c: 0004a503 lw x10 0 x9 10: 0044a583 lw x11 4 x9 14: 0084a603 lw x12 8 x9 18: ffc10113 addi x2 x2 -4 1c: 00112023 sw x1 0 x2 20: 024000ef jal x1 36 <bitChanges> 24: 00012083 lw x1 0 x2 28: 00410113 addi x2 x2 4 2c: 118000ef jal x1 280 <print_result> 30: 00c48493 addi x9 x9 12 34: fff40413 addi x8 x8 -1 38: fc041ae3 bne x8 x0 -44 <loop> 3c: 00a00893 addi x17 x0 10 40: 00000073 ecall 00000044 <bitChanges>: 44: 00050293 addi x5 x10 0 48: 00058313 addi x6 x11 0 4c: 0062c533 xor x10 x5 x6 50: ffc10113 addi x2 x2 -4 54: 00112023 sw x1 0 x2 58: 064000ef jal x1 100 <blclz> 5c: 00012083 lw x1 0 x2 60: 00410113 addi x2 x2 4 64: 00000393 addi x7 x0 0 68: 02000e13 addi x28 x0 32 6c: 40ae0e33 sub x28 x28 x10 70: 00000e93 addi x29 x0 0 74: 000e1663 bne x28 x0 12 <bitChanges_loop> 78: 00000513 addi x10 x0 0 7c: 03c0006f jal x0 60 <bitChanges_end> 00000080 <bitChanges_loop>: 80: 0012f913 andi x18 x5 1 84: 00137993 andi x19 x6 1 88: 00090663 beq x18 x0 12 <skip_add> 8c: 00099463 bne x19 x0 8 <skip_add> 90: 001e8e93 addi x29 x29 1 00000094 <skip_add>: 94: 00091863 bne x18 x0 16 <shift> 98: 00098663 beq x19 x0 12 <shift> 9c: fff00513 addi x10 x0 -1 a0: 0180006f jal x0 24 <bitChanges_end> 000000a4 <shift>: a4: 0012d293 srli x5 x5 1 a8: 00135313 srli x6 x6 1 ac: 00138393 addi x7 x7 1 b0: fdc398e3 bne x7 x28 -48 <bitChanges_loop> b4: 000e8513 addi x10 x29 0 000000b8 <bitChanges_end>: b8: 00008067 jalr x0 x1 0 000000bc <blclz>: bc: 00155393 srli x7 x10 1 c0: 00756533 or x10 x10 x7 c4: 00255393 srli x7 x10 2 c8: 00756533 or x10 x10 x7 cc: 00455393 srli x7 x10 4 d0: 00756533 or x10 x10 x7 d4: 00855393 srli x7 x10 8 d8: 00756533 or x10 x10 x7 dc: 01055393 srli x7 x10 16 e0: 00756533 or x10 x10 x7 e4: 00155393 srli x7 x10 1 e8: 55555eb7 lui x29 0x55555 ec: 555e8e93 addi x29 x29 1365 f0: 01d3f3b3 and x7 x7 x29 f4: 40750533 sub x10 x10 x7 f8: 00255393 srli x7 x10 2 fc: 33333eb7 lui x29 0x33333 100: 333e8e93 addi x29 x29 819 104: 01d3f3b3 and x7 x7 x29 108: 01d57eb3 and x29 x10 x29 10c: 01d38533 add x10 x7 x29 110: 00455393 srli x7 x10 4 114: 00a383b3 add x7 x7 x10 118: 0f0f1eb7 lui x29 0xf0f1 11c: f0fe8e93 addi x29 x29 -241 120: 01d3f533 and x10 x7 x29 124: 00855393 srli x7 x10 8 128: 00750533 add x10 x10 x7 12c: 01055393 srli x7 x10 16 130: 00750533 add x10 x10 x7 134: 07f57513 andi x10 x10 127 138: 02000393 addi x7 x0 32 13c: 40a38533 sub x10 x7 x10 00000140 <clz_end>: 140: 00008067 jalr x0 x1 0 00000144 <print_result>: 144: 00c51c63 bne x10 x12 24 <fail> 148: 10000517 auipc x10 0x10000 14c: edc50513 addi x10 x10 -292 150: 00400893 addi x17 x0 4 154: 00000073 ecall 158: 0140006f jal x0 20 <print_end> 0000015c <fail>: 15c: 10000517 auipc x10 0x10000 160: ece50513 addi x10 x10 -306 164: 00400893 addi x17 x0 4 168: 00000073 ecall 0000016c <print_end>: 16c: 00008067 jalr x0 x1 0 ``` ### 5-Stage RISC-V Processor Ripes provides different processors to execute the code. In my case, I selected 5-stage processer to run my code. ![](https://hackmd.io/_uploads/BkQ-R7-11x.png) I choose `add x10 x10 x7` as an example to demonstrate how instruction works in each stage, including `IF`, `ID`, `EX`, `MEM`, `WB`. 1. **IF (Instrction Fetch)** - Fetch the instructions from the instruction memory. 2. **ID (Instruction Decode)** - Decode the instruction to determine the type of operation, such as addition, subtraction, or load/store. - Read the values of source registers from the register file. 3. **EX (Execute)** - The ALU (Arithmetic Logic Unit) executes arithmetic (e.g., add, sub) and logical operations (e.g., AND, OR, XOR). - For branch instructions, the EX stage calculates whether the branch condition is true or false, based on the values of the source registers. 4. **MEM (Memory access)** - Load instructions: Reading data from memory to be stored in a register. - Store instructions: Writing data from a register into a memory location. 5. **WB (Write back)** - For ALU operations (e.g., add, sub), the result computed by the ALU in the EX stage is written back to the destination register. - For load instructions (e.g., lw), the value read from memory in the MEM stage is written to the destination register. --- ### 1. Instruction Fetch ![IF](https://hackmd.io/_uploads/H14a14ck1x.png) - The current program counter is `0x00000128`, which means that `add x10 x10 x7` instruction is located at memory address `000000128`. The processor will **fetch the instruction** which stored at the current address from the instruction memory. In this case, the instruction will be translated to machine code `0x00750533` - The program counter will then be added by 4(`0x0000012c`) for the next instruction to be fetched. ### 2. Instruction Decode ![ID](https://hackmd.io/_uploads/SyBngEcyye.png) - The decoder will decode the machine code into several parts. - `opcode` which indicate this instruction is `ADD` - `rs1` which is `0x0a` (x10) - `rs2` which is `0x07` (x7) - `rd` which is `0x0a` (x10) - Reg1 will hold the data of `rs1`(0x00000004). - Reg2 will hold the data of `rs2`(0x00000004). - `rd` will be sent to next stage. ### 3. Execute ![EX](https://hackmd.io/_uploads/ryozfN9Jkx.png) - In this stage, `Op1` will hold the data from `rs1`, `Op2` will hold the data from `rs2`. ALU then operates `ADD` instruction of the two inputs, `Op1`(0x00000004) and `Op2`(0x00000000), the result will be stored in `Res` which is `0x00000004`. >The reason why Op2 turned into 0x00000000 is because of data forwarding (We will talk about this later). - `rd` will be sent to next stage. ### 4. Memory Access ![MEM](https://hackmd.io/_uploads/BkfzQ4q1Jx.png) - `ADD` instruction uses the R-type format. In this stage, R-type format no need to load data from memory or store data to memory. - The `Wr en` is set to 0 since `ADD` don't store data to memory. - `rd` will be sent to next stage. ### 5. Write Back ![WB](https://hackmd.io/_uploads/Bytw74cJJe.png) - In this stage, `RES`(0x00000004) will be stored in `Wr data` and written back to the destination register, which is `rd`(0x0a), represented as `Wr idx` in the `ID` stage. ## Memory The data is stored in the memory start from address `0x1000_0000`. We want to load data to registers. The assembler uses `auipc` and `program counter` to set the higher 20 bits, then uses `addi` to set the lower 12 bits to get the memory address. ``` .data n1: .word 13 #0x1000 0000 k1: .word 4 #0x1000 0004 output1: .word 2 #0x1000 0008 n2: .word 21 #0x1000 000c k2: .word 21 #0x1000 0010 output2: .word 0 #0x1000 0014 n3: .word 14 #0x1000 0018 k3: .word 13 #0x1000 001c output3: .word -1 #0x1000 0020 ``` Store address of `n1` to s1 register ``` la s1, n1 ``` Pseudo Instruction ``` 4: 10000497 auipc x9 0x10000 #x9 == 0x10000004 8: ffc48493 addi x9 x9 -4 #x9 == 0x10000000 ``` Load `n1` to a0 register, `k1` to a1 register, `output1` to a2 register ``` #At this point, s1 == 0x10000000 lw a0, 0(s1) # a0 = n1 lw a1, 4(s1) # a1 = k1 lw a2, 8(s1) # a2 = output1 ``` Move to next test case ``` addi s1, s1, 12 ``` Load `n2` to a0 register, `k2` to a1 register, `output2` to a2 register ``` #At this point, s1 == 0x1000000c lw a0, 0(s1) # a0 = n2 lw a1, 4(s1) # a1 = k2 lw a2, 8(s1) # a2 = output2 ``` Move to next test case ``` addi s1, s1, 12 ``` Load `n3` to a0 register, `k3` to a1 register, `output3` to a2 register ``` #At this point, s1 == 0x10000018 lw a0, 0(s1) # a0 = n3 lw a1, 4(s1) # a1 = k3 lw a2, 8(s1) # a2 = output3 ``` ## Pipeline data hazard Data hazards occur when instructions that exhibit data dependence modify data in different stages of a pipeline. Most commonly seen data hazard is usually caused by `read after write(RAW)` dependency. In the Ripes simulator, we can employ [data forwarding](https://en.wikipedia.org/wiki/Operand_forwarding) to deal with it. ### RAW data hazard ``` addi x5 x10 0 addi x6 x11 0 xor x10 x5 x6 ``` ![](https://hackmd.io/_uploads/r1_-T0xyyl.png) In this case, `x5` and `x6` has not been writen to register in this cycle. For `xor` instruction, the data it read from the register is still the original data. Thus, we have to get the newest data from the `EX/MEM` and `MEM/WB`. By employing data forwarding, we can get the latest `x5` data from `MEM/WB` stage and the latest `x6` data from `EX/MEM` stage. They will then be `op1` and `op2` respectively for the ALU operation. ## Reference * [Quiz1 of Computer Architecture (2024 Fall)](https://hackmd.io/@sysprog/arch2024-quiz1-sol) * [LeetCode : 3226. Number of Bit Changes to Make Two Integers Equal](https://leetcode.com/problems/number-of-bit-changes-to-make-two-integers-equal/description/) * [RISC-V 指令集架構介紹 - RV32I](https://tclin914.github.io/16df19b4/)