Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [dennis356811](https://github.com/dennis356811) > ## Quiz1 Problem C: Counting Leading Zero The counting leading zeros (CLZ) function is used to determine how many zero bits are presented at the beginning (most significant bits) of a binary number, before the first occurrence of a '1'. ### C Code #### Version 1: From quiz 1 ```cpp static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` This function counts the leading zeros of an unsigned 32-bit integer by recursively checking each bit starting from the most significant one. It begins with the index ```i = 31``` and shifts the bit to determine how many leading zeros there are. After each shift, it performs a bitwise AND operation with ```x``` to check the value of the bit in the ```32-i``` position. Once it encounters the first non-zero bit, the loop terminates, and the count of leading zeros is returned. :::info The version has a big disadvatage that when the number is small, that is, has a lot of leading zero, the function will suffer from going through many loops. ::: So, I come up with another solution. #### Version 2: Binary search clz ```cpp int clz_bs(uint32_t x) { int count = 0; if (x == 0) { return 32; } if (x <= 0x0000FFFF) { count += 16; x <<= 16; } if (x <= 0x00FFFFFF) { count += 8; x <<= 8; } if (x <= 0x0FFFFFFF) { count += 4; x <<= 4; } if (x <= 0x3FFFFFFF) { count += 2; x <<= 2; } if (x <= 0x7FFFFFFF) { count += 1; } return count; } ``` This version separates 32 bits into 6 cases using binary search alike method. It means the program only need to checks 6 condition for every number. In the first case, when x = 0, just returns 32. In the other cases, x is compared to a specific constant. When x are smaller than it, it means that x at least has that much leading zero corresponding to the cases. After every checking, shift x and start next checking until every cases are done. #### Test Program ```cpp #include <stdint.h> #include <stdio.h> static inline int clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } int clz_bs(uint32_t x) { int count = 0; if (x == 0) { return 32; } if (x <= 0x0000FFFF) { count += 16; x <<= 16; } if (x <= 0x00FFFFFF) { count += 8; x <<= 8; } if (x <= 0x0FFFFFFF) { count += 4; x <<= 4; } if (x <= 0x3FFFFFFF) { count += 2; x <<= 2; } if (x <= 0x7FFFFFFF) { count += 1; } return count; } int main() { const int size = 8; uint32_t arr[size] = {64, 128, 1584367, 2147483647, 4294967295, 300, 878780422, 920105}; int error = 0; printf("test original clz\n"); for (int i = 0; i < size; i++) { printf("answer: %d, your result: %d", __builtin_clz(arr[i]), clz(arr[i])); if (__builtin_clz(arr[i]) == clz(arr[i])) { printf(" ,pass !\n"); } else { printf(" ,wrong !\n"); error++; } } printf("test binary search clz\n"); for (int i = 0; i < size; i++) { printf("answer: %d, your result: %d", __builtin_clz(arr[i]), clz_bs(arr[i])); if (__builtin_clz(arr[i]) == clz_bs(arr[i])) { printf(" ,pass !\n"); } else { printf(" ,wrong !\n"); error++; } } if (error != 0) { printf("Total has %d error !\n", error); } else { printf("All pass !\n"); } } ``` The test program's compares the value returned between the two clz function I designed with the __builtin_clz(), which is in gcc's standard library. :::info According to GCC manual, for __builtin_clz(x), it is not defined when x = 0, that is, it won't return 32. That's why I didn't choose 0 to test. ::: #### Output ```cpp test original clz answer: 25, your result: 25 ,pass ! answer: 24, your result: 24 ,pass ! answer: 11, your result: 11 ,pass ! answer: 1, your result: 1 ,pass ! answer: 0, your result: 0 ,pass ! answer: 23, your result: 23 ,pass ! answer: 2, your result: 2 ,pass ! answer: 12, your result: 12 ,pass ! test binary search clz answer: 25, your result: 25 ,pass ! answer: 24, your result: 24 ,pass ! answer: 11, your result: 11 ,pass ! answer: 1, your result: 1 ,pass ! answer: 0, your result: 0 ,pass ! answer: 23, your result: 23 ,pass ! answer: 2, your result: 2 ,pass ! answer: 12, your result: 12 ,pass ! All pass ! ``` ### Assembly Code #### Version 1: From quiz 1 ```c .data num_test: .word 10 test: .word 64, 128, 0, 2147483647, 4294967295, -1, -2147483647, 300, 878780422, 920105 answer: .word 25, 24, 32, 1, 0, 0, 0, 23, 2, 12 output: .string "your result -> " output2: .string " , answer -> " next_line: .string "\n" output_pass: .string "All pass\n" output_error: .string "Fail, total has " output_error2: .string " error" .text main: lw s0, num_test # loading number of test la s1, test # loading test data address la s2, answer # loading answer data address li s3, 0 # i = 0 li s6, 0 # int error = 0 bge s3, s0, main_for_loop_exit main_for_loop: slli s4, s3, 2 # set s4 = i * 4 bytes add s5, s4, s1 # set s5 = *arr[i] lw a0, 0(s5) # loading test data from memory ## do clz function jal ra, clz ## verification add s5, s2, s4 # set s5 = *answer[i] lw a1, 0(s5) # loading answer data from memory beq a0, a1, pass addi s6, s6, 1 pass: jal ra, print ## i++, check i < num_test addi s3, s3, 1 blt s3, s0, main_for_loop main_for_loop_exit: ## program finish beq s6, x0, all_pass la a0, output_error li a7, 4 ecall add a0, s6, x0 li a7, 1 ecall la a0, output_error2 li a7, 4 ecall jal x0, exit all_pass: la a0, output_pass li a7, 4 ecall exit: li a7, 10 ecall clz: li t0, 0 # int count = 0 li t1, 31 # int i = 31 li t2, 1 # set t2 = 1U blt t1, x0, loop_exit # if i < 0, go outside loop for_loop: sll t3, t2, t1 # set t3 = (1U << i) and t3, t3, a0 # set t3 = (x & (1U << i)) bne t3, x0, loop_exit # if (x & (1U << i) == 1): break addi t0, t0, 1 # count++ ## condition check (j >= 0) addi t1, t1, -1 # --i bge t1, x0, for_loop # if (i >= 0) go loop again loop_exit: add a0, t0, x0 # return count jalr x0, ra, 0 print: ## your result: add a2, a0, x0 # backup a0 result la a0, output li a7, 4 ecall add a0, a2, x0 li a7, 1 ecall ## answer: la a0, output2 li a7, 4 ecall add a0, a1, x0 li a7, 1 ecall la a0, next_line li a7, 4 ecall jalr x0, ra, 0 ``` #### Version 2: Binary search clz ```cpp .data num_test: .word 10 test: .word 64, 128, 0, 2147483647, 4294967295, -1, -2147483647, 300, 878780422, 920105 answer: .word 25, 24, 32, 1, 0, 0, 0, 23, 2, 12 output: .string "your result -> " output2: .string " , answer -> " next_line: .string "\n" output_pass: .string "All pass\n" output_error: .string "Fail, total has " output_error2: .string " error" .text main: lw s0, num_test # loading number of test la s1, test # loading test data address la s2, answer # loading answer data address li s3, 0 # i = 0 li s6, 0 # int error = 0 bge s3, s0, main_for_loop_exit main_for_loop: slli s4, s3, 2 # set s4 = i * 4 bytes add s5, s4, s1 # set s5 = *arr[i] lw a0, 0(s5) # loading test data from memory ## do clz function jal ra, clz_bs ## verification add s5, s2, s4 # set s5 = *answer[i] lw a1, 0(s5) # loading answer data from memory beq a0, a1, pass addi s6, s6, 1 pass: jal ra, print ## i++, check i < num_test addi s3, s3, 1 blt s3, s0, main_for_loop main_for_loop_exit: ## program finish beq s6, x0, all_pass la a0, output_error li a7, 4 ecall add a0, s6, x0 li a7, 1 ecall la a0, output_error2 li a7, 4 ecall jal x0, exit all_pass: la a0, output_pass li a7, 4 ecall exit: li a7, 10 ecall clz_bs: ## save ra addi sp, sp, -4 sw ra, 0(sp) add t0, x0, x0 # int count = 0 beq a0, x0, x_zero # if (x == 0): li t1, 0x0000FFFF bgtu a0, t1, check_8 # if (x <= 0x0000FFFF): addi t0, t0, 16 # count += 16 slli a0, a0, 16 # x <<= 16 check_8: li t1, 0x00FFFFFF bgtu a0, t1, check_4 # if (x <= 0x00FFFFFF): addi t0, t0, 8 # count += 8 slli a0, a0, 8 # x <<= 8 check_4: li t1, 0x0FFFFFFF bgtu a0, t1, check_2 # if (x <= 0x0FFFFFFF): addi t0, t0, 4 # count += 4 slli a0, a0, 4 # x <<= 4 check_2: li t1, 0x3FFFFFFF bgtu a0, t1, check_1 # if (x <= 0x0000FFFF): addi t0, t0, 2 # count += 2 slli a0, a0, 2 # x <<= 2 check_1: li t1, 0x7FFFFFFF bgtu a0, t1, end_clz # if (x <= 0x0000FFFF): addi t0, t0, 1 # count += 1 slli a0, a0, 1 # x <<= 1 end_clz: add a0, t0, x0 # return count ## Retrieved ra lw ra, 0(sp) addi sp, sp, 4 jalr x0, ra, 0 x_zero: li a0, 32 # return 32 ## Retrieved ra lw ra, 0(sp) addi sp, sp, 4 jalr x0, ra, 0 print: ## your result: add a2, a0, x0 # backup a0 result la a0, output li a7, 4 ecall add a0, a2, x0 li a7, 1 ecall ## answer: la a0, output2 li a7, 4 ecall add a0, a1, x0 li a7, 1 ecall la a0, next_line li a7, 4 ecall jalr x0, ra, 0 ``` #### Output ```cpp your result -> 25 , answer -> 25 your result -> 24 , answer -> 24 your result -> 32 , answer -> 32 your result -> 1 , answer -> 1 your result -> 0 , answer -> 0 your result -> 0 , answer -> 0 your result -> 0 , answer -> 0 your result -> 23 , answer -> 23 your result -> 2 , answer -> 2 your result -> 12 , answer -> 12 All pass Program exited with code: 0 ``` #### Comparison | | clz (v1) | clz_bs (v2) | | --- | --- | --- | | Cycles | 1609 | 861 | | Instructions | 1117 | 557 | | CPI | 1.44 | 1.55 | | IPC | 0.694 | 0.647 | From above table, it is obviously that the second version of clz has fewer execution cycles and instruction. ## Improving shift-and-add multiplication using clz From Quiz 2 Problem A, it implements the multiplication by using shift-and-add technique. In this approach, the multiplicand ```x``` is repeatedly added to itself according to the value of the multiplier ```y``` , effectively performing ```x``` multiplied by ```y```. ### Assembly Code #### Version 1: From quiz2 ```cpp .data num_test: .word 14 multiplier: .word 9, 9, -9, -9, 0, 10, 0, -10, 0, 1024, 32768, -32768, 32768, -32768 multiplicand: .word 7, -7, 7, -7, 0, 0, 10, 0, -10, -1024, 32768, -32768, -32768, 32768 answer: .word 63, -63, -63, 63, 0, 0, 0, 0, 0, -1048576, 1073741824, 1073741824, -1073741824, -1073741824 next_line: .string "\n" output: .string "your result -> " output2: .string " , answer -> " output_pass: .string "All pass\n" output_error: .string "Fail, total has " output_error2: .string " error" .text main: la s0, multiplier # load multiplier address la s1, multiplicand # load multiplicand address la s2, answer # Load answer address la s3, num_test # Load number of test lw s3, 0(s3) li s5, 0 # int i = 0 li s6, 0 # int error = 0 main_for_loop: li s4, 0 # Initialize accumulator ## get test data slli t0, s5, 2 # size = i * 4 byte add t1, t0, s0 # get multiplier[i] address lw t1, 0(t1) # set t1 = mulitplier[i] add t2, t0, s1 # get multiplicand[i] address lw t2, 0(t2) # set t1 = mulitplicand[i] ## make multipler positvie srai t3, t1, 31 # t3 = 0xFFFFFFFF if different signed bge t1, x0, multiplier_positive # If multiplier positive (#A02) neg t1, t1 # Make multiplier positive multiplier_positive: ## use clz to initialize bit counter (t4) add a0, t1, x0 # pass arguement li t4, 32 # t4 = 32 ## do shift_and_add_loop shift_and_add_loop: beq t4, x0, end_shift_and_add # Exit if bit count is zero andi t5, t1, 1 # Check least significant bit beq t5, x0, skip_add # Skip add if bit is 0 add s4, s4, t2 # Add to accumulator skip_add: srai t1, t1, 1 # Right shift multiplier slli t2, t2, 1 # Left shift multiplicand addi t4, t4, -1 # Decrease bit counter jal x0, shift_and_add_loop # Repeat loop end_shift_and_add: xor s4, s4, t3 andi t3, t3, 1 add s4, s4, t3 jal ra, print addi s5, s5, 1 blt s5, s3, main_for_loop main_for_loop_exit: beq s6, x0, pass la a0, output_error li a7, 4 ecall add a0, s6, x0 li a7, 1 ecall la a0, output_error2 li a7, 4 ecall jal x0, exit pass: la a0, output_pass li a7, 4 ecall exit: li a7, 10 ecall print: la a0, output li a7, 4 ecall add a0, s4, x0 li a7, 1 ecall la a0, output2 li a7, 4 ecall add a0, t0, s2 lw a0, 0(a0) li a7, 1 ecall beq a0, s4, correct addi s6, s6, 1 correct: la a0, next_line li a7, 4 ecall jalr x0, ra, 0 ``` In this approach, I fixed the bugs of dealing with negative number, and add some testing data for verification. Before the shifting routine, I first check the signed bit of multiplier, ensuring it is positive, so that it can correctly checks its least significant bit ```(LSB)``` to know whether the multiplicand needed to be added. After adding is performed, it then checks the next ```LSB``` of multipler by shifting right one bit, and shifting multiplicand left one bit. In the original one, it totally performs 32 times of shifting in shift-and-add technique, while the actual number don't need that much of shifting. :::info By using clz to multiplier in advance, it won't need 32 times of shifting for every number, increasing the efficiency of the program. ::: #### Version 2: Improving shift-and-add by clz from quiz1 ```cpp .data num_test: .word 14 multiplier: .word 9, 9, -9, -9, 0, 10, 0, -10, 0, 1024, 32768, -32768, 32768, -32768 multiplicand: .word 7, -7, 7, -7, 0, 0, 10, 0, -10, -1024, 32768, -32768, -32768, 32768 answer: .word 63, -63, -63, 63, 0, 0, 0, 0, 0, -1048576, 1073741824, 1073741824, -1073741824, -1073741824 next_line: .string "\n" output: .string "your result -> " output2: .string " , answer -> " output_pass: .string "All pass\n" output_error: .string "Fail, total has " output_error2: .string " error" .text main: la s0, multiplier # load multiplier address la s1, multiplicand # load multiplicand address la s2, answer # Load answer address la s3, num_test # Load number of test lw s3, 0(s3) li s5, 0 # int i = 0 li s6, 0 # int error = 0 main_for_loop: li s4, 0 # Initialize accumulator ## get test data slli t0, s5, 2 # size = i * 4 byte add t1, t0, s0 # get multiplier[i] address lw t1, 0(t1) # set t1 = mulitplier[i] add t2, t0, s1 # get multiplicand[i] address lw t2, 0(t2) # set t1 = mulitplicand[i] ## make multipler positvie srai t3, t1, 31 # t3 = 0xFFFFFFFF if different signed bge t1, x0, multiplier_positive # If multiplier positive (#A02) neg t1, t1 # Make multiplier positive multiplier_positive: ## use clz to initialize bit counter (t4) add a0, t1, x0 # pass arguement ## Caller save addi sp, sp, -16 sw t0, 0(sp) sw t1, 4(sp) sw t2, 8(sp) sw t3, 12(sp) jal ra, clz # return a0 = clz(multiplier) ## Retrieved caller save lw t0, 0(sp) lw t1, 4(sp) lw t2, 8(sp) lw t3, 12(sp) addi sp, sp, 16 li t4, 32 # t4 = 32 sub t4, t4, a0 # t4 = 32 - clz(multiplier) ## do shift_and_add_loop shift_and_add_loop: beq t4, x0, end_shift_and_add # Exit if bit count is zero andi t5, t1, 1 # Check least significant bit beq t5, x0, skip_add # Skip add if bit is 0 add s4, s4, t2 # Add to accumulator skip_add: srai t1, t1, 1 # Right shift multiplier slli t2, t2, 1 # Left shift multiplicand addi t4, t4, -1 # Decrease bit counter jal x0, shift_and_add_loop # Repeat loop end_shift_and_add: xor s4, s4, t3 andi t3, t3, 1 add s4, s4, t3 jal ra, print addi s5, s5, 1 blt s5, s3, main_for_loop main_for_loop_exit: beq s6, x0, pass la a0, output_error li a7, 4 ecall add a0, s6, x0 li a7, 1 ecall la a0, output_error2 li a7, 4 ecall jal x0, exit pass: la a0, output_pass li a7, 4 ecall exit: li a7, 10 ecall ## function clz: ## used register: t0~t3 clz: li t0, 0 # int count = 0 li t1, 31 # int i = 31 li t2, 1 # set t2 = 1U blt t1, x0, loop_exit # if i < 0, go outside loop for_loop: sll t3, t2, t1 # set t3 = (1U << i) and t3, t3, a0 # set t3 = (x & (1U << i)) bne t3, x0, loop_exit # if (x & (1U << i) == 1): break addi t0, t0, 1 # count++ ## condition check (j >= 0) addi t1, t1, -1 # --i bge t1, x0, for_loop # if (i >= 0) go loop again loop_exit: add a0, t0, x0 # return count jalr x0, ra, 0 print: la a0, output li a7, 4 ecall add a0, s4, x0 li a7, 1 ecall la a0, output2 li a7, 4 ecall add a0, t0, s2 lw a0, 0(a0) li a7, 1 ecall beq a0, s4, correct addi s6, s6, 1 correct: la a0, next_line li a7, 4 ecall jalr x0, ra, 0 ``` In this approach, instead of shifting ```32``` times for every number, I use clz function to count how many leading zero ```clz(x)``` multiplier has, which means only ```32 - clz(x)``` times of shifting needs to be done. #### Version 3 : Improving shift-and-add by binary searching clz ```cpp .data num_test: .word 14 multiplier: .word 9, 9, -9, -9, 0, 10, 0, -10, 0, 1024, 32768, -32768, 32768, -32768 multiplicand: .word 7, -7, 7, -7, 0, 0, 10, 0, -10, -1024, 32768, -32768, -32768, 32768 answer: .word 63, -63, -63, 63, 0, 0, 0, 0, 0, -1048576, 1073741824, 1073741824, -1073741824, -1073741824 next_line: .string "\n" output: .string "your result -> " output2: .string " , answer -> " output_pass: .string "All pass\n" output_error: .string "Fail, total has " output_error2: .string " error" .text main: la s0, multiplier # load multiplier address la s1, multiplicand # load multiplicand address la s2, answer # Load answer address la s3, num_test # Load number of test lw s3, 0(s3) li s5, 0 # int i = 0 li s6, 0 # int error = 0 main_for_loop: li s4, 0 # Initialize accumulator ## get test data slli t0, s5, 2 # size = i * 4 byte add t1, t0, s0 # get multiplier[i] address lw t1, 0(t1) # set t1 = mulitplier[i] add t2, t0, s1 # get multiplicand[i] address lw t2, 0(t2) # set t1 = mulitplicand[i] ## make multipler positvie srai t3, t1, 31 # t3 = 0xFFFFFFFF if different signed bge t1, x0, multiplier_positive # If multiplier positive (#A02) neg t1, t1 # Make multiplier positive multiplier_positive: ## use clz to initialize bit counter (t4) add a0, t1, x0 # pass arguement ## Caller save addi sp, sp, -8 sw t0, 0(sp) sw t1, 4(sp) jal ra, clz_bs # return a0 = clz(multiplier) ## Retrieved caller save lw t0, 0(sp) lw t1, 4(sp) addi sp, sp, 8 li t4, 32 # t4 = 32 sub t4, t4, a0 # t4 = 32 - clz(multiplier) ## do shift_and_add_loop shift_and_add_loop: beq t4, x0, end_shift_and_add # Exit if bit count is zero andi t5, t1, 1 # Check least significant bit beq t5, x0, skip_add # Skip add if bit is 0 add s4, s4, t2 # Add to accumulator skip_add: srai t1, t1, 1 # Right shift multiplier slli t2, t2, 1 # Left shift multiplicand addi t4, t4, -1 # Decrease bit counter jal x0, shift_and_add_loop # Repeat loop end_shift_and_add: xor s4, s4, t3 andi t3, t3, 1 add s4, s4, t3 jal ra, print addi s5, s5, 1 blt s5, s3, main_for_loop main_for_loop_exit: beq s6, x0, pass la a0, output_error li a7, 4 ecall add a0, s6, x0 li a7, 1 ecall la a0, output_error2 li a7, 4 ecall jal x0, exit pass: la a0, output_pass li a7, 4 ecall exit: li a7, 10 ecall ## function clz: ## used register: t0~t3 clz_bs: add t0, x0, x0 # int count = 0 beq a0, x0, x_zero # if (x == 0): li t1, 0x0000FFFF bgtu a0, t1, check_8 # if (x <= 0x0000FFFF): addi t0, t0, 16 # count += 16 slli a0, a0, 16 # x <<= 16 check_8: li t1, 0x00FFFFFF bgtu a0, t1, check_4 # if (x <= 0x00FFFFFF): addi t0, t0, 8 # count += 8 slli a0, a0, 8 # x <<= 8 check_4: li t1, 0x0FFFFFFF bgtu a0, t1, check_2 # if (x <= 0x0FFFFFFF): addi t0, t0, 4 # count += 4 slli a0, a0, 4 # x <<= 4 check_2: li t1, 0x3FFFFFFF bgtu a0, t1, check_1 # if (x <= 0x0000FFFF): addi t0, t0, 2 # count += 2 slli a0, a0, 2 # x <<= 2 check_1: li t1, 0x7FFFFFFF bgtu a0, t1, end_clz # if (x <= 0x0000FFFF): addi t0, t0, 1 # count += 1 slli a0, a0, 1 # x <<= 1 end_clz: add a0, t0, x0 # return count jalr x0, ra, 0 x_zero: li a0, 32 # return 32 jalr x0, ra, 0 print: la a0, output li a7, 4 ecall add a0, s4, x0 li a7, 1 ecall la a0, output2 li a7, 4 ecall add a0, t0, s2 lw a0, 0(a0) li a7, 1 ecall beq a0, s4, correct addi s6, s6, 1 correct: la a0, next_line li a7, 4 ecall jalr x0, ra, 0 ``` In this approach, I use the binary searching clz instead of the looped one from quiz1, futher improving the performance of multiplication #### Output ```cpp your result -> 63 , answer -> 63 your result -> -63 , answer -> -63 your result -> -63 , answer -> -63 your result -> 63 , answer -> 63 your result -> 0 , answer -> 0 your result -> 0 , answer -> 0 your result -> 0 , answer -> 0 your result -> 0 , answer -> 0 your result -> 0 , answer -> 0 your result -> -1048576 , answer -> -1048576 your result -> 1073741824 , answer -> 1073741824 your result -> 1073741824 , answer -> 1073741824 your result -> -1073741824 , answer -> -1073741824 your result -> -1073741824 , answer -> -1073741824 All pass ``` For the test data, I randomly choose some calculation between small number, big number , negative number, postive number, and zero. #### Comparison | | Quiz2 (v1) | with clz (v2) | with bs clz (v3) | | --- | --- | --- | --- | | Cycles | 5772 | 5082 | 2438 | | Instructions | 3708 | 3644 | 1648 | | CPI | 1.56 | 1.39 | 1.48 | | IPC | 0.642 | 0.717 | 0.676 | In the table, we can find out clz indeed improves the performance of shift-and-add technique. But in version 2, its clz still has a lot of loops to check, so its improvement is not that obvious. On the other hand, in version 3, it significantly decreases the number of cycles and instructions. ## RISC-V 5-stage pipeline processor Pipeline is a technique which separate the execution of an instruction into different stage, in order to do several instruction at the same time. In convenience, I will use the following program to introduce each stage. ```cpp 0000009c <clz>: 9c: 00000293 addi x5 x0 0 a0: 01f00313 addi x6 x0 31 a4: 00100393 addi x7 x0 1 a8: 00034e63 blt x6 x0 28 <loop_exit> 000000ac <for_loop>: ac: 00639e33 sll x28 x7 x6 b0: 00ae7e33 and x28 x28 x10 b4: 000e1863 bne x28 x0 16 <loop_exit> b8: 00128293 addi x5 x5 1 bc: fff30313 addi x6 x6 -1 c0: fe0356e3 bge x6 x0 -20 <for_loop> 000000c4 <loop_exit>: c4: 00028533 add x10 x5 x0 c8: 00008067 jalr x0 x1 0 ``` Take ```bc: fff30313 addi x6 x6 -1``` for example. The five stage includes: ### IF (Instruction Fetch) ![image](https://hackmd.io/_uploads/Bkhx_Yt1kg.png) This stage fetches instruction from instruction memory corresponding to the ``PC``. As seen from the picture, when ```pc``` is ```0x000000bc```, the corresponding address in memory is ```0xfff30313```, that is, ```addi x6, x6, -1``` in machine lauguage. ### ID (Instruction Decode) ![image](https://hackmd.io/_uploads/r1_eFFKy1l.png) This stage decodes the instruction from the last stage, and recognized the register and immediate being use in the instruction. As seen from the picture, the decoded register is ```0x06```, that is ```x6```, and the stored data is ```0x0000001f```. ```Imm``` blocks generate ```0xffffffff``` corresponding to ```-1``` in the instruction. ### EXE (Execution) ![image](https://hackmd.io/_uploads/rkc55Kt1Jl.png) This stage executes the instruction calculation using ```ALU``` unit. As seen from the picture, two input ```0x0000001f``` and ```0xffffffff``` are added and the output is ```0x0000001e```. ### MEM (Memory) ![image](https://hackmd.io/_uploads/ry76stY11l.png) This stage loads or store data from memory. Because in the previous example ```bc: fff30313 addi x6 x6 -1``` doesn't involve memory reading/writing, I use another example to demonstrate the effect. In instruction ```sw x1 0 x2```, it stores data in ```x1``` register into the address stored in ```x2``` register corresponding to the memory. As seen the picture, ```0x00000034``` is stored in ```x1``` register, and will be writed into ```0x7ffffff0``` stored in ```x2``` register in data memory. ### WB (Write back) ![image](https://hackmd.io/_uploads/rJpd6FFJ1x.png) ![image](https://hackmd.io/_uploads/rJUK6FFkyg.png) This stage writes data back to the register. As seen in the picture, the instruction eventually stores the result into ```x5``` register. The writing data is ```0x00000002``` and writing register is ```x5```.