# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by <`hsuhsiang`> (Hsu, Han-Hsiang) :::danger Check the title, making it consistent with the sample pages. ::: ## Quiz1 Probem C ### C code :::danger Choose one problem (A, B, or C) from [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol), translate it from C code to a complete RISC-V assembly program, and include the relevant test data. ::: ```c= static inline float fabsf(float x) { uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value x = *(float *)&i; // Write the modified bits back into the float return x; } ``` ```c= static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` ```c= static inline uint32_t fp16_to_fp32(uint16_t h) { /* * Extends the 16-bit half-precision floating-point number to 32 bits * by shifting it to the upper half of a 32-bit word: * +---+-----+------------+-------------------+ * | S |EEEEE|MM MMMM MMMM|0000 0000 0000 0000| * +---+-----+------------+-------------------+ * Bits 31 26-30 16-25 0-15 * * S - sign bit, E - exponent bits, M - mantissa bits, 0 - zero bits. */ const uint32_t w = (uint32_t) h << 16; /* * Isolates the sign bit from the input number, placing it in the most * significant bit of a 32-bit word: * * +---+----------------------------------+ * | S |0000000 00000000 00000000 00000000| * +---+----------------------------------+ * Bits 31 0-31 */ const uint32_t sign = w & UINT32_C(0x80000000); /* * Extracts the mantissa and exponent from the input number, placing * them in bits 0-30 of the 32-bit word: * * +---+-----+------------+-------------------+ * | 0 |EEEEE|MM MMMM MMMM|0000 0000 0000 0000| * +---+-----+------------+-------------------+ * Bits 30 27-31 17-26 0-16 */ const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); /* * The renorm_shift variable indicates how many bits the mantissa * needs to be shifted to normalize the half-precision number. * For normalized numbers, renorm_shift will be 0. For denormalized * numbers, renorm_shift will be greater than 0. Shifting a * denormalized number will move the mantissa into the exponent, * normalizing it. */ uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; /* * If the half-precision number has an exponent of 15, adding a * specific value will cause overflow into bit 31, which converts * the upper 9 bits into ones. Thus: * inf_nan_mask == * 0x7F800000 if the half-precision number is * NaN or infinity (exponent of 15) * 0x00000000 otherwise */ const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); /* * If nonsign equals 0, subtracting 1 will cause overflow, setting * bit 31 to 1. Otherwise, bit 31 will be 0. Shifting this result * propagates bit 31 across all bits in zero_mask. Thus: * zero_mask == * 0xFFFFFFFF if the half-precision number is * zero (+0.0h or -0.0h) * 0x00000000 otherwise */ const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; /* * 1. Shifts nonsign left by renorm_shift to normalize it (for denormal * inputs). * 2. Shifts nonsign right by 3, adjusting the exponent to fit in the * 8-bit exponent field and moving the mantissa into the correct * position within the 23-bit mantissa field of the single-precision * format. * 3. Adds 0x70 to the exponent to account for the difference in bias * between half-precision and single-precision. * 4. Subtracts renorm_shift from the exponent to account for any * renormalization that occurred. * 5. ORs with inf_nan_mask to set the exponent to 0xFF if the input * was NaN or infinity. * 6. ANDs with the inverted zero_mask to set the mantissa and exponent * to zero if the input was zero. * 7. Combines everything with the sign bit of the input number. */ return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } } ``` Implement the fp16_to_fp32 function, which converts a 16-bit floating-point number in IEEE half-precision format (bit representation) to a 32-bit floating-point number in IEEE single-precision format (bit representation). This implementation will avoid using any floating-point arithmetic and will utilize the clz function discussed earlier. ### Assembly code ```c= .data num_test: .word 10 test: .word 0x1234,0x5678,0x4876,0xF123,0xFC00,0x0400,0x3555,0x0022,0x0001,0x00FF answer: .word 0x3A468000,0x42cf0000,0x410ec000,0xc6246000,0xFF800000,0x38800000,0x3eaaa000,0x36080000,0x33800000,0x377f0000 statement1: .string "All Test Pass" statement2: .string "Wrong, Check Again" .text main: la t0, num_test lw a6, 0(t0) # a0 = num_test la s1, test la t3, answer fp16tofp32: lw t1, 0(s1) # t1 = number lw a2, 0(t3) # a2 = answer slli t0, t1, 16 li t6, 0x80000000 and t4, t0, t6 li t6, 0x7FFFFFFF and t5, t0, t6 ########### call function clz ############# addi sp, sp, -16 sw a1, 12(sp) sw t2, 8(sp) sw t1, 4(sp) sw t0, 0(sp) mv a0, t5 # a0 = num li a1, 0x55af jal ra, clz # a0 = leading zero lw t0, 0(sp) lw t1, 4(sp) lw t2, 8(sp) lw a1, 12(sp) ########################################## sltiu t6, a0, 6 bne t6, x0, lessthan5 addi a0, a0, -5 j inf_nan_mask lessthan5: li a0, 0 inf_nan_mask: li t6, 0x04000000 add t6, t6, t5 srai t6, t6, 8 li a1, 0x7F800000 and a1, t6, a1 addi a5, t5, -1 srai a5, a5, 31 sub a5, x0, a5 sll a3, t5, a0 srli a3, a3, 3 li a4, 0x70 sub a4, a4, a0 slli a4, a4, 23 add a3, a3, a4 or a3, a3, a1 or a3, a3, a5 or t2, a3, t4 next: bne t2, a2, wrong addi t3, t3, 4 addi s1, s1, 4 addi a6, a6, -1 bne a6, x0, fp16tofp32 return: la a0, statement1 addi a7, zero, 4 ecall j fin wrong: la a0, statement2 addi a7, zero, 4 ecall fin: li a7, 10 ecall ########################################## ###############Loop through############### clz: li t2, 0 li s3, 31 li t1, 1 loop: sll t0, t1, s3 and t0, a0, t0 bne t0, x0, stop addi t2, t2, 1 addi s3, s3, -1 bge s3, x0, loop stop: mv a0, t2 ret ########################################## ``` ## Count leading zero Count the leading zeros starting from the most significant bit (MSB). It returns an integer representing the number of zero bits that precede the first '1' in the binary representation of the number. :::info In C, according to the GCC manual, the behavior of __builtin_clz(x) is undefined when x = 0, meaning the output is unpredictable. However, in my implementation of the count leading zero function, when the input is x = 0, the output is explicitly defined to be 32. ::: 1. **Example 1**: For x = 16 * Binary representation: `00000000 00000000 00000000 00010000` * The number of leading zero bits is `27`(as there are `27` zeros before the first `1`). * Output: `clz(16)` returns `27`. 1. **Example 2**: For x = -1 * Binary representation: `11111111 11111111 11111111 111111111` * The number of leading zero bits is `0`(as there are `0` zeros before the first `1`). * Output: `clz(-1)` returns `0`. 1. **Example 3**: For x = 0 * Binary representation: `00000000 00000000 00000000 00000000` * The number of leading zero bits is `32`(as x is zero). * Output: `clz(0)` returns `32`. ### C code from quiz1 problem C, consider the C implementation of `my_clz`: #### first version (Naive loop through bits) ```c= int clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` This approach works by scanning each bit individually from left to right, stopping at the first occurrence of 1. Although this method is simple and easy to understand, it can be inefficient for certain inputs, especially when there are very few or no leading zeros, as the loop will have to examine all 32 bits in the worst case. :::danger Show the full C source code corresponding to the original problem set. ::: #### second version (Conditional checks) ```c= int clz(uint32_t x) { if (x == 0) return 32; int count = 0; if (x <= 0x0000FFFF) { count += 16; x <<= 16; } if (x <= 0x00FFFFFF) { count += 8; x <<= 8; } if (x <= 0x0FFFFFFF) { count += 4; x <<= 4; } if (x <= 0x3FFFFFFF) { count += 2; x <<= 2; } if (x <= 0x7FFFFFFF) { count += 1; x <<= 1; } return count; } ``` This approach uses conditional checks to progressively determine the number of leading zeros in the input number. It works by shifting the bits and updating the count in chunks—16, 8, 4, and so on—until the total number of leading zeros is found. #### third version (Bitwise operations without branches) ```c= int clz(uint32_t x) { int count = 0, temp; temp = (x < 0x00010000) << 4; count += temp; x <<= temp; temp = (x < 0x01000000) << 3; count += temp; x <<= temp; temp = (x < 0x10000000) << 2; count += temp; x <<= temp; temp = (x < 0x40000000) << 1; count += temp; x <<= temp; temp = (x < 0x80000000); count += temp; x <<= temp; count += (x == 0); return count; } ``` This approach implements the same algorithm in second version but without using any branch instructions. Instead of conditional checks, it relies on bitwise operations to progressively determine the number of leading zeros. #### fourth version (Lookup table optimization) ```c= int clz(uint32_t x) { int count = 0, temp; temp = (x < 0x00010000) << 4; count += temp; x <<= temp; temp = (x < 0x01000000) << 3; count += temp; x <<= temp; temp = (x < 0x10000000) << 2; count += temp; x <<= temp; temp = (x >> 27) & 0x1e; count += (0x55af >> temp) & 3; count += (x == 0); return count; } ``` For the final two steps of the third version, we adopt a lookup table approach to efficiently determine the number of leading zeros in the upper 4 bits of x. ### Assembly code :::danger Use fewer instructions when possible. ::: #### first version (Naive loop through bits) ```c= .data num_test: .word 12 test: .word -1,0,1,2147483647,-2147483648,48763,54321,123456789,16,256,65536,65535 answer: .word 0,32,31,1,0,16,16,5,27,23,15,16 statement1: .string "All Test Pass" statement2: .string "Wrong, Check Again" .text main: la t0, num_test lw a0, 0(t0) # a0 = num_test la t1, test la t6, answer count: lw a1, 0(t1) # a1 = number lw a2, 0(t6) # a2 = answer li t2, 0 li t3, 31 li t4, 1 loop: sll t5, t4, t3 and t5, a1, t5 bne t5, x0, next addi t2, t2, 1 addi t3, t3, -1 bge t3, x0, loop next: bne t2, a2, wrong addi t6, t6, 4 addi t1, t1, 4 addi a0, a0, -1 bne a0, x0, count return: la a0, statement1 addi a7, zero, 4 ecall j fin wrong: la a0, statement2 addi a7, zero, 4 ecall fin: li a7, 10 ecall ``` #### second version (Conditional checks) ```c= .data num_test: .word 12 constant: .word 0x0000FFFF,0x00FFFFFF,0x0FFFFFFF,0x3FFFFFFF,0x7FFFFFFF test: .word -1,0,1,2147483647,-2147483648,48763,54321,123456789,16,256,65536,65535 answer: .word 0,32,31,1,0,16,16,5,27,23,15,16 statement1: .string "All Test Pass" statement2: .string "Wrong, Check Again" .text main: la t0, num_test lw a0, 0(t0) # a0 = num_test la t1, test la t4, answer count: lw a1, 0(t1) # a1 = number lw a2, 0(t4) # a2 = answer li t2, 0 # t2 = count li t6, 32 la t5, constant bne a1, x0, loop li t2, 32 j next loop: lw t3, 0(t5) srli t6, t6, 1 beq t6, x0, next addi t5, t5, 4 bltu t3, a1, loop add t2, t2, t6 sll a1, a1, t6 j loop next: bne t2, a2, wrong addi t4, t4, 4 addi t1, t1, 4 addi a0, a0, -1 bne a0, x0, count return: la a0, statement1 addi a7, zero, 4 ecall j fin wrong: la a0, statement2 addi a7, zero, 4 ecall fin: li a7, 10 ecall ``` #### third version (Bitwise operations without branches) ```c= .data num_test: .word 12 constant: .word 0x00010000,0x01000000,0x10000000,0x40000000,0x80000000 test: .word -1,0,1,2147483647,-2147483648,48763,54321,123456789,16,256,65536,65535 answer: .word 0,32,31,1,0,16,16,5,27,23,15,16 statement1: .string "All Test Pass" statement2: .string "Wrong, Check Again" .text main: la t0, num_test lw a0, 0(t0) # a0 = num_test la t1, test la t3, answer count: la t5, constant lw a1, 0(t1) # a1 = number lw a2, 0(t3) # a2 = answer li t2, 0 # t2 = count li t6, 4 loop: lw a3, 0(t5) sltu t4, a1, a3 sll t4, t4, t6 add t2, t2, t4 sll a1, a1, t4 addi t5, t5, 4 addi t6, t6, -1 bge t6, x0, loop sltiu t4, a1, 1 add t2, t2, t4 next: bne t2, a2, wrong addi t3, t3, 4 addi t1, t1, 4 addi a0, a0, -1 bne a0, x0, count return: la a0, statement1 addi a7, zero, 4 ecall j fin wrong: la a0, statement2 addi a7, zero, 4 ecall fin: li a7, 10 ecall ``` #### fourth version (Lookup table optimization and loop unrolling) ```c= .data num_test: .word 12 test: .word -1,0,1,2147483647,-2147483648,48763,54321,123456789,16,256,65536,65535 answer: .word 0,32,31,1,0,16,16,5,27,23,15,16 statement1: .string "All Test Pass" statement2: .string "Wrong, Check Again" .text main: la t0, num_test lw a0, 0(t0) # a0 = num_test la t1, test la t3, answer li s0, 0x00010000 li s1, 0x01000000 li s2, 0x10000000 li s3, 0x55af count: lw a1, 0(t1) # a1 = number lw a2, 0(t3) # a2 = answer li t2, 0 # t2 = count sltu t4, a1, s0 slli t4, t4, 4 add t2, t2, t4 sll a1, a1, t4 sltu t4, a1, s1 slli t4, t4, 3 add t2, t2, t4 sll a1, a1, t4 sltu t4, a1, s2 slli t4, t4, 2 add t2, t2, t4 sll a1, a1, t4 srli t4, a1, 27 andi t4, t4, 0x1e srl t4, s3, t4 andi t4, t4, 3 add t2, t2, t4 sltiu t4, a1, 1 add t2, t2, t4 next: bne t2, a2, wrong addi t3, t3, 4 addi t1, t1, 4 addi a0, a0, -1 bne a0, x0, count return: la a0, statement1 addi a7, zero, 4 ecall j fin wrong: la a0, statement2 addi a7, zero, 4 ecall fin: li a7, 10 ecall ``` :::danger You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking. ::: ### Comparison Comparison of the above four clz function: | | 1st | 2nd | 3rd | 4th (best)| | --- | --- | --- | --- | --- | |cycles|1675|725|838|375| |Insts |1259|537|650|343| |IPC |0.752|0.741|0.776|0.915| |CPI |1.33|1.35|1.29|1.09| From the table, we can find the fourth one has the least execution cycles. For Quiz1 Problem C, we can replace original loop through clz with branchless and optimized one: ```c= #########branchless and optimized######## clz: li t0, 0 li t1, 0x00010000 sltu t2, a0, t1 slli t2, t2, 4 add t0, t0, t2 sll a0, a0, t2 li t1, 0x01000000 sltu t2, a0, t1 slli t2, t2, 3 add t0, t0, t2 sll a0, a0, t2 li t1, 0x10000000 sltu t2, a0, t1 slli t2, t2, 2 add t0, t0, t2 sll a0, a0, t2 srli t1, a0, 27 andi t1, t1, 0x1e srl t1, a1, t1 andi t1, t1, 3 add t0, t0, t1 sltiu t1, a0, 1 add t0, t0, t1 mv a0, t0 ret ########################################## ``` Comparison of the Original and Revised clz Functions in Quiz1 Problem C: | | loop through | branchless and optimized | | --- | --- | --- | |cycles|1014|815| |Insts |836|727| |IPC |0.806|0.892| |CPI |1.24|1.43| Based on the data in the table, we can observe that the revised implementation reduces execution cycles by approximately 20%. ## Int to Float conversion ### Problem The goal of this problem is to convert a 32-bit signed integer into a single-precision floating-point number, following the IEEE 754 standard for floating-point representation. This involves translating the integer into its equivalent floating-point bit representation, which consists of three components: the sign bit, the exponent, and the mantissa. 1. **Example**: For $x = -1 = -1\times 1\times2^0$ * |Binary representation|: `00000000 00000000 00000000 00000001` * The sign bit is `1` * The exponent is `127 + 0 = 127` * The mantissa is `00000000 00000000 0000000` * The round bit is `0` * So output will be `1 01111111 00000000000000000000000 = 0xBF800000` 2. **Example**: For $x = 48763 = -1\times 1.48813\times2^{15}$ * |Binary representation|: `00000000 00000000 10111110 01111011` * The sign bit is `0` * The exponent is `127 + 15 = 142` * The mantissa is `01111100 11110110 0000000` * The round bit is `0` * So output will be `0 10001110 01111100111101100000000= 0x473E7B00` 3. **Example**: For $x = 134217736 = 1\times (1+2^{-24})\times2^{27}$ * |Binary representation|: `00001000 00000000 00000000 00001000` * The sign bit is `0` * The exponent is `127 + 27 = 154` * The mantissa is `00000000 00000000 0000000` * The round bit is `1` * But the last bit of mantissa is `0` * For round-to-even, round bit is modified to be `0` * So output will be `0 10011010 00000000000000000000000 = 0x4D000000` 4. **Example**: For $x = 134217752 = 1\times (1+3 \times 2^{-24})\times2^{27}$ * |Binary representation|: `00001000 00000000 00000000 00011000` * The sign bit is `0` * The exponent is `127 + 27 = 154` * The mantissa is `00000000 00000000 0000001` * The round bit is `1` * For round-to-even, round bit is still to be `1` * So output will be `0 10011010 00000000000000000000010 = 0x4D000002` 5. **Example**: For $x = 0$ * For `x = 0`, the output is defined as `0x00000000` * So output will be `0 00000000 00000000000000000000000 = 0x00000000` reference: https://onestepcode.com/int-to-float-c/ https://evanw.github.io/float-toy/ ### C code ```c= uint32_t IntToFloat(int num) { int leading_zero, exponent, mantissa, sign = 0, abs_num = num; int shift, round_bit, temp, last_bit; sign = num >> 31; // If num is 0, the function directly returns 0 if (num == 0) { return 0; } // The absolute value of num is taken to proceed with the conversion. if (num < 0) { abs_num = -num; } // calculates the number of leading zeros leading_zero = clz(abs_num); shift = 31 - leading_zero; // The exponent in IEEE 754 format is stored with a bias of 127 // So exponent = 127 + (31 - leading_zero); exponent = 127 + shift; mantissa = abs_num ^ (1 << shift); if (shift > 23) { // Round to closest round_bit = (mantissa >> (shift - 24)) & 0x00000001; // Round to even modification last_bit = (mantissa >> (shift - 23)) & 0x00000001; temp = (1 << (shift - 24)) - 1; temp = mantissa & temp; // if the round part is XXX.5 // in C, .5 will be round to closest even // if last_bit == 1, then round_bit = 1 // if last_bit == 0, then round_bit = 0 if (temp == 0 && round_bit == 1) { round_bit = last_bit; } mantissa = mantissa >> shift - 23; mantissa = mantissa + round_bit; } else { mantissa = mantissa << 23 - shift; } // The function assembles the single-precision floating-point number by packing: // The sign bit into the most significant bit 31, // The 8-bit exponent into bits 30-23, // The 23-bit mantissa into bits 22-0. uint32_t value = ((sign << 31) | (exponent << 23)) + mantissa; //internal validation assert(value == fp32_to_bits((float)(num))); return value; } ``` Convert Int type to single-precision floating-point bit representation. ```c= uint32_t fp32_to_bits(float f) { union { float as_value; uint32_t as_bits; } fp32 = {.as_value = f}; return fp32.as_bits; } ``` Then convert float type to bit representation ```c= int main() { srand(time(NULL)); int num, y, corner_case[7] = {0, -1, 1, -2147483648, 2147483647, 0x08000008,0x08000018 }; for (int i = 0; i < 7; i++) { y = IntToFloat(corner_case[i]); assert(y == fp32_to_bits((float)(corner_case[i]))); } for (int i = 0; i < 10000; i++) { num = ((rand() << 17) | rand()); y = IntToFloat(num); assert(y == fp32_to_bits((float)(num))); } printf("All test pass!!\n"); return 0; } ``` For the testing program, I started by testing 7 corner cases: * -1 * 0 * 1 * 2147483647 (INT_MAX) * -2147483648 (INT_MIN) * 0x08000008 (to test round-to-even behavior) * 0x08000018 (to test round-to-even behavior) Afterward, I generated 10,000 random values to further test the program. I used assert statements to verify that the function's output matched the result of the built-in C int-to-float conversion function. :::danger You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking. ::: :::success modified in the function ::: ### Assembly code ```c= .data num_test: .word 13 test: .word -1,0,1,2147483647,-2147483648,0x08000008,0x08000018,48763,-48763,123456789,537530813,278410568,1509961956 answer: .word 0xbf800000,0,0x3f800000,0x4f000000,0xcf000000,0x4d000000,0x4d000002,0x473e7b00,0xc73e7b00,0x4ceb79a3,0x4e002847,0x4d84c1aa,0x4eb40062 statement1: .string "All Test Pass" statement2: .string "Wrong, Check Again" .text main: la s0, num_test lw t0, 0(s0) # t0 = num_test la s1, test la t2, answer loop: lw t1, 0(s1) # t1 = num lw a2, 0(t2) # a2 = answer beq t1, x0, next sign: srli t3, t1, 31 bgt t1, x0, exponent sub t1, x0, t1 exponent: ########### call function clz ############# addi sp, sp, -12 sw t2, 8(sp) sw t1, 4(sp) sw t0, 0(sp) mv a0, t1 # a0 = num li a1, 0x55af jal ra, clz # a0 = leading zero lw t0, 0(sp) lw t1, 4(sp) lw t2, 8(sp) ########################################## sub a0, x0, a0 addi s5, a0, 31 addi t4, s5, 127 slt s6, s5, x0 mantissa: li s8, 1 sll s7, s8, s5 xor t5, t1, s7 round: li s7, 23 ble s5, s7, noround addi s9, s5, -24 srl s7, t5, s9 andi t6, s7, 1 addi s7, s5, -23 srl s7, t5, s7 andi s6, s7, 1 sll s8, s8, s9 addi s8, s8, -1 and s8, s8, t5 bne s8, x0, noroundeven beq t6, x0, noroundeven mv t6, s6 noroundeven: addi s9, s5, -23 srl t5, t5, s9 add t5, t5, t6 j merge noround: sub s5, x0, s5 addi s7, s5, 23 sll t5, t5, s7 merge: slli t3, t3, 31 slli t4, t4, 23 or a0, t3, t4 add t1, a0, t5 next: bne t1, a2, wrong addi t2, t2, 4 addi s1, s1, 4 addi t0, t0, -1 bne t0, zero, loop return: la a0, statement1 addi a7, zero, 4 ecall j fin wrong: la a0, statement2 addi a7, zero, 4 ecall fin: li a7, 10 ecall ########################################## clz: li t0, 0 li t1, 0x00010000 sltu t2, a0, t1 slli t2, t2, 4 add t0, t0, t2 sll a0, a0, t2 li t1, 0x01000000 sltu t2, a0, t1 slli t2, t2, 3 add t0, t0, t2 sll a0, a0, t2 li t1, 0x10000000 sltu t2, a0, t1 slli t2, t2, 2 add t0, t0, t2 sll a0, a0, t2 srli t1, a0, 27 andi t1, t1, 0x1e srl t1, a1, t1 andi t1, t1, 3 add t0, t0, t1 sltiu t1, a0, 1 add t0, t0, t1 mv a0, t0 ret ########################################## ``` :::danger Use fewer instructions. ::: ## pipeline explanation A 5-stage pipeline CPU is a common architecture used to improve instruction throughput by dividing the execution of an instruction into five stages. Each stage handles a different part of instruction processing, allowing multiple instructions to be processed simultaneously at different stages. #### Take add x7 x7 x29 for example: #### Instruction Fetch (IF): The CPU retrieves the next instruction from memory, based on the Program Counter (PC). The PC is then updated for the next cycle. ![image](https://hackmd.io/_uploads/rkDsGOACR.png) In this case, the current PC is 60, and the instruction is fetched from the instruction memory address 60 where the preloaded instruction is stored. #### Instruction Decode (ID): The fetched instruction is decoded to understand what operation is required. The operands are also fetched from registers if needed. ![image](https://hackmd.io/_uploads/HJ_CUORCC.png) The instruction is decoded as follows: * The operation is ADD * R1 index is 7 * R2 index is 29 * The values 10 and 0 are fetched from the corresponding registers. #### Execution (EXE): The CPU performs the actual computation or operation specified by the instruction. This could be an arithmetic or logic operation, or calculating an address for memory access. ![image](https://hackmd.io/_uploads/rkwUDdCA0.png) ![image](https://hackmd.io/_uploads/BkL5P_A00.png) For the value of rs1 (x7), which is 10, it directly enters the ALU for computation. However, for rs2 (x29), based on the instruction execution order, x29 will be written by the previous instruction, causing a Read-After-Write (RAW) hazard. Therefore, the mux below selects the forwarded value to provide the ALU with the correct data for calculation. #### Memory Access (MEM): If the instruction requires data to be read from or written to memory, this is done in the MEM stage. For load/store instructions, this is where data is fetched or saved in memory. ![image](https://hackmd.io/_uploads/ryoq_uA00.png) Since the ADD instruction does not involve memory, the computed value is passed directly to the next stage. #### Write Back (WB): The result of the instruction execution (from EXE or MEM stage) is written back to the register file, completing the instruction. ![image](https://hackmd.io/_uploads/HkL-tdRRC.png) The result of the ADD computation, which is 10, is written back to the register file.