Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by <[`amos0256`](https://github.com/amos0256/Computer-Aritecture-2024-Fall)> ###### tags: `Computer Architecture 2024 Fall` ## [Problem `C`](https://hackmd.io/@sysprog/arch2024-quiz1-sol#Problem-C) ### `fabsf` The `fabsf` function is used to compute the absolute value of a single-precision floating-point number by performing a bitwise AND operation with the mask `0x7FFFFFFF` to clear the sign bit. ```c static inline float fabsf(float x) { uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value x = *(float *)&i; // Write the modified bits back into the float return x; } ``` ```c fabsf: li t0, 0x7FFFFFFF and a0, a0, t0 ret ``` ### `my_clz` The `my_clz` function counts the number of leading zeros in a 32-bit unsigned integer. It iterates through the bits of the input `x` from the MSB to the LSB, incrementing a counter until it encounters the first set bit, and then returns the count of leading zeros. ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` ```c my_clz: li t0, 0 # counter = 0 li t1, 31 # i counter li t2, 1 my_clz_loop: bltz t1, end_my_clz # if t1 < 0, goto end_my_clz sll t3, t2, t1 # t3 = 1 << i and t4, a0, t3 # t4 = x & (1 << i) bnez t4, end_my_clz # if t4 == 1, goto end_my_clz addi t0, t0, 1 # count++ addi t1, t1, -1 # i-- j my_clz_loop end_my_clz: mv a0, t0 # return the count ret ``` ### `fp16_to_fp32` The `fp16_to_fp32` function converts a 16-bit half-precision floating-point number into a 32-bit single-precision floating-point number. It extends the representation by shifting the half-precision number into the upper half of a 32-bit word, with the appropriate placement of the sign bit, exponent, and mantissa. The function handles both normalized and denormalized numbers by determining how much to shift the mantissa for normalization, and it also accounts for special cases such as infinity and NaN. Finally, the function combines all components—sign, adjusted exponent, and mantissa—into a single 32-bit output, ensuring accurate conversion between the two floating-point formats. ```c static inline uint32_t fp16_to_fp32(uint16_t h) { /* * Extends the 16-bit half-precision floating-point number to 32 bits * by shifting it to the upper half of a 32-bit word: * +---+-----+------------+-------------------+ * | S |EEEEE|MM MMMM MMMM|0000 0000 0000 0000| * +---+-----+------------+-------------------+ * Bits 31 26-30 16-25 0-15 * * S - sign bit, E - exponent bits, M - mantissa bits, 0 - zero bits. */ const uint32_t w = (uint32_t) h << 16; /* * Isolates the sign bit from the input number, placing it in the most * significant bit of a 32-bit word: * * +---+----------------------------------+ * | S |0000000 00000000 00000000 00000000| * +---+----------------------------------+ * Bits 31 0-31 */ const uint32_t sign = w & UINT32_C(0x80000000); /* * Extracts the mantissa and exponent from the input number, placing * them in bits 0-30 of the 32-bit word: * * +---+-----+------------+-------------------+ * | 0 |EEEEE|MM MMMM MMMM|0000 0000 0000 0000| * +---+-----+------------+-------------------+ * Bits 30 27-31 17-26 0-16 */ const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); /* * The renorm_shift variable indicates how many bits the mantissa * needs to be shifted to normalize the half-precision number. * For normalized numbers, renorm_shift will be 0. For denormalized * numbers, renorm_shift will be greater than 0. Shifting a * denormalized number will move the mantissa into the exponent, * normalizing it. */ uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; /* * If the half-precision number has an exponent of 15, adding a * specific value will cause overflow into bit 31, which converts * the upper 9 bits into ones. Thus: * inf_nan_mask == * 0x7F800000 if the half-precision number is * NaN or infinity (exponent of 15) * 0x00000000 otherwise */ const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); /* * If nonsign equals 0, subtracting 1 will cause overflow, setting * bit 31 to 1. Otherwise, bit 31 will be 0. Shifting this result * propagates bit 31 across all bits in zero_mask. Thus: * zero_mask == * 0xFFFFFFFF if the half-precision number is * zero (+0.0h or -0.0h) * 0x00000000 otherwise */ const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; /* * 1. Shifts nonsign left by renorm_shift to normalize it (for denormal * inputs). * 2. Shifts nonsign right by 3, adjusting the exponent to fit in the * 8-bit exponent field and moving the mantissa into the correct * position within the 23-bit mantissa field of the single-precision * format. * 3. Adds 0x70 to the exponent to account for the difference in bias * between half-precision and single-precision. * 4. Subtracts renorm_shift from the exponent to account for any * renormalization that occurred. * 5. ORs with inf_nan_mask to set the exponent to 0xFF if the input * was NaN or infinity. * 6. ANDs with the inverted zero_mask to set the mantissa and exponent * to zero if the input was zero. * 7. Combines everything with the sign bit of the input number. */ return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } ``` There are three fp16 test cases in the RISC V assembly code: * `0x7C00` represent $+ \infty$ and `0x7F800000` in fp32 * `0x0001` represent $5.96 \times 10^{-8}$ and `0x33800000` in fp32 * `0x3C00` represent `1.0` and `0x3F800000` in fp32 ``` .data examples: .word 0x7C00 .word 0x0001 .word 0x3C00 expected_res: .word 0x7F800000 .word 0x33800000 .word 0x3F800000 correct_msg: .asciz "Correct" wrong_msg: .asciz "Wrong" newline: .asciz ".\n" .text .global main main: la t0, examples # t0 = address of examples la t1, expected_res # t1 = address of expected_res li t2, 3 # t2 = number of examples li t3, 0 # t3 = for loop counter loop_example: beq t2, t3, end_main # if counter == 3, goto end_main lw a0, 0(t0) # a0 = example jal ra, fp16_to_fp32 lw a1, 0(t1) # a1 = expected result jal ra, print_result addi t0, t0, 4 # move to the next example addi t1, t1, 4 # move to the next expected result addi t3, t3, 1 # increment loop counter j loop_example end_main: # exit program li a7, 10 # syscall exit ecall fp16_to_fp32: # preserve ra register addi sp, sp, -4 sw ra, 0(sp) # store return address # preserve t0-t3 registers addi sp, sp, -16 sw t0, 0(sp) sw t1, 4(sp) sw t2, 8(sp) sw t3, 12(sp) slli a0, a0, 16 # shift left by 16 bits to extend to 32-bit # sign bit li t0, 0x80000000 # load sign mask and t1, a0, t0 # t1 is sign = w & 0x80000000 # nonsign bit li t0, 0x7FFFFFFF # load nonsign mask and t2, a0, t0 # t2 is nonsign = w & 0x7FFFFFFF # normalize the number, use my_clz to count leading zeros mv a0, t2 jal ra, my_clz li t0, 5 blt t0, a0, renorm # if renorm_shift > 5, goto renorm li a0, 0 # renorm_shift = 0 j continue renorm: sub a0, a0, t0 # renorm_shift -= 5 continue: # handle inf_nan_mask li t0, 0x04000000 add t3, t2, t0 # inf_nan_mask = nonsign + 0x04000000 srai t3, t3, 8 # inf_nan_mask >>= 8 li t0, 0x7F800000 and t3, t3, t0 # inf_nan_mask # handle zero_mask addi t4, t2, -1 # zero_mask = nonsign - 1 srai t4, t4, 31 # zero_mask >> 31 # normalize and adjust exponent sll t5, t2, a0 # shift nonsign left by renorm_shift srai t5, t5, 3 # shift nonsign right by 3 li t0, 0x70 sub t0, t0, a0 # 0x70 - renorm_shift slli t0, t0, 23 # adjust exponent and place in proper position add t5, t5, t0 # add the adjust exponent to the result # combine inf_nan_mask and handle zero case or t5, t5, t3 # or with inf_nan_mask not t4, t4 # invert zero_mask and t5, t5, t4 # zero out result if zero_mask is set # combine sign and result or a0, t1, t5 # combine with sign bit # restore t0-t3 registers lw t0, 0(sp) lw t1, 4(sp) lw t2, 8(sp) lw t3, 12(sp) addi sp, sp, 16 # restore t0-t3 registers lw ra, 0(sp) addi sp, sp, 4 ret # my_clz function my_clz: # preserve t1 and t2 registers addi sp, sp, -8 sw t1, 0(sp) sw t2, 4(sp) li t0, 0 # counter = 0 li t1, 31 # i counter li t2, 1 my_clz_loop: bltz t1, end_my_clz # if t1 < 0, goto end_my_clz sll t3, t2, t1 # t3 = 1 << i and t4, a0, t3 # t4 = x & (1 << i) bnez t4, end_my_clz # if t4 == 1, goto end_my_clz addi t0, t0, 1 # count++ addi t1, t1, -1 # i-- j my_clz_loop end_my_clz: mv a0, t0 # return the count # preserve t1 and t2 registers lw t1, 0(sp) lw t2, 4(sp) addi sp, sp, 8 ret # print result function print_result: bne a0, a1, wrong_case # if a0 != a1, goto wrong_case li a7, 4 ecall la a0, correct_msg li a7, 4 ecall j end_print wrong_case: la a0, wrong_msg li a7, 4 ecall end_print: la a0, newline li a7, 4 ecall ret ``` ## Minimum Flips to Make a OR b Equal to c > [LeetCode 1318](https://leetcode.com/problems/minimum-flips-to-make-a-or-b-equal-to-c/description/) Description: Given 3 positives numbers `a`, `b` and `c`. Return the minimum flips required in some bits of `a` and `b` to make ( `a` OR `b` == `c` ). (bitwise OR operation). Flip operation consists of change any single bit 1 to 0 or change the bit 0 to 1 in their binary representation. Constraint: * `1 <= a <= 10^9` * `1 <= b <= 10^9` * `1 <= c <= 10^9` ## Implementation ### Idle of solving problem 1. The result of bitwise OR (`a | b`) is 1 if either bit is 1, and 0 if both bits are 0. 2. For each bit position, there are two cases: * If the corresponding bit in `c` is 0, both `a` and `b` must have 0 at that bit position. The total number of required flips is equal to the count 1s at the corresponding positions in `a` and `b`, which is the sum of the bit values in `a` and `b`. $$ \begin{array}{cccccc} {} & {a} & {1} & {\underrightarrow{flip}} & {0} & {} \\ {} & {b} & {1} & {\underrightarrow{flip}} & {0} & {} \\ \hline {} & {c} & {} & {} & {0} & {} \end{array} \\ \Rightarrow total \ flips \ times = \# \ of \ 1s $$ * If the corresponding bit in `c` is 1, at least one of `a` or `b` needs to be 1. A flip is needed only when both `a` and `b` are 0, and the number of required flips is 1. $$ \begin{array}{cccccc} {} & {a} & {0} & {\underrightarrow{flip}} & {1} & {} \\ {} & {b} & {0} & {} & {0} & {} \\ \hline {} & {c} & {} & {} & {1} & {} \end{array} \\ \Rightarrow total \ flips \ times = 1 $$ ### C program #### First version In the first version, I used an intuitive approach, starting from the 32nd bit to compare the corresponding bits of the three numbers a, b, and c. The loop calculated the minimum number of required flips based on the ideas mentioned above until it arrived at the 1st bit. ```c int minFlips(int a, int b, int c){ int flips = 0; for (int i = 0; i < 32; i++) { int bitA = (a >> i) & 1; int bitB = (b >> i) & 1; int bitC = (c >> i) & 1; if (bitC == 0) { flips += bitA + bitB; } else { if (bitA == 0 && bitB == 0) { flips += 1; } } } return flips; } ``` #### Second version When the numbers `a`, `b`, and `c` are small, unnecessary iterations can occur due to comparisons involving 0. Therefore, we can first identify the leading 1 in the binary representation of the largest number among `a`, `b` and `c`. This enables to calculate the number of leading zeros for the maximum number, allowing to skip these unnecessary comparisons. * without counting leading zero $$ \begin{array} {} & {a} & {\overbrace{000...0}^{30 \ bits} 01} & {} \\ {} & {b} & {\overbrace{000...0}^{30 \ bits} 10} & {} \\ \hline {} & {c} & {\overbrace{000...0}^{30 \ bits} 11} & {} \end{array} \\ $$ It can observe that only the last two bits need to be compared. * with counting leading zero $$ \begin{array} {} & {a} & {01} & {} \\ {} & {b} & {10} & {} \\ \hline {} & {c} & {11} & {} \end{array} \\ $$ When calculating the counting leading zero of the three numbers `a`, `b`, and `c`, we avoid using a comparison of their sizes to find the maximum value. Since we only care about the position of the first occurrence of 1, performing an OR operation on the three numbers will not change the position of the first 1 that appears. ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } int minFlips(int a, int b, int c) { int flips = 0; int maxBit = 31 - my_clz(a | b | c); for (int i = 0; i <= maxBit; i++) { int bitA = (a >> i) & 1; int bitB = (b >> i) & 1; int bitC = (c >> i) & 1; if (bitC == 0) { flips += bitA + bitB; } else { if (bitA == 0 && bitB == 0) { flips += 1; } } } return flips; } ``` :::danger Replace `else if (b >= a && b >= c)` with `if (b >= a && b >= c)`. Shorten the above. ::: ### RISC V Assembly #### First version ``` .data examples: .word 1, 2, 3 .word 51041, 65280, 716177407 .word 143165576, 715827882, 1 expected_res: .word 0 .word 14 .word 23 correct_msg: .asciz "Correct" wrong_msg: .asciz "Wrong" newline: .asciz ".\n" .text .global main main: la t0, examples # t0 = address of examples la t1, expected_res # t1 = address of expected_res li t2, 3 # t2 = number of examples li t3, 0 # t3 = for loop counter loop_example: beq t2, t3, end_main # if counter == 3, goto end_main # load the current example's value lw a0, 0(t0) lw a1, 4(t0) lw a2, 8(t0) jal ra, min_flips # call min_flips lw a1, 0(t1) # load the current expected result jal ra, print_result # call print_result addi t0, t0, 12 # move to the next example addi t1, t1, 4 # move to the next expected result addi t3, t3, 1 # increment loop counter j loop_example end_main: # exit program li a7, 10 # syscall exit ecall # min flips function min_flips: # preserve t0-t3 registers addi sp, sp, -16 sw t0, 0(sp) sw t1, 4(sp) sw t2, 8(sp) sw t3, 12(sp) li t0, 0 # flips = 0 li t1, 31 # i = 31, 31 bits loop: bltz t1, end_loop # if i < 0, goto end_loop srl t2, a0, t1 # bitA = (a >> i) & 1 andi t2, t2, 1 srl t3, a1, t1 # bitB = (b >> i) & 1 andi t3, t3, 1 srl t4, a2, t1 # bitC = (c >> i) & 1 andi t4, t4, 1 beqz t4, add_flip # if bitC == 0, goto add_flip beqz t2, check_b # if bitA == 0, goto check_b j continue_loop add_flip: add t0, t0, t2 # flips += bitA add t0, t0, t3 # flips += bitB j continue_loop check_b: bnez t3, continue_loop # if bitB != 0, goto continue_loop addi t0, t0, 1 # flips += 1 continue_loop: addi t1, t1, -1 # i-- j loop end_loop: mv a0, t0 # restore return value # restore t0-t3 registers lw t0, 0(sp) lw t1, 4(sp) lw t2, 8(sp) lw t3, 12(sp) addi sp, sp, 16 ret # print result function print_result: bne a0, a1, wrong_case # if a0 != a1, goto wrong_case la a0, correct_msg li a7, 4 ecall j end_print wrong_case: la a0, wrong_msg li a7, 4 ecall end_print: la a0, newline li a7, 4 ecall ret ``` #### Second version ``` .data examples: .word 1, 2, 3 .word 51041, 65280, 716177407 .word 143165576, 715827882, 1 expected_res: .word 0 .word 14 .word 23 correct_msg: .asciz "Correct" wrong_msg: .asciz "Wrong" newline: .asciz ".\n" .text .global main main: la t0, examples # t0 = address of examples la t1, expected_res # t1 = address of expected_res li t2, 3 # t2 = number of examples li t3, 0 # t3 = for loop counter loop_example: beq t2, t3, end_main # if counter == 3, goto end_main # load the current example's value lw a0, 0(t0) lw a1, 4(t0) lw a2, 8(t0) jal ra, min_flips # call min_flips lw a1, 0(t1) # load the current expected result jal ra, print_result # call print_result addi t0, t0, 12 # move to the next example addi t1, t1, 4 # move to the next expected result addi t3, t3, 1 # increment loop counter j loop_example end_main: # exit program li a7, 10 # syscall exit ecall # my_clz function my_clz: li t0, 0 # counter = 0 li t1, 31 # i counter li t2, 1 my_clz_loop: bltz t1, end_my_clz # if t1 < 0, goto end_my_clz sll t3, t2, t1 # t3 = 1 << i and t4, a0, t3 # t4 = x & (1 << i) bnez t4, end_my_clz # if t4 == 1, goto end_my_clz addi t0, t0, 1 # count++ addi t1, t1, -1 # i-- j my_clz_loop end_my_clz: mv a0, t0 # return the count ret # min flips function min_flips: # preserve t0-t3 registers addi sp, sp, -16 sw t0, 0(sp) sw t1, 4(sp) sw t2, 8(sp) sw t3, 12(sp) # push the input parameters into the stack addi sp, sp, -16 # allocate space on the stack sw ra, 12(sp) sw a0, 0(sp) sw a1, 4(sp) sw a2, 8(sp) or a1, a1, a2 # b OR c or a0, a0, a1 # a OR b OR c jal ra, my_clz # compute the leading zero li t0, 0 # flips = 0 li t1, 31 sub t1, t1, a0 # i = 31 - number of leading zero # restore the input parameters from the stack lw a0, 0(sp) lw a1, 4(sp) lw a2, 8(sp) loop: bltz t1, end_loop # if i < 0, goto end_loop srl t2, a0, t1 # bitA = (a >> i) & 1 andi t2, t2, 1 srl t3, a1, t1 # bitB = (b >> i) & 1 andi t3, t3, 1 srl t4, a2, t1 # bitC = (c >> i) & 1 andi t4, t4, 1 beqz t4, add_flip # if bitC == 0, goto add_flip beqz t2, check_b # if bitA == 0, goto check_b j continue_loop add_flip: add t0, t0, t2 # flips += bitA add t0, t0, t3 # flips += bitB j continue_loop check_b: bnez t3, continue_loop # if bitB != 0, goto continue_loop addi t0, t0, 1 # flips += 1 continue_loop: addi t1, t1, -1 # i-- j loop end_loop: mv a0, t0 # restore return value lw ra, 12(sp) addi sp, sp, 16 # restore t0-t3 registers lw t0, 0(sp) lw t1, 4(sp) lw t2, 8(sp) lw t3, 12(sp) addi sp, sp, 16 ret # print result function print_result: bne a0, a1, wrong_case # if a0 != a1, goto wrong_case la a0, correct_msg li a7, 4 ecall j end_print wrong_case: la a0, wrong_msg li a7, 4 ecall end_print: la a0, newline li a7, 4 ecall ret ``` :::danger Use fewer instructions. ::: ## Analysis ### Execution info. | | Cycles | instrs. retired | CPI | IPC | | -------- | -------- | -------- | -------- | -------- | | minFlips without `clz` | 1945 | 1355 | 1.44 | 0.697 | | minFlips with `clz` | 1692 | 1220 | 1.39 | 0.721 | Through the improvement of counting leading zeros (clz), it can significantly accelerate test case with a large number of leading zeros. By comparing the first test case, which `a = 1`, `b = 2` and `c = 3`, we can evaluate how many clock cycles can be reduced before and after using the optimized counting leading zero method. | | Cycles | | -------- | -------- | | without `clz` | 676 | | with `clz` | 405 | ### Single-cycle processor ![image](https://hackmd.io/_uploads/r1F7eYDy1l.png) A single-cycle processor completes instruction execution, including IF, ID, EX, MEM, and WB, in one clock cycle. Every instruction, regardless of complexity, is executed within this cycle, requiring N cycles for N instructions, fewer than a 5-stage processor. However, the clock cycle time must accommodate the slowest instruction, like memory access, which extends the cycle and slows down the processor despite using fewer cycles. | | Cycles | instrs. retired | CPI | IPC | | -------- | -------- | -------- | -------- | -------- | | minFlips with `clz` | 1220 | 1220 | 1 | 1 | The output from Ripes shows that the number of cycles is equal to the instructions retired, indicating that each instruction is completed in a single cycle. ### 5-stage processor ![image](https://hackmd.io/_uploads/H1cjLtvk1x.png) A pipelined processor divides instruction execution into stages, IF, ID, EX, MEM, and WB, with each stage taking one clock cycle. This allows different instructions to be processed simultaneously at various stages, a process known as pipelining. Although a 5-stage pipeline processor may require more total clock cycles than a single-cycle processor due to potential stalls from data or control hazards, its clock cycle time is shorter. This is because each stage only performs part of the work, minimizing the wait for the longest operation. Consequently, despite using more clock cycles, the shorter cycle time and instruction overlap enhance throughput and performance. ### Pipeline Using the second instruction in `my_clz_loop`, which is `sll t3, t2, t1`, allows for an explanation of the pipeline #### IF, Instruction Fetch The image illustrates the pipeline for the `sll x28, x7, x6` instruction, showing the flow from the Program Counter (PC) holding the instruction address `0x00000060` to the Instruction Memory, which fetches the instruction `0x00639e33`. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/BJt1cYD1Jg.png" height=500px> </div> #### ID, Instruction Decode The image illustrates the decode stage for the `sll x28, x7, x6` instruction, showing the decoding of the instruction `0x00639e33` and its opcode, along with the relevant register indices for `x7`, `x6`, and `x28`. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/r17IjKPyyg.png" height=500px> </div> #### EX, Exection The image illustrates the execution stage of the `sll x28 x7 x6` instruction, where the ALU performs a left shift on the value in register `x7` by the number of bits specified in `x6`, and the result is stored in `x28`. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/BJHthtDk1g.png" height=500px> </div> #### MEM, Memory Access The image illustrates the memory access stage of the `sll x28 x7 x6` instruction, the data memory block handles data read and write operations, with the write enable signal active. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/BJNMk5vJJl.png" height=500px> </div> #### WB, Write Back The image illustrates the write stage of the `sll x28 x7 x6` instruction, the shifted result from the ALU is written to register x28. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/ByDT-qDkJx.png" height=500px> </div> ## Reference * [Quiz1 of Computer Architecture (2024 Fall)](https://hackmd.io/@sysprog/arch2024-quiz1-sol) * [1318. Minimum Flips to Make a OR b Equal to c](https://leetcode.com/problems/minimum-flips-to-make-a-or-b-equal-to-c/description/) * [RISC-V Instruction Set Specifications](https://msyksphinz-self.github.io/riscv-isadoc/html/index.html)