Assignment1: RISC-V Assembly and Instruction Pipeline

Assignment1: RISC-V Assembly and Instruction Pipeline === contributed by < [李皓翔]() > ## Problem C in [Quiz1] ### fabsf fabsf calculates the absolute value of a floating-point number using bit manipulation.The mask 0x7FFFFFFF has all bits set to 1 except the most significant bit (the sign bit), which is set to 0.This operation clears the sign bit of the float, taking the absolute value. #### c code ```c static inline float fabsf(float x) { uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value x = *(float *)&i; // Write the modified bits back into the float return x; } ``` #### assembly code ```s .data input_num: .word 0x007f .text .globl _start _start: la t0, input_num lw t0, 0(t0) # store x in t0 li t1, 0x7FFFFFFF and t0, t0, t1 li a7,1 ecall ``` ### clz The __builtin_clz function is a GCC built-in function that counts the number of leading zero bits in the binary representation of an unsigned integer, starting from the most significant bit (MSB). It returns an integer representing how many zero bits precede the first 1 in the binary representation of the number. There also provided function called my_clz to achieve the result of __builtin_clz. Below is the original code. #### c code ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` This function use a for loop iterates from 31 down to 0. The index i represents the bit position in the integer: * Bit position 31 is the most significant bit (MSB). * Bit position 0 is the least significant bit (LSB). Inside the loop, the expression `x & (1U << i)` checks if the i-th bit of x is equals 1: * `1U << i` creates a bitmask with a 1 at the i-th position and 0s elsewhere. * The bitwise AND (&) operation with x determines if that bit in x is 1. If the i-th bit is 1, the loop breaks, meaning no more leading zeros are present.If the i-th bit is 0, the count is incremented. #### assambly code ```s .data input_num: .word 0x00000001 .text .globl _start _start: la t0, input_num # t0存放input_num lw t0, 0(t0) li a1,1 li a2,31 li a5,0 clz_loop: sll a3,a1,a2 #a3 1U left shift value and a4,t1,a3 #a4 compare MSB to LSB bnez a4 ,exit addi a5,a5,1 addi a2,a2,-1 bgez a2,clz_loop mv a0,t5 li a7,10 # a7 = 10, syscall for exit ecall ``` According to the last year quiz 1, we can achieve the same effect using different bitmasks to replace the for loop structure. Below is the improved code. ```c uint16_t count_leading_zeros(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x -= ((x >> 1) & 0x55555555 ); x = ((x >> 2) & 0x33333333) + (x & 0x33333333 ); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x3f)); } ``` ### fp16 to fp32 fp16_to_fp32, converts a 16-bit floating-point value (also known as fp16 or half-precision) into a 32-bit floating-point value (fp32 or single-precision). #### fp32 ![image](https://hackmd.io/_uploads/rknQrcLkyg.png) * **Sign:** The leftmost 1 bit represents the sign. If it's a positive number, the sign is 0; otherwise, it's 1. * **Exponent:** The middle 8 bits represent the exponent after normalization, using the excess-127 format. For example, 3 is added to 127, resulting in 130. * **Fraction:** The rightmost 23 bits contain the fractional part. For a number like 1.0001, you remove the "1." and store "0001" as the fractional part. The reason why exponent part added 127 bias is adding 127 effectively shifts all numbers to positive values, making comparison easier. Without the +127 bias, exponents would include both positive and negative values, leading to signed bits. When comparing exponents using 8 bits as unsigned integers, negative values (which have a leading 1) would be considered larger than positive values. We want floating-point numbers to be compared directly as unsigned integers, so the range of exponents originally from -126 to +127 becomes 1 (the smallest normalized value) to 254 (the largest normalized value). When computing fp32 neeed to consider 3 condition: 1. **Normalized Mode** express as 1.fraction *2^exponent^ 1. **Denormalized Mode** express as 0.fraction *2^-126^ 1. **NaN and INF Mode** By using denormalized (subnormal) floating-point numbers, the distance between the smallest normalized number and 0 can be made smoother, achieving gradual underflow. During the process of approaching 0, we can use denormalized floating-point numbers to store even smaller values. | floating point number | sign | exponent | fraction | value| | -------- | -------- | -------- | --------|--------| | second smallest normalized number | 0 | 00000001 | 00000000000000000000001| 2^-126^+2^-149^ | smallest normalized number | 0 | 00000001 | 00000000000000000000000| 2^-126^ | biggest denormalized number | 0 | 00000000 | 11111111111111111111111| 2^-126^-2^-149^ | smallest denormalized number | 0 | 00000000 | 00000000000000000000001| 2^-149^ The difference between the smallest and the second smallest normalized number is 2^-149^. The smallest positive normalized number would differ from 0 by 2^-126^, creating a sudden gap of 2^23^ between them. Denormalized numbers fill this gap, allowing for much smaller values to be represented and providing a smoother transition towards zero, which avoids the sudden underflow that would otherwise occur. #### fp16 ![image](https://hackmd.io/_uploads/ByunziLyJx.png) * **Sign:** The leftmost 1 bit represents the sign. If it's a positive number, the sign is 0; otherwise, it's 1. * **Exponent:** The next 5 bits represent the exponent after normalization, using the excess-15 format. For example, if the exponent is 3, it is added to 15, resulting in 18. * **Fraction:** The remaining 10 bits contain the fractional part. For a number like 1.0001, you remove the "1." and store "0001" as the fractional part. Compare with fp32 ,fp16 Uses 16 bits to store a floating-point number. It can represent a smaller range of numbers and with less precision compared to fp32. fp16 also requires half the memory of fp32, making it more efficient for memory storage and bandwidth. Using fp16 can lead to faster computations because less data is being processed and transferred, which can speed up both training and inference for deep learning models. #### C code ```c static inline uint32_t fp16_to_fp32(uint16_t h) { const uint32_t w = (uint32_t) h << 16; const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) &INT32_C(0x7F800000); const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; return sign | ((((nonsign << renorm_shift >> 3) +((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } ``` #### assembly code ```s .data input_num: .word 0x007f .text .globl _start _start: la t0, input_num lw t1, 0(t0) li a1,1 li a2,31 li a5,0 mv t2,t1 slli t2,t2,16 # w = (uint32_t) h << 16; li t0,0x80000000 and t3,t2,t0 #sign = w & UINT32_C(0x80000000); li t0,0x7FFFFFFF and t4,t2,t0 #nonsign = w & UINT32_C(0x7FFFFFFF); my_clz: sll a3,a1,a2 and a4,t4,a3 bnez a4 ,shift addi a5,a5,1 addi a2,a2,-1 bgez a2,my_clz shift: addi a5,a5,-5 bge a5,a1,mask add a5,x0,x0 mask: li t0,0x04000000 add t5,t4,t0 srai t5,t5,8 li t0,0x7f800000 and t5,t5,t0 #if_nan_mask zero_mask: addi t6,t4,-1 srai t6,t6,31 exit: sll t4,t4,a5 srli t4,t4,3 addi t0,x0,0x70 sub t0,t0,a5 slli t0,t0,23 add t4,t4,t0 or t4,t4,t5 not t6,t6 and t4,t4,t6 or t4,t4,t3 mv a0,t4 li a7,10 ecall ``` ## 397. Integer Replacement Given a positive integer n, you can apply one of the following operations: If n is even, replace n with n / 2. If n is odd, replace n with either n + 1 or n - 1. Return the minimum number of operations needed for n to become 1. Example 1: >Input: n = 8 Output: 3 Explanation: 8 -> 4 -> 2 -> 1 Example 2: >Input: n = 7 Output: 4 Explanation: 7 -> 8 -> 4 -> 2 -> 1 or 7 -> 6 -> 3 -> 2 -> 1 Example 3: Input: n = 4 Output: 2 **Constraints:** 1 <= n <= 2^31^ - 1 ### c code version1 In this program, I will convert the clz function from problem C into cto and ctz functions, which calculate the number of consecutive zeros or ones starting from the least significant bit. This will help determine the number of divisions needed. ```c static inline int my_ctz(uint32_t x) { int count = 0; for (int i = 0; i < 32; ++i) { if ((x & (1U << i))) break; count++; } return count; } static inline int my_cto(uint32_t x) { int count = 0; for (int i = 0; i < 32; ++i) { if (!(x & (1U << i))) break; count++; } return count; } int integerReplacement(int n) { uint32_t x =*(uint32_t *)&n; int count = 0; while(x != 1){ if (x & 1){ int count_tailing_one = my_cto(x); if (count_tailing_one >1 && x > 3 ){ x++; x >>= count_tailing_one; count = count + count_tailing_one + 1; }else if(count_tailing_one == 32){ return 33; } else{ x--; x >> = 1; count = count + 2; } }else{ int count_tailing_zero = my_ctz(x); x >>= count_tailing_zero; count = count + count_tailing_zero; } } return count; } int main() { int n = 16773120; int n1 = 2; int n2 = 2130708480; int result = integerReplacement(n); int result1 = integerReplacement(n1); int result2 = integerReplacement(n2); printf("The number of steps to reduce %d to 1 is: %d\n", n, result); printf("The number of steps to reduce %d to 1 is: %d\n", n, result1); printf("The number of steps to reduce %d to 1 is: %d\n", n, result2); return 0; } ``` ### c code version2 In the following program, I referenced last year's clz implementation and replaced the for loop with a masking method. ```c #include <stdio.h> #include <stdint.h> uint16_t count_trailing_ones(uint32_t x) { x = ~x; x |= (x << 1); x |= (x << 2); x |= (x << 4); x |= (x << 8); x |= (x << 16); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x3f)); } uint16_t count_trailing_zeros(uint32_t x) { x |= (x << 1); x |= (x << 2); x |= (x << 4); x |= (x << 8); x |= (x << 16); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x3f)); } int integerReplacement(int n) { uint32_t x = *(uint32_t *)&n; int count = 0; while (x != 1) { if (x & 1) { int count_trailing_one = count_trailing_ones(x); if (count_trailing_one > 1 && x > 3) { x++; x >>= count_trailing_one; count = count + count_trailing_one + 1; } else if (count_trailing_one == 32) { return 33; } else { x--; x >>= 1; count = count + 2; } } else { int count_trailing_zero = count_trailing_zeros(x); x >>= count_trailing_zero; count = count + count_trailing_zero; } } return count; } int main() { int n = 16773120; int n1 = 2; int n2 = 2130708480; int result = integerReplacement(n); int result1 = integerReplacement(n1); int result2 = integerReplacement(n2); printf("The number of steps to reduce %d to 1 is: %d\n", n, result); printf("The number of steps to reduce %d to 1 is: %d\n", n, result1); printf("The number of steps to reduce %d to 1 is: %d\n", n, result2); return 0; } ``` ### assembly code version1 ```s .data input_num1: .word 0x00fff000 # First input input_num2: .word 0x00000002 # Second input input_num3: .word 0x7f000800 # Third input answer: .word 0x19,0x1,0x21 newline: .asciz "\n" # Newline character for printing str1: .string " answer is " # First part of the output message str2: .string " .Actual answer is " # Second part of the output message str3: .string "Input number is " .text .globl _start _start: # Handle first input add t6,x0,x0 la t0, input_num1 # Load address of first input lw t1, 0(t0) # Load the first input into t1 mv t3,t1 jal call_process # Call the process loop jal call_print # Print the result addi t6,x0,4 # Handle second input la t0, input_num2 # Load address of second input lw t1, 0(t0) mv t3,t1 # Load the second input into t1 jal call_process # Call the process loop jal call_print # Print the result addi t6,t6,4 # Handle third input la t0, input_num3 # Load address of third input lw t1, 0(t0) mv t3,t1 # Load the third input into t1 jal call_process # Call the process loop jal call_print # Print the result li a7, 10 # Syscall for exit ecall # Process loop call_process: addi a1, x0, 1 # Initialize a1 to 1 addi a6, x0, 31 # Initialize a6 to 31 add t0, x0, x0 # Initialize result t5 to 0 while_loop: li a2,0 #for i value li a5,0 #count number of ctz or cto beq t1,a1,exit andi t2,t1,1 bnez t2,cto_loop j ctz_loop ctz_loop: sll a3,a1,a2 #a3 1U left shift value and a4,t1,a3 #a4 compare MSB to LSB bnez a4 ,else_case addi a5,a5,1 addi a2,a2,1 blt a2,a6,ctz_loop else_case: srl t1,t1,a5 add t0,t0,a5 j while_loop cto_loop: sll a3,a1,a2 #a3 1U left shift value and a4,t1,a3 #a4 compare MSB to LSB beqz a4 ,if_case addi a5,a5,1 addi a2,a2,1 blt a2,a6,cto_loop if_case: blt a1,a5,if_case2 else_if_case: beq a5,a6,exit_special else_case2: addi t1,t1,-1 srli t1,t1,1 addi t0,t0,2 j while_loop if_case2: addi t4,x0,3 blt t4,t1,count_mul_1 j else_if_case count_mul_1: addi t1,t1 1 srl t1,t1,a5 add t0,t0,a5 addi t0,t0,1 j while_loop exit_special: addi a5,x0,33 ret exit: add a5,x0,t0 # Return from call_process ret # Print function call_print: la a0,str3 li a7,4 ecall mv a0, t3 # Move result to a0 for printing li a7, 1 # Syscall for print integer ecall la a0,str1 li a7,4 ecall mv a0, a5 # Move result to a0 for printing li a7, 1 # Syscall for print integer ecall la a0,str2 li a7,4 ecall la t0,answer add t0,t0,t6 lw a0,0(t0) li a7,1 ecall la a0, newline # Load address of newline character li a7, 4 # Syscall for print string ecall ret # Return from call_print ``` ![image](https://hackmd.io/_uploads/r1ucoku1Je.png) ### assembly code version2 ```s .data input_num1: .word 0x00fff000 # First input input_num2: .word 0x00000002 # Second input input_num3: .word 0x7f000800 # Third input answer: .word 0x19,0x1,0x21 newline: .asciz "\n" # Newline character for printing str1: .string " answer is " # First part of the output message str2: .string " .Actual answer is " # Second part of the output message str3: .string "Input number is " .text .globl _start _start: # Handle first input add t6,x0,x0 la t0, input_num1 # Load address of first input lw t1, 0(t0) # Load the first input into t1 mv a2,t1 jal call_process # Call the process loop jal call_print # Print the result addi t6,x0,4 # Handle second input la t0, input_num2 # Load address of second input lw t1, 0(t0) mv a2,t1 # Load the second input into t1 jal call_process # Call the process loop jal call_print # Print the result addi t6,t6,4 # Handle third input la t0, input_num3 # Load address of third input lw t1, 0(t0) # Load the third input into t1 mv a2,t1 jal call_process # Call the process loop jal call_print # Print the result li a7, 10 # Syscall for exit ecall # Process loop call_process: addi a1, x0, 1 # Initialize a1 to 1 addi a6, x0, 31 # Initialize a6 to 31 add t5, x0, x0 # Initialize result t5 to 0 while_loop: li a5,0 #count number of ctz or cto li a4,0 #judge cto or ctz add t2,x0,t1 beq t1,a1,exit andi t3,t1,1 bnez t3,cto_loop j ctz_loop cto_loop: not t2, t1 addi a4,x0,1 ctz_loop: slli t0,t2,1 or t2,t0,t2 slli t0,t2,2 or t2,t0,t2 slli t0,t2,4 or t2,t0,t2 slli t0,t2,8 or t2,t0,t2 slli t0,t2,16 or t2,t0,t2 srli t0,t2,1 li t3,0x55555555 and t0,t0,t3 sub t2,t2,t0 srli t0,t2,2 li t3,0x33333333 and t0,t0,t3 and t2,t2,t3 add t2,t0,t2 srli t0,t2,4 add t2,t2,t0 li t3,0x0f0f0f0f and t2,t2,t3 srli t3,t2,8 add t2,t2,t3 srli t3,t2,16 add t2,t2,t3 andi t2,t2,0x3f addi t0,x0,32 sub a5,t0,t2 bnez a4,if_case else_case: srl t1,t1,a5 add t5,t5,a5 j while_loop if_case: blt a1,a5,if_case2 else_if_case: beq a5,a6,exit_special else_case2: addi t1,t1,-1 srli t1,t1,1 addi t5,t5,2 j while_loop if_case2: addi t4,x0,3 blt t4,t1,count_mul_1 j else_if_case count_mul_1: addi t1,t1 1 srl t1,t1,a5 add t5,t5,a5 addi t5,t5,1 j while_loop exit_special: addi a5,x0,33 ret # a7 = 10, syscall for exit exit: add a5,x0,t5 # Return from call_process ret # Print function call_print: la a0,str3 li a7,4 ecall mv a0, a2 # Move result to a0 for printing li a7, 1 # Syscall for print integer ecall la a0,str1 li a7,4 ecall mv a0, a5 # Move result to a0 for printing li a7, 1 # Syscall for print integer ecall la a0,str2 li a7,4 ecall la t0,answer add t0,t0,t6 lw a0,0(t0) li a7,1 ecall la a0, newline # Load address of newline character li a7, 4 # Syscall for print string ecall ret ``` ![image](https://hackmd.io/_uploads/SkuAnJuyke.png) ### Ripes Simulation Ripes provides various pipeline forms, and this time we are using a **Five-stage RISC-V Processor**. ![image](https://hackmd.io/_uploads/BkDHsQtyJx.png) The basic implementation of the RISC instruction set can be divided into five parts, which are: * **Instruction Fetch (IF):** Fetches the instruction from memory. * **Instruction Decode (ID):** Decodes the fetched instruction and reads the registers. * **Execute (EX):** Performs the operation specified by the instruction. * **Memory Access (MEM):** Accesses memory for load or store instructions. * **Write Back (WB):** Writes the result back to the register file. 1. **Instruction Fetch (IF):** ![image](https://hackmd.io/_uploads/SyXkRQtJkl.png) >+ **PC (Program Counter):** Stores the address of the next instruction to be executed. In each clock cycle, the PC automatically increments or jumps based on branch calculations. >+ **Multiplexer:** The first multiplexer determines the source for updating the PC, especially in cases where a branch operation is required, while the second multiplexer selects the offset values for the PC. >+ **Instruction Memory:** Retrieves the corresponding machine instruction from instruction memory based on the address provided by the PC. >+ **Compressed Decoder:** Handles the compressed instruction set. RISC-V supports compressed instructions, and this component is responsible for decoding those compressed instructions. >+ **IF/ID (Instruction Fetch/Instruction Decode Pipeline Register):** This is the first pipeline register, used to pass the fetched instruction to the next stage (decode stage) and retain the results from the previous stage. 2.**Instruction Decode (ID):** ![image](https://hackmd.io/_uploads/Byxtm4Y1Jl.png) >+ **Decode (Instruction Decode)：** This area is responsible for decoding the instruction, identifying the operands, opcode, and relevant control signals, and passing the operands to the subsequent execution stage. >+ **Registers (Register File)：** This component includes a register file that can read data from specified registers based on the instruction and write results back to the registers during the write-back stage. >+ **Imm (Immediate):** Handles the retrieval of immediate values. Some instructions include immediate values as operands, and this component is responsible for decoding and providing those immediate values from the instruction >+ **ID/EX (Instruction Decode/Execute Pipeline Register):** This is the second pipeline register, responsible for passing the decoded data to the execution stage and retaining the results from the previous stage. 3.**Execute (EX):** ![image](https://hackmd.io/_uploads/r19FEEtJJe.png) >+ **ALU (Arithmetic Logic Unit):** Performs arithmetic and logic operations. This is one of the core components of the processor, responsible for carrying out calculations based on the instruction (such as addition, subtraction, logical operations). >+ **Branch Unit:** Responsible for calculating branch conditions. This component determines whether to take a branch based on the ALU results and returns the result to the PC to update the instruction address. >+ **EX/MEM (Execute/Memory Pipeline Register):** This is the third pipeline register, responsible for passing the results from the execution stage to the memory stage and retaining the results from the previous stage. 4.**MEM (Memory Access):** >+ **Data Memory:** Responsible for accessing data memory. Memory operations, such as reading or writing data (e.g., load/store instructions), are performed during this stage. >+ **MEM/WB (Memory/Writeback Pipeline Register):** This is the final pipeline register, responsible for passing the results from the memory stage to the write-back stage. 5.**WB (Write Back):** >+ Writes the computed results or data retrieved from memory back to the register file, completing the final step of instruction execution. ![image](https://hackmd.io/_uploads/HyBorEKkyx.png) Inserting pipeline registers between each pipeline stage allows for the parallel execution of instructions and ensures that each stage can correctly store and pass data. Next, I will use the first line of assembly code as an example to explain the execution flow. ![image](https://hackmd.io/_uploads/rJRQxN9JJe.png) The first instruction of the assembly code is `add x31, x0, x0`, with the machine code being `0x00000fb3` and the address being `0x00000000`.First , the current PC value is `0x00000000`, indicating that the processor is set to execute the instruction at address `0x00000000`. The instruction fetched from instruction memory at this address is `0x00000fb3`. IF/ID register stores the instruction obtained from the instruction fetch stage and passes it to the next stage (the decode stage). The instruction 0x00000fb3 will be sent to the decode unit in the next clock cycle. ![image](https://hackmd.io/_uploads/S1D77V9kye.png) IF/ID register store and pass the instruction to the Instruction Decode stage .In Instrtuction Decode stage , deocde identifies this as an add instruction based on the opcode. It also distinguishes the registers for rs1, rs2, and rd. In this case, rs1 is x0, rs2 is x0, and rd is x31.The register is reading rs1 and rs2 values and passing them to the next execution stage. At the same time, the pipeline will fetch the next instruction in the IF stage. ![image](https://hackmd.io/_uploads/BJD7wVcJ1g.png) In Excute stage , The ALU (Arithmetic Logic Unit) receives the two operands from the register file, which are the values of x0 and x0. Since both values are 0, the ALU will perform the addition of 0 + 0. ![image](https://hackmd.io/_uploads/HJhqGB9kkl.png) This instruction does not operate on memory data; then, it will return the result to the register. ### Execution result #### assembly code v1 ![image](https://hackmd.io/_uploads/BywookuJJg.png) #### assembly code v2 ![image](https://hackmd.io/_uploads/ry8TnkuJyx.png) Compare to the assembly code v1 result, the v2 result has the better performance, v2 has fewer cycles per instruction retired and a lower CPI, indicating that its execution speed is faster. V2 can execute more instructions in the same amount of time, resulting in better performance.