# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by <[劉哲維](https://github.com/uuuwei0504)> ## Quiz1 Problem C my_clz function: The concept of the my_clz function is usually related to the calculation of "leading zeros" in bitwise operations, which is very common in low-level programming, numerical processing, and performance optimization. my_clz in [Quiz1 Probelm C](https://hackmd.io/@sysprog/arch2024-quiz1-sol) code: ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` Function: Counts the number of leading zeros in a 32-bit unsigned integer x. Logic: * 1. Start checking from the most significant bit (bit 31). * 2. If a 1 is encountered, stop the check and return the count of leading zeros. * 3. If it's 0, increment the counter count and continue checking the next bit. Return Value: The number of leading zeros. ### Next is the assembly code ``` .data n_value: .word 0x08000000 # 測試數據 result: .word 0 .text .global _start _start: la t0,n_value lw t1,0(t0) #x load 0x08000000 li t2,31 #int i=31 li t3,0 #int count=0 clz_loop: li t4,1 # Store 1U sll t4,t4,t2 # Store the result of 1U << i in t4 and t5,t1,t4 # Store x AND t4 in t5 beq t5 ,x0,continue # If t5 is equal to 0, continue the clz process (i.e., i-- and count++) j break # If t5 is not equal to 0, found a 1 at this bit, if condition is true continue: addi t3,t3,1 addi t2,t2,-1 bge t2,zero,clz_loop break: mv t5,t3 ``` ### fp16 to fp32 function ```c static inline uint32_t fp16_to_fp32(uint16_t h) { const uint32_t w = (uint32_t) h << 16; const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0 const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } ``` **FP16 (Half-Precision) and FP32 (Single-Precision)** are different precision representations of floating-point numbers in the IEEE 754 standard. FP16 uses 16 bits to represent a floating-point number, while FP32 uses 32 bits. In order to represent FP16 in the higher precision FP32, we need to appropriately extend and rearrange the individual bits of FP16 (sign bit, exponent bits, and mantissa bits) according to the IEEE standard. ![image](https://hackmd.io/_uploads/HkgHvN1Y1Jl.png) ### Analysis * 1. Sign Bit: The sign bit occupies the 15th bit in FP16 and the 31st bit in FP32. Since it is always the first bit in both formats, it is straightforward to handle. The program only needs a temporary variable to store the sign bit, as the rules for handling it are the same in both formats. The program then directly moves the sign bit to the correct position in the 32-bit format. * 2. Exponent Bits: The exponent bits in FP16 occupy bits 10 to 14, while in FP32 they occupy bits 23 to 30. The program extends the exponent from FP16 to fit into the FP32 range by appropriately shifting the bits and adjusting for the bias difference (the bias for FP16 is 15, while for FP32 it is 127). * 3. Mantissa Bits: The mantissa in FP16 occupies the lower 10 bits, while in FP32 it occupies the lower 23 bits. Therefore, the program shifts the mantissa bits from FP16 to the correct position in FP32. * 4.Special Case Handling: 1. NaN and Infinity: If the exponent of the input value is equal to 15 (the maximum exponent in FP16), these cases require special handling. The program converts them to their corresponding representations in FP32. 2. Zero: When FP16 represents 0 or -0, the program ensures that the result remains as zero in FP32. ### Conclusion This type of conversion is highly useful in applications such as graphics processing, machine learning, and any scenario that requires efficient floating-point computations, as FP16 precision is sufficient in certain cases and saves memory space. ### fp16_fp32 assembly code: ``` .data n_value: .word 0x08000000 # 測試數據 result: .word 0 .text .global _start _start: la t0, n_value lw t1, 0(t0) # Load the 16-bit half-precision number into t1 slli t2, t1, 16 # Shift left by 16 bits: w = (uint32_t) h << 16 # Extract sign bit li t0, 0x80000000 and t3, t2, t0 # sign = w & 0x80000000 # Extract nonsign part (exponent and mantissa) li t0, 0x7FFFFFFF and t4, t2, t0 # nonsign = w & 0x7FFFFFFF # Count leading zeros for renormalization shift li a5, 0 # renorm_shift counter li a2, 31 # Initialize bit position for my_clz loop my_clz: li a1, 1 sll a3, a1, a2 # Generate mask: 1U << i and a4, t4, a3 # Test if bit i is set bnez a4, shift # Break if found first 1 bit addi a5, a5, 1 # Increment leading zero count addi a2, a2, -1 # Move to the next bit bgez a2, my_clz # Repeat until bit 0 is reached # Shift adjustment for renormalized numbers shift: addi a5, a5, -5 # Subtract 5 to adjust for mantissa position bge a5, zero_shift, mask_correction # If shift is positive, skip zeroing addi a5, x0, 0 # If shift < 5, set shift to 0 mask_correction: li t0, 0x04000000 add t5, t4, t0 # Add for NaN/Infinity check srai t5, t5, 8 li t0, 0x7F800000 and t5, t5, t0 # inf_nan_mask creation zero_shift: addi t6, t4, -1 # zero_mask creation srai t6, t6, 31 # Final result assembly exit: sll t4, t4, a5 # Shift mantissa based on renormalization srli t4, t4, 3 # Align exponent and mantissa to FP32 layout li t0, 0x70 # FP32 exponent bias adjustment sub t0, t0, a5 slli t0, t0, 23 # Adjust exponent to FP32 position add t4, t4, t0 # Add adjusted exponent to result or t4, t4, t5 # Handle NaN/Infinity not t6, t6 and t4, t4, t6 # Apply zero mask or t4, t4, t3 # Add sign bit to final result mv a0, t4 # Move result to a0 for output # Exit system call li a7, 10 ecall ``` --- ## Optimize the problem of LeetCode 190: Reverse Bits #### Reverse bits of a given 32 bits unsigned integer. #### Problem overview: I am using the my_clz function (Quiz1 problem C) to optimize LeetCode 190: Reverse Bits. Example 1: Input: n = 00000010100101000001111010011100 Output: 964176192 (00111001011110000010100101000000) Example 2: Input: n = 11111111111111111111111111111101 Output: 3221225471 (10111111111111111111111111111111) ---- ## Solution We need to: implement a my_clz function and revert to the original first version of the C code, which doesn't use the my_clz function, to observe how much memory usage is reduced. ---- ## C program ### First version(without using the clz_function) Problem-solving approach: 1.Goal: The objective of this function is to reverse the bits of a given 32-bit unsigned integer. To achieve this, the function processes the input integer bit by bit and reverses the order of its bits. 2.Initialize the result: We start by initializing a variable result to 0, which will store the reversed bits as we process the input integer. 3.Iterate through all 32 bits: Since the input is a 32-bit unsigned integer, we need to iterate exactly 32 times (using a for loop) to process each bit. 4.Shift the result to the left: During each iteration, we shift the bits in the result variable to the left by 1 position to make space for the next bit we extract from n. 5.Extract and append the least significant bit of n: We use the bitwise AND operation (n & 1) to extract the least significant bit (LSB) of n. This bit is then appended to result using the bitwise OR operation (result |=). 6.Shift the input n to the right: After extracting the LSB, we shift the bits of n to the right by 1 position (n >>= 1) to process the next bit during the next iteration. 7.Return the result: After result, which is ```c uint32_t reverseBits(uint32_t n) { uint32_t result=0; for(int i=0;i<32;i++){ result<<=1; result |= (n & 1); n>>=1; } return result; } ``` ### Second version in previous problem, if the input value is not large, there may be more leading zeros, causing the loop to run unnecessarily many times. Therefore, I use the my_clz function to calculate the leading zeros, reducing the number of excessive and unnecessary iterations. ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } uint32_t reverseBits(uint32_t n) { uint32_t result = 0; int ret = my_clz(n); // n >>= ret; // for (int i = 0; i < (32-ret); i++) { result <<= 1; result |= (n & 1); n >>= 1; } return result; } ``` ### main() ```c int main() { uint32_t n1 = 43261596; // binary: 00000010100101000001111010011100 uint32_t result1 = reverseBits(n1); printf("Input: %u, Reversed: %u\n", n1, result1); uint32_t n2 = 4294967293; // binary: 11111111111111111111111111111101 uint32_t result2 = reverseBits(n2); printf("Input: %u, Reversed: %u\n", n2, result2); return 0; } ``` --- ## assembly code: ### c code for function clz: ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` ### test assembly code for function clz: We use 0x08000000 as an example, and after using the clz function, we expect an output of 4. ``` .data n_value: .word 0x08000000 # 測試數據 result: .word 0 .text .global _start _start: la t0,n_value lw t1,0(t0) #x load 0x08000000 li t2,31 #int i=31 li t3,0 #int count=0 clz_loop: li t4,1 # Store 1U sll t4,t4,t2 # Store the result of 1U << i in t4 and t5,t1,t4 # Store x AND t4 in t5 beq t5 ,x0,continue # If t5 is equal to 0, continue the clz process (i.e., i-- and count++) j break # If t5 is not equal to 0, found a 1 at this bit, if condition is true continue: addi t3,t3,1 addi t2,t2,-1 bge t2,zero,clz_loop break: mv t5,t3 ``` ### Ripes Compilation Results ![image](https://hackmd.io/_uploads/BJ6C8h7kke.png) We expect the answer to be 4, and t5 will also store 4 Another version to check (0x8000) ![image](https://hackmd.io/_uploads/S1iRD2Xyye.png) --- ### LeetCode: 190. Reverse Bits in RISC-V using the clz function ``` .data n_value: .word 0x08000000 # 測試數據 result: .word 0 .text .global _start _start: la t0,n_value lw t1,0(t0) #x load 0x08000000 li t2,31 #int i=31 li t3,0 #int count=0 clz_loop: li t4,1 # Store 1U sll t4,t4,t2 # Store the result of 1U << i in t4 and t5,t1,t4 # Store x AND t4 in t5 beq t5 ,x0,continue # If t5 is equal to 0, continue the clz process (i.e., i-- and count++) j break # If t5 is not equal to 0, found a 1 at this bit, if condition is true continue: addi t3,t3,1 addi t2,t2,-1 bge t2,zero,clz_loop break: mv t5,t3 #----------------------------reverbits function------ reverbits: lw t0,0(t0) t0=n_value li a3,32 sub t1,a3,t5//t1 effectibe_bits li,t2,0 #t2=j li t3,0 #t3=result li a0,0 #i=0 for_loop: slt a1, a0, t1 # Compare a0 with t1 (t1 is effective_bits), if i < effective_bits, a1 will be set to 1 beq a1, zero, break_for_loop # If a1 is 1, continue the loop; the condition i < effective_bits is correct slli t3, t3, 1 # t3 is result, shift t3 left by 1 andi a3, t0, 1 # a3 is a temporary register to store n & 1 or t3, t3, a3 # t3 stores the result of t3 OR (n & 1) srli t0, t0, 1 # n is shifted right by 1 addi a0, a0, 1 # i++ j for_loop # Continue the loop break_for_loop: sll t3, t3, t5 # result << the result of clz mv a5, t3 # a5 is the final reversed value ``` ### Ripes Compilation Results We input 0x08000000, and after reversing, the expected result is 0x10. The a5 register also outputs the same value ![image](https://hackmd.io/_uploads/rkGlrTQJJl.png) Test with other values, e.g., 0xFFFFFFFD, the expected result is 0xBFFFFFFF ![image](https://hackmd.io/_uploads/By9gLTXk1g.png) ### Result: using clz function ![image](https://hackmd.io/_uploads/HyuXQltyke.png) without using clz : ![image](https://hackmd.io/_uploads/rktGzlFy1g.png) | A Comparison of Non-Loop Unrolling and Using clz Function Implementations | Non-Loop Unrolling Assembly | Using clz Function Assembly Code | |----------------------------------------------------------------------------|----------------------------|----------------------------------| | **Cycles** | 380 | 338 | | **Instrs. retired** | 246 | 268 | | **CPI** | 1.53 | 1.26 | | **IPC** | 0.647 | 0.793 | | **Clock rate** | 0 Hz | 0 Hz | ### Conclusion The implementation with the clz function performs better, mainly in the following aspects: * It requires fewer total cycles (338 cycles vs. 380 cycles). * The CPI is lower, indicating that each instruction is executed faster. * The IPC is higher, meaning that more instructions are executed per cycle. # Pipeline 5 Stage Explanation ![image](https://hackmd.io/_uploads/Sk7SmH5J1e.png) ## Overview The 5-stage pipeline is an optimization technique used in modern processors. By dividing the execution of each instruction into five steps, it enables parallel processing of instructions. This allows the processor to handle multiple instructions simultaneously in different stages of execution, significantly improving performance. The five stages are: 1. **Instruction Fetch (IF)** 2. **Instruction Decode (ID)** 3. **Execution (EX)** 4. **Memory Access (MEM)** 5. **Write Back (WB)** ## Detailed Explanation of the 5 Stages ### 1. Instruction Fetch (IF) In the Instruction Fetch (IF) stage, the processor retrieves the next instruction from memory based on the value of the Program Counter (PC) and passes it to the Instruction Decoder for processing. The PC is updated during each clock cycle to ensure continuous fetching of instructions. ### 2. Instruction Decode (ID) The Instruction Decode (ID) stage is responsible for decoding the fetched instruction, parsing the opcode and operands, and reading the necessary data from the register file. In this stage, any immediate values (Immediates) are also extracted for use in the subsequent Execution stage. ### 3. Execution (EX) The Execution (EX) stage uses the Arithmetic Logic Unit (ALU) to perform the required operations. Depending on the type of instruction, the ALU carries out arithmetic (e.g., addition, subtraction) or logical operations (e.g., shifting, comparison) and produces either a result or a memory access address. ### 4. Memory Access (MEM) In the Memory Access (MEM) stage, the processor reads or writes data to memory depending on the instruction type. For Load instructions, data is retrieved from memory and passed to the next stage. For Store instructions, data is written to memory. ### 5. Write Back (WB) The Write Back (WB) stage stores the result of the ALU operation or the data fetched from memory back into the register file. This is the final stage of instruction execution, after which the processor is ready to execute the next instruction. ## Registers Registers are small, fast memory units within the processor used to store data and intermediate results. In the 5-stage pipeline, the register file can be accessed by multiple stages simultaneously. Common registers include: - **x0 to x31**: General-purpose registers, where `x0` is a special register that is always zero. - **Special-purpose registers**: These include the Program Counter (PC), Status Registers, and others. ## Instruction Memory The Instruction Memory holds the machine code instructions that the processor will execute. In each clock cycle, the Program Counter (PC) fetches an instruction from memory and sends it to the decoder for execution. For example, at address `0x0`, there could be an instruction like `auipc x5, 0x0`. ## Data Flow and Dependencies In a 5-stage pipeline, Data Hazards are a common issue. These occur when an instruction depends on data from a previous instruction that hasn't yet been written back. To address this, the processor employs a technique called Forwarding. Forwarding allows the processor to directly pass the result from the ALU or memory stage to subsequent instructions without waiting for the Write Back stage to complete. ## Example Instruction Flow Consider the instruction `add x5, x0, x1`, which adds the values from registers `x0` and `x1`, and stores the result in `x5`: 1. **IF Stage**: The instruction `add x5, x0, x1` is fetched from instruction memory. ![image](https://hackmd.io/_uploads/Sk0lorqy1l.png) 3. **ID Stage**: The instruction is decoded, and the values of `x0` and `x1` are read. ![image](https://hackmd.io/_uploads/BkkUoB5Jyl.png) 5. **EX Stage**: The ALU adds the values of `x0` and `x1`. ![image](https://hackmd.io/_uploads/H14bhH9k1l.png) 7. **MEM Stage**: Since this is an arithmetic operation, no memory access is required. ![image](https://hackmd.io/_uploads/HkYihScy1e.png) 9. **WB Stage**: The result is written back to register `x5`. ![image](https://hackmd.io/_uploads/S1m4pr5J1g.png)