# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by <[劉哲維](https://github.com/uuuwei0504)>
## Quiz1 Problem C
my_clz function:
The concept of the my_clz function is usually related to the calculation of "leading zeros" in bitwise operations, which is very common in low-level programming, numerical processing, and performance optimization.
my_clz in [Quiz1 Probelm C](https://hackmd.io/@sysprog/arch2024-quiz1-sol) code:
```c
static inline int my_clz(uint32_t x)
{
int count = 0;
for (int i = 31; i >= 0; --i)
{
if (x & (1U << i))
break;
count++;
}
return count;
}
```
Function: Counts the number of leading zeros in a 32-bit unsigned integer x.
Logic:
* 1. Start checking from the most significant bit (bit 31).
* 2. If a 1 is encountered, stop the check and return the count of leading zeros.
* 3. If it's 0, increment the counter count and continue checking the next bit.
Return Value: The number of leading zeros.
### Next is the assembly code
```
.data
n_value: .word 0x08000000 # 測試數據
result: .word 0
.text
.global _start
_start:
la t0,n_value
lw t1,0(t0) #x load 0x08000000
li t2,31 #int i=31
li t3,0 #int count=0
clz_loop:
li t4,1 # Store 1U
sll t4,t4,t2 # Store the result of 1U << i in t4
and t5,t1,t4 # Store x AND t4 in t5
beq t5 ,x0,continue # If t5 is equal to 0, continue the clz process (i.e., i-- and count++)
j break # If t5 is not equal to 0, found a 1 at this bit, if condition is true
continue:
addi t3,t3,1
addi t2,t2,-1
bge t2,zero,clz_loop
break:
mv t5,t3
```
### fp16 to fp32 function
```c
static inline uint32_t fp16_to_fp32(uint16_t h) {
const uint32_t w = (uint32_t) h << 16;
const uint32_t sign = w & UINT32_C(0x80000000);
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) &
INT32_C(0x7F800000);
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
return sign | ((((nonsign << renorm_shift >> 3) +
((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
}
```
**FP16 (Half-Precision) and FP32 (Single-Precision)** are different precision representations of floating-point numbers in the IEEE 754 standard. FP16 uses 16 bits to represent a floating-point number, while FP32 uses 32 bits. In order to represent FP16 in the higher precision FP32, we need to appropriately extend and rearrange the individual bits of FP16 (sign bit, exponent bits, and mantissa bits) according to the IEEE standard.

### Analysis
* 1. Sign Bit: The sign bit occupies the 15th bit in FP16 and the 31st bit in FP32. Since it is always the first bit in both formats, it is straightforward to handle. The program only needs a temporary variable to store the sign bit, as the rules for handling it are the same in both formats. The program then directly moves the sign bit to the correct position in the 32-bit format.
* 2. Exponent Bits: The exponent bits in FP16 occupy bits 10 to 14, while in FP32 they occupy bits 23 to 30. The program extends the exponent from FP16 to fit into the FP32 range by appropriately shifting the bits and adjusting for the bias difference (the bias for FP16 is 15, while for FP32 it is 127).
* 3. Mantissa Bits: The mantissa in FP16 occupies the lower 10 bits, while in FP32 it occupies the lower 23 bits. Therefore, the program shifts the mantissa bits from FP16 to the correct position in FP32.
* 4.Special Case Handling:
1. NaN and Infinity: If the exponent of the input value is equal to 15 (the maximum exponent in FP16), these cases require special handling. The program converts them to their corresponding representations in FP32.
2. Zero: When FP16 represents 0 or -0, the program ensures that the result remains as zero in FP32.
### Conclusion
This type of conversion is highly useful in applications such as graphics processing, machine learning, and any scenario that requires efficient floating-point computations, as FP16 precision is sufficient in certain cases and saves memory space.
### fp16_fp32 assembly code:
```
.data
n_value: .word 0x08000000 # 測試數據
result: .word 0
.text
.global _start
_start:
la t0, n_value
lw t1, 0(t0) # Load the 16-bit half-precision number into t1
slli t2, t1, 16 # Shift left by 16 bits: w = (uint32_t) h << 16
# Extract sign bit
li t0, 0x80000000
and t3, t2, t0 # sign = w & 0x80000000
# Extract nonsign part (exponent and mantissa)
li t0, 0x7FFFFFFF
and t4, t2, t0 # nonsign = w & 0x7FFFFFFF
# Count leading zeros for renormalization shift
li a5, 0 # renorm_shift counter
li a2, 31 # Initialize bit position for my_clz loop
my_clz:
li a1, 1
sll a3, a1, a2 # Generate mask: 1U << i
and a4, t4, a3 # Test if bit i is set
bnez a4, shift # Break if found first 1 bit
addi a5, a5, 1 # Increment leading zero count
addi a2, a2, -1 # Move to the next bit
bgez a2, my_clz # Repeat until bit 0 is reached
# Shift adjustment for renormalized numbers
shift:
addi a5, a5, -5 # Subtract 5 to adjust for mantissa position
bge a5, zero_shift, mask_correction # If shift is positive, skip zeroing
addi a5, x0, 0 # If shift < 5, set shift to 0
mask_correction:
li t0, 0x04000000
add t5, t4, t0 # Add for NaN/Infinity check
srai t5, t5, 8
li t0, 0x7F800000
and t5, t5, t0 # inf_nan_mask creation
zero_shift:
addi t6, t4, -1 # zero_mask creation
srai t6, t6, 31
# Final result assembly
exit:
sll t4, t4, a5 # Shift mantissa based on renormalization
srli t4, t4, 3 # Align exponent and mantissa to FP32 layout
li t0, 0x70 # FP32 exponent bias adjustment
sub t0, t0, a5
slli t0, t0, 23 # Adjust exponent to FP32 position
add t4, t4, t0 # Add adjusted exponent to result
or t4, t4, t5 # Handle NaN/Infinity
not t6, t6
and t4, t4, t6 # Apply zero mask
or t4, t4, t3 # Add sign bit to final result
mv a0, t4 # Move result to a0 for output
# Exit system call
li a7, 10
ecall
```
---
## Optimize the problem of LeetCode 190: Reverse Bits
#### Reverse bits of a given 32 bits unsigned integer.
#### Problem overview:
I am using the my_clz function (Quiz1 problem C) to optimize LeetCode 190: Reverse Bits.
Example 1:
Input: n = 00000010100101000001111010011100
Output: 964176192 (00111001011110000010100101000000)
Example 2:
Input: n = 11111111111111111111111111111101
Output: 3221225471 (10111111111111111111111111111111)
----
## Solution
We need to:
implement a my_clz function and revert to the original first version of the C code, which doesn't use the my_clz function, to observe how much memory usage is reduced.
----
## C program
### First version(without using the clz_function)
Problem-solving approach:
1.Goal:
The objective of this function is to reverse the bits of a given 32-bit unsigned integer. To achieve this, the function processes the input integer bit by bit and reverses the order of its bits.
2.Initialize the result:
We start by initializing a variable result to 0, which will store the reversed bits as we process the input integer.
3.Iterate through all 32 bits:
Since the input is a 32-bit unsigned integer, we need to iterate exactly 32 times (using a for loop) to process each bit.
4.Shift the result to the left:
During each iteration, we shift the bits in the result variable to the left by 1 position to make space for the next bit we extract from n.
5.Extract and append the least significant bit of n:
We use the bitwise AND operation (n & 1) to extract the least significant bit (LSB) of n. This bit is then appended to result using the bitwise OR operation (result |=).
6.Shift the input n to the right:
After extracting the LSB, we shift the bits of n to the right by 1 position (n >>= 1) to process the next bit during the next iteration.
7.Return the result:
After result, which is
```c
uint32_t reverseBits(uint32_t n) {
uint32_t result=0;
for(int i=0;i<32;i++){
result<<=1;
result |= (n & 1);
n>>=1;
}
return result;
}
```
### Second version
in previous problem, if the input value is not large, there may be more leading zeros, causing the loop to run unnecessarily many times. Therefore, I use the my_clz function to calculate the leading zeros, reducing the number of excessive and unnecessary iterations.
```c
static inline int my_clz(uint32_t x)
{
int count = 0;
for (int i = 31; i >= 0; --i)
{
if (x & (1U << i))
break;
count++;
}
return count;
}
uint32_t reverseBits(uint32_t n)
{
uint32_t result = 0;
int ret = my_clz(n); //
n >>= ret; //
for (int i = 0; i < (32-ret); i++)
{
result <<= 1;
result |= (n & 1);
n >>= 1;
}
return result;
}
```
### main()
```c
int main()
{
uint32_t n1 = 43261596; // binary: 00000010100101000001111010011100
uint32_t result1 = reverseBits(n1);
printf("Input: %u, Reversed: %u\n", n1, result1);
uint32_t n2 = 4294967293; // binary: 11111111111111111111111111111101
uint32_t result2 = reverseBits(n2);
printf("Input: %u, Reversed: %u\n", n2, result2);
return 0;
}
```
---
## assembly code:
### c code for function clz:
```c
static inline int my_clz(uint32_t x)
{
int count = 0;
for (int i = 31; i >= 0; --i)
{
if (x & (1U << i))
break;
count++;
}
return count;
}
```
### test assembly code for function clz:
We use 0x08000000 as an example, and after using the clz function, we expect an output of 4.
```
.data
n_value: .word 0x08000000 # 測試數據
result: .word 0
.text
.global _start
_start:
la t0,n_value
lw t1,0(t0) #x load 0x08000000
li t2,31 #int i=31
li t3,0 #int count=0
clz_loop:
li t4,1 # Store 1U
sll t4,t4,t2 # Store the result of 1U << i in t4
and t5,t1,t4 # Store x AND t4 in t5
beq t5 ,x0,continue # If t5 is equal to 0, continue the clz process (i.e., i-- and count++)
j break # If t5 is not equal to 0, found a 1 at this bit, if condition is true
continue:
addi t3,t3,1
addi t2,t2,-1
bge t2,zero,clz_loop
break:
mv t5,t3
```
### Ripes Compilation Results

We expect the answer to be 4, and t5 will also store 4
Another version to check (0x8000)

---
### LeetCode: 190. Reverse Bits in RISC-V using the clz function
```
.data
n_value: .word 0x08000000 # 測試數據
result: .word 0
.text
.global _start
_start:
la t0,n_value
lw t1,0(t0) #x load 0x08000000
li t2,31 #int i=31
li t3,0 #int count=0
clz_loop:
li t4,1 # Store 1U
sll t4,t4,t2 # Store the result of 1U << i in t4
and t5,t1,t4 # Store x AND t4 in t5
beq t5 ,x0,continue # If t5 is equal to 0, continue the clz process (i.e., i-- and count++)
j break # If t5 is not equal to 0, found a 1 at this bit, if condition is true
continue:
addi t3,t3,1
addi t2,t2,-1
bge t2,zero,clz_loop
break:
mv t5,t3
#----------------------------reverbits function------
reverbits:
lw t0,0(t0) t0=n_value
li a3,32
sub t1,a3,t5//t1 effectibe_bits
li,t2,0 #t2=j
li t3,0 #t3=result
li a0,0 #i=0
for_loop:
slt a1, a0, t1 # Compare a0 with t1 (t1 is effective_bits), if i < effective_bits, a1 will be set to 1
beq a1, zero, break_for_loop # If a1 is 1, continue the loop; the condition i < effective_bits is correct
slli t3, t3, 1 # t3 is result, shift t3 left by 1
andi a3, t0, 1 # a3 is a temporary register to store n & 1
or t3, t3, a3 # t3 stores the result of t3 OR (n & 1)
srli t0, t0, 1 # n is shifted right by 1
addi a0, a0, 1 # i++
j for_loop # Continue the loop
break_for_loop:
sll t3, t3, t5 # result << the result of clz
mv a5, t3 # a5 is the final reversed value
```
### Ripes Compilation Results
We input 0x08000000, and after reversing, the expected result is 0x10. The a5 register also outputs the same value

Test with other values, e.g., 0xFFFFFFFD, the expected result is 0xBFFFFFFF

### Result:
using clz function

without using clz :

| A Comparison of Non-Loop Unrolling and Using clz Function Implementations | Non-Loop Unrolling Assembly | Using clz Function Assembly Code |
|----------------------------------------------------------------------------|----------------------------|----------------------------------|
| **Cycles** | 380 | 338 |
| **Instrs. retired** | 246 | 268 |
| **CPI** | 1.53 | 1.26 |
| **IPC** | 0.647 | 0.793 |
| **Clock rate** | 0 Hz | 0 Hz |
### Conclusion
The implementation with the clz function performs better, mainly in the following aspects:
* It requires fewer total cycles (338 cycles vs. 380 cycles).
* The CPI is lower, indicating that each instruction is executed faster.
* The IPC is higher, meaning that more instructions are executed per cycle.
# Pipeline 5 Stage Explanation

## Overview
The 5-stage pipeline is an optimization technique used in modern processors. By dividing the execution of each instruction into five steps, it enables parallel processing of instructions. This allows the processor to handle multiple instructions simultaneously in different stages of execution, significantly improving performance.
The five stages are:
1. **Instruction Fetch (IF)**
2. **Instruction Decode (ID)**
3. **Execution (EX)**
4. **Memory Access (MEM)**
5. **Write Back (WB)**
## Detailed Explanation of the 5 Stages
### 1. Instruction Fetch (IF)
In the Instruction Fetch (IF) stage, the processor retrieves the next instruction from memory based on the value of the Program Counter (PC) and passes it to the Instruction Decoder for processing. The PC is updated during each clock cycle to ensure continuous fetching of instructions.
### 2. Instruction Decode (ID)
The Instruction Decode (ID) stage is responsible for decoding the fetched instruction, parsing the opcode and operands, and reading the necessary data from the register file. In this stage, any immediate values (Immediates) are also extracted for use in the subsequent Execution stage.
### 3. Execution (EX)
The Execution (EX) stage uses the Arithmetic Logic Unit (ALU) to perform the required operations. Depending on the type of instruction, the ALU carries out arithmetic (e.g., addition, subtraction) or logical operations (e.g., shifting, comparison) and produces either a result or a memory access address.
### 4. Memory Access (MEM)
In the Memory Access (MEM) stage, the processor reads or writes data to memory depending on the instruction type. For Load instructions, data is retrieved from memory and passed to the next stage. For Store instructions, data is written to memory.
### 5. Write Back (WB)
The Write Back (WB) stage stores the result of the ALU operation or the data fetched from memory back into the register file. This is the final stage of instruction execution, after which the processor is ready to execute the next instruction.
## Registers
Registers are small, fast memory units within the processor used to store data and intermediate results. In the 5-stage pipeline, the register file can be accessed by multiple stages simultaneously. Common registers include:
- **x0 to x31**: General-purpose registers, where `x0` is a special register that is always zero.
- **Special-purpose registers**: These include the Program Counter (PC), Status Registers, and others.
## Instruction Memory
The Instruction Memory holds the machine code instructions that the processor will execute. In each clock cycle, the Program Counter (PC) fetches an instruction from memory and sends it to the decoder for execution. For example, at address `0x0`, there could be an instruction like `auipc x5, 0x0`.
## Data Flow and Dependencies
In a 5-stage pipeline, Data Hazards are a common issue. These occur when an instruction depends on data from a previous instruction that hasn't yet been written back. To address this, the processor employs a technique called Forwarding. Forwarding allows the processor to directly pass the result from the ALU or memory stage to subsequent instructions without waiting for the Write Back stage to complete.
## Example Instruction Flow
Consider the instruction `add x5, x0, x1`, which adds the values from registers `x0` and `x1`, and stores the result in `x5`:
1. **IF Stage**: The instruction `add x5, x0, x1` is fetched from instruction memory.

3. **ID Stage**: The instruction is decoded, and the values of `x0` and `x1` are read.

5. **EX Stage**: The ALU adds the values of `x0` and `x1`.

7. **MEM Stage**: Since this is an arithmetic operation, no memory access is required.

9. **WB Stage**: The result is written back to register `x5`.
