Assignment1: RISC-V Assembly and Instruction Pipeline
===
contributed by <[`YHKung`](https://github.com/YHK00103)>
## 1. Problem C
### 1.1 Abstract
In problem C, the goal is to find the absolute value of a 32-bit floating-point number and convert a 16-bit floating-point number in IEEE half-precision format (`fp16`) to a 32-bit floating-point number in IEEE single-precision format (`fp32`). The problem defines three functions, including `fabsf`, `my_clz`, and `fp16_to_fp32`.
The most significant function, `fp16_to_fp32`, implements the conversion of 16-bit half-precision floating-point numbers and 32-bit single-precision floating-point numbers. In order to ensure the accurate conversion, it also considers the special cases, such as `NaN`, `infinity`, and `zero`.
I will first introduce the floating-point format and then explain each function along with its corresponding RISC-V assembly.
### 1.2 Half-precision and Single-precision floating-point format
In IEEE 754 floating-point format, a floating-point number is separated into three main parts, including `sign`, `exponent`, and `Mantissa`. The following forms represent the structure of 16-bit half-precision floating-point number and 32-bit single-precision floating-point number.
16-bit half-precision floating-point number:
| Sign (1 bit) | Exponent (5 bits) | Mantissa (10 bits) |
|:------------:|:-----------------:|:------------------:|
| x | xxxxx | xxxx xxxx xx |
32-bit single-precision floating-point number:
| Sign (1 bit) | Exponent (8 bits) | Mantissa (23 bits) |
|:------------:|:-----------------:|:----------------------------:|
| x | xxxx xxxx | xxxx xxxx xxxx xxxx xxxx xxx |
**1. Sign**
Sign bit determines whether a floating-point number is positive or negative.
* If the sign bit is `0`, the floating-point number is positive.
* If the sign bit is `1`, the floating-point number is negative.
**2. Exponent**
The function of exponent is to control the range of floating-point number. It stores exponent bits in a biased format, which simpifies the comparison of floating-point numbers and allows exponent to be expressed as both positive and negative value.
* For a 16-bit half-precision foating-point number, the bias is `15`.
* For a 32-bit single-precision floating-point numberm the bias is `127`.
**3. Mantissa**
Mantissa decides the valid digits and the precision of a floating-point number. In IEEE 754 standard, the most significant bit (MSB) of mantissa is assumed to be 1, known as the implicit bit. Hence, the actual representation is `1.mantissa`.
### 1.3 fabsf
The `fabsf` function computes the absolute value of a 32-bit floating-point number by leveraging the characteristics of IEEE 754 floating-point format.
The function reads the bits of the floating-point `x` into an unsigned integer and uses the mask `0x7FFFFFFF` to clear the sign bit, thus obtaining the absolute value. Finally, it writes the modified bits back into the float and returns the absolute value.
```=C
static inline float fabsf(float x) {
uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer
i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value
x = *(float *)&i; // Write the modified bits back into the float
return x;
}
```
The corresponding RISC-V assembly is showed below.
```=C
# fabsf function in RISC-V assembly
# Input: float x (passed via a0)
# Output: absolute value of x (returned in a0)
fabsf:
li t1, 0x7FFFFFFF # t1 = 0x7FFFFFFF (mask)
and a0, a0, t1 # x &= i
ret
```
### 1.4 my_clz
The `my_clz` function calculates the number of leading zeros in a 32-bit unsigned integer.
First, the function reads the 32-bit unsigned interger and initializes the variable `count` to 0. Then, a for loop starts from the most significant bit (MSB) and checks whether each bits is zero. If a non-zero bit is found, the for loop breaks, and the function returns the value of `count` , which represents the number of leading zeros.
```=C
static inline int my_clz(uint32_t x) {
int count = 0; //Initialize count to be 0
for (int i = 31; i >= 0; --i) {
if (x & (1U << i)) //Check whether each bit is non-zero
break;
count++;
}
return count;
}
```
The corresponding RISC-V assembly is showed below.
```=C
# my_clz function in RISC-V assembly
# Input: 32-bit unsigned integer x (passed via a0)
# Output: count of leading zeros (returned in a0)
my_clz:
li t0, 0 # Initialize count = 0 (t0)
li t1, 31 # Initialize i = 31 (t1)
li s0, 1 # S0 = 1U
my_clz_loop:
sll t2, s0, t1 # t2 = (IU << i)
and t2, a0, t2 # t2 = (x & (IU << i))
beqz t2, skip_end # if ((x & (IU << i)) != 0), go to skip_end
j my_clz_end
skip_end:
addi t0, t0, 1 # count++
addi t1, t1, -1 # t1 = i--
bgez t1, my_clz_loop # if (i >= 0), go to for loop
my_clz_end:
mv a0, t0 # move result to a0 and return
ret
```
### 1.5 fp16_to_fp32
The `fp16_to_fp32` function converts a 16-bit half-precision floating-point number to a 32-bit floating-point single-precision number. The details of this function are as follows:
**1. Extend half-precision to single-precision**
To extend a 16-bit half-precision floating-point number to 32-bit single-precision floating-point number, shift it left by 16 bits, and put the mantissa and exponent into the upper 16 bits of an integer.
**2. Extract sign, exponent, and mantissa**
By using masks `0x80000000` and `0x7FFFFFFF` respectively, the sign bit is extracted into the variable `sign`, while the exponent and mantissa bits are stored in the variable `nonsign`.
**3. Calcualte renormalized shift**
The purpose of this operation is to determine if a half-precision floating-point number needs to be normalized. First, The number of leading zeros in exponent and mantissa is calculated by `my_clz` function. Then, `renorm_shift` adjusts the result according to whether the leading zeros exceed 5 bits, which decides how many bits the mantissa needs to be shifted for normalization.
**4. Create inf_nan_mask and zero_mask**
To identify `NaN` and `infinity` cases, the operation creates a mask to examine whether the exponent bits are `11111`. Then, for `zero` case, it creates an all one mask (`0xFFFFFFFF`).
**5. Reformatting and Reorganization**
Eventually, the sign bit, normalized mantissa, and adjusted exponent are combined to form a 32-bit single-precision floating-point number, which is returned as the result.
The C code is as follows.
```=C
static inline uint32_t fp16_to_fp32(uint16_t h) {
// 1. Extend half-precision to dingle-precision
const uint32_t w = (uint32_t) h << 16;
// 2. Extract sign, exponent, and mantissa
const uint32_t sign = w & UINT32_C(0x80000000);
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
// 3. Calcualte renormalized shift
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
// 4. Create inf_nan_mask and zero_mask
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000);
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
// 5. Reformatting and Reorganization
return sign | ((((nonsign << renorm_shift >> 3) +
((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
}
```
The corresponding RISC-V assembly is showed below.
```=C
# fp16_to_fp32 function in RISC-V assembly
# Input: uint16_t h (passed via a0)
# Output: uint32_t h_32 (returned in a0)
fp16_to_fp32:
# w
slli t0, a0, 16 # t0 = w = (uint32_t) h << 16
# sign
li t1, 0x80000000 # t1 = 0x80000000
and t2, t0, t1 # t2 = sign = w & UINT32_C(0x80000000)
# nonsign
li t1, 0x7FFFFFFF # t1 = 0x7FFFFFFF
and t3, t0, t1 # t3 = nonsign = w & UINT32_C(0x7FFFFFFF);
# renorm_shift
addi sp, sp, -4 # store ra to stack
sw ra, 4(sp)
mv a0, t3
jal ra, my_clz
lw ra, 4(sp) # reload ra from stack
addi sp, sp, 4
mv t4, a0 # t4 = renorm_shift = my_clz(nonsign)
addi t4, t4, -5 # t4 = renorm_shift - 5
bgt t4, zero, continue # if (renorm_shift > 5), goto continue
li t4 0 # if (renorm_shift <= 5), t4 = renorm_shift = 0
continue:
# inf_nan_mask
li t1, 0x04000000 # t1 = 0x04000000
add t5, t3, t1 # t5 = nonsign + 0x04000000
srai t5, t5, 8 # t5 = (nonsign + 0x04000000) >> 8
li t1, 0x7F800000 # t1 = 0x7F800000
and t5, t5, t1 # t5 = inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000)
# zero_mask
addi t6, t3, -1 # t6 = nonsign - 1
srai t6, t6, 31 # t6 = zero_mask = (int32_t)(nonsign - 1) >> 31
# return value
sll t3, t3, t4 # t3 = nonsign << renorm_shift
srai t3, t3, 3 # t3 = (nonsign << renorm_shift) >> 3
li t1, 0x70 # t1 = 0x70
sub t1, t1, t4 # t1 = 0x70 - renorm_shift
slli t1, t1, 23 # t1 = (0x70 - renorm_shift) << 23
add t3, t3, t1 # t3 = ((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23))
or t3, t3, t5 # t3 = (((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask)
li t1, -1
xor t6, t6, t1 # t6 = ~zero_mask
and t3, t3, t6 # t3 = ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask)
or t2, t2, t3 # t2 = sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
#return
mv a0, t2
ret
```
## 2. Use case from Leetcode
### 2.1 Motivation
In problem C, The `my_clz` function calculates how many leading zeros there are in the binary representation of a 32-bit unsigned integer. This function can be utilized in programs that count the maximum consecutive 1s.
Counting the maximum consecutive 1s in a sequence consisting only of 0s and 1s can be applied in various fields. For example, in image compression, the pixels of a black-and-white image can be representeed as 0s and 1s. Calculating the maximum consecutive 1s in a specific area can be used to compress the data, thereby reducing storage space. Hence, finding the number of maximum consecutive 1s is an important issue.
### 2.2 Max Consecutive Ones
Based on the motivation, [Leetcode 485. Max Consecutive Ones](https://leetcode.com/problems/max-consecutive-ones/description/) can be solved by using the `my_clz` function. The description of the Leetcode problem is as follows:
>Description:
>Given a binary array nums, return the maximum number of consecutive 1's in the array
>
>Example 1:
>Input: nums = [1,1,0,1,1,1]
Output: 3
Explanation: The first two digits or the last three digits are consecutive 1s. The maximum number of consecutive 1s is 3.
>
>Example 2:
Input: nums = [1,0,1,1,0,1]
Output: 2
>
>Constraints:
1 <= nums.length <= 10^5
nums[i] is either 0 or 1.
## 3. Solution
### 3.1 C code
The C code program can be divided into `my_clz` function and `findMaxConsecutiveOnes` function. The details of these functions are as follows:
**1. my_clz function**
This function counts the number of leading zeros in the binary representation of a 32-bit integer `x`. It iterates from the most significant bit (bit 31) down to 0, counting the number of zeros until it finds the first 1.
**2. findMaxConsecutiveOnes function**
This function calculates the maximum number of consecutive 1's in the array `nums[]`. It first groups every 32 elements in the array `nums[]` into a pack called `packed`. Then, if `packed` is `0xFFFFFFFF`, it adds `wordSize` to `currentCount`. Otherwise, it uses the `my_clz` function to skip over the leading zeros and count the consecutive 1s. Eventually, it returns the value of `maxCount`, which represents the maximum number of consecutive 1s.
```c=
#include <stdio.h>
#include <stdint.h>
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
int findMaxConsecutiveOnes(int* nums, int numsSize) {
int maxCount = 0;
int currentCount = 0;
int wordSize = 32;
int set = (numsSize >> 5) << 5;
if (numsSize == 0)
return 0;
for (int i = set; i >= 0; i-= wordSize) {
uint32_t packed = 0;
//Pack 32 bits into an unsigned interger
for (int j = 0; j < wordSize && i + j < numsSize; j++) {
packed |= ((uint32_t)nums[i + j] << j);
}
if (packed == 0xFFFFFFFF) {
currentCount += wordSize;
}
else {
int counter = 32;
while(counter > 0){
int leadingZeros = my_clz(packed);
packed <<= leadingZeros;
counter -= leadingZeros;
if(leadingZeros != 0){
if(currentCount > maxCount)
maxCount = currentCount;
currentCount = 0;
}
else{
currentCount++;
packed <<= 1;
counter--;
}
}
}
}
if (currentCount > maxCount)
maxCount = currentCount;
return maxCount;
}
int main() {
int nums1[] = {1, 1, 0, 1, 1, 1};
int numsSize1 = sizeof(nums1) / sizeof(nums1[0]);
int result1 = findMaxConsecutiveOnes(nums1, numsSize1);
printf("Maximum number of consecutive 1's: %d\n", result1);
int nums2[] = {1, 0, 1, 1, 0, 1};
int numsSize2 = sizeof(nums2) / sizeof(nums2[0]);
int result2 = findMaxConsecutiveOnes(nums2, numsSize2);
printf("Maximum number of consecutive 1's: %d\n", result2);
int nums3[] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1};
int numsSize3 = sizeof(nums3) / sizeof(nums3[0]);
int result3 = findMaxConsecutiveOnes(nums3, numsSize3);
printf("Maximum number of consecutive 1's: %d\n", result3);
return 0;
}
```
### 3.2 RISC-V assembly code
The corresponding RISC-V assembly is as follows:
```=C
.data
num1: .word 1, 1, 0, 1, 1, 1
numSize1: .word 6
num2: .word 1, 0, 1, 1, 0, 1
numSize2: .word 6
num3: .word 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
numSize3: .word 35
str1: .string "Maximum number of consecutive 1's :"
str2: .string "\n"
.text
# main function in RISC-V assembly
main:
# num1 case
la a0, num1
lw a1, numSize1
jal ra, findMaxConsecutiveOnes
mv t0, a0
la a0, str1
li a7, 4
ecall
mv a0, t0
li a7, 1
ecall
la a0, str2
li a7, 4
ecall
# num2 case
la a0, num2
lw a1, numSize2
jal ra, findMaxConsecutiveOnes
mv t0, a0
la a0, str1
li a7, 4
ecall
mv a0, t0
li a7, 1
ecall
la a0, str2
li a7, 4
ecall
# num3 case
la a0, num3
lw a1, numSize3
jal ra, findMaxConsecutiveOnes
mv t6, a0
la a0, str1
li a7, 4
ecall
mv a0, t6
li a7, 1
ecall
la a0, str2
li a7, 4
ecall
# Exit the program
li a7, 10 # System call code for exiting
ecall
# my_clz function in RISC-V assembly
# Input: 32-bit unsigned integer x (passed via a0)
# Output: count of leading zeros (returned in a0)
my_clz:
li t2, 1 # t2 = 1U
li t1, 31 # t1 = i = 31
li t0, 0 # t0 = count
my_clz_loop:
sll t3, t2, t1 # t3 = 1U << i
addi t1, t1, -1 # t1 = i--
and t3, a0, t3 # t3 = x & (1U << i)
bnez t3, my_clz_end # if ((x & (1U << i)) == 1), goto my_clz_end
addi t0, t0, 1 # t0 = count++
bgez t1, my_clz_loop # if (i >= 0), goto my_clz_loop
my_clz_end:
mv a0, t0
ret
# findMaxConsecutiveOnes function in RISC-V assembly
# Input: pointer to array *num, interger numSize (passed via a0 and a1)
# Output: max consecutive ones (returned in a0)
findMaxConsecutiveOnes:
li s0, 0 # s0 = maxCount = 0
srli s3, a1, 5 # s3 = set = (numSize >> 5)
beqz a1, for_loop_end # if (numsSize == 0), goto for_loop_end
li s1, 0 # s1 = currentCount = 0
slli s3, s3, 5 # s3 = set = (numSize >> 5) << 5
li s2, 32 # s2 = wordSize = 32
li s4, 0 # s4 = packed = 0
addi s5, s3, 0 # s5 = i = set
li s6, 0 # s6 = j = 0
j for_loop
for_loop:
bgez s5, pack_bits # if (i >= 0), goto pack_bits
sub t5, s1, s0 # t5 = currentcount - maxCount
bgez t5, update_maxCount2 # if (currentCount >= maxCount), goto update_maxCount2
j for_loop_end
# if-else structure 2
sub t6, s1, s0 # t6 = currentCount - maxCount
bgez t6, update_maxCount2 # if (currentCount - maxCount >= 0), goto update_maxCount2
j for_loop_end
for_loop_update:
sub s5, s5, s2 # s1 = i -= wordSize
j for_loop
pack_bits:
add t0, s5, s6 # t0 = i + j
slli, t4, t0, 2 # t4 = (i+j) * 4
sub t1, s6, s2 # t1 = j - wordSize
add t3, a0, t4 # t4 = nums[i+j] address
sub t2, t0, a1 # t2 = (i+j) - numsSize
bgez t1, check_packed # if (j - wordSize >= 0), goto check_packed
lw t3, 0(t3) # t3 = num[i+j] value
bgez t2, check_packed # if (((i+j) - numsSize) >= 0), goto check_packed
sll t3, t3, s6 # t3 = num[i+j] << j
addi s6, s6, 1 # s6 = j++
li s8, 0xFFFFFFFF # s8 = 0xFFFFFFFF
or s4, s4, t3 # s4 = packed |= (nums[i + j] << j);
j pack_bits
check_packed:
sub t4, s4, s8 # t4 = packed - 0xFFFFFFFF
li s7, 32 # s7 = counter = 32
li s6, 0 # s6 = j = 0
beqz t4, update_currentCount # if (packed == 0xFFFFFFFF), goto update_currentCount
bgez s7, while_loop # if (counter >= 0), goto, while_loop
update_currentCount:
add s1, s1, s2 # s1 = currentCount += wordSize
sub s5, s5, s2 # s1 = i -= wordSize
j for_loop_update
while_loop:
# check while_loop condition
beqz s7, for_loop_update # if (counter == 0), goto for_loop_update
# call my_clz
addi sp, sp, -24 #store ra and a0 to stack before calling my_clz
sw ra, 24(sp)
sw a0, 20(sp)
sw t0, 16(sp)
sw t1, 12(sp)
sw t2, 8(sp)
sw t3, 4(sp)
mv a0, s4 # a0 = packed
jal ra, my_clz
mv s9, a0 # s9 = leadingZeros = my_clz(packed)
lw ra, 24(sp)
lw a0, 20(sp)
lw t0, 16(sp)
lw t1, 12(sp)
lw t2, 8(sp)
lw t3, 4(sp)
addi sp, sp, 24
sll s4, s4, s9 # s4 = packed <<= leadingZeros
sub s7, s7, s9 # s7 = counter -= leadingZeros
# if-else structure 1
beqz s9, update_currentCount2 # if (leadingZeros == 0), goto update_currentCount2
sub t5, s1, s0 # t5 = currentcount - maxCount
bgez t5, update_maxCount1 # if (currentCount - maxCount >= 0), goto update_maxCount1
update_currentCount2:
addi s7, s7, -1 # s7 = counter -= 1
addi s1, s1, 1 # s1 = currentCount++
slli s4, s4, 1 # s4 = packed <<= 1
j while_loop
update_maxCount1:
mv s0, s1 # s0 = maxCount = currentCount
li s1, 0 # s1 = currentCount = 0
j while_loop
update_maxCount2:
mv s0, s1 # s0 = maxCount = currentCount
j for_loop_end
for_loop_end:
mv a0, s0 # return maxCount
ret
```
## 4. Result
### 4.1 Test cases
The test cases are shown below:
| | numsSize | nums[] |
|:-----------:|:--------:|:---------------------------------------------------------------------------------------------------------:|
| test case 1 | 6 | {1, 1, 0, 1, 1, 1} |
| test case 2 | 6 | {1, 0, 1, 1, 0, 1} |
| test case 3 | 35 | {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1} |
### 4.2 C progran output
The output of C program is as follows:
```=C
Maximum number of consecutive 1's :3
Maximum number of consecutive 1's :2
Maximum number of consecutive 1's :35
--------------------------------
Process exited after 0.0407 seconds with return value 0
```
### 4.3 RISC-V assembly output
The outputs of RISC-V assembly output are below, including the results of maximum number of consecutive 1's and the execution information.
**Results:**
```=C
Maximum number of consecutive 1's :3
Maximum number of consecutive 1's :2
Maximum number of consecutive 1's :35
Program exited with code: 0
```
**Execution information:**
```=C
Cycles: 2434
Instrs. retired: 1902
CPI: 1.28
IPC: 0.781
Clock rate: 0 Hz
```
 
## 5. Analysis
I test the code using [Ripes](https://github.com/mortbopet/Ripes) simulator.
### 5.1 Pseodo instruction
:::spoiler Pseodo instruction
```
00000000 <main>:
0: 10000517 auipc x10 0x10000
4: 00050513 addi x10 x10 0
8: 10000597 auipc x11 0x10000
c: 0105a583 lw x11 16 x11
10: 0f0000ef jal x1 240 <findMaxConsecutiveOnes>
14: 00050293 addi x5 x10 0
18: 10000517 auipc x10 0x10000
1c: 0b050513 addi x10 x10 176
20: 00400893 addi x17 x0 4
24: 00000073 ecall
28: 00028513 addi x10 x5 0
2c: 00100893 addi x17 x0 1
30: 00000073 ecall
34: 10000517 auipc x10 0x10000
38: 0b850513 addi x10 x10 184
3c: 00400893 addi x17 x0 4
40: 00000073 ecall
44: 10000517 auipc x10 0x10000
48: fd850513 addi x10 x10 -40
4c: 10000597 auipc x11 0x10000
50: fe85a583 lw x11 -24 x11
54: 0ac000ef jal x1 172 <findMaxConsecutiveOnes>
58: 00050293 addi x5 x10 0
5c: 10000517 auipc x10 0x10000
60: 06c50513 addi x10 x10 108
64: 00400893 addi x17 x0 4
68: 00000073 ecall
6c: 00028513 addi x10 x5 0
70: 00100893 addi x17 x0 1
74: 00000073 ecall
78: 10000517 auipc x10 0x10000
7c: 07450513 addi x10 x10 116
80: 00400893 addi x17 x0 4
84: 00000073 ecall
88: 10000517 auipc x10 0x10000
8c: fb050513 addi x10 x10 -80
90: 10000597 auipc x11 0x10000
94: 0345a583 lw x11 52 x11
98: 068000ef jal x1 104 <findMaxConsecutiveOnes>
9c: 00050f93 addi x31 x10 0
a0: 10000517 auipc x10 0x10000
a4: 02850513 addi x10 x10 40
a8: 00400893 addi x17 x0 4
ac: 00000073 ecall
b0: 000f8513 addi x10 x31 0
b4: 00100893 addi x17 x0 1
b8: 00000073 ecall
bc: 10000517 auipc x10 0x10000
c0: 03050513 addi x10 x10 48
c4: 00400893 addi x17 x0 4
c8: 00000073 ecall
cc: 00a00893 addi x17 x0 10
d0: 00000073 ecall
000000d4 <my_clz>:
d4: 00100393 addi x7 x0 1
d8: 01f00313 addi x6 x0 31
dc: 00000293 addi x5 x0 0
000000e0 <my_clz_loop>:
e0: 00639e33 sll x28 x7 x6
e4: fff30313 addi x6 x6 -1
e8: 01c57e33 and x28 x10 x28
ec: 000e1663 bne x28 x0 12 <my_clz_end>
f0: 00128293 addi x5 x5 1
f4: fe0356e3 bge x6 x0 -20 <my_clz_loop>
000000f8 <my_clz_end>:
f8: 00028513 addi x10 x5 0
fc: 00008067 jalr x0 x1 0
00000100 <findMaxConsecutiveOnes>:
100: 00000413 addi x8 x0 0
104: 0055d993 srli x19 x11 5
108: 10058c63 beq x11 x0 280 <for_loop_end>
10c: 00000493 addi x9 x0 0
110: 00599993 slli x19 x19 5
114: 02000913 addi x18 x0 32
118: 00000a13 addi x20 x0 0
11c: 00098a93 addi x21 x19 0
120: 00000b13 addi x22 x0 0
124: 0040006f jal x0 4 <for_loop>
00000128 <for_loop>:
128: 020ad263 bge x21 x0 36 <pack_bits>
12c: 40848f33 sub x30 x9 x8
130: 0e0f5463 bge x30 x0 232 <update_maxCount2>
134: 0ec0006f jal x0 236 <for_loop_end>
138: 40848fb3 sub x31 x9 x8
13c: 0c0fde63 bge x31 x0 220 <update_maxCount2>
140: 0e00006f jal x0 224 <for_loop_end>
00000144 <for_loop_update>:
144: 412a8ab3 sub x21 x21 x18
148: fe1ff06f jal x0 -32 <for_loop>
0000014c <pack_bits>:
14c: 016a82b3 add x5 x21 x22
150: 00229e93 slli x29 x5 2
154: 412b0333 sub x6 x22 x18
158: 01d50e33 add x28 x10 x29
15c: 40b283b3 sub x7 x5 x11
160: 02035063 bge x6 x0 32 <check_packed>
164: 000e2e03 lw x28 0 x28
168: 0003dc63 bge x7 x0 24 <check_packed>
16c: 016e1e33 sll x28 x28 x22
170: 001b0b13 addi x22 x22 1
174: fff00c13 addi x24 x0 -1
178: 01ca6a33 or x20 x20 x28
17c: fd1ff06f jal x0 -48 <pack_bits>
00000180 <check_packed>:
180: 418a0eb3 sub x29 x20 x24
184: 02000b93 addi x23 x0 32
188: 00000b13 addi x22 x0 0
18c: 000e8463 beq x29 x0 8 <update_currentCount>
190: 000bd863 bge x23 x0 16 <while_loop>
00000194 <update_currentCount>:
194: 012484b3 add x9 x9 x18
198: 412a8ab3 sub x21 x21 x18
19c: fa9ff06f jal x0 -88 <for_loop_update>
000001a0 <while_loop>:
1a0: fa0b82e3 beq x23 x0 -92 <for_loop_update>
1a4: fe810113 addi x2 x2 -24
1a8: 00112c23 sw x1 24 x2
1ac: 00a12a23 sw x10 20 x2
1b0: 00512823 sw x5 16 x2
1b4: 00612623 sw x6 12 x2
1b8: 00712423 sw x7 8 x2
1bc: 01c12223 sw x28 4 x2
1c0: 000a0513 addi x10 x20 0
1c4: f11ff0ef jal x1 -240 <my_clz>
1c8: 00050c93 addi x25 x10 0
1cc: 01812083 lw x1 24 x2
1d0: 01412503 lw x10 20 x2
1d4: 01012283 lw x5 16 x2
1d8: 00c12303 lw x6 12 x2
1dc: 00812383 lw x7 8 x2
1e0: 00412e03 lw x28 4 x2
1e4: 01810113 addi x2 x2 24
1e8: 019a1a33 sll x20 x20 x25
1ec: 419b8bb3 sub x23 x23 x25
1f0: 000c8663 beq x25 x0 12 <update_currentCount2>
1f4: 40848f33 sub x30 x9 x8
1f8: 000f5a63 bge x30 x0 20 <update_maxCount1>
000001fc <update_currentCount2>:
1fc: fffb8b93 addi x23 x23 -1
200: 00148493 addi x9 x9 1
204: 001a1a13 slli x20 x20 1
208: f99ff06f jal x0 -104 <while_loop>
0000020c <update_maxCount1>:
20c: 00048413 addi x8 x9 0
210: 00000493 addi x9 x0 0
214: f8dff06f jal x0 -116 <while_loop>
00000218 <update_maxCount2>:
218: 00048413 addi x8 x9 0
21c: 0040006f jal x0 4 <for_loop_end>
00000220 <for_loop_end>:
220: 00040513 addi x10 x8 0
224: 00008067 jalr x0 x1 0
```
:::
### 5.2 5-stage pipelined processor
[Ripes](https://github.com/mortbopet/Ripes) provides several processers simulator, and I use 5-stage pipelined processor to test the code.

Compared to a single-cycle processor, 5-stage processor is divided into five stages:
Instruction Fetch (`IF`), Instruction Decode (`ID`), Execute (`EX`), Memory Access (`MEM`), and Write Back (`WB`). The details of each stage are as follows:
**1. Instruction Fetch (IF)**
* The instruction is fetched from the memory address pointed by the Program Counter (PC).
* The Program Counter (PC) is updated by being added an immediate value of 4 using an adder, allowing the PC to point to the memory address of the next instruction.
**2. Instruction Decode (ID)**
* The instruction is decoded using a decoder and read the value from register sourse.
* For I-format instructions, the immediate value undergoes sign extension before being inputted into the pipelined register.
**3. Execute (EX)**
* The ALU carries out one of the following instructions:
* **R-format** : The ALU performs the operation based on the opcode and the values in registers.
* **I-format** : The ALU performs the operation based on the opcode, the values in registers, and the sign-extensioned immediate value.
* **S-format** : The ALU performs an addition operation to add the base address and offset, with the resulting value representing the memory address for storing or loading data.
* **B-format** : The ALU checks conditions and determines whether the branch occurs.
**4. Memory Access (MEM)**
* For store instruction, the data will be stored in memory at the address computed in Execute (EX) stage.
* For load instruction, the data will be loaded from memory according to the address computed in Execute (EX) stage and then input into pipelined register.
**5. Write Back (WB)**
* The data value from memory or the result of ALU operation is written back into register.
### 5.3 Example
To clarify how the processor operates instructions, I will use the pseudoinstruction `addi x22, x22, 1` to illustrate the operation stage by stage. The memory address of pseudoinstruction `addi x22, x22, 1` is `0x170`.
**1. Instruction Fetch (IF)**
* First, the memory address pointed by program counter (PC) is `0x170`, which represents the next insturction the processor will execute in next cycle.
* According to RISC-V instruction format, the I-format instruction in binay representation can be separated into `immediate`, `rs1`, `funct3`, `rd`, and `opcode`.
| immediate [11:0] | rs1 [12:16] | funct3 [17:19] | rd [20:24] | opcode [25:31] |
|:----------------:|:-----------:|:--------------:|:----------:|:--------------:|
| 0000 0000 0001 | 10110 | 000 | 10110 | 0010011 |
* Meanwhile, the program counter (PC) is added by immediate value 4 and obtain the memory address of the next instruction `0x174`.

**2. Instruction Decode (ID)**
According to the instruction fetched in the previous stage, the decoder decodes the instruction into `rs1`, `rd`, and `opcode`. The `immediate` is directly input into the sign-extension device.
* `rs1` is `0x16` (x22). The register file reads the value `0x0` from the corresponding register and inputs it into pipelined register.
* `opcode` is `0010011`. (ADDI)
* `immediate` is `0x1`, which is extended to a 32-bit number for calculation in next stage.

**3. Execute (EX)**
* In the EX stage, the input signals `op1` and `op2` are controlled by two-level multiplexers.
* op1 : the value of `rs1` passes through two-level multiplexers as an input of the ALU.
* op2 : the value of `immediate` passes through two-level multiplexers as an input of the ALU.
* Hence, the inputs and output of ALU are as follows:
* Input : `op1` = `0x0` , `op2` = `0x1`
* Output : `ALU_result` = `0x1`
* The result of the ALU operation, the destination register `0x16` (x22) , and the program counter are sent to the pipelined register for the next stage.

**4. Memory Access (MEM)**
* Because I-format instructions do not need to load or store value from memory, the control signal `wr_en` is set to 0.
* Program counter, the result of ALU operation, and destination register `0x16` are sent to pipelined register directly.

**5. Write Back (WB)**
* In the WB stage, the multiplexer controls which signal will pass through using the control signal. In this case, the result of ALU operation will be sent directly from pipelined register and written back into the destination register `0x16` (x22) through the multiplexer.

**6. Memory Update**
The first image below shows the memory status when the pseudoinstruction `addi x22, x22, 1` is in MEM stage, while second image shows the same pseudoinstruction in the WB stage. According to two images, it is evident that the value of register `x22` is updated in the WB stage.

