# [Assignment1: RISC-V Assembly and Instruction Pipeline](?)
:::danger
Check the sample pages, making the title consistent.
:::
contributed by < `hsin` >(CHIEN,TZU-HSIN)
## Subject
:::danger
Make it meaningful.
:::
### Abstract
Problem C in the quiz1 contains about count leading zero (CLZ).
Counting leading zeros plays a key role in converting floating-point numbers, especially when it comes to normalization and maintaining precision.
The following function is the method of counting leading zero in Problem C of the quiz1.
```c
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
```
The problem involves using an application of my_clz. The application is to calculate how many '1's are in the binary representation of the value stored in a certain array position. For example, if there are two '1's, it confirms that there are two sets with common elements.
### Motivation
[Two Out of Three]
It is a problem of finding common elements across all sets, and this approach can actually be applied to string comparison as well—treating strings as sets for comparison. Of course, it can also be used in everyday life, wherever there is a need to find matching elements.
The most straightforward way to compare sets is by using two arrays and running a nested loop to compare them. However, this approach has a time complexity of O(𝑛²), which is inefficient. In this problem, since the size of the elements is limited, I use a single array to record the elements present in each set. Because I don’t need to compare pairs of sets, I process each set independently, reducing the time complexity to at most O(𝑛).
The method to determine the elements in each set involves an array where each position contains 3 bits. For example, if the first set contains a certain element, its index in the array corresponds to that element, and the value stored at that position would be 001. This bitwise approach is used to track which elements appear in each set. I then use a function similar to my_clz to count how many '1's are in the binary representation. When two or more '1's appear, it indicates that there are common elements between sets.
## Problem
[2032. Two Out of Three]
(https://leetcode.com/problems/two-out-of-three/description/)
> Description : Given three integer arrays nums1, nums2, and nums3, return a distinct array containing all the values that are present in at least two out of the three arrays. You may return the values in any order.
>
```
Example 1:
Input: nums1 = [1,1,3,2], nums2 = [2,3], nums3 = [3]
Output: [3,2]
Explanation: The values that are present in at least two arrays are:
- 3, in all three arrays.
- 2, in nums1 and nums2.
Example 2:
Input: nums1 = [3,1], nums2 = [2,3], nums3 = [1,2]
Output: [2,3,1]
Explanation: The values that are present in at least two arrays are:
- 2, in nums2 and nums3.
- 3, in nums1 and nums2.
- 1, in nums1 and nums3.
Example 3:
Input: nums1 = [1,2,2], nums2 = [4,3,3], nums3 = [5]
Output: []
Explanation: No value is present in at least two arrays.
1 <= nums1.length, nums2.length, nums3.length <= 100
1 <= nums1[i], nums2[j], nums3[k] <= 100
```
## Solution
The Two Out of Three
In this problem, the elements of the sets are restricted to a specific range, only between 0 and 100. Therefore, I set up an ans[101] array to record how many times each number from 0 to 100 appears across the different sets. The input sets may contain duplicate elements, but according to the definition of a set, duplicates should only be counted once.
To handle this, I have a function called testset. For example, when I load the first set (nums1) into this function, if the set contains the number 1, the function will set ans[1] as 000 -> 001, where the leftmost bit is set to 1 because it belongs to the first set. If the second set also contains 1, ans[1] will become 000 -> 010, and so on.
As mentioned earlier, the input may contain duplicate elements. Let’s assume ans[1] = 001 and the number 1 appears more than once in the first set. I use a variable called bitPosition. Since it's in the first set, bitPosition = 001, and I will set ans[1] = ans[1] OR bitPosition, ensuring that duplicates are not counted more than once.
After all sets are processed using testset, I use the countBits function to calculate how many bits are set to 1 in each element of ans[101]. Once the counting is complete, the output function will check if any element in ans[101] contains two or more 1s, and if so, it will print the index of that element. The flag variable is used to determine whether a comma should be printed; if it’s not the first number, a comma is printed, but if it is the first number, no comma is printed.
### C program
```c
#include <stdio.h>
void testset(int input[], int nums[], int size, int bitPosition)
{
for (int i = 0; i < size; i++)
{
input[nums[i]] |= (1 << bitPosition); // Set the bit corresponding to the bit position
}
}
int countbits(int value) // Count how many 1's in each integer
{
int count = 0;
while (value > 0) {
if (value & 1) // Check if the rightmost bit is 1
count++;
value = value >> 1; // Shift right by one
}
return count;
}
void output(int set[])
{
int flag = 1; // Check if it's the first number to print
printf("[");
for (int i = 0; i < 101; i++) {
if (countbits(set[i]) >= 2) { // Count the number of bits
if (!flag) {
printf(","); // If it's not the first number, print a comma first
}
printf("%d", i); // Print the value
flag = 0; // After printing the first number, set the flag to 0
}
}
printf("]\n");
}
int main() {
int nums1[] = { 1, 1, 3, 2 }; // The first set
int nums2[] = { 2, 3 }; // The second set
int nums3[] = { 3 }; // The third set
int ans[101] = { 0 };
testset(ans, nums1, sizeof(nums1) / sizeof(nums1[0]), 0); // nums1 corresponds to bit position 0
testset(ans, nums2, sizeof(nums2) / sizeof(nums2[0]), 1); // nums2 corresponds to bit position 1
testset(ans, nums3, sizeof(nums3) / sizeof(nums3[0]), 2); // nums3 corresponds to bit position 2
output(ans); // Find numbers that appear in at least two sets
return 0;
}
```
### Assembly code
```clike
#Input: nums1 = [1,1,3,2], nums2 = [2,3], nums3 = [3]
#Input: nums1 = [3,1], nums2 = [2,3], nums3 = [1,2]
#Input: nums1 = [1,2,2], nums2 = [4,3,3], nums3 = [5]
.data
nums1: .word 1,1,3,2 # First set
nums2: .word 2,3 # Second set
nums3: .word 3 # Third set
ans: .zero 404 # Initialize array, size is 101 of 4 bytes
strleft: .asciz "[" # Left bracket
strright: .asciz "]" # Right bracket
comma: .asciz "," # Comma
.text
.globl main
main:
# Initialize nums1, nums2, and nums3
la a0, ans # Load the address of ans into a0
la a1, nums1 # Load the address of nums1 into a1
li a2, 4 # The length of nums1 is 4 (manual adjustment)
li a3, 0 # bitPosition = 0
jal testset # Call testset
la a1, nums2 # Load the address of nums2 into a1
li a2, 2 # The length of nums2 is 2 (manual adjustment)
li a3, 1 # bitPosition = 1
jal testset # Jump to testset
la a1, nums3 # Load the address of nums3 into a1
li a2, 1 # The length of nums3 is 1 (manual adjustment)
li a3, 2 # bitPosition = 2
jal testset # Jump to testset
la a0, ans # Load the address of ans into a0
jal output # Jump to output
li a7, 10 # ecall 10: End program
ecall
testset:
li t0, 0 # Initialize i = 0
li s8, 1
sll t1, s8, a3 # t1 = 1 << bitPosition (corrected as shift left operation)
addi s6, a0, 0 # s6 points to the base of ans
loop_testset:
beq t0, a2, end_testset # If i >= size, exit the loop
slli t2, t0, 2 # Calculate the offset of nums[i] (multiplied by 4)
add t3, a1, t2 # Calculate the address of nums[i]
lw t4, 0(t3) # Load nums[i], which is from nums1 or nums2
slli t5, t4, 2 # Multiply nums[i] by 4 to calculate the offset of ans
add t6, s6, t5 # s6 + offset, pointing to ans[nums[i]]
lw s7, 0(t6) # Load the value of ans[nums[i]]
or s7, s7, t1 # ans[nums[i]] |= (1 << bitPosition)
sw s7, 0(t6) # Store the updated value back to ans[nums[i]]
addi t0, t0, 1 # i++
j loop_testset # Continue the loop
end_testset:
addi a0, s6, 0
ret # Return
output:
li t5, 0 # Initialize i = 0
li t2, 1 # Used to track whether it's the first number
li t6, 6 # Loop end, theoretically should be 101, but for testing this value is reduced (manual adjustment)
add t0, a0, zero # Store the initial address of a0 in t0
jal printf_left # Print left bracket "["
loop_output: # s4: count
bge t5, t6, end_output # If t5 >= t6, exit the loop
slli t3, t5, 2 # Calculate the offset of set[i]
add t4, t0, t3 # Add set address
lw a0, 0(t4) # Load set[i]
li s1, 1
addi a3, a0, 0
jal countbits # Count the number of 1's in set[i], result stored in s4
bge s1, s4, skip # If there are not more than 2 ones, do not print the value
addi a0, t5, 0 # Store the value to print in a0
beqz t2, comma_print # If it's not the first number, print a comma
li t2, 0 # Update the status of the first number
jal print_num # Print the value
addi t5, t5, 1 # i++
j loop_output # Continue the loop
comma_print:
jal printf_comma # Print the comma
addi a0, t5, 0 # Update the value to print
jal print_num # Print the value
skip:
addi t5, t5, 1 # i++
j loop_output # Continue the loop
end_output:
jal printf_right # Print right bracket "]"
ret # Return
countbits: # a3: value (input number)
addi s4, zero, 0 # s4: count = 0
addi s5, zero, 0 # s5: used to store the result of & 1
countbits_loop:
beq a3, zero, exit_countbits # If value == 0, exit the loop
andi s5, a3, 1 # Check if the least significant bit is 1
beq s5, zero, skip_count # If s5 == 0, skip incrementing count
srai a3, a3, 1 # Right shift value by one bit
addi s4, s4, 1 # If s5 == 1, count++
j countbits_loop
skip_count:
srai a3, a3, 1 # Right shift value by one bit
j countbits_loop # Continue the loop
exit_countbits:
ret # Return, the result of the count is stored in s4
print_num:
li a7, 1 # syscall for print_int
ecall
ret # Return
# Print left bracket
printf_left:
la a1, strleft # Load the address of the string into a1
li a7, 64 # System call number 64, used for write
li a0, 1 # File descriptor 1, standard output (stdout)
li a2, 1 # String length
ecall # Execute the system call
ret # Return
# Print right bracket
printf_right:
la a1, strright # Load the address of the string into a1
li a7, 64 # System call number 64, used for write
li a0, 1 # File descriptor 1, standard output (stdout)
li a2, 1 # String length
ecall # Execute the system call
ret # Return
# Print comma
printf_comma:
la a1, comma # Load the address of the string into a1
li a7, 64 # System call number 64, used for write
li a0, 1 # File descriptor 1, standard output (stdout)
li a2, 1 # String length
ecall # Execute the system call
ret # Return
```
## Result
### test data 1
* Output: [3,2]
### test data 2
* Output: [1,2,3]
### test data 3
* Output: []
### C program time complexity
The main function iterates over the sets and calls the testset function for each set. The testset function iterates over the input array, which has a maximum size of 101. Therefore, the overall time complexity is O(n) where n is the total number of elements in all sets.
(https://www.timecomplexity.ai/?id=8d9b6322-8ec9-4578-838a-552e75f432cc)
:::danger
Do not use screenshots for plain text content, as this is inaccessible to visually impaired users.
:::
### RISC-V assembly output
<div align="center">
<img src=https://hackmd.io/_uploads/Skck8Et1yl.png>
<img src=https://hackmd.io/_uploads/Sk1f8NtJke.png>
</div>
## Analysis
I test the code using [Ripes](https://github.com/mortbopet/Ripes) simulator.
### Pseudo instruction
```pseudocode!
00000000 <main>:
0: 10000517 auipc x10 0x10000
4: 01850513 addi x10 x10 24
8: 10000597 auipc x11 0x10000
c: ff858593 addi x11 x11 -8
10: 00200613 addi x12 x0 2
14: 00000693 addi x13 x0 0
18: 040000ef jal x1 64 <testset>
1c: 10000597 auipc x11 0x10000
20: fec58593 addi x11 x11 -20
24: 00200613 addi x12 x0 2
28: 00100693 addi x13 x0 1
2c: 02c000ef jal x1 44 <testset>
30: 10000597 auipc x11 0x10000
34: fe058593 addi x11 x11 -32
38: 00200613 addi x12 x0 2
3c: 00200693 addi x13 x0 2
40: 018000ef jal x1 24 <testset>
44: 10000517 auipc x10 0x10000
48: fd450513 addi x10 x10 -44
4c: 050000ef jal x1 80 <output>
50: 00a00893 addi x17 x0 10
54: 00000073 ecall
00000058 <testset>:
58: 00000293 addi x5 x0 0
5c: 00100c13 addi x24 x0 1
60: 00dc1333 sll x6 x24 x13
64: 00050b13 addi x22 x10 0
00000068 <loop_testset>:
68: 02c28663 beq x5 x12 44 <end_testset>
6c: 00229393 slli x7 x5 2
70: 00758e33 add x28 x11 x7
74: 000e2e83 lw x29 0 x28
78: 002e9f13 slli x30 x29 2
7c: 01eb0fb3 add x31 x22 x30
80: 000fab83 lw x23 0 x31
84: 006bebb3 or x23 x23 x6
88: 017fa023 sw x23 0 x31
8c: 00128293 addi x5 x5 1
90: fd9ff06f jal x0 -40 <loop_testset>
00000094 <end_testset>:
94: 000b0513 addi x10 x22 0
98: 00008067 jalr x0 x1 0
0000009c <output>:
9c: 00000f13 addi x30 x0 0
a0: 00100393 addi x7 x0 1
a4: 00600f93 addi x31 x0 6
a8: 000502b3 add x5 x10 x0
ac: 090000ef jal x1 144 <printf_left>
000000b0 <loop_output>:
b0: 05ff5663 bge x30 x31 76 <end_output>
b4: 002f1e13 slli x28 x30 2
b8: 01c28eb3 add x29 x5 x28
bc: 000ea503 lw x10 0 x29
c0: 00100493 addi x9 x0 1
c4: 00050693 addi x13 x10 0
c8: 03c000ef jal x1 60 <countbits>
cc: 0344d463 bge x9 x20 40 <skip>
d0: 000f0513 addi x10 x30 0
d4: 00038a63 beq x7 x0 20 <comma_print>
d8: 00000393 addi x7 x0 0
dc: 054000ef jal x1 84 <print_num>
e0: 001f0f13 addi x30 x30 1
e4: fcdff06f jal x0 -52 <loop_output>
000000e8 <comma_print>:
e8: 08c000ef jal x1 140 <printf_comma>
ec: 000f0513 addi x10 x30 0
f0: 040000ef jal x1 64 <print_num>
000000f4 <skip>:
f4: 001f0f13 addi x30 x30 1
f8: fb9ff06f jal x0 -72 <loop_output>
000000fc <end_output>:
fc: 05c000ef jal x1 92 <printf_right>
100: 00008067 jalr x0 x1 0
00000104 <countbits>:
104: 00000a13 addi x20 x0 0
108: 00000a93 addi x21 x0 0
0000010c <countbits_loop>:
10c: 02068063 beq x13 x0 32 <exit_countbits>
110: 0016fa93 andi x21 x13 1
114: 000a8863 beq x21 x0 16 <skip_count>
118: 4016d693 srai x13 x13 1
11c: 001a0a13 addi x20 x20 1
120: fedff06f jal x0 -20 <countbits_loop>
00000124 <skip_count>:
124: 4016d693 srai x13 x13 1
128: fe5ff06f jal x0 -28 <countbits_loop>
0000012c <exit_countbits>:
12c: 00008067 jalr x0 x1 0
00000130 <print_num>:
130: 00100893 addi x17 x0 1
134: 00000073 ecall
138: 00008067 jalr x0 x1 0
0000013c <printf_left>:
13c: 10000597 auipc x11 0x10000
140: 07058593 addi x11 x11 112
144: 04000893 addi x17 x0 64
148: 00100513 addi x10 x0 1
14c: 00100613 addi x12 x0 1
150: 00000073 ecall
154: 00008067 jalr x0 x1 0
00000158 <printf_right>:
158: 10000597 auipc x11 0x10000
15c: 05658593 addi x11 x11 86
160: 04000893 addi x17 x0 64
164: 00100513 addi x10 x0 1
168: 00100613 addi x12 x0 1
16c: 00000073 ecall
170: 00008067 jalr x0 x1 0
00000174 <printf_comma>:
174: 10000597 auipc x11 0x10000
178: 03c58593 addi x11 x11 60
17c: 04000893 addi x17 x0 64
180: 00100513 addi x10 x0 1
184: 00100613 addi x12 x0 1
188: 00000073 ecall
18c: 00008067 jalr x0 x1 0
```
### 5-stage pipelined processor
Ripes provide different processors to execution the code. Then I choose 5-stage processor to run my program.

First, introduce the functions of the five stages of the pipeline in sequence.
1. **IF (Instruction Fetch)**
* The instruction memory fetches instructions based on the PC (Program Counter).
* The PC (Program Counter) simultaneously increments by 4 to update its position..
2. **ID (Instruction Decode and register read)**
* Decode the instruction and read the registers corresponding to register source specifiers from the register file
3. **EX (Execute)**
* The ALU is responsible for performing operations such as addition or performing logical operations like AND.
4. **MEM (Memory Access)**
* If a load word (lw) or store word (sw) instruction is executed, data will need to be fetched from data memory.
5. **WB (Write Back to registers)**
* Write the result into the register file
---
The testset function provided will go through each of these five stages as the RISC-V processor works through the instruction set.
I will use the first instruction **li t0, 0** from the test set to demonstrate how it works in the pipeline stages.
**IF (Instruction Fetch)**

* The PC is advanced by 4 bytes to retrieve the instruction. For instance, the first instruction in your test set code, addi x5,x0 0, is retrieved.
* The PC retrieves the machine code version of the instruction (x5,x0 0).
* The PC is updated to the next instruction address (PC + 4).
**ID (Instruction Decode and register read)**

* After decoding the operand, it is determined that the operation is addi.
* Set rd to x5, rs to x0, and immediate to 0.
3. **EX (Execute)**

* The ALU performs the ADDI operation, calculating x0+0.
4. **MEM (Memory Access)**

* Since it does not need to access data memory to retrieve data, no actions are taken in this stage.
5. **WB (Write Back to registers)**

* Store the result of the ALU execution back into x5.