# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by <[吳睿秉](https://github.com/XTX7900XTX/Computer-Architecture-2024.git)>
:::danger
Be aware of Markdown headings.
:::
## Quiz1 [Problem C](https://hackmd.io/@sysprog/arch2024-quiz1-sol#Problem-C)
:::danger
Use color sparingly, only when necessary to differentiate from plain text. Prioritize content over formatting.
:::
### <font color="white">fabsf</font>
The function fabsf is used to obtain the absolute value of a floating-point number. This is achieved by applying the AND operation with the mask `0x7FFFFFFF`, which sets the sign bit to 0.
#### C
In this C program, the input floating point number's address is first cast to a `uint32_t` type to enable bitwise operations. After this, the `uint32_t` value is masked with `0x7FFFFFFF` to clear the sign bit. Finally, the result is cast back to the original floating point type, which gives the absolute value of the floating point number.
```c
static inline float fabsf(float x) {
uint32_t i = *(uint32_t *)&x;
i &= 0x7FFFFFFF;
x = *(float *)&i;
return x;
}
```
:::danger
Don't paste code snip without comprehensive discussions.
:::
#### RISC-V
There are three test cases, so the len is set to `3` in advance. and the process will print the input to the console before calling the function `fabsf`. After the function `fabsf` returns, the output will be stored in `a0` and then moved to `a1`. Next, the value in `a1` will be compared to the expected answer in `a2`. The console will print `1` if the output equals the expected answer.
| register name | content |
|:------------- |:------------------------------------------------- |
| t0 | data start address (input) |
| t1 | number of data |
| t2 | data start address (expect output) |
| t3 | hold the value in a0 during a syscall temporarily |
| t4 | hold 0x7FFFFFFF temporarily |
| a1 | fabsf output |
| a2 | expect output |
```c
.data
data: .word 0x3f000000, 0xc1420000, 0xf0700000
len: .word 0x3
expect: .word 0x3f000000, 0x41420000, 0x70700000
endl: .string "\n"
.text
la t0, data
lw t1, len
la t2, expect
main:
lw a0, 0(t0) # Show input value
li a7, 2 #
ecall #
jal ra, newline #
jal ra, fabsf # Call fabsf
add a1, zero, a0 # Show output value
li a7, 2 #
ecall #
jal ra, newline #
lw a2, 0(t2) # Check if the output matches the expected result
li a7, 1 #
bne a1, a2, wrong #
right: # If output == expected, print 1
li a0, 1 #
j goback
wrong: # If output != expected, print 0
li a0, 0 #
goback:
ecall
jal ra newline
addi t0, t0, 4 # Iterate data array
addi t1, t1,-1
addi t2, t2, 4 # Iterate expect array
bne t1, zero, main
li a7, 10
ecall
fabsf: # fabsf function
li t4, 0x7fffffff
and a0, a0, t4 # x &= 0x7FFFFFFF;
ret
newline: # print "\n"
mv t3, a0
la a0, endl
li a7, 4
ecall
mv a0, t3
ret
```
#### Testcase
* ##### Case1:
> Input: 0x3f000000 (0.5 in decimal)
> Output: 0x3f000000 (0.5 in decimal)
* ##### Case2:
> Input: 0xc1420000 (-12.125 in decimal)
> Output: 0x41420000 (12.125 in decimal)
* ##### Case3:
> Input: 0xf0700000 (-2.97106e+29 in decimal)
> Output: 0x70700000 (2.97106e+29 in decimal)
#### Console
```
0.5
0.5
1
-12.125
12.125
1
-2.97106e+29
2.97106e+29
1
Program exited with code: 0
```
:::danger
Use fewer instructions.
:::
### <font color="white">my_clz</font>
The `my_clz` function counts the leading zeros in a 32-bit digit. Starting from the leftmost digit and moving toward the right, it counts the zeros until a '1' appears. The function then returns the number of zeros in `t4` register. Note that if the 32-bit digit is all zeros (i.e., 0), the leading zeros are considered undefined.
#### C
This C program simply counts the leading zeros in the input `x`. It starts by setting the `count` to `0`. Then, it iterates from left to right through all 32 bits. Using a left shift operation (`1U << i`), for `i=31` to `0`, it performs an AND operation with `x` to check if the bit is 1. As soon as the first 1 appears from the left side, the loop breaks and returns `count`, which stores the number of leading zeros.
```c
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
```
#### RISC-V
The RISC-V program for counting leading zeros works as follows. There are four inputs. The program displays the 32-bit input before starting `my_clz`. It uses a loop to accumulate the number of leading zeros until the `t6` register holds a 1, meaning the condition `(x & (1 << i))` is true. At this point, the loop stops, and the count is stored in `t4`. The program automatically checks the answer, and if it's correct, it will print "correct."
| register name | content |
| ------------- |:----------------------------------------------------------------- |
| t0 | data start address (input) |
| t1 | number of data |
| t2 | data start address (expect output) |
| t3 | hold the value in a0 during a syscall temporarily |
| t4 | count |
| t5 | i |
| t6 | (x & (1 << i)) |
| a1 | my_clz output |
| a2 | expect output |
| s0 | store the immediate value 32 to check if the result is undefined. |
```c
.data
data: .word 0x0000002, 0x000000FF, 0xFFFFFFFF, 0x00000000
len: .word 0x4
expect: .word 30, 24, 0, 32
endl: .string "\n"
undef: .string "undefined"
correct: .string " correct"
incorrect: .string " incorrect"
.text
la t0, data
lw t1, len
la t2, expect
main:
lw a0, 0(t0) # Show input value
li a7, 1 #
ecall #
jal ra, newline #
jal ra, my_clz # Call my_clz
mv a1, a0 # Show output value
li s0, 32 # Check if the result is undefined
beq a1, s0, isundef #
li a7, 1 # Show output value
ecall #
lw a2, 0(t2) # Check if the output matches the expected result
li a7, 4 #
bne a1, a2, wrong #
right: # If output == expected, print correct
la a0, correct #
j goback #
wrong: # If output != expected, print incorrect
la a0, incorrect #
j goback #
isundef: # undefined
la a0, undef #
goback:
ecall
jal ra, newline
addi t0, t0, 4
addi t1, t1,-1
addi t2, t2, 4
bne t1, zero, main
li a7, 10
ecall
my_clz: # my_clz function
li t4 0
li t5 31
loop:
li t6, 1 # t6 = 1
sll t6, t6, t5 # t6 = (1 << i)
and t6, a0, t6 # t6 = (x & (1 << i))
bnez t6, quit # if (x & (1U << i)) break;
addi t4, t4, 1
addi t5, t5, -1
bge t5, zero, loop
quit:
mv a0, t4
ret
newline: # to print "\n"
mv t3, a0
la a0, endl
li a7, 4
ecall
mv a0, t3
ret
```
#### Testcase
* ##### Case1:
> Input: 0x0000002 (2 in decimal, `00000000 00000000 00000000 00000010` in binary)
> Output: 30
* ##### Case2:
> Input: 0x000000FF (255 in decimal, `00000000 00000000 00000000 11111111` in binary)
> Output: 24
* ##### Case3:
> Input: 0xFFFFFFFF (-1 in decimal, `11111111 11111111 11111111 11111111` in binary)
> Output: 0
* ##### Case4:
> Input: 0x00000000 (0)
> Output: undefined
#### Console
```
2
30 correct
255
24 correct
-1
0 correct
0
undefined
Program exited with code: 0
```
### <font color="white">fp16_to_fp32</font>
The purpose of fp16_to_fp32 is to convert a 16-bit floating-point number to a 32-bit floating-point number. Due to the format differences between FP16 and FP32, certain transformations are required during the conversion process.
There are several steps involved. First, the input `h` is shifted left by 16 bits into a 32-bit `uint32_t`, where the upper bits hold the FP16 data and the lower bits remain zero.
Second, the `sign` bit is extracted, and the `nonsign` bits (i.e., exponent and mantissa) are separated.
The third step involves calculating the renormalization shift. The function `my_clz`, as mentioned earlier, is used to count the leading zeros in the nonsign part (exponent and mantissa). The result is adjusted by subtracting 5 if it's greater than 5, otherwise it's set to 0. This shift value is used to normalize the mantissa by shifting it left and then right.
The fourth step handles the special cases for the exponent being all ones (infinity or NaN). In FP16, an exponent of 0x1F (31 in decimal) indicates either infinity or NaN. The result is right-shifted by 8 bits to move the exponent to the correct position in the 32-bit representation. The mask `0x7F800000` ensures that only the exponent bits are retained in the final result.
In the fifth step, the zero_mask is used to checks if `nonsign` is zero. If so, equals 0, subtracting 1 causes overflow, setting bit 31 to 1. If nonsign is not zero, bit 31 remains 0.
Finally, the function combines the sign, the normalized nonsign, and applies the `inf_nan_mask` and `zero_mask` to handle cases of `infinity`, `NaN`, and `zero`, ultimately returning the 32-bit floating-point value.
#### C
```c
static inline uint32_t fp16_to_fp32(uint16_t h) {
const uint32_t w = (uint32_t) h << 16;
const uint32_t sign = w & UINT32_C(0x80000000);
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000);
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
}
```
#### RISC-V
This RISC-V program executes the process described in the C program, utilizing the previously mentioned `my_clz` function to convert from FP16 to FP32.
| register name | content |
|:------------- |:---------------------------------------------------- |
| s1 | h, w, renorm_shift |
| s2 | sign |
| s3 | nonsign |
| s4 | inf_nan_mask |
| s5 | zero_mask |
| s6 | (nonsign << renorm_shift >> 3) |
| s7 | ((0x70 - renorm_shift) << 23) |
| s8 | result |
| s11 | Temporarily store an immediate value for operations. |
```c
.data
data: .word 0x3C00, 0xCC10, 0x962C
len: .word 0x3
expect: .word 0x3F800000, 0xC1820000, 0xBAC58000
endl: .string "\n"
correct: .string " correct"
incorrect: .string " incorrect"
.text
la t0, data
lw t1, len
la t2, expect
main:
lw a0, 0(t0) # Show input value
li a7, 1 #
ecall #
jal ra, newline #
jal ra, fp16_to_fp32 # Call my_clz
mv a1, a0 #
li a7, 1 # Show output value
ecall #
lw a2, 0(t2) # Check if the output matches the expected result
li a7, 4 #
bne a1, a2, wrong #
right: # If output == expected, print correct
la a0, correct #
j goback #
wrong: # If output != expected, print incorrect
la a0, incorrect #
goback:
ecall
jal ra, newline
addi t0, t0, 4
addi t1, t1,-1
addi t2, t2, 4
bne t1, zero, main
li a7, 10
ecall
fp16_to_fp32:
slli a0 a0 16
li s11 0x80000000
and s2 a0 s11
li s11 0x7FFFFFFF
and s3 a0 s11
mv a0 s3
my_clz: # my_clz function
li t4 0
li t5 31
loop:
li t6, 1 # t6 = 1
sll t6, t6, t5 # t6 = (1 << i)
and t6, a0, t6 # t6 = (x & (1 << i))
bnez t6, quit # if (x & (1U << i)) break;
addi t4, t4, 1
addi t5, t5, -1
bge t5, zero, loop
quit:
mv s1 t4
li s11 5
blt s11 s1 Sub_Five # if (5 < renorm_shift)
li s1 0
j continue
Sub_Five:
addi s1 s1 -5 # renorm_shift -= 5
continue:
li s11 0x04000000
add s4 s3 s11
srli s4 s4 8
li s11 0x7F800000
and s4 s4 s11
addi s5 s3 -1
srli s5 s5 31
sll s6 s3 s1
srli s6 s6 3
li s11 0x70
sub s7 s11 s1
slli s7 s7 23
add s8 s6 s7
or s8 s8 s4
xori s5 s5 -1
and s8 s8 s5
or s8 s8 s2
mv a0 s8
ret
newline: # to print "\n"
mv t3, a0
la a0, endl
li a7, 4
ecall
mv a0, t3
ret
```
#### Testcase
* ##### Case1:
> Input: 0x3C00 (1 in fp16, 15360 in decimal on the console)
> Output: 0x3F800000 (1 in fp32, 1065353216 in decimal on the console)
* ##### Case2:
> Input: 0xCC10 (-16.25 in fp16, 52240 in decimal on the console)
> Output: 0xC1820000 (-16.25 in fp32, -1048444928 in decimal on the console)
* ##### Case3:
> Input: 0x962C (-0.001507 in fp16, 38444 in decimal on the console)
> Output: 0xBAC58000 (-0.0015068054 in fp32, -1161461760 in decimal on the console)
#### Console
```
15360
1065353216 correct
52240
-1048444928 correct
38444
-1161461760 correct
Program exited with code: 0
```
## [3158. Find the XOR of Numbers Which Appear Twice](https://leetcode.com/problems/find-the-xor-of-numbers-which-appear-twice/description/)
### <font color="white">Description</font>
You are given an array `nums`, where each number in the array appears either once or twice.
Return the bitwise `XOR` of all the numbers that appear twice in the array, or 0 if no number appears twice.
#### Example:
> Input: nums = [1,2,2,1]
> Output: 3
> Explanation: Numbers 1 and 2 appeared twice. `1 XOR 2 == 3`.
#### Constraints:
* `1 <= nums.length <= 50`
* `1 <= nums[i] <= 50`
* Each number in `nums` appears either once or twice.
---
### <font color="white">Solution - Naive Approach</font>
This problem can easily be solved with a nested loop that checks the values in the array. If a value repeats, perform the XOR operation to get the final answer. This approach is straightforward and easy to understand.
#### C
```c
int duplicateNumbersXOR(int* nums, int numsSize) {
int xor = 0;
for(int i=0;i<numsSize;++i)
for(int j=i+1;j<numsSize;++j)
if(nums[i]==nums[j])
xor ^= nums[i];
return xor;
}
```
#### RISC-V
| register name | content |
|:------------- |:--------------- |
| s3 | nums[i] address |
| s9 | nums[i] value |
| s11 | i |
| s4 | nums[j] address |
| t2 | nums[j] value |
| t1 | j |
```c
.text
lw s1, len
la s2, size
la s3, dataset # for nums[i]
main:
lw a1, 0(s2) # a1 = numsSize
li s5, 0
jal ra, dupNumXOR # call dupNumXOR
li a7, 1 # show output
ecall #
addi s2, s2, 4 # the next dataset
addi s1, s1, -1 # < len keep doing
bne s1, zero, main
li a7, 10
ecall
dupNumXOR:
li s6, 0
li s11, 0
loop_i:
bge s11, a1, loop_i_exit
addi t1, s11, 1
addi s4, s3, 4 # for nums[j]
loop_j:
bge t1, a1, loop_j_exit
lw s9, 0(s3)
lw t2, 0(s4)
bne s9, t2, skipxor
xor s6, s6, s9
skipxor:
addi t1, t1, 1 # j++
addi s4, s4, 4
j loop_j
loop_j_exit:
addi s11, s11, 1 # i++
addi s3, s3, 4
j loop_i
loop_i_exit:
mv a0, s6
ret
```
| Execution info | Value |
|:--------------- |:----- |
| Cycles | 3047 |
| Instrs. retired | 1816 |
| CPI | 1.68 |
| IPC | 0.596 |
This problem can easily be solved with **Naive Approach**. While it’s easy to understand, it is far from being the most time-efficient solution. After studying Problem C in Quiz 1, I applied a similar concept by using bitwise shift and AND operations to efficiently maintain information.
### <font color="white">Solution - Optimized Bitwise Approach</font>
This problem has the constraint that `nums[i] < 50`. Instead of using a naive `O(n²)` solution, a 64-bit variable (uint64_t) can be used to record whether each number has appeared. Since 64 bits exceed the range of 50, using this method is efficient and sufficient for this problem.
If a number appears for the first time, set the corresponding bit in the uint64_t using an OR operation (i.e., `set |= (1ULL << nums[i])` at line 8). To check if the number has appeared a second time, use an AND operation(i.e., `set & (1ULL << nums[i])` at line 6). If it has, then perform the XOR operation.
#### C
```c=
#include <stdint.h>
int duplicateNumbersXOR(int *nums, int numsSize){
uint64_t set = 0ULL;
int xor = 0;
for (int i = 0; i < numsSize; ++i){
if (set & (1ULL << nums[i]))
xor ^= nums[i];
set |= (1ULL << nums[i]);
}
return xor;
}
```
```c
#include <stdio.h>
#define test1_Size 4
#define test1_ans 1
int test1_data[test1_Size] = {1, 2, 1, 3};
#define test2_Size 14
#define test2_ans 12
int test2_data[test2_Size] = {1, 2, 3, 4, 6, 7, 8, 2, 9, 9, 3, 5, 1, 5};
#define test3_Size 16
#define test3_ans 43
int test3_data[test3_Size] = { 3, 49, 49, 3, 8, 12, 19, 19,
28, 27, 28, 50, 31, 36, 50, 36};
int main()
{
int ans1 = duplicateNumbersXOR(test1_data, test1_Size);
printf("%2d, %2d, %2d\n", ans1, test1_ans, ans1 == test1_ans);
int ans2 = duplicateNumbersXOR(test2_data, test2_Size);
printf("%2d, %2d, %2d\n", ans2, test2_ans, ans2 == test2_ans);
int ans3 = duplicateNumbersXOR(test3_data, test3_Size);
printf("%2d, %2d, %2d\n", ans3, test3_ans, ans3 == test3_ans);
}
```
#### Console
```
1, 1, 1
12, 12, 1
43, 43, 1
```
In the C program, the XOR result is stored in a uint64_t variable. However, in RISC-V, the registers are only 32 bits wide. If we use only one 32-bit register (`s7` register in **Wrong version of dupNumXOR function**)to store the result, it could lead to incorrect answers.
#### Wrong version of dupNumXOR function
```c
dupNumXOR:
li s6 0 # xor result
li s7 0 # set
li s11 0 # i
li t5 1 # const 1
loop:
lw s9, 0(s3) # s9 = nums[i]
sll s10, t5, s9 # set &= (1 << nums[i])
and s10, s7, s10 #
beqz s10, skipxor
xor s6, s6 ,s9 # xor ^= nums[i];
skipxor:
sll s10, t5, s9 # set |= (1ULL << nums[i])
or s7, s7, s10 #
addi s3, s3, 4 # iterate from nums[i] to nums[i+1]
addi s11, s11 1 # i<numsSize; i++;
bge s11, a1, quit #
j loop
quit:
mv a0 s6 # s6 is the xor result
ret
```
This scenario is similar to the **C** code, where changing` uint64_t set = 0ULL;` to `uint32_t set = 0ULL;` in line 3 would also lead to an incorrect result.
In this case, if `nums[i] > 31`, the information from `(1ULL << nums[i])` cannot be stored in a single 32-bit register.
To correctly store a 64-bit unsigned integer in RISC-V, we need to utilize two 32-bit registers(in **RISC-V Corrected version**, register `s7` and `s8`).
#### RISC-V Corrected version
```c
.data
len: .word 3 # number of dataset
size: .word 4, 14, 16 # length of every dataset
dataset:
.word 1, 2, 1, 3
.word 1, 2, 3, 4, 6, 7, 8, 2, 9, 9, 3, 5, 1, 5
.word 3, 49, 49, 3, 8, 12, 19, 19, 28, 27, 28, 50, 31, 36, 50, 36
expect: .word 1, 12, 43
endl: .string "\n"
.text
lw s1, len
la s2, size
la s3, dataset
main:
lw a1, 0(s2) # numsSize
li s5, 0
jal ra dupNumXOR # (s5 < numsSize) print(nums[i])
li a7, 1
ecall
goback:
addi s2, s2,4
jal ra newline
addi s1, s1,-1 # < len keep doing
bne s1, zero, main
li a7, 10
ecall
dupNumXOR: # duplicateNumbersXOR function
li s6 0 # xor result
li s7 0 # set_A (the left half bits)
li s8 0 # set_B (the right half bits)
li s11 0
li t5 1
loop:
lw s9, 0(s3) # s9 = nums[i]
li t4 32
bge s9 t4 left_half_bits
right_half_bits: # s8: set_B (the right half bits)
sll s10, t5, s9
and s10, s8, s10
beqz s10, right_skipxor # if not(set & (1ULL << nums[i]))
xor s6, s6 ,s9
right_skipxor:
sll s10, t5, s9
or s8, s8, s10
j loop_goback
left_half_bits: # s7: set_A (the left half bits)
addi t4 s9 -31 # (nums[i] - 31)
sll s10, t5, t4
and s10, s7, s10
beqz s10, left_skipxor # if not(set & (1ULL << nums[i]))
xor s6, s6 ,s9
left_skipxor:
sll s10, t5, t4
or s7, s7, s10
loop_goback:
addi s3, s3, 4 # iter nums[i] -> nums[i+1]
addi s11 s11 1 # i<numsSize; i++;
bge s11, a1, quit #
j loop
quit:
mv a0 s6
ret
newline: # to print "\n"
mv t6, a0
la a0, endl
li a7, 4
ecall
mv a0, t6
ret
```
| Execution info | Value |
|:--------------- |:----- |
| Cycles | 754 |
| Instrs. retired | 528 |
| CPI | 1.43 |
| IPC | 0.7 |
#### RISC-V Full Revised Version
Eventually, after slightly rearranging the registers in the checks for upper and lower bits and improving the branch conditions, I revised the code, reducing the number of instructions and cycles.
| register name | content |
|:------------- |:------------------------------------------------- |
| t4 | immediate 32 for comparision, (nums[i] - 31) |
| t5 | const 1 |
| t6 | hold the value in a0 during a syscall temporarily |
| a1 | array length (numsSize) |
| s1 | number of dataset |
| s2 | length of every dataset base address |
| s3 | dataset base address |
| s4 | expect base address |
| s5 | currently doing x-th in a dataset |
| s6 | xor result |
| s7 | set_A (the left half bits) |
| s8 | set_B (the right half bits) |
| s9 | nums[i] |
| s10 | (set & (1ULL << nums[i])) |
| s11 | i |
```c
.data
len: .word 3 # number of dataset
size: .word 4, 14, 16 # length of every dataset
dataset:
.word 1, 2, 1, 3
.word 1, 2, 3, 4, 6, 7, 8, 2, 9, 9, 3, 5, 1, 5
.word 3, 49, 49, 3, 8, 12, 19, 19, 28, 27, 28, 50, 31, 36, 50, 36
expect: .word 1, 12, 43
endl: .string "\n"
.text
lw s1, len
la s2, size
la s3, dataset
main:
lw a1, 0(s2) # numsSize
li s5, 0
jal ra dupNumXOR # (s5 < numsSize) print(nums[i])
li a7, 1
ecall
jal ra newline # print "\n"
addi s2, s2,4 # the next dataset
addi s1, s1,-1 # < len keep doing
bne s1, zero, main # goback
li a7, 10 # quit
ecall
dupNumXOR: # duplicateNumbersXOR function
li s6 0 # xor result
li s7 0 # set_A (the left half bits)
li s8 0 # set_B (the right half bits)
li s11 0
li t5 1
loop:
lw s9, 0(s3) # s9 = nums[i]
li t4 32
bge s9 t4 left_half_bits # If nums[i] > 31, it belongs to the upper half of the bits
right_half_bits: # s8: set_B (the right half bits)
sll t4, t5, s9
and s10, s8, t4
beqz s10, right_skipxor # if not(set & (1ULL << nums[i]))
xor s6, s6 ,s9
right_skipxor:
or s8, s8, t4
j loop_goback
left_half_bits: # s7: set_A (the left half bits)
addi t4, s9, -31 # (nums[i] - 31)
sll t4, t5, t4
and s10, s7, t4
beqz s10, left_skipxor # if not(set & (1ULL << nums[i]))
xor s6, s6, s9
left_skipxor:
or s7, s7, t4
loop_goback:
addi s3, s3, 4 # iter nums[i] -> nums[i+1]
addi s11, s11, 1 # i<numsSize; i++;
blt s11, a1, loop
mv a0 s6 # quit
ret
newline: # to print "\n"
mv t6, a0
la a0, endl
li a7, 4
ecall
mv a0, t6
ret
```
| Execution info | Value |
|:--------------- |:----- |
| Cycles | 683 |
| Instrs. retired | 463 |
| CPI | 1.48 |
| IPC | 0.678 |
#### Testcase
* ##### Case1:
> Input: nums = [1, 2, 1, 3]
> Output: 1
* ##### Case2:
> Input: nums = [1, 2, 3, 4, 6, 7, 8, 2, 9, 9, 3, 5, 1, 5]
> Output: 12
* ##### Case3:
> Input: nums = [3, 49, 49, 3, 8, 12, 19, 19, 28, 27, 28, 50, 31, 36, 50, 36]
> Output: 43
#### Console
```
1
12
43
Program exited with code: 0
```
---
### <font color="white">Analysis</font>
In the [Ripes](https://ripes.me/) simulator, the CPU is organized as a classic five-stage pipeline, with pipeline registers between each stage. An instruction will sequentially go through these stages, allowing for different instructions to overlap in execution. For instance, consider the instruction `lw a1, 0(s2)` , which retrieves the starting address of the `size` array and stores it in the a1 register. This example illustrates how the pipeline operates in the following five stages:

#### **1. Instruction fetch (IF):**
In RISC-V, the register `a1` is represented as `x11`, and the register `s2` is represented as `x18`. Therefore, the instruction becomes `lw x11, 0 x18`, as shown below, which means loading a word from the address specified by `s2` (in `x18`) into `a1` (in `x11`). The `lw` instruction format is as follows:
`lw rd, offset(rs1)`
| 31 ~ 20 | 19 ~ 15 | 14 ~ 12 | 11 ~ 7 | 6 ~ 0 |
|:------------:|:-------:|:-------:|:------:|:-------:|
| imm [11:0] | rs1 | funct3 | rd | opcode |
| 0 | x18 | | x11 | |
| 000000000000 | 10010 | 010 | 01011 | 0000011 |
According to the Program Counter (PC), the instruction address in instruction memory is `0x00000018`, which retrieves the instruction `00092583`. The PC will then load the address of the next instruction, which is the current instruction address plus 4 bytes (i.e., 0x0000001C).

#### 2. Instruction decode (ID)
The instruction is decoded from binary according to the above table, and control signals are generated to determine what the instruction is supposed to do.
Additionally, this stage reads the values from the source registers in the register file. In this case, `rs1`, which corresponds to `s2`, contains the starting address of the `size` array in the data section. The starting address is exactly `0x10000004`, and the offset immediate is `0`.

In memory, the orange area indicates the address and the words of the `size` array that are stored. The starting address is `0x10000004`, with space for 3 words. The corresponding data is `4, 14, 16`, as previously defined.

#### 3. Execute (EX):
Arithmetic or logical operations are performed in this stage. The address calculation uses the base address from the source register `s2` and the offset (which is `0`). The multiplexer selects the register value from the previous step (i.e., the ID stage). The ALU adds these two values together, resulting in an effective address that is the same as the value stored in `s2`.

#### 4. Memory access (MEM)
For memory-related instructions, this stage handles reading from or writing to memory. In this stage, it retrieves the word stored at the address specified by the offset in data memory. As a result, the first word in the `size` array, which is `4`, will be retrieved.

#### 5. Writeback (WB)
The word from data memory is written back to the register file. The result `4`, is stored in the destination register `a1` (`0b`).

## Reference
* Assignment1
https://hackmd.io/@sysprog/2024-arch-homework1
* Ripes simulator
https://ripes.me/
<s>
* RISC-V 指令集架構介紹 - Integer Calling convention
https://tclin914.github.io/77838749/
* Classic RISC pipeline
https://en.wikipedia.org/wiki/Classic_RISC_pipeline
</s>
:::danger
Always refer to primary sources, such as official RISC-V documentation.
:::