# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < `404allen404` >
Successfully executed the factorial function in Ripes and successfully generated output through ecall in the console

### [Quiz1: Problem C](https://hackmd.io/@sysprog/arch2024-quiz1-sol#Problem-C)
:::danger
Choose one problem (A, B, or C) from [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol), translate it from C code to a complete RISC-V assembly program, and include the relevant test data.
:::
### [LeetCode: 2917. Find the K-or of an Array](https://leetcode.com/problems/find-the-k-or-of-an-array/description/?envType=problem-list-v2&envId=bit-manipulation)
**Description**
You are given an integer array nums, and an integer k. Let's introduce K-or operation by extending the standard bitwise OR. In K-or, a bit position in the result is set to 1 if at least k numbers in nums have a 1 in that position.
Return the K-or of nums.
**C Code Ver.1**
```c
#include <stdio.h>
int findKOr(int* nums, int numsSize, int k) {
/* Assume bitwidth of int is 32-bit */
int res = 0;
int bit_cnt[32] = {0};
for (int n = 0; n < numsSize; ++n) {
int num = nums[n];
for (int bit_idx = 0; bit_idx < 32; ++bit_idx) {
if (num & ((unsigned)1 << bit_idx)) {
if (++bit_cnt[bit_idx] == k) {
res |= 1 << bit_idx;
}
}
}
}
return res;
}
int main (void) {
int nums[6] = {7, 12, 9, 8, 9, 15};
int res = findKOr(nums, 6, 4);
printf("res: %d\n", res);
return 0;
}
```
**C Code Ver.2**
```c
#include <stdio.h>
int findKOr(int* nums, int numsSize, int k) {
/* Assume bitwidth of int is 32-bit */
int res = 0;
int bit_cnt[32] = {0};
for (int n = 0; n < numsSize; ++n) {
int num = nums[n];
for (int bit_idx = 0; bit_idx < 32; ++bit_idx) {
if (num & ((unsigned)1 << bit_idx)) {
if (++bit_cnt[bit_idx] == k) {
res |= 1 << bit_idx;
}
}
}
}
return res;
}
int main (void) {
int k[3] = {4, 2, 5};
int nums[3][6] = {
{7, 12, 9, 8, 9, 15}, // answer is 9
{13432, 343, 4, 143, 3411, 23}, // answer is 1375
{533, 5552, 24, 55, 8792, 9889} // answer is 16
};
int answer[3] = {9, 1375, 16};
for (int i = 0; i < 3; ++i) {
int res = findKOr(nums[i], 6, k[i]);
if (res == answer[i]) {
printf("Test case %d passed\n", i + 1);
} else {
printf("Test case %d failed. Expected: %d, Got: %d\n", i + 1, answer[i], res);
}
}
return 0;
}
```
**Assembly Code Ver.1**
:::danger
Use fewer instructions.
:::
```s
.data
nums: .word 7, 12, 9, 8, 9, 15
numsSize: .word 6
k: .word 4
str_1: .string "res: "
str_2: .string "\n"
.text
main:
la a0, nums
lw a1, numsSize
lw a2, k
jal ra, findKOr
jal ra, printResult
li a7, 10
ecall
findKOr:
addi sp, sp, -144
clear_bit_cnt:
sw x0, 0(sp)
sw x0, 4(sp)
sw x0, 8(sp)
sw x0, 12(sp)
sw x0, 16(sp)
sw x0, 20(sp)
sw x0, 24(sp)
sw x0, 28(sp)
sw x0, 32(sp)
sw x0, 36(sp)
sw x0, 40(sp)
sw x0, 44(sp)
sw x0, 48(sp)
sw x0, 52(sp)
sw x0, 56(sp)
sw x0, 60(sp)
sw x0, 64(sp)
sw x0, 68(sp)
sw x0, 72(sp)
sw x0, 76(sp)
sw x0, 80(sp)
sw x0, 84(sp)
sw x0, 88(sp)
sw x0, 92(sp)
sw x0, 96(sp)
sw x0, 100(sp)
sw x0, 104(sp)
sw x0, 108(sp)
sw x0, 112(sp)
sw x0, 116(sp)
sw x0, 120(sp)
sw x0, 124(sp)
store_arg:
sw ra, 128(sp)
sw a2, 132(sp) # k
sw a1, 136(sp) # numsSize
sw a0, 140(sp) # address of nums
add t0, x0, x0 # t0 = res = 0
add t1, x0, x0 # t1 = n = 0
addi t4, x0, 32 # t4 = 32
findKOr_loop_0:
bge t1, a1, findKOr_end # n >= numsSize
slli t3, t1, 2 # t3 = 4 * n
add t3, t3, a0 # t3 = a0 + 4 * n
lw t2, 0(t3) # t2 = num
add t3, x0, x0 # t3 = bit_idx = 0
findKOr_loop_1:
bge t3, t4, findKOr_loop_0_pre # bit_idx >= 32
addi t5, x0, 1 # t5 = 1
sll t5, t5, t3 # t5 = 1 << bit_idx
and t5, t5, t2 # t5 = num & (1 << bit_idx)
beq t5, x0, findKOr_loop_1_pre
slli t5, t3, 2 # t5 = bit_idx * 4
add t5, t5, sp
lw t6, 0(t5) # t6 = bit_cnt[bit_idx]
addi t6, t6, 1 # t6 = t6 + 1
sw t6, 0(t5)
bne t6, a2, findKOr_loop_1_pre
addi t5, x0, 1 # t5 = 1
sll t5, t5, t3 # t5 = 1 << bit_idx
or t0, t0, t5 # res |= 1 << bit_idx
findKOr_loop_1_pre:
addi t3, t3, 1 # ++bit_idx
j findKOr_loop_1
findKOr_loop_0_pre:
addi t1, t1, 1 # ++n
j findKOr_loop_0
findKOr_end:
lw ra, 128(sp)
lw a2, 132(sp)
lw a1, 136(sp)
lw a0, 140(sp)
addi sp, sp, 144
add a0, t0, x0
ret
printResult:
add t0, a0, x0
la a0, str_1
li a7, 4
ecall
add a0, t0, x0
li a7, 1
ecall
la a0, str_2
li a7, 4
ecall
ret
```
:::danger
You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking.
:::
**Assembly Code Ver.2**
When allocating the stack, initially 32 sw instructions were used to zero out the stack. In version 2, it was changed to use a loop to zero it out, reducing the instructions from 32 to 8.
In Ver.2, an automated testing mechanism was added, which can automatically compare the results and easily add new test data.
```s
.data
# Number arrays
nums1: .word 7, 12, 9, 8, 9, 15
nums2: .word 13432, 343, 4, 143, 3411, 23
nums3: .word 533, 5552, 24, 55, 8792, 9889
# Test case structure: address of array, size of array, k, expected result
test_cases:
.word nums1, 6, 4, 9
.word nums2, 6, 2, 1375
.word nums3, 6, 5, 16
.word 0 # End marker
str_test: .string "Test case "
str_pass: .string " passed\n"
str_fail: .string " failed. Expected: "
str_got: .string ", Got: "
str_nl: .string "\n"
.text
main:
la s0, test_cases # s0 points to the current test case
addi s1, x0, 1 # s1 is the test case counter
run_tests:
lw t0, 0(s0) # Load address of nums
beq t0, x0, end_tests # If address is 0, end tests
lw a0, 0(s0) # Load address of nums
lw a1, 4(s0) # Load size
lw a2, 8(s0) # Load k
lw a3, 12(s0) # Load expected result
jal ra, findKOr
# Check result
beq a0, a3, test_passed
# Test failed
la a0, str_test
li a7, 4
ecall
add a0, x0, s1
li a7, 1
ecall
la a0, str_fail
li a7, 4
ecall
add a0, x0, a3 # Expected result
li a7, 1
ecall
la a0, str_got
li a7, 4
ecall
add a0, x0, t0 # Actual result
li a7, 1
ecall
la a0, str_nl
li a7, 4
ecall
jal x0, next_test
test_passed:
la a0, str_test
li a7, 4
ecall
mv a0, s1
li a7, 1
ecall
la a0, str_pass
li a7, 4
ecall
next_test:
addi s0, s0, 16 # Move to next test case
addi s1, s1, 1 # Increment test case counter
jal x0, run_tests
end_tests:
li a7, 10
ecall
findKOr:
addi sp, sp, -144
sw ra, 128(sp)
sw a2, 132(sp) # k
sw a1, 136(sp) # numsSize
sw a0, 140(sp) # address of nums
add t0, x0, x0 # t0 = res = 0
add t1, x0, x0 # t1 = n = 0
addi t4, x0, 32 # t4 = 32
clear_bit_cnt_init:
add t2, x0, x0
addi t3, x0, 32
add t5, sp, x0 # t5 = sp
clear_bit_cnt:
bge t2, t3, findKOr_loop_0
sw x0, 0(t5)
addi t5, t5, 4
addi t2, t2, 1
jal x0, clear_bit_cnt
findKOr_loop_0:
bge t1, a1, findKOr_end # n >= numsSize
slli t3, t1, 2 # t3 = 4 * n
add t3, t3, a0 # t3 = a0 + 4 * n
lw t2, 0(t3) # t2 = num
add t3, x0, x0 # t3 = bit_idx = 0
findKOr_loop_1:
bge t3, t4, findKOr_loop_0_pre # bit_idx >= 32
addi t5, x0, 1 # t5 = 1
sll t5, t5, t3 # t5 = 1 << bit_idx
and t5, t5, t2 # t5 = num & (1 << bit_idx)
beq t5, x0, findKOr_loop_1_pre
slli t5, t3, 2 # t5 = bit_idx * 4
add t5, t5, sp
lw t6, 0(t5) # t6 = bit_cnt[bit_idx]
addi t6, t6, 1 # t6 = t6 + 1
sw t6, 0(t5)
bne t6, a2, findKOr_loop_1_pre
addi t5, x0, 1 # t5 = 1
sll t5, t5, t3 # t5 = 1 << bit_idx
or t0, t0, t5 # res |= 1 << bit_idx
findKOr_loop_1_pre:
addi t3, t3, 1 # ++bit_idx
j findKOr_loop_1
findKOr_loop_0_pre:
addi t1, t1, 1 # ++n
j findKOr_loop_0
findKOr_end:
lw ra, 128(sp)
lw a2, 132(sp)
lw a1, 136(sp)
lw a0, 140(sp)
addi sp, sp, 144
add a0, t0, x0
ret
```
**Result (Ver.2)**
- C Code
| Metrics | Value |
| --------------- | ----- |
| Cycle | 18112 |
| Instrs. retired | 12630 |
| IPC | 0.697 |
| CPI | 1.43 |
- Assembly Code
| Metrics | Value |
| --------------- | ----- |
| Cycle | 7941 |
| Instrs. retired | 5254 |
| IPC | 0.662 |
| CPI | 1.51 |
### CPU Pipeline Analysis
- **PC, Instruction memory and Compressed Decoder**
The **Program Counter (PC)** serves as the address input for fetching instructions from the instruction memory.
If the compressed instruction extension (RVC) is supported, the PC increments by either **2** for a 16-bit compressed instruction or by **4** for a 32-bit instruction.
The **multiplexer** before the PC selects between:
1. `pc + 2` or `pc + 4` (for sequential instruction fetch)
2. The result of the ALU (for jump or branch instructions: `jal`, `jalr`, or B-type instructions)
When the branch unit signals a branch or jump, the PC multiplexer selects the ALU result.
The **compressed decoder** converts 16-bit compressed instruction into a 32-bit format, enabling the next-level decoder to decode all instructions uniformly.

- **IF/ID Stage (Instruction Fetch to Instruction Decode)**
The **IF/ID pipeline register** temporarily stores the fetched instruction and the corresponding PC value, ensuring the proper handoff from the fetch stage to the decode stage.
This ensures that even when a stall occurs, the next pipeline stage can process the instruction correctly.

- **Decode Stage: Instruction Decoding, Register File and Immediate Generation**
The decode module extracts relevant fields such as `rd`, `rs1`, `rs2`, and the opcode. These fields drive the control signals for later stages.
The **register file** consists 32 general-prupose registers (x0-x31), with two read ports (`rs1`, `rs2`) for the source registers and one write port for writing back the result (`rd`).
The **immediate generation** logic extracts the appropriate immediate value from the instruction (I-type, S-type, B-type, etc.) based on the instruction format.

- **ID/EX Stage (Instruction Decode to Execution)**
The **ID/EX pipeline register** stores the register operands, the immediate value, and control signals generated by the decode stage, preparing them for the execution stage.

- **ALU and Branch Unit (Execution Stage)**
The **ALU** performs operations such as addition, subtraction AND, OR and shifts, as defined by the instruction.
The **branch unit** compares register values (`x[rs1]` and `x[rs2]`) and determines whether to take a branch based on the branch condition (`beq`, `bne`, etc.).

- **Mux 1** selects either the PC or the `x[rs1]` register as the operand for the ALU, depending on whether the instruction is a jump or a regular arithmetic operation.

- **Mux 2** selects between `x[rs2]` and the immediate value, used for operation involving an immediate (e.g., `addi`) or register (e.g. `add`).

- **Mux 3 and Mux 4** handle **forwarding** to avoid stalls due to data hazards. They select the correct operands for the ALU by choosing between values from the ID/EX stage, the memory stage, or the writeback stage.


- **EX/MEM Stage (Execution to Memory)**
The **EX/MEM pipeline register** stores the ALU result and the data to be written to memory, as well as control signals for memory access. The ALU result can either be an address for a memory operation (load/store) or the result of an arithmetic/logic operation.

- **Data Memory (Memory Stage)**
The **data memory** is accessed for load (`lw`, `lb`, `lh`) and store (`sw`, `sh`, `sb`) instruction.
The address is calculated in the ALU during the execution stage, and the result is either fetched from or written to memory based on the instruction.

- **MEM/WB Stage (Memory to Writeback)**
The **MEM/WB pipeline register** stores either the data read from memory or the ALU result from the previous stage. This data is prepared for writing back into the register file in the final stage of the pipelien.

- **WB Mux (Writeback Stage)**
The **WB Mux** selects between:
1. `PC + 4` (for `jal`/`jalr` instructions)
2. The ALU result (for arithmetic/logic instructions)
3. The data fetched from memory (for load instructions)
