# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < `404allen404` > Successfully executed the factorial function in Ripes and successfully generated output through ecall in the console ![image](https://hackmd.io/_uploads/BJ8EjxBRR.png) ### [Quiz1: Problem C](https://hackmd.io/@sysprog/arch2024-quiz1-sol#Problem-C) :::danger Choose one problem (A, B, or C) from [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol), translate it from C code to a complete RISC-V assembly program, and include the relevant test data. ::: ### [LeetCode: 2917. Find the K-or of an Array](https://leetcode.com/problems/find-the-k-or-of-an-array/description/?envType=problem-list-v2&envId=bit-manipulation) **Description** You are given an integer array nums, and an integer k. Let's introduce K-or operation by extending the standard bitwise OR. In K-or, a bit position in the result is set to 1 if at least k numbers in nums have a 1 in that position. Return the K-or of nums. **C Code Ver.1** ```c #include <stdio.h> int findKOr(int* nums, int numsSize, int k) { /* Assume bitwidth of int is 32-bit */ int res = 0; int bit_cnt[32] = {0}; for (int n = 0; n < numsSize; ++n) { int num = nums[n]; for (int bit_idx = 0; bit_idx < 32; ++bit_idx) { if (num & ((unsigned)1 << bit_idx)) { if (++bit_cnt[bit_idx] == k) { res |= 1 << bit_idx; } } } } return res; } int main (void) { int nums[6] = {7, 12, 9, 8, 9, 15}; int res = findKOr(nums, 6, 4); printf("res: %d\n", res); return 0; } ``` **C Code Ver.2** ```c #include <stdio.h> int findKOr(int* nums, int numsSize, int k) { /* Assume bitwidth of int is 32-bit */ int res = 0; int bit_cnt[32] = {0}; for (int n = 0; n < numsSize; ++n) { int num = nums[n]; for (int bit_idx = 0; bit_idx < 32; ++bit_idx) { if (num & ((unsigned)1 << bit_idx)) { if (++bit_cnt[bit_idx] == k) { res |= 1 << bit_idx; } } } } return res; } int main (void) { int k[3] = {4, 2, 5}; int nums[3][6] = { {7, 12, 9, 8, 9, 15}, // answer is 9 {13432, 343, 4, 143, 3411, 23}, // answer is 1375 {533, 5552, 24, 55, 8792, 9889} // answer is 16 }; int answer[3] = {9, 1375, 16}; for (int i = 0; i < 3; ++i) { int res = findKOr(nums[i], 6, k[i]); if (res == answer[i]) { printf("Test case %d passed\n", i + 1); } else { printf("Test case %d failed. Expected: %d, Got: %d\n", i + 1, answer[i], res); } } return 0; } ``` **Assembly Code Ver.1** :::danger Use fewer instructions. ::: ```s .data nums: .word 7, 12, 9, 8, 9, 15 numsSize: .word 6 k: .word 4 str_1: .string "res: " str_2: .string "\n" .text main: la a0, nums lw a1, numsSize lw a2, k jal ra, findKOr jal ra, printResult li a7, 10 ecall findKOr: addi sp, sp, -144 clear_bit_cnt: sw x0, 0(sp) sw x0, 4(sp) sw x0, 8(sp) sw x0, 12(sp) sw x0, 16(sp) sw x0, 20(sp) sw x0, 24(sp) sw x0, 28(sp) sw x0, 32(sp) sw x0, 36(sp) sw x0, 40(sp) sw x0, 44(sp) sw x0, 48(sp) sw x0, 52(sp) sw x0, 56(sp) sw x0, 60(sp) sw x0, 64(sp) sw x0, 68(sp) sw x0, 72(sp) sw x0, 76(sp) sw x0, 80(sp) sw x0, 84(sp) sw x0, 88(sp) sw x0, 92(sp) sw x0, 96(sp) sw x0, 100(sp) sw x0, 104(sp) sw x0, 108(sp) sw x0, 112(sp) sw x0, 116(sp) sw x0, 120(sp) sw x0, 124(sp) store_arg: sw ra, 128(sp) sw a2, 132(sp) # k sw a1, 136(sp) # numsSize sw a0, 140(sp) # address of nums add t0, x0, x0 # t0 = res = 0 add t1, x0, x0 # t1 = n = 0 addi t4, x0, 32 # t4 = 32 findKOr_loop_0: bge t1, a1, findKOr_end # n >= numsSize slli t3, t1, 2 # t3 = 4 * n add t3, t3, a0 # t3 = a0 + 4 * n lw t2, 0(t3) # t2 = num add t3, x0, x0 # t3 = bit_idx = 0 findKOr_loop_1: bge t3, t4, findKOr_loop_0_pre # bit_idx >= 32 addi t5, x0, 1 # t5 = 1 sll t5, t5, t3 # t5 = 1 << bit_idx and t5, t5, t2 # t5 = num & (1 << bit_idx) beq t5, x0, findKOr_loop_1_pre slli t5, t3, 2 # t5 = bit_idx * 4 add t5, t5, sp lw t6, 0(t5) # t6 = bit_cnt[bit_idx] addi t6, t6, 1 # t6 = t6 + 1 sw t6, 0(t5) bne t6, a2, findKOr_loop_1_pre addi t5, x0, 1 # t5 = 1 sll t5, t5, t3 # t5 = 1 << bit_idx or t0, t0, t5 # res |= 1 << bit_idx findKOr_loop_1_pre: addi t3, t3, 1 # ++bit_idx j findKOr_loop_1 findKOr_loop_0_pre: addi t1, t1, 1 # ++n j findKOr_loop_0 findKOr_end: lw ra, 128(sp) lw a2, 132(sp) lw a1, 136(sp) lw a0, 140(sp) addi sp, sp, 144 add a0, t0, x0 ret printResult: add t0, a0, x0 la a0, str_1 li a7, 4 ecall add a0, t0, x0 li a7, 1 ecall la a0, str_2 li a7, 4 ecall ret ``` :::danger You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking. ::: **Assembly Code Ver.2** When allocating the stack, initially 32 sw instructions were used to zero out the stack. In version 2, it was changed to use a loop to zero it out, reducing the instructions from 32 to 8. In Ver.2, an automated testing mechanism was added, which can automatically compare the results and easily add new test data. ```s .data # Number arrays nums1: .word 7, 12, 9, 8, 9, 15 nums2: .word 13432, 343, 4, 143, 3411, 23 nums3: .word 533, 5552, 24, 55, 8792, 9889 # Test case structure: address of array, size of array, k, expected result test_cases: .word nums1, 6, 4, 9 .word nums2, 6, 2, 1375 .word nums3, 6, 5, 16 .word 0 # End marker str_test: .string "Test case " str_pass: .string " passed\n" str_fail: .string " failed. Expected: " str_got: .string ", Got: " str_nl: .string "\n" .text main: la s0, test_cases # s0 points to the current test case addi s1, x0, 1 # s1 is the test case counter run_tests: lw t0, 0(s0) # Load address of nums beq t0, x0, end_tests # If address is 0, end tests lw a0, 0(s0) # Load address of nums lw a1, 4(s0) # Load size lw a2, 8(s0) # Load k lw a3, 12(s0) # Load expected result jal ra, findKOr # Check result beq a0, a3, test_passed # Test failed la a0, str_test li a7, 4 ecall add a0, x0, s1 li a7, 1 ecall la a0, str_fail li a7, 4 ecall add a0, x0, a3 # Expected result li a7, 1 ecall la a0, str_got li a7, 4 ecall add a0, x0, t0 # Actual result li a7, 1 ecall la a0, str_nl li a7, 4 ecall jal x0, next_test test_passed: la a0, str_test li a7, 4 ecall mv a0, s1 li a7, 1 ecall la a0, str_pass li a7, 4 ecall next_test: addi s0, s0, 16 # Move to next test case addi s1, s1, 1 # Increment test case counter jal x0, run_tests end_tests: li a7, 10 ecall findKOr: addi sp, sp, -144 sw ra, 128(sp) sw a2, 132(sp) # k sw a1, 136(sp) # numsSize sw a0, 140(sp) # address of nums add t0, x0, x0 # t0 = res = 0 add t1, x0, x0 # t1 = n = 0 addi t4, x0, 32 # t4 = 32 clear_bit_cnt_init: add t2, x0, x0 addi t3, x0, 32 add t5, sp, x0 # t5 = sp clear_bit_cnt: bge t2, t3, findKOr_loop_0 sw x0, 0(t5) addi t5, t5, 4 addi t2, t2, 1 jal x0, clear_bit_cnt findKOr_loop_0: bge t1, a1, findKOr_end # n >= numsSize slli t3, t1, 2 # t3 = 4 * n add t3, t3, a0 # t3 = a0 + 4 * n lw t2, 0(t3) # t2 = num add t3, x0, x0 # t3 = bit_idx = 0 findKOr_loop_1: bge t3, t4, findKOr_loop_0_pre # bit_idx >= 32 addi t5, x0, 1 # t5 = 1 sll t5, t5, t3 # t5 = 1 << bit_idx and t5, t5, t2 # t5 = num & (1 << bit_idx) beq t5, x0, findKOr_loop_1_pre slli t5, t3, 2 # t5 = bit_idx * 4 add t5, t5, sp lw t6, 0(t5) # t6 = bit_cnt[bit_idx] addi t6, t6, 1 # t6 = t6 + 1 sw t6, 0(t5) bne t6, a2, findKOr_loop_1_pre addi t5, x0, 1 # t5 = 1 sll t5, t5, t3 # t5 = 1 << bit_idx or t0, t0, t5 # res |= 1 << bit_idx findKOr_loop_1_pre: addi t3, t3, 1 # ++bit_idx j findKOr_loop_1 findKOr_loop_0_pre: addi t1, t1, 1 # ++n j findKOr_loop_0 findKOr_end: lw ra, 128(sp) lw a2, 132(sp) lw a1, 136(sp) lw a0, 140(sp) addi sp, sp, 144 add a0, t0, x0 ret ``` **Result (Ver.2)** - C Code | Metrics | Value | | --------------- | ----- | | Cycle | 18112 | | Instrs. retired | 12630 | | IPC | 0.697 | | CPI | 1.43 | - Assembly Code | Metrics | Value | | --------------- | ----- | | Cycle | 7941 | | Instrs. retired | 5254 | | IPC | 0.662 | | CPI | 1.51 | ### CPU Pipeline Analysis - **PC, Instruction memory and Compressed Decoder** The **Program Counter (PC)** serves as the address input for fetching instructions from the instruction memory. If the compressed instruction extension (RVC) is supported, the PC increments by either **2** for a 16-bit compressed instruction or by **4** for a 32-bit instruction. The **multiplexer** before the PC selects between: 1. `pc + 2` or `pc + 4` (for sequential instruction fetch) 2. The result of the ALU (for jump or branch instructions: `jal`, `jalr`, or B-type instructions) When the branch unit signals a branch or jump, the PC multiplexer selects the ALU result. The **compressed decoder** converts 16-bit compressed instruction into a 32-bit format, enabling the next-level decoder to decode all instructions uniformly. ![image](https://hackmd.io/_uploads/ry7UpTIJJg.png) - **IF/ID Stage (Instruction Fetch to Instruction Decode)** The **IF/ID pipeline register** temporarily stores the fetched instruction and the corresponding PC value, ensuring the proper handoff from the fetch stage to the decode stage. This ensures that even when a stall occurs, the next pipeline stage can process the instruction correctly. ![image](https://hackmd.io/_uploads/Hkr7m0Lkkx.png) - **Decode Stage: Instruction Decoding, Register File and Immediate Generation** The decode module extracts relevant fields such as `rd`, `rs1`, `rs2`, and the opcode. These fields drive the control signals for later stages. The **register file** consists 32 general-prupose registers (x0-x31), with two read ports (`rs1`, `rs2`) for the source registers and one write port for writing back the result (`rd`). The **immediate generation** logic extracts the appropriate immediate value from the instruction (I-type, S-type, B-type, etc.) based on the instruction format. ![image](https://hackmd.io/_uploads/SyZH70U1Jx.png) - **ID/EX Stage (Instruction Decode to Execution)** The **ID/EX pipeline register** stores the register operands, the immediate value, and control signals generated by the decode stage, preparing them for the execution stage. ![image](https://hackmd.io/_uploads/r1kP7RIkyg.png) - **ALU and Branch Unit (Execution Stage)** The **ALU** performs operations such as addition, subtraction AND, OR and shifts, as defined by the instruction. The **branch unit** compares register values (`x[rs1]` and `x[rs2]`) and determines whether to take a branch based on the branch condition (`beq`, `bne`, etc.). ![image](https://hackmd.io/_uploads/SkaDQCUyke.png) - **Mux 1** selects either the PC or the `x[rs1]` register as the operand for the ALU, depending on whether the instruction is a jump or a regular arithmetic operation. ![image](https://hackmd.io/_uploads/r1OUGPwyyg.png) - **Mux 2** selects between `x[rs2]` and the immediate value, used for operation involving an immediate (e.g., `addi`) or register (e.g. `add`). ![image](https://hackmd.io/_uploads/rJRvMwwJJx.png) - **Mux 3 and Mux 4** handle **forwarding** to avoid stalls due to data hazards. They select the correct operands for the ALU by choosing between values from the ID/EX stage, the memory stage, or the writeback stage. ![image](https://hackmd.io/_uploads/SkKhGvDJke.png) ![image](https://hackmd.io/_uploads/HkXqGDwkke.png) - **EX/MEM Stage (Execution to Memory)** The **EX/MEM pipeline register** stores the ALU result and the data to be written to memory, as well as control signals for memory access. The ALU result can either be an address for a memory operation (load/store) or the result of an arithmetic/logic operation. ![image](https://hackmd.io/_uploads/B1JKXRL1Jg.png) - **Data Memory (Memory Stage)** The **data memory** is accessed for load (`lw`, `lb`, `lh`) and store (`sw`, `sh`, `sb`) instruction. The address is calculated in the ALU during the execution stage, and the result is either fetched from or written to memory based on the instruction. ![image](https://hackmd.io/_uploads/SknYQC8yJe.png) - **MEM/WB Stage (Memory to Writeback)** The **MEM/WB pipeline register** stores either the data read from memory or the ALU result from the previous stage. This data is prepared for writing back into the register file in the final stage of the pipelien. ![image](https://hackmd.io/_uploads/B1JhXA81Jg.png) - **WB Mux (Writeback Stage)** The **WB Mux** selects between: 1. `PC + 4` (for `jal`/`jalr` instructions) 2. The ALU result (for arithmetic/logic instructions) 3. The data fetched from memory (for load instructions) ![image](https://hackmd.io/_uploads/Syw6XAUJke.png)