# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by <`liballouo`>
## Problem `C` in [Quiz 1](https://hackmd.io/@sysprog/arch2024-quiz1-sol)
We test our code using [Ripes](https://ripes.me/) simulator.
### Version 1
#### C code
```c
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
```
#### RISC-V assembly code
You can see the entire code [here](https://github.com/liballouo/Computer-Architecture-2024-Fall-HW1/blob/main/Counting_Leading_Zero_v1.s).
```c
# Input: uint32_t a0
# Output: uint32_t a0
my_clz:
li t0, 0 # count = 0
li t1, 31 # t1 = 31
clz_loop:
li t2, 1
sll t2, t2, t1 # (1U << i)
and t3, a0, t2 # x & (1U << i)
bnez t3, clz_done # if (x & (1U << i)) break loop
addi t0, t0, 1 # count = count + 1
addi t1, t1, -1 # i = i - 1
bgez t1, clz_loop # if (i>=0) continue loop
clz_done:
addi a0, t0, 0 # set result to a0
ret
```
#### execution info
| Info | Value |
| --------------- | ----- |
| Cycle | 443 |
| Instrs. retired | 322 |
| CPI | 1.38 |
| IPC | 0.727 |
| Clock rate | 0 Hz |
### Version 2
Counting leading zero utilizes population count.
#### C code
```c
static inline int my_clz(uint32_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
/* count ones (population count) */
x -= ((x >> 1) & 0x55555555);
x = (x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (32 - (x & 0x3f));
}
```
#### RISC-V assembly code
You can see the entire code [here](https://github.com/liballouo/Computer-Architecture-2024-Fall-HW1/blob/main/Counting_Leading_Zero_v2.s).
```c
# Input: uint32_t a0
# Output: uint32_t a0
my_clz:
srli t0, a0, 1 # x >> 1
or a0, a0, t0 # x |= (x >> 1)
srli t0, a0, 2 # x >> 2
or a0, a0, t0 # x |= (x >> 2)
srli t0, a0, 4 # x >> 4
or a0, a0, t0 # x |= (x >> 4)
srli t0, a0, 8 # x >> 8
or a0, a0, t0 # x |= (x >> 8)
srli t0, a0, 16 # x >> 16
or a0, a0, t0 # x |= (x >> 16)
popcount:
lui t1, 0x55555
ori t1, t1, 0x555 # 0x55555555 (t1)
srli t0, a0, 1 # x >> 1
and t2, t0, t1 # (x >> 1) & 0x55555555
sub a0, a0, t2 # x -= ((x >> 1) & 0x55555555)
lui t1, 0x33333
ori t1, t1, 0x333 # 0x33333333 (t1)
srli t0, a0, 2 # x >> 2
and t2, t0, t1 # (x >> 2) & 0x33333333
and t3, a0, t1 # x & 0x33333333
add a0, t2, t3 # ((x >> 2) & 0x33333333) + (x & 0x33333333)
lui t1, 0x0f0f0
ori t1, t1, 0x70f
addi t1, t1, 0x800 # 0x0f0f0f0f (t1)
srli t0, a0, 4 # x >> 4
add t2, a0, t0 # (x >> 4) + x
and a0, t2, t1 # ((x >> 4) + x) & 0x0f0f0f0f
srli t0, a0, 8 # x >> 8
add a0, a0, t0 # x += (x >> 8)
srli t0, a0, 16 # x >> 16
add a0, a0, t0 # x += (x >> 16)
andi t0, a0, 0x3f # x & 0x3f
li t1, 32
sub a0, t1, t0 # 32 - (x & 0x3f)
ret
```
#### execution info
| Info | Value |
| --------------- | ----- |
| Cycle | 194 |
| Instrs. retired | 151 |
| CPI | 1.28 |
| IPC | 0.778 |
| Clock rate | 0 Hz |
### test cases
#### test case 1
> Input: 0x3
> Output: 30
#### test case 2
> Input: 0xFFFFFFFF
> Output: 0
#### test case 3
> Input: 0x3321675
> Output: 6
## Biweekly Contest 141: Q1. Construct the Minimum Bitwise Array II
[Q1. Construct the Minimum Bitwise Array II](https://leetcode.com/contest/biweekly-contest-141/problems/construct-the-minimum-bitwise-array-ii/description/)
### Description
You are given an array `nums` consisting of `n` prime integers.
You need to construct an array `ans` of length `n`, such that, for each index `i`, the bitwise `OR` of `ans[i]` and `ans[i] + 1` is equal to `nums[i]`, i.e. `ans[i] OR (ans[i] + 1) == nums[i]`.
Additionally, you must minimize each value of `ans[i]` in the resulting array.
If it is not possible to find such a value for `ans[i]` that satisfies the condition, then set `ans[i] = -1`.
A prime number is a natural number greater than 1 with only two factors, 1 and itself.
### Constraints
- `1 <= nums.length <= 100`
- `2 <= nums[i] <= 10^9`
- `nums[i]` is a prime number.
### Solution
There are three cases to consider:
1. **Case1**: If `nums[i] = 2`, then `ans[i] = -1`.
2. **Case 2**: If `nums[i]` consists of all 1s in binary representation, then `ans[i] = nums[i] >> 1`.
3. **Case 3**: For other numbers, find the rightmost 0 bit and change the 1 bit immediately to its right to 0.
Here's an example for **Case 3**:
```
input num = 19 (in decimal, 10011 in binary)
expect output = 17
1. find the rightmost 0 : 10011
^
index i = 2
2. set i-1 to 0 : 10011 -> 10001
^ ^
answer = 10001 (binary) = 17 (decimal)
3. check the answer :
10001 ^ 10010 = 10011 = 19 (decimal) -> checked!
```
We implement it in C.
You can see the entire code [here](https://github.com/liballouo/Computer-Architecture-2024-Fall-HW1/blob/main/Construct_the_Minimum_Bitewise_Array_I.c).
To handle different cases, the solution involves multiple functions:
#### `minBitwiseNum` Function
This function determines the appropriate value for `ans` based on the input `num`.
```c
uint32_t minBitwiseNum(uint32_t num) {
// Input: num
// Output: ans
uint32_t ans;
// case: 2
if(num == 2){
ans = -1;
}
// case: all 1s (2^n-1)
else if(ilog2(num) + 1 == popcount(num)){
ans = num >> 1;
}
// case: others
else{
ans = helper(num);
}
return ans;
}
```
- **Case 1**: This handles the special case where `num` equals 2, for which there is no valid solution for `ans`, so the function returns -1.
- **Case 2**: This checks if `num` has all 1s in its binary representation (e.g., 3, 7, 15). The condition uses the `ilog2 `function to get the number of bits required and `popcount` to check if all bits are set.
- **Case 3**: When the number doesn't fall under the previous two cases, the `helper` function finds the correct value.
#### `ilog2` Function
This function calculates the integer logarithm base 2 of a number using the `my_clz` function to count leading zeros.
```c
int ilog2(uint32_t num){
return 31 - my_clz(num);
}
```
#### `my_clz` Function
The `my_clz` function counts the number of leading zeros in a 32-bit unsigned integer. It uses bitwise operations to set all bits to 1 to the right of the highest set bit and then uses `popcount` to calculate the number of 1s. Last, it returns the value of leading zero.
```c
int my_clz(uint32_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x = popcount(x);
return (32 - (x & 0x3f));
}
```
#### `popcount` Function
The `popcount` function counts the number of 1s in the binary representation of a number. It uses bit manipulation techniques to sum the bits in groups, eventually reducing to a count of all 1s.
```c
int popcount(uint32_t x){
/* count ones (population count) */
x -= ((x >> 1) & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return x;
}
```
#### `helper` Function
The `helper` function is used to find the minimum possible value for `ans` when the number does not fit into the special cases. It looks for the rightmost 0 bit and clears the bit immediately to its left.
```c
uint32_t helper(uint32_t num) {
for(int i=1; i<32; i++){
// Find the rightmost 0
if(!((1 << i) & num)){
// Change the 1 on the right of the rightmost 0 to 0
num ^= (1 << (i - 1));
break;
}
}
return num;
}
```
After finishing the C code, we translate it from C code to a RISC-V assembly program.
You can see the entire code [here](https://github.com/liballouo/Computer-Architecture-2024-Fall-HW1/blob/main/Construct_the_Minimum_Bitwise_Array_I.s).
#### RISC-V assembly code
```c
# Input: uint32_t a0
# Output: uint32_t a0
# t6: flag, 0 for ilog ; 1 for popcount
min_bitwise_num:
addi sp, sp, -4
sw ra, 0(sp)
li t0, 2
beq a0, t0, case_2 # case: num == 2
addi a1, a0, 0
li t6, 0 # flag for ilog / popcount
jal ra, ilog2 # do ilog2(num) and save the result in a2
addi a0, a1, 0
addi a2, a2, 1 # ilog2(num) + 1
li t6, 1 # flag for ilog / popcount
jal ra, popcount # do popcount(num) and save the result in a3
addi a0, a1, 0
beq a2, a3, case_all_1s # case: all 1s
jal ra, helper # case: others
case_2:
li a0, -1 # no possible value
lw ra, 0(sp)
addi sp, sp, 4
ret
my_clz:
srli t0, a0, 1 # x >> 1
or a0, a0, t0 # x |= (x >> 1)
srli t0, a0, 2 # x >> 2
or a0, a0, t0 # x |= (x >> 2)
srli t0, a0, 4 # x >> 4
or a0, a0, t0 # x |= (x >> 4)
srli t0, a0, 8 # x >> 8
or a0, a0, t0 # x |= (x >> 8)
srli t0, a0, 16 # x >> 16
or a0, a0, t0 # x |= (x >> 16)
popcount:
lui t1, 0x55555
ori t1, t1, 0x555 # 0x55555555 (t1)
srli t0, a0, 1 # x >> 1
and t2, t0, t1 # (x >> 1) & 0x55555555
sub a0, a0, t2 # x -= ((x >> 1) & 0x55555555)
lui t1, 0x33333
ori t1, t1, 0x333 # 0x33333333 (t1)
srli t0, a0, 2 # x >> 2
and t2, t0, t1 # (x >> 2) & 0x33333333
and t3, a0, t1 # x & 0x33333333
add a0, t2, t3 # ((x >> 2) & 0x33333333) + (x & 0x33333333)
lui t1, 0x0f0f0
ori t1, t1, 0x70f
addi t1, t1, 0x800 # 0x0f0f0f0f (t1)
srli t0, a0, 4 # x >> 4
add t2, a0, t0 # (x >> 4) + x
and a0, t2, t1 # ((x >> 4) + x) & 0x0f0f0f0f
srli t0, a0, 8 # x >> 8
add a0, a0, t0 # x += (x >> 8)
srli t0, a0, 16 # x >> 16
add a0, a0, t0 # x += (x >> 16)
beqz t6, clz_done
addi a3, a0, 0 # save the result in a3
ret
clz_done:
andi t0, a0, 0x3f # x & 0x3f
li t1, 32
sub a0, t1, t0 # 32 - (x & 0x3f)
ret
ilog2:
addi sp, sp, -4
sw ra, 0(sp)
jal ra my_clz # my_clz(num) and save the result in a0
li t2, 31
sub a2, t2, a0 # 31 - my_clz(num)
lw ra, 0(sp)
addi sp, sp, 4
ret
case_all_1s:
srli a0, a0, 1 # ans = num >> 1
lw ra, 0(sp)
addi sp, sp, 4
ret
helper:
li t0, 32 # upper bound t0 = 32
li t1, 1 # loop start from 1, i(t1)
helper_loop:
li t2, 1 # t2 = 1
sll t2, t2, t1 # 1 << i
and t3, a0, t2 # (1 << i) & num
beqz t3, helper_done # correctly find the rightmost 0
addi t1, t1, 1 # i += 1
bne t0, t1, helper_loop # do not find the rightmost 0 yet, continue the loop
helper_done:
srli t2, t2, 1 # 1 << (i - 1)
xor a0, a0, t2 # num ^= (1 << (i - 1))
lw ra, 0(sp)
addi sp, sp, 4
ret
```
### Result
#### test cases
##### test case 1
> Input: 2
> Output: -1
##### test case 2
> Input: 3
> Output: 1
##### test case 3
> Input: 5
> Output: 4
##### test case 4
> Input: 7
> Output: 3
##### test case 5
> Input: 31
> Output: 15
##### test case 6
> Input: 307
> Output: 305
##### test case 7
> Input: 383
> Output: 319
##### test case 8
> Input: 5039
> Output: 5031
#### execution info
##### C code
| Info | Value |
| --------------- | ----- |
| Cycle | 6522 |
| Instrs. retired | 5066 |
| CPI | 1.26 |
| IPC | 0.777 |
| Clock rate | 0 Hz |
##### RISC-V assembly code
| Info | Value |
| --------------- | ----- |
| Cycle | 1056 |
| Instrs. retired | 806 |
| CPI | 1.31 |
| IPC | 0.763 |
| Clock rate | 0 Hz |
### Analysis
#### pseudo instruction
The pseudo instruction below is `popcount` part of the [code](https://github.com/liballouo/Computer-Architecture-2024-Fall-HW1/blob/main/Construct_the_Minimum_Bitwise_Array_I.s) run on [Ripes](https://ripes.me/).
```c
000000cc <popcount>:
cc: 55555337 lui x6 0x55555
d0: 55536313 ori x6 x6 1365
d4: 00155293 srli x5 x10 1
d8: 0062f3b3 and x7 x5 x6
dc: 40750533 sub x10 x10 x7
e0: 33333337 lui x6 0x33333
e4: 33336313 ori x6 x6 819
e8: 00255293 srli x5 x10 2
ec: 0062f3b3 and x7 x5 x6
f0: 00657e33 and x28 x10 x6
f4: 01c38533 add x10 x7 x28
f8: 0f0f0337 lui x6 0xf0f0
fc: 70f36313 ori x6 x6 1807
100: 80030313 addi x6 x6 -2048
104: 00455293 srli x5 x10 4
108: 005503b3 add x7 x10 x5
10c: 0063f533 and x10 x7 x6
110: 00855293 srli x5 x10 8
114: 00550533 add x10 x10 x5
118: 01055293 srli x5 x10 16
11c: 00550533 add x10 x10 x5
120: 000f8663 beq x31 x0 12 <clz_done>
124: 00050693 addi x13 x10 0
128: 00008067 jalr x0 x1 0
```
#### 5-stage pipelined processor
We choose **5-stage pipelined processor** as the processor we use in [Ripes](https://ripes.me/) to run the [code](https://github.com/liballouo/Computer-Architecture-2024-Fall-HW1/blob/main/Construct_the_Minimum_Bitwise_Array_I.s). Its block diagram look like this:

The "5-stage" means this processor using five-stage pipeline to parallelize instructions. The stages are:
1. `IF`: Instruction fetch
2. `ID`: Instruction decode and register fetch
3. `EX`: Execute
4. `MEM`: Memory access
5. `WB`: Register write back
To illustrate, we take the instruction `lui t1, 0x55555` as an example. According to [The RISC-V Instruction Set Manual](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf)
> `LUI` (load upper immediate) is used to build 32-bit constants and uses the **U-type format**. `LUI` places the U-immediate value in the top 20 bits of the destination register `rd`, filling in the lowest 12 bits with zeros.
#### U-type format instruction
The layout of the U-format instructions is:
| imm [31:12] | rd [11:7] | opcode [6:0] |
| --------------------- | --------- | ------------ |
The pseudo instruction is `lui x6 0x55555` which decoded into `55555337`.
| imm [31:12] | rd [11:7] | opcode [6:0] | representation |
| --------------------- | --------- | ------------ | -------------- |
| 01010101010101010101 | 00110 | 0110111 | binary |
| 0x55555 | 0x6 | 0x37 | heximal |
#### 1. `IF`: Instruction fetch

- The `PC` starts at `0x000000cc`. `PC+4` will pass to the following stages for further usages.
- We fetch the instruction at `0x000000cc` from instruction memory, resulting in the instruction `0x55555337`.
#### 2. `ID`: Instruction decode and register fetch

- The instruction `0x55555337` is decoded, revealing `lui x6, 0x55555`.
- `opcode = LUI`
- `Imm. = 0x55555000`
- `Wr idx = 0x06`
- Although U-type instructions do not require source registers, the processor still fetches the source register indices from the instruction, resulting in `rs1 = 0x0a` and `rs2 = 0x15`. These fields are irrelevant for this instruction.
```
31 25 24 20 19 15 14 12 ...
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 ...
| rs2 = 0x15| | rs1 = 0x0a|
------------- -------------
```
- The values of registers `rs1` and `rs2` are fetched, resulting in `Reg 1 = 0x00000003` and `Reg 2 = 0x00000000`, but they are not used in this instruction.
- The `PC` and `PC+4` are passed unchanged to the next stage.
#### `EX`: Execute

- We do not use the `PC+4`, so we just let it pass.
- We do not use `Reg 1` and `Reg 2` in U-type format instructions, so after their value passes the multiplexers, they are filtered. The result of ALU is just `0x55555000` from `Imm.`.
- There is no branch taken in this stage.
- `rd` is not used in this stage, so it just passes to the next stage.
#### `MEM`: Memory access

- The `PC+4` is still passed to the next stage.
- Memory access is not required for U-type instructions, so there is no memory read or write. Consequently, the "read out" of data memory will be `0x00000000`, indicating that no memory access occurred.
#### `WB`: Register write back

- We pass the `rd = 0x06` and `imm = 0x55555000` to `Wr idx` and `Wr data` respectively.


After completing these five stages, the register `x6(t1)` is set to `0x55555000`.

## Reference
[The RISC-V Instruction Set Manual](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf)
[Quiz1 of Computer Architecture (2023 Fall)](https://hackmd.io/@sysprog/arch2023-quiz1-sol)
[Quiz1 of Computer Architecture (2024 Fall)
](https://hackmd.io/@sysprog/arch2024-quiz1-sol)