# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < `hhchen` >
## Problem C in [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol)
### C Code
```c
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
static inline uint32_t fp16_to_fp32(uint16_t h) {
const uint32_t w = (uint32_t) h << 16;
const uint32_t sign = w & UINT32_C(0x80000000);
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) &
INT32_C(0x7F800000);
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
return sign | ((((nonsign << renorm_shift >> 3) +
((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
}
```
### Assembly Code for function fp16_to_fp32
```c
.data
argument: .word 0x0000
endl: .string "\n"
.text
la t0, argument
lw t0, 0(t0) # store x in t0
slli t0, t0, 16
li s1, 0x80000000
and s1, s1, t0 # s1 = x and with 0x80000000(sign)
li s2, 0x7FFFFFFF
and s2, s2, t0 # s2 = x and with 0x7FFFFFFF(unsign)
addi t1, t1, 31 # t1 = 31 (t1 is i)
addi t2, t2, 1 # t2 = 1
clz_loop:
sll t4, t2, t1 # t4 record 1U shift i
addi t1, t1, -1 # i = i - 1
and t4, t4, s1 # x & (1U << i)
bnez t4, out_clz # break
addi t5, t5, 1 # did not enter the if statement and incremented count
bge t1, zero, clz_loop # return to the clz_loop
out_clz:
addi s3 ,s3 , 5
addi t5, t5, -5
bge t5, s3, out
add t5, x0, x0 # t5 = renorm_shift
out:
li t3, 0x4000000
add t3, s2, t3
srli t3, t3, 8
li s3, 0x7F800000
and t3, t3, s3 # s3 = inf_nan_mask
addi s4, s2, -1
srli s4, s4, 31 # s4 = zero mask
sll s2, s2, t5
srli s2, s2, 3
sub t5, zero, t5
addi t5, t5, 0x70
slli t5, t5, 23 # 0x70 - renorm << 23
or t5, t5, s3
xori s4, s4, -1
and t5, t5, s4
add s2, t5, s2
or s1, s1, s2
mv a0, s1
li a7,34
ecall
```
## Binary Watch ([LeetCode 401](https://leetcode.com/problems/binary-watch/description/))
### Motivation
I chose the Binary Watch problem (LeetCode 401) because I saw someone solve the [Number of 1 Bits problem (LeetCode 191)](https://leetcode.com/problems/number-of-1-bits/description/) before. These problems are similar because they both deal with counting bits. For the Binary Watch, I used a simple method that checks all possibilities.
### Problem
A binary watch has 4 LEDs on the top to represent the hours (0-11), and 6 LEDs on the bottom to represent the minutes (0-59). Each LED represents a zero or one, with the least significant bit on the right.
For example, the below binary watch reads "4:51".

Given an integer turnedOn which represents the number of LEDs that are currently on (ignoring the PM), return all possible times the watch could represent. You may return the answer in any order.
The hour must not contain a leading zero.
For example, "01:00" is not valid. It should be "1:00".
The minute must consist of two digits and may contain a leading zero.
For example, "10:2" is not valid. It should be "10:02".
## Solution
### Idea for problem solving
- Enumerate all possible hours (0 to 11) and for each hour, enumerate all possible minutes (0 to 59).
- Convert both the hour and minute to binary and concatenate them.
- Then count the number of 1s in the resulting binary string. This count is known as the [Hamming weight](https://en.wikipedia.org/wiki/Hamming_weight) (you can use the concept of Hamming weight, as in [LeetCode problem 191](https://leetcode.com/problems/number-of-1-bits/description/)).
- If the count is exactly n, add the corresponding time to the result set.
- After completing the enumeration, you will have the desired set of times.
- This approach effectively finds all times on a 12-hour clock whose binary representation has a Hamming weight of n.
### C Code
- At first, I implemented the Hamming weight function and printed the corresponding value to test if the functionality was working correctly.
#### Hamming Weight (Version 1)
- To compute the 1’s of a number, I apply the idea that the result of `n & (n - 1)` will always be the value of `n` without the least significant 1.
```c
int hammingWeight(unsigned short n) {
int count = 0;
while (n != 0){
n = n & (n - 1);
count++;
}
return count;
}
```
#### Hamming Weight (Version 2)
- This function `hammingWeight` calculates the number of bits set to `1` (Hamming Weight) in an `unsigned short` (16-bit) value `n`. It uses an optimized method to count bits through successive bitwise operations:
1. **First Step**: `(n & 0x5555) + ((n >> 1) & 0x5555)` — pairs of bits are combined.
2. **Second Step**: `(n & 0x3333) + ((n >> 2) & 0x3333)` — groups of 4 bits are added.
3. **Third Step**: `(n & 0x0F0F) + ((n >> 4) & 0x0F0F)` — groups of 8 bits are added.
4. **Fourth Step**: `(n & 0x00FF) + ((n >> 8) & 0x00FF)` — combines all 16 bits.
The final result is the count of `1` bits in `n`. This approach is efficient for counting bits in constant time.
```c
int hammingWeight(unsigned short n) {
n = (n & 0x5555) + ((n >> 1) & 0x5555);
n = (n & 0x3333) + ((n >> 2) & 0x3333);
n = (n & 0x0F0F) + ((n >> 4) & 0x0F0F);
n = (n & 0x00FF) + ((n >> 8) & 0x00FF);
return n;
}
```
#### Reading BinaryWatch
```c
char** readBinaryWatch(int turnedOn, int* returnSize) {
char** valid = (char**)malloc(256 * sizeof(char*));
*returnSize = 0;
if (turnedOn < 0 || turnedOn > 10) // If the number of LEDs is out of range, return empty result
return valid;
for (unsigned short i = 0; i < 1024; ++i) { // Iterate through all possible 10-bit combinations
if (hammingWeight(i) == turnedOn && (i >> 6) < 12 && (i & 0x3F) < 60) {
// Check if the number of bits set to '1' matches and the time is valid
int hour = i >> 6; // Extract the hour (first 4 bits)
int minute = i & 0x3F; // Extract the minutes (last 6 bits)
char* timeString = (char*)malloc(6 * sizeof(char));
if (minute < 10)
sprintf(timeString, "%d:0%d", hour, minute);
else
sprintf(timeString, "%d:%d", hour, minute);
valid[*returnSize] = timeString;
(*returnSize)++;
}
}
return valid;
}
```
#### Main Program
```c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main() {
int turnedOn = 1; // Number of LEDs turned on (fixed to 1 for testing)
int returnSize; // Number of valid time results
// Get the valid times for the given number of LEDs
char** times = readBinaryWatch(turnedOn, &returnSize);
// Print the valid times
printf("\nValid times for %d LEDs turned on:\n", turnedOn);
for (int i = 0; i < returnSize; i++) {
printf("%s\n", times[i]);
free(times[i]);
}
free(times);
return 0;
}
```
- Console Output
```c
Valid times for 1 LEDs turned on:
0:01
0:02
0:04
0:08
0:16
0:32
1:00
2:00
4:00
8:00
```
### Assembly Code
#### Binary Watch (Version 1)
```c
.data
str_valid: .string "LEDs turned on:"
str_colon: .string ":"
str_leading_zero: .string "0"
str_newline: .string "\n"
input_turned_on: .word 1
.text
.globl main
main:
addi sp, sp, -4
sw ra, 0(sp)
lw a0, input_turned_on # Load LED turned on count into a0
jal ra, readBinaryWatch # Call readBinaryWatch function
# Print result message
la a0, str_valid # Load result message string
li a7, 4 # Syscall: print string
ecall
lw a0, input_turned_on # Load LED turned on count
li a7, 1 # Syscall: print integer
ecall
la a0, str_newline # Print newline
li a7, 4
ecall
# Exit program
lw ra, 0(sp)
addi sp, sp, 4
li a7, 10 # Syscall: exit
ecall
# hammingWeight function
hammingWeight:
li t0, 0 # Initialize count to 0
hammingWeight_loop:
beqz a0, hammingWeight_end # If n == 0, exit loop
addi t1, a0, -1 # t1 = n - 1
and a0, a0, t1 # n = n & (n - 1)
addi t0, t0, 1 # Increment count
j hammingWeight_loop
hammingWeight_end:
mv a0, t0 # Move count to return register
ret
# readBinaryWatch function
readBinaryWatch:
addi sp, sp, -28
sw ra, 24(sp)
sw s0, 20(sp)
sw s1, 16(sp)
sw s2, 12(sp)
sw s3, 8(sp)
sw s4, 4(sp)
sw s5, 0(sp)
mv s5, a0 # s5 = turnedOn (input parameter)
li s0, 0 # s0 = counter
li s1, 1024 # s1 = upper bound
li s3, 0 # s3 = number of valid times found
loop_watch:
beq s0, s1, end_readBinaryWatch
mv a0, s0
jal ra, hammingWeight
bne a0, s5, next_iteration
# Extract hour and minute
srli t0, s0, 6
andi t0, t0, 0xF # t0 = hour
andi t1, s0, 0x3F # t1 = minute
# Check hour < 12 and minute < 60
li t2, 12
bge t0, t2, next_iteration
li t2, 60
bge t1, t2, next_iteration
# Print hour
mv a0, t0
li a7, 1 # Syscall: print integer
ecall
# Print colon
la a0, str_colon
li a7, 4 # Syscall: print string
ecall
# Print minute with leading zero if needed
li t2, 10
bge t1, t2, print_minute
la a0, str_leading_zero
li a7, 4 # Syscall: print string
ecall
print_minute:
mv a0, t1
li a7, 1 # Syscall: print integer
ecall
# Print newline
la a0, str_newline
li a7, 4 # Syscall: print string
ecall
addi s3, s3, 1 # Increment valid times counter
next_iteration:
addi s0, s0, 1
j loop_watch
end_readBinaryWatch:
mv a0, s3 # Return number of valid times found
# Restore saved registers and return
lw ra, 24(sp)
lw s0, 20(sp)
lw s1, 16(sp)
lw s2, 12(sp)
lw s3, 8(sp)
lw s4, 4(sp)
lw s5, 0(sp)
addi sp, sp, 28
ret
```
#### Binary Watch (Version 2)
```c
.data
str_valid: .string "LEDs turned on:"
str_colon: .string ":"
str_leading_zero: .string "0"
str_newline: .string "\n"
input_turned_on: .word 1
.text
.globl main
main:
addi sp, sp, -4
sw ra, 0(sp)
lw a0, input_turned_on # Load LED turned on count into a0
jal ra, readBinaryWatch # Call readBinaryWatch function
# Print result message
la a0, str_valid # Load result message string
li a7, 4 # Syscall: print string
ecall
lw a0, input_turned_on # Load LED turned on count
li a7, 1 # Syscall: print integer
ecall
la a0, str_newline # Print newline
li a7, 4
ecall
# Exit program
lw ra, 0(sp)
addi sp, sp, 4
li a7, 10 # Syscall: exit
ecall
# hammingWeight function
hammingWeight:
mv t0, a0
srli t1, t0, 1
li t2, 0x55555555
and t1, t1, t2
sub t0, t0, t1
li t1, 0x33333333
and t2, t0, t1
srli t0, t0, 2
and t0, t0, t1
add t0, t0, t2
srli t1, t0, 4
add t0, t0, t1
li t1, 0x0f0f0f0f
and t0, t0, t1
srli t1, t0, 8
add t0, t0, t1
srli t1, t0, 16
add a0, t0, t1
andi a0, a0, 0x3f
ret
# readBinaryWatch function
readBinaryWatch:
addi sp, sp, -28
sw ra, 24(sp)
sw s0, 20(sp)
sw s1, 16(sp)
sw s2, 12(sp)
sw s3, 8(sp)
sw s4, 4(sp)
sw s5, 0(sp)
mv s5, a0 # s5 = turnedOn (input parameter)
li s0, 0 # s0 = counter
li s1, 1024 # s1 = upper bound
li s3, 0 # s3 = number of valid times found
loop_watch:
beq s0, s1, end_readBinaryWatch
mv a0, s0
jal ra, hammingWeight
bne a0, s5, next_iteration
# Extract hour and minute
srli t0, s0, 6
andi t0, t0, 0xF # t0 = hour
andi t1, s0, 0x3F # t1 = minute
# Check hour < 12 and minute < 60
li t2, 12
bge t0, t2, next_iteration
li t2, 60
bge t1, t2, next_iteration
# Print hour
mv a0, t0
li a7, 1 # Syscall: print integer
ecall
# Print colon
la a0, str_colon
li a7, 4 # Syscall: print string
ecall
# Print minute with leading zero if needed
li t2, 10
bge t1, t2, print_minute
la a0, str_leading_zero
li a7, 4 # Syscall: print string
ecall
print_minute:
mv a0, t1
li a7, 1 # Syscall: print integer
ecall
# Print newline
la a0, str_newline
li a7, 4 # Syscall: print string
ecall
addi s3, s3, 1 # Increment valid times counter
next_iteration:
addi s0, s0, 1
j loop_watch
end_readBinaryWatch:
mv a0, s3 # Return number of valid times found
# Restore saved registers and return
lw ra, 24(sp)
lw s0, 20(sp)
lw s1, 16(sp)
lw s2, 12(sp)
lw s3, 8(sp)
lw s4, 4(sp)
lw s5, 0(sp)
addi sp, sp, 28
ret
```
:::danger
Check list:
1. Did your test suite include the corner cases?
2. Can you test suite be operated via given [test driver](https://en.wikipedia.org/wiki/Test_driver)?
3. Can you check the report and break down the details?
:::
:::success
I added three new chapters: Results, Comparison, and Analysis.
:::
## Result
### Test case
1. Test case 1 (0 LEDs on):
Expected output: Only one valid time - 0:00
Analysis: This is a corner case where no LEDs are lit. The only valid time in this scenario is 0:00 (midnight). It tests the program's ability to handle the minimum possible input.
2. Test case 2 (1 LED on):
Expected output: 12 valid times
Analysis: This case represents the lower bound for valid inputs. The possible times are:
- Hours: 1:00, 2:00, 4:00, 8:00
- Minutes: 0:01, 0:02, 0:04, 0:08, 0:16, 0:32
- Total: 10 times
This case tests the program's ability to handle a minimal valid input and correctly generate all possible times with only one bit set.
3. Test case 3 (9 LEDs on):
Expected output: No valid times
Analysis: This is an invalid input corner case. It's impossible to have 9 LEDs on in a valid time configuration because:
- Hours use 4 bits (0-11 in binary)
- Minutes use 6 bits (0-59 in binary)
- Total bits available: 4 + 6 = 10 bits
## Comparing
This analysis primarily focuses on comparing two versions of the Hamming weight function implementation. The Hamming weight function counts the number of '1' bits in a binary number, which is crucial for the Binary Watch problem.
| Feature | Version 1 | Version 2 |
|----------------------------|-------------------------------------------|-------------------------------------|
| **Algorithm Type** | Bitwise Clear (Lowest Bit) | Mask and Divide-and-Conquer |
| **Code Complexity** | Simple and intuitive | More complex (uses masks and shifts)|
| **Parallelism Potential** | No parallelism | Can be parallelized with masking |
| **Use Case** | Simple implementation, fewer 1-bits | High efficiency for large data |
I tested two versions of the Hamming weight function.The cycle count, CPI, and IPC of this assembly code(as shown in the image below).
| Execution info | Version 1 | Version 2 |
|----------------------------|--------------|--------------|
| **Cycles** | 56734 | 38302 |
| **Instrs. retired** | 36156 | 30012 |
| **CPI** | 1.57 | 1.28 |
| **IPC** | 0.637 | 0.784 |
| **Clock rate** | 0 Hz | 0 Hz |
## Analysis
I test the code using [Ripes](https://ripes.me/) simulator.
### Pseudo instruction
The translated code looks like:
```c
00000000 <main>:
0: ffc10113 addi x2 x2 -4
4: 00112023 sw x1 0 x2
8: 10000517 auipc x10 0x10000
c: 00e52503 lw x10 14 x10
10: 0a0000ef jal x1 160 <readBinaryWatch>
14: 10000517 auipc x10 0x10000
18: fec50513 addi x10 x10 -20
1c: 00400893 addi x17 x0 4
20: 00000073 ecall
24: 10000517 auipc x10 0x10000
28: ff252503 lw x10 -14 x10
2c: 00100893 addi x17 x0 1
30: 00000073 ecall
34: 10000517 auipc x10 0x10000
38: fe050513 addi x10 x10 -32
3c: 00400893 addi x17 x0 4
40: 00000073 ecall
44: 00012083 lw x1 0 x2
48: 00410113 addi x2 x2 4
4c: 00a00893 addi x17 x0 10
50: 00000073 ecall
00000054 <hammingWeight>:
54: 00050293 addi x5 x10 0
58: 0012d313 srli x6 x5 1
5c: 555553b7 lui x7 0x55555
60: 55538393 addi x7 x7 1365
64: 00737333 and x6 x6 x7
68: 406282b3 sub x5 x5 x6
6c: 33333337 lui x6 0x33333
70: 33330313 addi x6 x6 819
74: 0062f3b3 and x7 x5 x6
78: 0022d293 srli x5 x5 2
7c: 0062f2b3 and x5 x5 x6
80: 007282b3 add x5 x5 x7
84: 0042d313 srli x6 x5 4
88: 006282b3 add x5 x5 x6
8c: 0f0f1337 lui x6 0xf0f1
90: f0f30313 addi x6 x6 -241
94: 0062f2b3 and x5 x5 x6
98: 0082d313 srli x6 x5 8
9c: 006282b3 add x5 x5 x6
a0: 0102d313 srli x6 x5 16
a4: 00628533 add x10 x5 x6
a8: 03f57513 andi x10 x10 63
ac: 00008067 jalr x0 x1 0
000000b0 <readBinaryWatch>:
b0: fe410113 addi x2 x2 -28
b4: 00112c23 sw x1 24 x2
b8: 00812a23 sw x8 20 x2
bc: 00912823 sw x9 16 x2
c0: 01212623 sw x18 12 x2
c4: 01312423 sw x19 8 x2
c8: 01412223 sw x20 4 x2
cc: 01512023 sw x21 0 x2
d0: 00050a93 addi x21 x10 0
d4: 00000413 addi x8 x0 0
d8: 40000493 addi x9 x0 1024
dc: 00000993 addi x19 x0 0
000000e0 <loop_watch>:
e0: 08940463 beq x8 x9 136 <end_readBinaryWatch>
e4: 00040513 addi x10 x8 0
e8: f6dff0ef jal x1 -148 <hammingWeight>
ec: 07551a63 bne x10 x21 116 <next_iteration>
f0: 00645293 srli x5 x8 6
f4: 00f2f293 andi x5 x5 15
f8: 03f47313 andi x6 x8 63
fc: 00c00393 addi x7 x0 12
100: 0672d063 bge x5 x7 96 <next_iteration>
104: 03c00393 addi x7 x0 60
108: 04735c63 bge x6 x7 88 <next_iteration>
10c: 00028513 addi x10 x5 0
110: 00100893 addi x17 x0 1
114: 00000073 ecall
118: 10000517 auipc x10 0x10000
11c: ef850513 addi x10 x10 -264
120: 00400893 addi x17 x0 4
124: 00000073 ecall
128: 00a00393 addi x7 x0 10
12c: 00735a63 bge x6 x7 20 <print_minute>
130: 10000517 auipc x10 0x10000
134: ee250513 addi x10 x10 -286
138: 00400893 addi x17 x0 4
13c: 00000073 ecall
00000140 <print_minute>:
140: 00030513 addi x10 x6 0
144: 00100893 addi x17 x0 1
148: 00000073 ecall
14c: 10000517 auipc x10 0x10000
150: ec850513 addi x10 x10 -312
154: 00400893 addi x17 x0 4
158: 00000073 ecall
15c: 00198993 addi x19 x19 1
00000160 <next_iteration>:
160: 00140413 addi x8 x8 1
164: f7dff06f jal x0 -132 <loop_watch>
00000168 <end_readBinaryWatch>:
168: 00098513 addi x10 x19 0
16c: 01812083 lw x1 24 x2
170: 01412403 lw x8 20 x2
174: 01012483 lw x9 16 x2
178: 00c12903 lw x18 12 x2
17c: 00812983 lw x19 8 x2
180: 00412a03 lw x20 4 x2
184: 00012a83 lw x21 0 x2
188: 01c10113 addi x2 x2 28
18c: 00008067 jalr x0 x1 0
```
In each row it denotes address in instruction memory, instruction's machine code (in hex) and instruction itself respectively.
### 5-stage pipelined processor
The 5 stages is designed to deal with the instructions in pipeline approach to enhance the execution efficiency. The 5 stages is shown as below:

- Instruction fetch (IF)
- Instruction decode and register fetch (ID)
- Execute (EX)
- Memory access (MEM)
- Register write back (WB)
The five-stage pipeline in a processor is used to parallelize the execution of instructions, allowing multiple instructions to be processed simultaneously at different stages. Here’s an overview of the pipeline stages and an explanation of how an I-type instruction progresses through these stages:
### I-Type Instruction Example
Let’s use the **I-Type instruction** `addi x2, x2, -16` as an example and demonstrate how it passes through each of the five stages in the RISC-V pipeline.
1. **Instruction Fetch (IF)**:
In the **IF** stage, the processor fetches the instruction `addi x2, x2, -16` from memory. The **Program Counter (PC)** determines the address of the instruction. After fetching, the PC is incremented by `4` to point to the next instruction, since RISC-V instructions are 4 bytes long.

2. **Instruction Decode (ID)**:
In the **ID** stage, the fetched instruction is decoded by the control unit to determine that it is an `addi` instruction, an I-Type arithmetic operation. The control signals are generated to guide the ALU to perform addition. The value of the source register (`x2`) is read from the register file, and the immediate value (`-16`) is extracted and sign-extended to 32 bits.

3. **Execute (EX)**:
In the **EX** stage, the ALU performs the addition operation between the value from `x2` and the immediate value `-16`.
- For example, let’s assume `x2` initially contains the value `64`.
- The ALU then calculates `64 + (-16)` which results in `48`.

4. **Memory Access (MEM)**:
In the **MEM** stage, there is no memory operation needed for this instruction since it is purely an arithmetic operation. Therefore, the ALU result (`48`) is simply forwarded to the next stage.

5. **Register Write Back (WB)**: No action needed for `sw`.
In the **WB** stage, the result from the ALU (`48`) is written back to the destination register (`x2`). After this stage is completed, the value in register `x2` is updated to `48`.

## Reference
- [Binary Watch C++ Solution](https://hackmd.io/@Inversionpeter/BkccX5Ovw)
- [191. Number of 1 Bits](https://hackmd.io/jj61kCP_R-etsz4c7KTOSw?view)