# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by : <[`tyt017`](https://github.com/tyt017/2024arch/tree/main)> (Ya-Tong Tsai)
## Quiz1 Problem C
### test cases
| number | input | output |
|:------:|:-----:|:--------:|
| NaN | 0x7FFF | 0x7FFFFFFF |
| -Inifinity | 0xFC00 | 0xFF800000 |
| +Inifinity | 0x7C00 | 0x7F800000 |
| -0 | 0x8000 | 0x80000000 |
| +0 | 0x0000 | 0x00000000 |
| Denormalized | 0x0001 | 0x33800000 |
| Normalized | 0x3C00 | 0x3F800000 |
### The origin version
#### C code
```c=
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
static inline uint32_t fp16_to_fp32(uint16_t h) {
const uint32_t w = (uint32_t) h << 16;
const uint32_t sign = w & 0x80000000;
const uint32_t nonsign = w & 0x7FFFFFFF;
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & 0x7F800000;
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
return sign | ((((nonsign << renorm_shift >> 3) +
((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
}
```
#### RISC-V Assembly code
```riscv=
.data
input_data: .word 0x7FFF
answer: .word 0x7FFFFFFF
str1: .string "TRUE"
str2: .string "FALSE"
.text
input:
la s0, input_data # address of input
lw t0, 0(s0)
slli t0, t0 16 #extend 16-bit to 32-bit
li t1, 0x80000000 #sign-bit mask
and t1, t0, t1 #sign bit
li t2, 0x7FFFFFFF #nonsign-bit mask
and t2, t0, t2 #nonsign part
addi t3, x0, 31 #loop counter i
my_clz:
addi t4, x0, 1 #load 1
sll t4, t4, t3 # 1 << i
and t4, t2, t4 # nonsign & (1 << i)
blt x0, t4, renorm # finish counting
addi a0, a0, 1 # count++
addi t3, t3, -1 #--i
blt t3, x0, renorm
beqz t4, my_clz # keep counting
renorm:
add t5, a0, x0 # load #of leading zero
addi t4, x0, 5
sub t5, t5, t4
bgt t5, x0, result # if (renorm - 5) > 0
add t5, x0, x0 # t5 = renorm_shift
result:
li t0, 0x04000000
add s1, t2, t0
srai s1, s1, 8
li t0, 0x7F800000
and s1, s1, t0 # s1 = inf_nan_mask
addi s2, t2, -1
srai s2, s2, 31 # s2 = zero_mask
sll t2, t2, t5
srli t2, t2, 3
li t6, 0x70
sub t6, t6, t5
slli t6, t6, 23
add t2, t2, t6
or t2, t2, s1
xori s2, s2, -1
and t2, t2, s2
or a0, t1, t2 # the result is stored in a0
la s3, answer
lw t1, 0(s3)
beq a0, t1, true
la a0, str2 # the result is wrong
li a7, 4
ecall
true: # the result is the same with the answer
la a0, str1
li a7, 4
ecall
```
:::success
After testing, the code above can output the correct result.
:::
## Branchless CLZ
To make it more friendly for pipeline system, we can use bitwise AND, OR and SHIFT operators to implement `branchless clz`.
> I was going to use the function clz5(x) on [wiki page](https://en.wikipedia.org/wiki/Find_first_set#CLZ), but it still have to use conditionals to judge the next step. In order to achieve "branchless", I change to use the thought of population count which is referenced from [stackoverflow](https://stackoverflow.com/questions/10439242/count-leading-zeroes-in-an-int32).
### C code
```c=
int my_clz2(uint32_t x)
{
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
//count the ones
x -= x >> 1 & 0x55555555;
x = (x >> 2 & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0F0F0F0F;
x += x >> 8;
x += x >> 16;
return 32 - (x & 0x3F);
}
```
### RISC-V Assembly code
```riscv=
.data
input_data: .word 0x0000
answer: .word 0x00000000
str1: .string "TRUE"
str2: .string "FALSE"
.text
input:
la s0, input_data # address of input
lw t0, 0(s0)
slli t0, t0 16 #extend 16-bit to 32-bit
li t1, 0x80000000 #sign-bit mask
and t1, t0, t1 #sign bit
li t2, 0x7FFFFFFF #nonsign-bit mask
and t2, t0, t2 #nonsign part
add t4, t2, x0
my_clz:
srli t3, t4, 1
or t4, t4, t3 # x |= x >> 1
srli t3, t4, 2
or t4, t4, t3 # x |= x >> 2
srli t3, t4, 4
or t4, t4, t3 # x |= x >> 4
srli t3, t4, 8
or t4, t4, t3 # x |= x >> 8
srli t3, t4, 16
or t4, t4, t3 # x |= x >> 16
li t5, 0x55555555
srli t3, t4, 1
and t3, t3, t5
sub t4, t4, t3 # x -= x >> 1 & 0x55555555
li t5, 0x33333333
srli t3, t4, 2
and t3, t3, t5
and t5, t4, t5
add t4, t3, t5 # x = (x >> 2 & 0x33333333) + (x & 0x33333333)
li t5, 0x0F0F0F0F
srli t3, t4, 4
add t4, t3, t4
and t4, t4, t5 # x = ((x >> 4) + x) & 0x0F0F0F0F
srli t3, t4, 8
add t4, t3, t4 # x += x >> 8
srli t3, t4, 16
add t4, t3, t4 # x += x >> 16
li t5, 0x3F
and t4, t4, t5
addi t3, x0, 32
sub t3, t3, t4 # 32 - (x & 0x3F)
add a0, x0, t3 # put the result into a0
renorm:
add t5, a0, x0 # load #of leading zero
addi t4, x0, 5
sub t5, t5, t4
bgt t5, x0, result # if (renorm - 5) > 0
add t5, x0, x0 # t5 = renorm_shift
result:
li t0, 0x04000000
add s1, t2, t0
srai s1, s1, 8
li t0, 0x7F800000
and s1, s1, t0 # s1 = inf_nan_mask
addi s2, t2, -1
srai s2, s2, 31 # s2 = zero_mask
sll t2, t2, t5
srli t2, t2, 3
li t6, 0x70
sub t6, t6, t5
slli t6, t6, 23
add t2, t2, t6
or t2, t2, s1
xori s2, s2, -1
and t2, t2, s2
or s1, t1, t2 # the result is stored in s1
la s3, answer
lw t1, 0(s3)
beq s1, t1, true
la a0, str2
li a7, 4
ecall
j exit
true:
la a0, str1
li a7, 4
ecall
exit:
add x0, x0, x0 # nop
```
## Elias Gamma Encoding
Elias Gamma encoding is a variable-length, lossless entropy encoding method used to encode non-negative integers efficiently. It is particularly useful when smaller integers are more frequent because it produces shorter encodings for smaller numbers. This makes it a practical choice for compressing sequences of small integers, such as in certain data compression algorithms.
The encoding flow is showed in the following graph.
First, calculate L = $\lfloor \log_2 x \rfloor$, L is the number of leading zero for encoded result, and then the encoded result contains L zeros and the binary form of x.
For example, `x = 278`, then `L = 8` and the binary form is `100010110`, so the encoded result is `00000000100010110`.
``` mermaid
graph TD;
A["Input int number x"]-->B["compute L = lg(x)"];
B["compute L = lg(x)"]-->C["Output L zeros + the binary form of x"];
```
### Test cases
|#| input | output |
|:-:|:-----:|:------:|
| 1 | 5 | 00101 |
| 2 | 16 | 000010000 |
| 3 | 278 | 00000000100010110 |
### Motivation
Since "Counting leading Zero (CLZ)" can help us not only to immediately calculate $\lfloor \log_2 x \rfloor$, but also to get the MSB in the binary form of the input, I think it can be implemented into this problem to speedup the performance.
In the following part, I will implement the solution with non-branchless CLZ and the other one with branchless CLZ.
### C code (No CLZ)
```c=
#include <stdio.h>
#include <stdint.h>
int binary_length(uint32_t n) {
int length = 0;
while (n > 0) {
length++;
n >>= 1;
}
return length;
}
// Elias Gamma encoding
void elias_gamma_encode(uint32_t n) {
if (n == 0) {
printf("0\n");
return;
}
int L = binary_length(n);
for (int i = 0; i < L - 1; i++) {
printf("0");
}
for (int i = L - 1; i >= 0; i--) {
printf("%d", (n >> i) & 1);
}
printf("\n");
}
int main() {
printf("Elias Gamma encoding:\n");
elias_gamma_encode(5); // output: 00101
elias_gamma_encode(7); // output: 00111
elias_gamma_encode(278); // output: 00000000100010110
return 0;
}
```
### Solution (C code with CLZ)
```c=
#include <stdio.h>
#include <stdint.h>
static inline int log_with_clz(uint32_t x) {
int count = 0;
if (x == 0) return 32;
while ((x & (1U << 31)) == 0) {
x <<= 1;
count++;
}
return 32 - count;
}
// Elias Gamma encoding
void elias_gamma_encode(uint32_t n) {
if (n == 0) {
printf("0\n");
return;
}
int L = log_with_clz(n);
for (int i = 0; i < L - 1; i++) {
printf("0");
}
for (int i = L - 1; i >= 0; i--) {
printf("%d", (n >> i) & 1);
}
printf("\n");
}
int main() {
printf("Elias Gamma encoding:\n");
elias_gamma_encode(5); // output: 00101
elias_gamma_encode(16); // output: 000010000
elias_gamma_encode(278); // output: 00000000100010110
return 0;
}
```
### Solution (RISC-V Assembly code with CLZ)
The code will directly output the result into console.
```riscv=
.data
input_data: .word 278
str0: .string "0"
str1: .string "1"
.text
intput:
la s0, input_data
lw t0, 0(s0)
addi t3, x0, 31
my_clz:
addi t4, x0, 1 #load 1
sll t4, t4, t3 # 1 << i
and t4, t0, t4 # x & (1 << i)
blt x0, t4, out_my_clz # finish counting
addi a0, a0, 1 # count++
addi t3, t3, -1 #--i
blt t3, x0, out_my_clz
beqz t4, my_clz # keep counting
out_my_clz:
li t1, 32
sub t1, t1, a0
add t2, x0, t1 # loop counter for L-1 zeros
addi t3, t1, -1
add t5, t1, x0 # loop counter for output_binary
output:
addi t2, t2, -1
bgt t2, x0, output_zero
j output_binary
output_zero: # print 0
la a0, str0
li a7, 4
ecall
j output
output_binary:
srl t4, t0, t3 # shift to the MSB
andi t1, t4, 1
addi t3, t3, -1
addi t5, t5, -1
blt t5, x0, exit
beqz t1, output_zero
output_one: # print 1
la a0, str1
li a7, 4
ecall
bgt t5, x0, output_binary
exit:
add x0, x0, x0 # nop
```
:::success
After testing, the code above can output the correct result.
:::
### Solution (C code with branchless CLZ)
Since the last line of the original branchless CLZ is `32 - (x & 0x3F)`,
and $\log_{2}{x}$ is `32 - clz(x)`, I reduce the last line of log_with_clz2(x) to `(x & 0x3F)`.
```c=
#include <stdio.h>
#include <stdint.h>
int log_with_clz2(uint32_t x)
{
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
//count the ones
x -= x >> 1 & 0x55555555;
x = (x >> 2 & 0x33333333) + (x & 0x33333333);
x = (x >> 4) + x & 0x0F0F0F0F;
x += x >> 8;
x += x >> 16;
return (x & 0x3F);
}
// Elias Gamma encoding
void elias_gamma_encode(uint32_t n) {
if (n == 0) {
printf("0\n");
return;
}
int L = log_with_clz2(n);
for (int i = 0; i < L - 1; i++) {
printf("0");
}
printf("1");
for (int i = L - 2; i >= 0; i--) {
printf("%d", (n >> i) & 1);
}
printf("\n");
}
int main() {
printf("Elias Gamma encoding:\n");
elias_gamma_encode(5); // output: 00101
elias_gamma_encode(16); // output: 000010000
elias_gamma_encode(278); // output: 00000000100010110
return 0;
}
```
### Solution (RISC-V Assembly code with branchless CLZ)
The code will directly output the result into console.
```riscv=
.data
input_data: .word 278
str0: .string "0"
str1: .string "1"
.text
intput:
la s0, input_data
lw t0, 0(s0)
add t4, t0, x0
my_clz:
srli t3, t4, 1
or t4, t4, t3 # x |= x >> 1
srli t3, t4, 2
or t4, t4, t3 # x |= x >> 2
srli t3, t4, 4
or t4, t4, t3 # x |= x >> 4
srli t3, t4, 8
or t4, t4, t3 # x |= x >> 8
srli t3, t4, 16
or t4, t4, t3 # x |= x >> 16
li t5, 0x55555555
srli t3, t4, 1
and t3, t3, t5
sub t4, t4, t3 # x -= x >> 1 & 0x55555555
li t5, 0x33333333
srli t3, t4, 2
and t3, t3, t5
and t5, t4, t5
add t4, t3, t5 # x = (x >> 2 & 0x33333333) + (x & 0x33333333)
li t5, 0x0F0F0F0F
srli t3, t4, 4
add t4, t3, t4
and t4, t4, t5 # x = ((x >> 4) + x) & 0x0f0f0f0f
srli t3, t4, 8
add t4, t3, t4 # x += x >> 8
srli t3, t4, 16
add t4, t3, t4 # x += x >> 16
li t5, 0x3F
and t4, t4, t5
add a0, x0, t4 # put the result into a0
out_my_clz:
add t2, x0, a0 # loop counter for L-1 zeros
addi t3, a0, -1
addi t5, a0, 0 # loop counter for output_binary
output:
addi t2, t2, -1
bgt t2, x0, output_zero
j output_binary
output_zero:
la a0, str0
li a7, 4
ecall
j output
output_binary:
srl t4, t0, t3 # shift to the MSB
andi t1, t4, 1
addi t3, t3, -1
addi t5, t5, -1
blt t5, x0, exit
beqz t1, output_zero
output_one:
la a0, str1
li a7, 4
ecall
bgt t5, x0, output_binary
exit:
add x0, x0, x0 # nop
```
:::success
After testing, the code above can output the correct result.
:::
## Result
As the result showed in the following figures, the program using branchless CLZ computes faster than the original one.
The reason is that if there is a loop or branch in C code, pipeline processor may have to deal with control hazard caused by mispredicted branch. In this situation, to make the program work correctly, pipeline processor needs to flush the instructions that have been load in to previous stages (before the stage deciding branch or not).
### test case 1 (5)
Performance of the original code (with CLZ)

Performance of the optimal code (with branchless CLZ)

### test case 2 (16)
Performance of the original code (with CLZ)

Performance of the optimal code (with branchless CLZ)

### test case 3 (278)
Performance of the original code (with CLZ)

Performance of the optimal code (with branchless CLZ)

## Analysis
### Pseudo Instruction for Elias Gamma Encoding with Branchless CLZ
```
00000000 <intput>:
0: 10000417 auipc x8 0x10000
4: 00040413 addi x8 x8 0
8: 00042283 lw x5 0 x8
c: 00028eb3 add x29 x5 x0
00000010 <my_clz>:
10: 001ede13 srli x28 x29 1
14: 01ceeeb3 or x29 x29 x28
18: 002ede13 srli x28 x29 2
1c: 01ceeeb3 or x29 x29 x28
20: 004ede13 srli x28 x29 4
24: 01ceeeb3 or x29 x29 x28
28: 008ede13 srli x28 x29 8
2c: 01ceeeb3 or x29 x29 x28
30: 010ede13 srli x28 x29 16
34: 01ceeeb3 or x29 x29 x28
38: 55555f37 lui x30 0x55555
3c: 555f0f13 addi x30 x30 1365
40: 001ede13 srli x28 x29 1
44: 01ee7e33 and x28 x28 x30
48: 41ce8eb3 sub x29 x29 x28
4c: 33333f37 lui x30 0x33333
50: 333f0f13 addi x30 x30 819
54: 002ede13 srli x28 x29 2
58: 01ee7e33 and x28 x28 x30
5c: 01eeff33 and x30 x29 x30
60: 01ee0eb3 add x29 x28 x30
64: 0f0f1f37 lui x30 0xf0f1
68: f0ff0f13 addi x30 x30 -241
6c: 004ede13 srli x28 x29 4
70: 01de0eb3 add x29 x28 x29
74: 01eefeb3 and x29 x29 x30
78: 008ede13 srli x28 x29 8
7c: 01de0eb3 add x29 x28 x29
80: 010ede13 srli x28 x29 16
84: 01de0eb3 add x29 x28 x29
88: 03f00f13 addi x30 x0 63
8c: 01eefeb3 and x29 x29 x30
90: 01d00533 add x10 x0 x29
00000094 <out_my_clz>:
94: 00a003b3 add x7 x0 x10
98: fff50e13 addi x28 x10 -1
9c: 00050f13 addi x30 x10 0
000000a0 <output>:
a0: fff38393 addi x7 x7 -1
a4: 00704463 blt x0 x7 8 <output_zero>
a8: 0180006f jal x0 24 <output_binary>
000000ac <output_zero>:
ac: 10000517 auipc x10 0x10000
b0: f5850513 addi x10 x10 -168
b4: 00400893 addi x17 x0 4
b8: 00000073 ecall
bc: fe5ff06f jal x0 -28 <output>
000000c0 <output_binary>:
c0: 01c2deb3 srl x29 x5 x28
c4: 001ef313 andi x6 x29 1
c8: fffe0e13 addi x28 x28 -1
cc: ffff0f13 addi x30 x30 -1
d0: 000f4e63 blt x30 x0 28 <exit>
d4: fc030ce3 beq x6 x0 -40 <output_zero>
000000d8 <output_one>:
d8: 10000517 auipc x10 0x10000
dc: f2e50513 addi x10 x10 -210
e0: 00400893 addi x17 x0 4
e4: 00000073 ecall
e8: fde04ce3 blt x0 x30 -40 <output_binary>
000000ec <exit>:
ec: 00000033 add x0 x0 x0
```
## 5-stages Pipeline Processor

The five stages for the 5-stages pipeline processor are:
| Stage Name | Function |
| :--------: | :--------: |
| IF | Instruction Fetch |
| ID | Instruction Decode & Register Read |
| EX | Execution or Address Calculation |
| MEM | Data Memory Access |
| WB | Write Back |
Each stage is separated by the pipeline register.
Instruction in different type of format will go through 5 stages with different signal turned on. Take R-type as example.
#### R-type format
Take `add x29 x5 x0` as example
>`add rd, rs1, rs2`
>add instruction is used to calculate the result of the two registers (rs1 + rs2) and put the result into destination register (rd)
Let's see how it go through each stage.
**1. Instruction Fetch (IF)**

- the machine code of this instruction is `00028eb3`, so `instr` is equal to `00028eb3`.
- Its PC address is `0x0000000c`, so the next PC address is `0x00000010` .
- PC will increment by 4 automatically using the above adder.
- Because there is no branch occur, next instruction will be at PC + 4, so the multiplexer before PC choose input come from adder.
**2. Instruction Decode and register fetch (ID)**

- instruction is decoded to four part:
- `opcode` = `add`
- `Wr idx` = `0x1d`
- `R1 idx` = `0x05`
- `R2 idx` = `0x00`
- Because the previous instruction of this one is `lw`, and we reuse the register `t0 (x5)`, there is a **Read After Write hazard**. Therefore, the processor will insert a `nop` between `lw` and `add`.
- 
- This way, `R1` register will not get the correct value at this stage.
- Only `R2` register loads the correct value from register memory which is `0x00000000`.
- The current PC value `0x0000000c` and next PC value `0x00000010` are just send through this stage, we don't use them.
**3. Execute (EX)**

- First level multiplexers choose the value come from `Forwarding path` and `R2`.
- After inserting a `nop`, at this stage `R1` register can get the correct value `0x00000116` which is forwarded from WB stage.
- Second level multiplexers still choose the value from rigisters and the value is sent to ALU, instead of the next PC address and the immediate value.
- ALU gets `op1` = `0x00000116` and `op2` = `0x00000000`, and add them together.
- The result `0x00000116` is output to `Res`.
- The result `0x00000116` and `Wr idx (0x1d)` are just send through this stage, we don't use them.
**4. Memory access (MEM)**

- Res from ALU is send to 3 ways:
- Pass through this stage and go to WB stage
- Send back to EXE stage for next instruction to use
- Use as data memory address.
- `R2` is sent to Data in, but memory is not able to writing (`Wr en` is unset).
- The ALU result `0x00000116` and `Wr idx (0x01d)` are just sent through this stage, we don't use them.
**5. Register write back (WB)**

- The multiplexer choose Res from ALU as final output. So the output value is `0x00000116`.
- The output value and `Wr idx (0x1d)` are sent back to registers block. Finally, the value `0x00000116` will be write into `x5` register, whose ABI name is `t0` and register address is `0x1d`.
After finishing 5-stage, the value is written into the destination register, as the following figure.

## Reference
[Quiz1 of Computer Architecture(2024)](https://hackmd.io/@sysprog/arch2024-quiz1-sol)
[Find first set - CLZ](https://en.wikipedia.org/wiki/Find_first_set#CLZ)
[Count leading zeroes in an Int32](https://stackoverflow.com/questions/10439242/count-leading-zeroes-in-an-int32)
[The Aggregate Magic Algorithms - Population Count](http://aggregate.org/MAGIC/#Population%20Count%20(Ones%20Count))
[Quiz1 of Computer Architecture(2023)](https://hackmd.io/@sysprog/arch2023-quiz1-sol)
[Elias Gamma Encoding](https://en.wikipedia.org/wiki/Elias_gamma_coding)