Assignment1: RISC-V Assembly and Instruction Pipeline
===
contributed by <[`5k6m4`](https://github.com/5k6m4)>
# Experimental Environment
The experimental environment is a 5-stage in-order pipeline processor with hazard detection and data forwarding. The pipeline consists of five stages: instruction fetch (IF), instruction decode (ID), execute (EX), memory (MEM), and write back (WB).
The following provides a brief introduction to each stage using an example assembly code and shows the functionality using the [Ripes](https://github.com/mortbopet/Ripes) simulator.
### Example Assembly Code
```c
.data
test_data: .word 0x0000ffff
.text
main:
# demonstrate memory operation and data hazard
la t0, test_data # load the address of test_data
lw t1, 0(t0) # t1 = 0x0000ffff
addi t1, x0, 0 # t1 = 0
addi t2, t1, 2 # t2 = 2
sw t1, 0(t0) # mem[t0] = 0
lw t2, 0(t0) # t2 = 0
if:
# demonstrate control hazard
beq t1, x0, exit
addi t2, x0, 2
addi t3, x0, 3
addi t4, x0, 4
exit:
li a7, 10 # system call code for exiting the program
ecall # make the exit system call
```
### Example Pipeline State

**Instruction Fetch (IF)**
- The processor loads an instruction using the value of the program counter (PC) as the address.
- In the example pipeline state, the processor load the instruction `sw x6, 0, x5`.
**Instruction Decode (ID)**
- The instruction is decoded to obtain the source register index and the immediate value. Next, the register file is accessed to retrieve the value of the source register.
- The instruction in the ID stage is `addi x7, x6, 2`. Therefore, one operand is set to the value of register `x6`, and the other is the immediate value `2`.
**Execute (EX)**
- The ALU performs the operation specified by the instruction in this stage.
- In the example where the instruction is `addi x6, x0, 0`, the ALU adds the value in register `x0` to the immediate value `0`.
**Memory (MEM)**
- If the instruction in this stage is a load/store instruction, the processor accesses the memory using the address stored in the source register.
- When the load instruction `lw x6, 0(x5)` is in this stage, the CPU requests to read the value stored at the address contained in register `x5`.
**Write Back (WB)**
- The value computed from the instruction of access from the memory will be written back to the destination register in this stage.
- In the example pipeline stage, the instruction in the WB stage is `addi x5, x5, 0`. Therefore, the processor writes the value computed from the EX stage to register `x5`.
# [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol) ProblemB
:::danger
Choose one problem (A, B, or C) from [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol), translate it from C code to a complete RISC-V assembly program, and include the relevant test data.
:::
## C Program
```c
#include <stdint.h>
typedef struct {
uint16_t bits;
} bf16_t;
static inline float bf16_to_fp32(bf16_t h) {
union {
float f;
uint32_t i;
} u = {.i = (uint32_t)h.bits << 16};
return u.f;
}
static inline bf16_t fp32_to_bf16(float s) {
bf16_t h;
union {
float f;
uint32_t i;
} u = {.f = s};
if ((u.i & 0x7fffffff) > 0x7f800000) { /* NaN */
h.bits = (u.i >> 16) | 64; /* force to quiet */
return h;
}
h.bits = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10;
return h;
}
```
## bf16_to_fp32()
### Test Case
| | Input in bf16 | Expected Output in fp32 | Description |
| ----- | ------------: | ----------------------: | :---------- |
| Case0 | 0x3f80 | 0x3f800000 | normal positive of bf16 |
| Case1 | 0xbf80 | 0xbf800000 | normal negative of bf16 |
| Case2 | 0x0000 | 0x00000000 | positive zero of bf16 |
| Case3 | 0x7f80 | 0x7f800000 | positive infinity of bf16 |
| Case4 | 0x7fc0 | 0x7fc00000 | quiet NaN of bf16 |
| Case5 | 0x0001 | 0x00010000 | subnormal positive of bf16 |
| Case6 | 0x8001 | 0x80010000 | subnormal negative of bf16 |
| Case7 | 0x7f7f | 0x7f7f0000 | normal maximum of bf16 |
| Case8 | 0x0080 | 0x00800000 | normal minimum of bf16 |
### Assembly Code
```c
.data
# test cases pair = {input bf16 value, golden fp32 value}
test_case:
.word 0x00003f80, 0x3f800000 # 0: normal positive of bf16
.word 0x0000bf80, 0xbf800000 # 1: normal negative of bf16
.word 0x00000000, 0x00000000 # 2: positive zero of bf16
.word 0x00007f80, 0x7f800000 # 3: positive infinity of bf16
.word 0x00007fc0, 0x7fc00000 # 4: quiet NaN of bf16
.word 0x00000001, 0x00010000 # 5: subnormal positive of bf16
.word 0x00008001, 0x80010000 # 6: subnormal negative of bf16
.word 0x00007f7f, 0x7f7f0000 # 7: normal maximum of bf16
.word 0x00000080, 0x00800000 # 8: normal minimum of bf16
err_str0: .string "TestCase"
err_str1: .string " of bf16_to_fp32() failed!!\n"
pass_str: .string "Pass all test cases of bf16_to_fp32()!!\n"
.text
main:
la s0, test_case # load the address of test_case to s0
# Initialize the test error to zero
addi s1, x0, 0 # s1 = test_err = 0
# Initialize local variables for the mainLoop
addi s2, x0, 0 # s2 = i = 0
addi s3, x0, 9 # s3 = const. = 9
mainLoop:
bgeu s2, s3, endMainLoop # when i >= 9, end mainLoop
# Load the argument that bf16_to_fp32() needs
lw a0, 0(s0)
# Function call
jal ra, bf16_to_fp32 # jump to functioin bf16_to_fp32
# Compare bf16_to_fp32() result with the golden
lw s4, 4(s0) # load golden, s4 = golden value
beq a0, s4, passTest # if result == golden, pass test case
# Update test error
addi s1, s1, 1 # test_err++
# Print failed message
la a0, err_str0 # load the address of err_str0
li a7, 4 # system call code for printing a string
ecall # print err_str0
addi a0, s2, 0 # load the number of the test case
li a7, 1 # system call code for printing a integer
ecall # print the number of the test case
la a0, err_str1 # load the address of err_str1
li a7, 4 # system call code for printing a string
ecall # print err_str1
passTest:
# Update test case and loop variable
addi s0, s0, 8 # move s0 to the address of the next test case
addi s2, s2, 1 # i++
j mainLoop # jal x0, mainLoop
endMainLoop:
bne s1, x0, exit # if test_err != 0, do not print pass message
# Print pass message
la a0, pass_str # load the address of pass_str
li a7, 4 # system call code for printing a string
ecall # print pass_str16
exit:
li a7, 10 # system call code for exiting the program
ecall # make the exit system call
bf16_to_fp32:
slli a0, a0, 16 # a0 = h.bits << 16
ret # jalr x0, ra, 0
```
## fp32_to_bf16()
### Test Case
| | Input in fp32 | Expected Output in bf16 | Description |
| ----- | ------------: | ----------------------: | :---------- |
| Case0 | 0x41200000 | 0x4120 | normal positive of fp32 |
| Case1 | 0xc1200000 | 0xc120 | normal negative of fp32 |
| Case2 | 0x00000000 | 0x0000 | positive zero of fp32 |
| Case3 | 0x7f800000 | 0x7f80 | positive infinity of fp32 |
| Case4 | 0x7fc00000 | 0x7fc0 | quiet NaN of fp32 |
| Case5 | 0x00000001 | 0x0000 | subnormal positive of fp32 |
| Case6 | 0x80000001 | 0x8000 | subnormal negative of fp32 |
| Case7 | 0x7f7fffff | 0x7f80 | normal maximum of fp32 |
| Case8 | 0x00800000 | 0x0080 | normal minimum of fp32 |
### Assembly Code
```c
.data
# test cases pair = {input fp32 value, golden bf16 value}
test_case:
.word 0x41200000, 0x00004120 # 0: normal positive of fp32
.word 0xc1200000, 0x0000c120 # 1: normal negative of fp32
.word 0x00000000, 0x00000000 # 2: positive zero of fp32
.word 0x7f800000, 0x00007f80 # 3: positive infinity of fp32
.word 0x7fc00000, 0x00007fc0 # 4: quiet NaN of fp32
.word 0x00000001, 0x00000000 # 5: subnormal positive of fp32
.word 0x80000001, 0x00008000 # 6: subnormal negative of fp32
.word 0x7f7fffff, 0x00007f80 # 7: normal maximum of fp32
.word 0x00800000, 0x00000080 # 8: normal minimum of fp32
err_str0: .string "TestCase"
err_str1: .string " of fp32_to_bf16() failed!!\n"
pass_str: .string "Pass all test cases of fp32_to_bf16()!!\n"
.text
main:
la s0, test_case # load the address of test_case to s0
# Initialize the test error to zero
addi s1, x0, 0 # s1 = test_err = 0
# Initialize local variables for the mainLoop
addi s2, x0, 0 # s2 = i = 0
addi s3, x0, 9 # s3 = const. = 9
mainLoop:
bgeu s2, s3, endMainLoop # when i >= 9, end mainLoop
# Load the argument that fp32_to_bf16() needs
lw a0, 0(s0)
# Function call
jal ra, fp32_to_bf16 # jump to functioin fp32_to_bf16
# Compare fp32_to_bf16() result with the golden
lw s4, 4(s0) # load golden, s4 = golden value
beq a0, s4, passTest # if result == golden, pass test case
# Update test error
addi s1, s1, 1 # test_err++
# Print failed message
la a0, err_str0 # load the address of err_str0
li a7, 4 # system call code for printing a string
ecall # print err_str0
addi a0, s2, 0 # load the number of the test case
li a7, 1 # system call code for printing a integer
ecall # print the number of the test case
la a0, err_str1 # load the address of err_str1
li a7, 4 # system call code for printing a string
ecall # print err_str1
passTest:
# Update test case and loop variable
addi s0, s0, 8 # move s0 to the address of the next test case
addi s2, s2, 1 # i++
j mainLoop # jal x0, mainLoop
endMainLoop:
bne s1, x0, exit # if test_err != 0, do not print pass message
# Print pass message
la a0, pass_str # load the address of pass_str
li a7, 4 # system call code for printing a string
ecall # print pass_str16
exit:
li a7, 10 # system call code for exiting the program
ecall # make the exit system call
fp32_to_bf16:
addi t0, a0, 0 # pass argument s, t0 = s
li t1, 0x7fffffff # t1 = 0x7fffffff
and t1, t0, t1 # t1 = u.i & 0x7fffffff
li t2, 0x7f800000 # t2 = 0x7f800000
if:
bge t2, t1, endIf # if 0x7f800000 >= (u.i & 0x7fffffff), skip NaN handling
# Handle NaN value
srli t1, t0, 16 # t1 = u.i >> 16
ori a0, t1, 64 # t1 = h = (u.i >> 16) | 64
ret # jalr x0, ra, 0
endIf:
# Round to nearest even value
srli t1, t0, 16 # t1 = u.i >> 0x10
andi t1, t1, 1 # t1 = ((u.i >> 0x10) & 1)
li t2, 0x7fff # t2 = 0x7fff
add t1, t2, t1 # t1 = 0x7fff + ((u.i >> 0x10) & 1)
add t1, t0, t1 # t1 = u.i + (0x7fff + ((u.i >> 0x10) & 1))
srli a0, t1, 16 # t1 = h = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10;
ret # jalr x0, ra, 0
```
# Hamming Distance ([Leetcode 461](https://leetcode.com/problems/hamming-distance/description/))
> Description: The [Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance) between two integers is the number of positions at which the corresponding bits are different.
>
> Given two integers `x` and `y`, return *the **Hamming distance** between them*.
## C Program
```c
#include <stdint.h>
int hammingDistance(int x, int y) {
int n = x ^ y;
int count = 0;
while (n > 0) {
count += n & 1;
n = n >> 1;
}
return count;
}
```
:::danger
You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking.
:::
## Test Case
| | Input1 | Input2 | Golden | Description |
| ----- | :----: | :----: | :----: | :---------: |
| Case0 | 0x0000000f | 0x0000000f | 0 | no bit is different |
| Case1 | 0x0000000f | 0x0000001f | 1 | odd different bit |
| Case2 | 0x0000000f | 0x0000003f | 2 | even different bits |
| Case3 | 0x0f0f0f0f | 0xf0f0f0f0 | 32 | all bits are different |
## Assembly Code
```c
.data
test_case0: .word 0x0000000f, 0x0000000f
test_case1: .word 0x0000000f, 0x0000001f
test_case2: .word 0x0000000f, 0x0000003f
test_case3: .word 0x0f0f0f0f, 0xf0f0f0f0
golden: .word 0, 1, 2, 32
err_str0: .string "TestCase"
err_str1: .string " failed!!\n"
pass_str: .string "Pass all test cases!!\n"
.text
main:
la s0, test_case0 # load the address of test_case0 to s0
la s1, golden # load the address of golden to s1
# Initialize the test error to zero
addi t0, x0, 0 # t0 = test_err = 0
# Initialize local variable for the mainLoop
addi t1, x0, 0 # t1 = i = 0
addi t2, x0, 4 # t2 = const. = 4
mainLoop:
bgeu t1, t2, endMainLoop # when i >= 4, end mainLoop
# Prepare the arguments that hammingDistance() need
lw a0, 0(s0) # load argument x, a0 = x
lw a1, 4(s0) # load argument y, a1 = y
# RISC-V calling convention and function call
addi sp, sp, -12 # push t0 ~ t2 to the stack
sw t0, 0(sp)
sw t1, 4(sp)
sw t2, 8(sp)
jal ra, hammingDistance # jump to function hammingDistance
lw t0, 0(sp)
lw t1, 4(sp)
lw t2, 8(sp)
addi sp, sp, 12 # pop t0~t2 from the stack
# Compare hammingDistance() result with the golden
lw t3, 0(s1) # load golden, t3 = golden[i]
beq a0, t3, passTest # if the result == golden, pass test
# Update test error
addi t0, t0, 1 # test_err++
# Print faild message
la a0, err_str0 # load the address of err_str0
li a7, 4 # system call code for printing a string
ecall # print err_str0
addi a0, t1, 0 # load the number of the test case
li a7, 1 # system call code for printing a integer
ecall # print the number of the test case
la a0, err_str1 # load the address of err_str1
li a7, 4 # system call code for printing a string
ecall # print err_str1
passTest:
# Update test case, golden and loop variable
addi s0, s0, 8 # move s0 to next test_case
addi s1, s1, 4 # move s1 to nest golden
addi t1, t1, 1 # i++
j mainLoop # jal x0, mainLoop
endMainLoop:
bne t0, x0, exit # if test_err != 0, do not print pass message
# Print pass message
la a0, pass_str # load the address of pass_str
li a7, 4 # system call code for printing a string
ecall # print pass_str
exit:
li a7, 10 # system call code for exiting the program
ecall # make the exit system call
hammingDistance:
# RISC-V calling convention
addi sp, sp, -8 # push s0 and s1 to stack
sw s0, 0(sp)
sw s1, 4(sp)
# Pass argument used in hammingDistance()
addi s0, a0, 0 # pass argument x, s0 = x
addi s1, a1, 0 # pass argument y, s1 = y
# Initialize local variables of hammingDistance()
xor t0, s0, s1 # n = x ^ y
addi t1, x0, 0 # count = 0
loop:
bgeu x0, t0, endLoop # when 0 >= n, end loop
andi t2, t0, 1 # t2 = n & 1
add t1, t1, t2 # count += n & 1
srli t0, t0, 1 # n = n >> 1
j loop # jal x0, loop
endLoop:
# RISC-V calling convention
lw s0, 0(sp)
lw s1, 4(sp)
addi sp, sp, 8 # pop s0 and s1 from stack
# Return count
addi a0, t1, 0 # move t1 to a0, a0 = count
ret # jalr x0, ra, 0
```
| Metrics | Value |
| :------ | ----: |
| Cycles | 494 |
| Instrs. retired | 354 |
| CPI | 1.4 |
| IPC | 0.717 |
| Code size | 82 |
:::danger
Use fewer instructions.
:::
## Assembly Optimization
### Seperate the use of callee-saved and caller-saved registers
```c
.data
test_case0: .word 0x0000000f, 0x0000000f
test_case1: .word 0x0000000f, 0x0000001f
test_case2: .word 0x0000000f, 0x0000003f
test_case3: .word 0x0f0f0f0f, 0xf0f0f0f0
golden: .word 0, 1, 2, 32
err_str0: .string "TestCase"
err_str1: .string " failed!!\n"
pass_str: .string "Pass all test cases!!\n"
.text
main:
la s0, test_case0 # load the address of test_case0 to s0
la s1, golden # load the address of golden to s1
# Initialize the test error to zero
addi s2, x0, 0 # s2 = test_err = 0
# Initialize local variable for the mainLoop
addi s3, x0, 0 # s3 = i = 0
addi s4, x0, 4 # s4 = const. = 4
mainLoop:
bgeu s3, s4, endMainLoop # when i >= 4, end mainLoop
# Prepare the arguments that hammingDistance() need
lw a0, 0(s0) # load argument x, a0 = x
lw a1, 4(s0) # load argument y, a1 = y
# Function call
jal ra, hammingDistance # jump to function hammingDistance
# Compare hammingDistance() result with the golden
lw s5, 0(s1) # load golden, s5 = golden[i]
beq a0, s5, passTest # if the result == golden, pass test
# Update test error
addi s2, s2, 1 # test_err++
# Print faild message
la a0, err_str0 # load the address of err_str0
li a7, 4 # system call code for printing a string
ecall # print err_str0
addi a0, s3, 0 # load the number of the test case
li a7, 1 # system call code for printing a integer
ecall # print the number of the test case
la a0, err_str1 # load the address of err_str1
li a7, 4 # system call code for printing a string
ecall # print err_str1
passTest:
# Update test case, golden and loop variable
addi s0, s0, 8 # move s0 to next test_case
addi s1, s1, 4 # move s1 to nest golden
addi s3, s3, 1 # i++
j mainLoop # jal x0, mainLoop
endMainLoop:
bne s2, x0, exit # if test_err != 0, do not print pass message
# Print pass message
la a0, pass_str # load the address of pass_str
li a7, 4 # system call code for printing a string
ecall # print pass_str
exit:
li a7, 10 # system call code for exiting the program
ecall # make the exit system call
hammingDistance:
# Pass argument used in hammingDistance()
addi t0, a0, 0 # pass argument x, t0 = x
addi t1, a1, 0 # pass argument y, t1 = y
# Initialize local variables of hammingDistance()
xor t2, t0, t1 # n = x ^ y
addi t3, x0, 0 # count = 0
loop:
bgeu x0, t2, endLoop # when 0 >= n, end loop
andi t4, t2, 1 # t4 = n & 1
add t3, t3, t4 # count += n & 1
srli t2, t2, 1 # n = n >> 1
j loop # jal x0, loop
endLoop:
# Return count
addi a0, t3, 0 # move t3 to a0, a0 = count
ret # jalr x0, ra, 0
```
| Metrics | Value |
| :------ | ----: |
| Cycles | 438 |
| Instrs. retired | 298 |
| CPI | 1.47 |
| IPC | 0.68 |
| Code size | 68 |
### Loop Unrolling
#### Unrolling 2 Times
```c
loop:
bgeu x0, t2, endLoop # if n > 0 (0 >= n), end loop
andi t4, t2, 1 # t4 = n & 1
add t3, t3, t4 # count += n & 1
srli t2, t2, 1 # n = n >> 1
andi t4, t2, 1 # t4 = (n >> 1) & 1
add t3, t3, t4 # count += (n >> 1) & 1
srli t2, t2, 1 # n = n >> 2
j loop # jal x0, loop
```
| Metrics | Value |
| :------ | ----: |
| Cycles | 357 |
| Instrs. retired | 259 |
| CPI | 1.38 |
| IPC | 0.725 |
| Code size | 71 |
#### Unrolling 4 Times
```c
loop:
bgeu x0, t2, endLoop # if n > 0 (0 >= n), end loop
andi t4, t2, 1 # t4 = n & 1
add t3, t3, t4 # count += n & 1
srli t2, t2, 1 # n = n >> 1
andi t4, t2, 1 # t4 = (n >> 1) & 1
add t3, t3, t4 # count += (n >> 1) & 1
srli t2, t2, 1 # n = n >> 2
andi t4, t2, 1 # t4 = (n >> 2) & 1
add t3, t3, t4 # count += (n >> 2) & 1
srli t2, t2, 1 # n = n >> 3
andi t4, t2, 1 # t4 = (n >> 3) & 1
add t3, t3, t4 # count += (n >> 3) & 1
srli t2, t2, 1 # n = n >> 4
j loop # jal x0, loop
```
| Metrics | Value |
| :------ | ----: |
| Cycles | 329 |
| Instrs. retired | 251 |
| CPI | 1.31 |
| IPC | 0.763 |
| Code size | 77 |
#### Unrolling Times Analysis
| Metrics | w/o Unrolling | 2 Times | 4 Times | 8 Times | 16 Times | 32 Times |
| :------ | ------------: | ------: | ------: | ------: | -------: | -------: |
| Cycles | 438 | 357 (-18%) | 329 (-25%) | 305 (-30%) | 345 (-21%) | 505 (+15%) |
| Instrs. retired | 298 | 259 (-13%) | 251 (-16%) | 239 (-20%) | 283 (-5%) | 459 (+54%) |
| CPI | 1.47 | 1.38 (-6%) | 1.31 (-11%) | 1.28 (-13%) | 1.22 (-17%) | 1.1 (-25%) |
| IPC | 0.68 | 0.725 (+7%) | 0.763 (+12%) | 0.784 (+15%) | 0.82 (+21%) | 0.909 (+34%) |
| Code size | 68 | 71 (+4% )| 77 (+13%) | 89 (+31%) | 118 (+74%) | 158 (+132%) |
# Reference
- [bfloat16 floating-point format](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)
- [The RISC-V Instruction Set Manual Volume I: User-Level ISA](https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf)