---
tags: RISC-V, jserv
---
# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < [`MicahDoo`](https://github.com/MicahDoo) >
[Notes on Using Ripes](https://hackmd.io/wr2NJwD5TJiENanuV_hiSQ)
## Invert Binary Tree
I base this assignment upon [LeetCode Problem 226. Invert Binary Tree](https://leetcode.com/problems/invert-binary-tree/), an <font color=lime>Easy</font> Tree problem:
> Given the `root` of a binary tree, invert the tree, and return its `root`.
>
>
:point_right: Straight to solution: Click [here](#Final-Assembly-Code) to skip my investigation and thought process.
## Problem Evaluation
### Data Organization
- **Nodes**: contiguous class objects, each consisting of three elements: value (`val`), left branch (`left`) and right branch (`right`) in order.
- **Links**: pointer values, each either a null pointer value or points to a child (a node).
```graphviz
digraph g {
fontname="Helvetica,Arial,sans-serif"
node [fontname="Helvetica,Arial,sans-serif"]
edge [fontname="Helvetica,Arial,sans-serif"]
graph [
rankdir = "LR"
];
node [
fontsize = "16"
shape = "ellipse"
];
edge [
];
"node0" [
label = " <f0> val | <f1> left | <f2> right"
shape = "record"
];
"node1" [
label = " <f0> val | left | right"
shape = "record"
];
"node2" [
label = " <f0> val | left | right"
shape = "record"
];
"node0":f1 -> "node1":f0 [
id = 0
];
"node0":f2 -> "node2":f0 [
id = 1
];
}
```
### Strategy
To concoct a strategy to attack this problem, it helps to first construct a solution using a higher level of abstraction.
Below is a function written in C++ to tackle this problem:
```cpp
void invert(TreeNode* node) {
if (node == nullptr) return;
TreeNode * temp = node -> left;
node -> left = node -> right;
node -> right = temp;
invert(node -> left);
invert(node -> right);
}
```
A rough flowchart of the above function:
```flow
st=>start: Invert(node)
e=>end: Return
cond=>condition: node == nullptr ?
op=>operation: Swap node->left & node->right
call=>subroutine: Invert(node->left)
cond2=>condition: node->right == null ?
call2=>subroutine: Invert(node->right)
st->cond
cond(no@y)->e
cond(yes@n)->op
op->call->call2->e
```
### Evaluation of the solution in the context of assembly
In assembly language, extra care has to be taken apropos a number of things:
- There's no naming of a variable, only data in memory.
- A class object is organized as a contiguous memory block where each word is an element in order.
- Instead of refering to the elements by their names, we use **offsets**. In the case of Invert Binary Tree, assuming a 4-byte word length, `->val` is `+0`, `->left` is `+4` and `->right` is `+8`.
- Function calls are handled on a stack.
- Instead of the easy parenthesizing and the handy `return` keyword, we need to manually handle function calls by pushing arguments, local variables(or pointers to local variables) and return addresses on the stack and update the stack pointer.
:::info
- [ ] Is a frame pointer necessary or benefitial at all?
:::
[RISCV Cheatsheet](https://itnext.io/risc-v-instruction-set-cheatsheet-70961b4bbe8)
## Thought Process
### Dataset A
I will use the following data for desmonstration throughout the development of the code, unless a modification is needed:
```c
.data
n: # 0x10000000
.word 1, 0x1000000c, 0x10000018
nl: # 0x1000000c
.word 2, 0x10000024, 0x10000030
nr: # 0x10000018
.word 3, 0x1000003c, 0x10000048
nll: # 0x10000024
.word 4, 0, 0
nlr: # 0x10000030
.word 5, 0, 0
nrl: # 0x1000003c
.word 6, 0, 0
nrr: # 0x10000048
.word 7, 0, 0
```
:::spoiler Original Tree
```
Tree DFS: 1 2 4 5 3 6 7
```
```graphviz
digraph g {
fontname="Helvetica,Arial,sans-serif"
node [fontname="Helvetica,Arial,sans-serif"]
edge [fontname="Helvetica,Arial,sans-serif"]
graph [
rankdir = "LR"
];
node [
fontsize = "16"
shape = "ellipse"
];
edge [
];
"n" [
label = " <f0> 1 | <f1> 0x1000000c | <f2> 0x10000018"
shape = "record"
];
"nl" [
label = " <f0> 2 | <f1> 0x10000024 | <f2> 0x10000030"
shape = "record"
];
"nr" [
label = " <f0> 3 | <f1> 0x1000003c | <f2> 0x10000048"
shape = "record"
];
"nll" [
label = " <f0> 4 | <f1> 0 | <f2> 0"
shape = "record"
];
"nlr" [
label = " <f0> 5 | <f1> 0 | <f2> 0"
shape = "record"
];
"nrl" [
label = " <f0> 6 | <f1> 0 | <f2> 0"
shape = "record"
];
"nrr" [
label = " <f0> 7 | <f1> 0 | <f2> 0"
shape = "record"
];
"n":f1 -> "nl":f0
"n":f2 -> "nr":f0
"nl":f1 -> "nll":f0
"nl":f2 -> "nlr":f0
"nr":f1 -> "nrl":f0
"nr":f2 -> "nrr":f0
}
```
:::
:::spoiler Desired Result
```
Tree: 1 3 7 6 2 5 4
```
```graphviz
digraph g {
fontname="Helvetica,Arial,sans-serif"
node [fontname="Helvetica,Arial,sans-serif"]
edge [fontname="Helvetica,Arial,sans-serif"]
graph [
rankdir = "LR"
];
node [
fontsize = "16"
shape = "ellipse"
];
edge [
];
"n" [
label = " <f0> 1 | <f1> 0x10000018 | <f2> 0x1000000c"
shape = "record"
];
"nr" [
label = " <f0> 3 | <f1> 0x10000048 | <f2> 0x1000003c"
shape = "record"
];
"nl" [
label = " <f0> 2 | <f1> 0x10000030 | <f2> 0x10000024"
shape = "record"
];
"nll" [
label = " <f0> 4 | <f1> 0 | <f2> 0"
shape = "record"
];
"nlr" [
label = " <f0> 5 | <f1> 0 | <f2> 0"
shape = "record"
];
"nrl" [
label = " <f0> 6 | <f1> 0 | <f2> 0"
shape = "record"
];
"nrr" [
label = " <f0> 7 | <f1> 0 | <f2> 0"
shape = "record"
];
"n":f1 -> "nr":f0
"n":f2 -> "nl":f0
"nl":f1 -> "nlr":f0
"nl":f2 -> "nll":f0
"nr":f1 -> "nrr":f0
"nr":f2 -> "nrl":f0
}
```
:::
### Traversal
Working from the bottom, I first try to implement a traversal of the tree:
```c
.text
main:
la a0, n
jal traverse
# Exit program
li a7, 10
ecall
traverse:
# PUSH
addi sp, sp, -12
sw ra, 0, sp
sw s0, 4, sp
sw s1, 8, sp
### DO THINGS
lw s0, 4, a0 # s0 <- M[n+4] = node->left
lw s1, 8, a0 # s1 <- M[n+8] = node->right
# BRANCH
beq s0, x0, left_traversed # node->left == 0?
# no
mv a0, s0
jal traverse
# yes
left_traversed:
beq s1, x0, right_traversed # node->right == 0
# no
mv a0, s1
jal traverse
# yes
right_traversed:
# POP from stack
lw ra, 0, sp
lw s0, 4, sp
lw s1, 8, sp
addi sp, sp, 12
ret
```
The above could serve as a template. To use it:
1. Re-label `traverse`, `left_traversed`, and `right_traversed` to avoid conflicts.
2. Replace `### DO THINGS` with operations.
:::spoiler For example: Print DFS
```c
.data # 0x10000000
n: # 0x10000000
.word 1, 0x1000000c, 0x10000018
nl: # 0x1000000c
.word 2, 0x10000024, 0x10000030
nr: # 0x10000018
.word 3, 0x1000003c, 0x10000048
nll: # 0x10000024
.word 4, 0, 0
nlr: # 0x10000030
.word 5, 0, 0
nrl: # 0x1000003c
.word 6, 0, 0
nrr: # 0x10000048
.word 7, 0, 0
str1:
.string "\nTree DFS: "
.text
main:
la a0, n
jal printResult
exit:
# Exit program
li a7, 10
ecall
printResult:
# PUSH
addi sp, sp, -4
sw ra, 0, sp
# print
mv t0, a0
la a0, str1 # print string
li a7, 4
ecall
mv a0, t0
# mv a0, a0 # arg is carried over from the caller
jal traverse
# POP
lw ra, 0, sp
addi sp, sp, 4
ret
traverse:
# PUSH
addi sp, sp, -12
sw ra, 0, sp
sw s0, 4, sp
sw s1, 8, sp
mv t0, a0
lw a0, 0, a0
li a7, 1 # print integer
ecall
mv a0, t0
lw s0, 4, a0 # s0 <- M[n+4] = node->left
lw s1, 8, a0 # s1 <- M[n+8] = node->right
# BRANCH
beq s0, x0, left_traversed # node->left == 0?
# no
mv a0, s0
jal traverse
# yes
left_traversed:
beq s1, x0, right_traversed # node->right == 0
# no
mv a0, s1
jal traverse
# yes
right_traversed:
# POP from stack
lw ra, 0, sp
lw s0, 4, sp
lw s1, 8, sp
addi sp, sp, 12
ret
```
Output:
```shell
Tree DFS: 1245367
Program exited with code: 0
```
:::
## Final Assembly Code
After adding the print feature, I use the `traversal` template to swap `node->left` and `node->right` for all traversed `node` (I include only the function in this synopsis for simplicity):
```c
invert:
# PUSH
addi sp, sp, -12
sw ra, 0, sp
sw s0, 4, sp
sw s1, 8, sp
# ASSUME: address to node is already in a0
lw s0, 4, a0 # s0 <- M[a0+4] = node->left
lw s1, 8, a0 # s1 <- M[a0+8] = node->right
# SWAP:
sw s0, 8, a0 # s0 -> M[a0+8]
sw s1, 4, a0 # s1 -> M[a0+4]
# BRANCH
beq s0, x0, left_done # node->left == 0?
# no
mv a0, s0
jal invert
# yes
left_done:
beq s1, x0, right_done # node->right == 0?
# no
mv a0, s1
jal invert
# yes
right_done:
# POP
lw ra, 0, sp
lw s0, 4, sp
lw s1, 8, sp
addi sp, sp, 12
ret
```
:::spoiler Full Code
```c
.data # 0x10000000
n: # 0x10000000
.word 1, 0x1000000c, 0x10000018
nl: # 0x1000000c
.word 2, 0x10000024, 0x10000030
nr: # 0x10000018
.word 3, 0x1000003c, 0x10000048
nll: # 0x10000024
.word 4, 0, 0
nlr: # 0x10000030
.word 5, 0, 0
nrl: # 0x1000003c
.word 6, 0, 0
nrr: # 0x10000048
.word 7, 0, 0
str1:
.string "\nTree DFS: "
space:
.string " "
.text
main:
# Load argument(s) from static data (pseudo instruction)
# PRE-VALIDATION
la a0, n # Preload node address
# Equivalence:
# auipc x10 0x10000
# lw x10 24 x10 # why 24? [12:0] 0x18 = 24
jal printResult
# EXECUTION
la a0, n
jal invert
# VALIDATION
la a0, n
jal printResult
exit:
# Exit program
li a7, 10
ecall
invert:
# PUSH
addi sp, sp, -12
sw ra, 0, sp
sw s0, 4, sp
sw s1, 8, sp
# ASSUME: address to node is already in a0
lw s0, 4, a0 # s0 <- M[a0+4] = node->left
lw s1, 8, a0 # s1 <- M[a0+8] = node->right
# SWAP:
sw s0, 8, a0 # s0 -> M[a0+8]
sw s1, 4, a0 # s1 -> M[a0+4]
# TRAVERSE:
beq s0, x0, left_done # node->left == 0?
# no
mv a0, s0
jal invert
# yes
left_done:
beq s1, x0, right_done # node->right == 0?
# no
mv a0, s1
jal invert
# yes
right_done:
# POP
lw ra, 0, sp
lw s0, 4, sp
lw s1, 8, sp
addi sp, sp, 12
ret # invert
printResult:
# PUSH
addi sp, sp, -4
sw ra, 0, sp
mv t0, a0
la a0, str1
li a7, 4 # print string
ecall
mv a0, t0
# mv a0, a0
jal traverse
# POP
lw ra, 0, sp
addi sp, sp, 4
ret
traverse:
# PUSH
addi sp, sp, -12
sw ra, 0, sp
sw s0, 4, sp
sw s1, 8, sp
mv t0, a0 # supply
lw a0, 0, a0
li a7, 1 # print integer
ecall
mv a0, t0 # restore
lw s0, 4, a0 # s0 <- M[a0+4] = node->left
lw s1, 8, a0 # s1 <- M[a0+8] = node->right
# BRANCH
# CHECK node->left == 0
beq s0, x0, left_traversed
# no
mv a0, s0
jal traverse
# yes
left_traversed:
# CHECK node->right == 0
beq s1, x0, right_traversed
# no
mv a0, s1
jal traverse
# yes
right_traversed:
# POP
lw ra, 0, sp
lw s0, 4, sp
lw s1, 8, sp
addi sp, sp, 12
ret # traverse
```
:::
:::spoiler Older Version
```c
.data # 0x10000000
n: # 0x10000000
.word 1, 0x1000000c, 0x10000018
nl: # 0x1000000c
.word 2, 0x10000024, 0x10000030
nr: # 0x10000018
.word 3, 0x1000003c, 0x10000048
nll: # 0x10000024
.word 4, 0, 0
nlr: # 0x10000030
.word 5, 0, 0
nrl: # 0x1000003c
.word 6, 0, 0
nrr: # 0x10000048
.word 7, 0, 0
str1:
.string "\nTree DFS: "
space:
.string " "
.text
main:
# Load argument from static data (pseudo instruction)
# Load node address
la a0, n
jal printResult
# Equivalence:
# auipc x10 0x10000
# lw x10 24 x10 # why 24? [12:0] 0x18 = 24
# j invert
# =
# jal x0, invert # assign program return address to x0 but x0 will be automatically reset to 0 anyway -> no effect
# save return address
# current return addresss in ra, if callee calls another function, push it
la a0, n
jal invert
la a0, n
jal printResult
exit:
# Exit program
li a7, 10
ecall
invert:
# ASSUME: address to node is already in a0
# Calculate address to node->left
addi a0, a0, 4 # n + 4
# Get the address stored in that address
# use saved register s0 for local variables
lw s0, 0, a0 # s0 <- [a0] = [n+4]
# Calculate address to node->right
addi a0, a0, 4 # n + 4 + 4
# Get the address stored in that address
# use saved register s1 for local variables
lw s1, 0, a0 # s1 <- [a0] = [n+8]
# SWAP: store node->left in node->right
sw s0, 0, a0 # s0 -> [n+8]
# SWAP: store node->right in node->left
addi a0, a0, -4
sw s1, 0, a0 # s1 -> [n+4]
# TRAVERSE:
# PUSH
addi sp, sp, -12
sw ra, 0, sp
sw s0, 4, sp
sw s1, 8, sp
# PASS arguments
lw a0, 4, sp
# =
# addi a0, s0, 0
# CHECK node->left == 0
beq a0, x0, left_done
# no
jal invert
# yes
left_done:
# CHECK node->right == 0
lw a0, 8, sp
beq a0, x0, right_done
# no
jal invert
# yes
right_done:
lw ra, 12, sp
addi sp, sp, 12
ret # invert
printResult:
mv t0, a0
la a0, str1
li a7, 4 # print string
ecall
mv a0, t0
# PUSH onto stack
sw ra, 0, sp
addi sp, sp, -4
jal traverse
# POP from stack
lw ra, 0, sp
addi sp, sp, 4
ret
traverse:
### DO THINGS
mv t0, a0
lw a0, 0, a0
li a7, 1 # print integer
ecall
mv a0, t0
# Calculate address to node->left
addi a0, a0, 4 # n + 4
lw s0, 0, a0 # s0 <- [a0] = [n+4]
# Calculate address to node->right
addi a0, a0, 4 # n + 4 + 4
lw s1, 0, a0 # s1 <- [a0] = [n+8]
# PUSH onto stack
sw ra, 0, sp
sw s0, -4, sp
sw s1, -8, sp
addi sp, sp, -12
# BRANCH
# PASS left argument
lw a0, 8, sp
# CHECK node->left == 0
beq a0, x0, left_traversed
# no
jal traverse
# yes
left_traversed:
# PASS right argument
lw a0, 4, sp
# CHECK node->right == 0
beq a0, x0, right_traversed
# no
jal traverse
# yes
right_traversed:
# POP
lw ra, 0, sp
lw s0, 4, sp
lw s1, 8, sp
addi sp, sp, 12
ret # traverse
```
:::
:::warning
:warning: Can you use fewer instructions?
:notes: jserv
:::
:::warning
My attempt at shortening the code:
* Access struct members (`node->left` & `node->right`) by using the offset. (Don't use addi + lw).
* Loop through left and right?
:::
:::info
Takeaway from Code Review with [`teimeikichengmingchi`](https://hackmd.io/X4DFhZwETxSo5uwU0scsLA?view):
- Printing and Checking is an inefficient validation method.
- Rewrite my code so as to make my validation method more efficient.
:::
Output:
```shell
Tree DFS: 1245367
Tree DFS: 1376254
Program exited with code: 0
```
As shown in the above output, the tree is thus successfully inverted.
## Ripes Simulator Pipeline Features
Here I will use the following processor and instruction to demonstrate Ripes' pipelining stages:
1. 5-Stage Processor: This is the most comprehensive 5-stage processor (has both forwarding and hazard detection)
2. `lw`: since load instructions are the only type that is active in all five stages.
### Initial State
Instruction: `lw a1 10 a0`
### IF (Instruction Fetch)

PC increases by 4 to advance to the next instruction in Instruction Memory.
### ID (Instruction Decode)

Instruction Memory passes the instruction on to the Decoder and the Immediate Generator. The Decoder then parse the instruction to get the following values:
* `a0`: number of the register which contains the address which, after added the immediate value, gives us the memory of the data to be loaded. It is fed into the Registers Module to get `R[a0]`.
* `M[R[a0]+imm]` is the data to be loaded.
* `LW`: the instruction type. Will be used by control unit to control the flow. This also tells Immediate Generator how to parse the immediate value. The Immediate Generator gives us `10`.
* `a1`: the write-back register, will be passed on to the next few stages until the write-back happens.
### EX (Execution)

* `R[a0]` from the Registers and the immediate value `10` from the Immediate Generator are fed into the ALU for addition. The result is `R[a0] + 10`.
* Note that `a1` is being passed through this stage without engaging in any operations.
### MEM (Memory)

* The address `R[a0]+10` is passed into Data Memory and it spits out the word `M[R[a0]+10]` stored in that address.
* Notice that WrEn (Write Enable) is off, which means nothing is being written into Data Memory. The Data In port is thus not important.
### WB (Write Back)

* The data `M[R[a0]+10]` is passed into the Registers.
* The `a1`, having looped all the way also back into Registers, tells it where to store `M[R[a0]+10]`.
* Another destination is through the Forwarding Unit (the two multiplexers to the right of IDEX) into either the Branch Comparator or the ALU. If an instruction happens to need `R[a1]`, then it doesn't need to stay in the ID stage to wait for the write back; it can progress to the EX stage and get the value from the WB stage directly.
## Issues with Other Processors
While the full code above works without error and as intended with a 5-stage processor with forwarding and hazard detection, other processors present a different story:
### 5-Stage Processor wo/ Forwarding (Still has Hazard Detection)
No error occurs, but since the time saved by forwarding is replaced by`nop`s injected by Control Unit's Hazard Detection, time efficiency is compromised.
* In general, for any instrution that writes a value, the value is not going to be saved back in the registers until the WB stage of that instruction. If the next instruction needs the WB value, it needs to **stay in the ID stage until the previous instruction reaches WB**. That means the two stages in between, EX and MEM stages are empty. That is **two stalls/nops**.
* If the circuit is designed so that the value can be forwarded from a stage before WB, then stalls can be avoided.
* For R-Type instructions, the WB value first appears as an output of ALU in the EX stage. Forwarding can be done from the end of EX directly to the start of the EX, which means when the next instruction advances into the Ex stage, the WB value of the previous instruction will have already been made available.
:::info
### Example: PREV: `addi x10 x0 0` followed by NEXT: `addi x10 x10 3` (a RAW data hazard)
#### :-1: Without Forwarding
PREV: EX
NEXT: ID
The addition result from PREV is fresh out of ALU. It hasn't been written back yet.

**Two cycles later...**
PREV: EX
NEXT: WB

PREV's output successfully written back to `x10`. Now NEXT can safely proceed to the stage EX with the right value for addition.
**Conclusion:**
Two cycles were wasted. That's two stalls.
#### :+1: With Forwarding
PREV: EX
NEXT: ID
The addition result from PREV is fresh out of ALU. It hasn't been written back yet, but there is a path out of ALU through the EX/MEM registers that can feed the value directly back into the ALU next stage.

**Next cycle...**
PREV: MEM
NEXT: EX

As illustrated above, both instructions advance to the next stage and the WB value from PREV can be directly pushed into the ALU unit currently used by NEXT.
**Conclusion**
No delay. Better than without forwarding.
:::
* For Load instructions, the WB value first appears as an output of DMEM in the MEM stage. The next instruction, if it needs the WB value, will have to stay in ID until the Load instruction reaches MEM. That translates to one empty stage in between, which means still one `nop` is needed for this to work. That is still a step up from 2 `nop`s without forwarding, though. Control's hazard detection will automatically add that for you.
:::info
### Example: PREV: `lw x10 0 x12` followed by NEXT: `addi x10 x10 3` (a Load Before Use data hazard)
#### :-1: Without Forwarding
PREV: EX
NEXT: ID
PREV has calculated the address of the data it needs to read from, but has yet to read it. Hence NEXT can't get the value at this stage.

**Next cycle...**
PREV: EX
NEXT: MEM
PREV just fetched the data from the address, but still has not written it back.

**Next cycle...**
PREV: EX
NEXT: WB

PREV's output successfully written back to `x10`. Now NEXT can safely proceed to the next stage with the right value for addition.
**Conclusion:**
Two cycles were wasted. That's two stalls.
#### :+1: With Forwarding
PREV: MEM
NEXT: ID

PREV has already fetched the data at `R[x12]` out of Data Memory. It hasn’t been written back yet, but there is a path out of DMEM through the MEM/WB registers that can feed the value directly back into the ALU next stage.
**Next cycle...**
PREV: WB
NEXT: EX

Because now at the WB stage, PREV can feed `M[R[x12]]` directly back into ALU, NEXT doesn't have to stay in ID to fetch from Registers.
**Conclusion:**
Only one cycle is wasted. Better than without forwarding.
:::
### 5-Stage Processor wo/ Hazard Detection (Still has Forwarding)
* Errors can occur.
* The appropriate number of with-forwarding nops need to be manually injected for the code to work correctly:
* For R-Type instructions, no `nop`s need to be inserted. The forwarding already takes care of everything.
* For Load instructions, one `nop` still need to be hand coded.
* There seems to be a potential design flaw with branch instructions: refer to [here](#Issue:-Branching-before-Forwarding).
#### Nops for Processor with Forwarding but without HD
* As discussed above, in order for the program to work correctly, we need to ensure the following:
1. If a load instruction's `rd` is used in the next instruction, put one `nop` in between.
2. :small_red_triangle: **(Might not be typical of a regular RISC-V machine due to the design flaw discussed in the [next section](#Issue:-Branching-before-Forwarding))** If a load/R-Type instruction's `rd` is used in the next or the one after the next instruction, which happens to be a branch instruction, make sure there are two non-conflicting instructions in between. Pad with `nop`s if no.
3. :small_red_triangle: **(Yet to figure out why, refer to the [ecall section](#Issue-with-Ecalls))** Three `nop`s between an `ecall` and instructions that set `a7` and `a0`.
:::spoiler Code
```c
.text
main:
# PRE-VALIDATION
la a0, n # Preload node address
# Equivalence:
# auipc x10 0x10000
# lw x10 24 x10 # why 24? [12:0] 0x18 = 24
jal printResult
# EXECUTION
la a0, n
jal invert
# VALIDATION
la a0, n
jal printResult
exit:
# Exit program
li a7, 10
nop
nop
nop
ecall
invert:
# PUSH
addi sp, sp, -12
sw ra, 0, sp
sw s0, 4, sp
sw s1, 8, sp
lw s0, 4, a0 # s0 <- M[a0+4] = node->left
lw s1, 8, a0 # s1 <- M[a0+8] = node->right
nop
# SWAP:
sw s0, 8, a0 # s0 -> M[a0+8]
sw s1, 4, a0 # s1 -> M[a0+4]
# TRAVERSE:
beq s0, x0, left_done # node->left == 0?
# no
mv a0, s0
jal invert
# yes
left_done:
beq s1, x0, right_done # node->right == 0?
# no
mv a0, s1
jal invert
# yes
right_done:
# POP
lw ra, 0, sp
lw s0, 4, sp
lw s1, 8, sp
addi sp, sp, 12
ret # invert
printResult:
# PUSH
addi sp, sp, -4
sw ra, 0, sp
mv t0, a0
la a0, str1
li a7, 4 # print string
nop
nop
nop
ecall
mv a0, t0
# mv a0, a0
jal traverse
# POP
lw ra, 0, sp
addi sp, sp, 4
ret
traverse:
# PUSH
addi sp, sp, -12
sw ra, 0, sp
sw s0, 4, sp
sw s1, 8, sp
mv t0, a0 # supply
lw a0, 0, a0
li a7, 1 # print integer
nop
nop
nop
ecall
mv a0, t0 # restore
lw s0, 4, a0 # s0 <- M[a0+4] = node->left
lw s1, 8, a0 # s1 <- M[a0+8] = node->right
nop
# BRANCH
# CHECK node->left == 0
beq s0, x0, left_traversed
# no
mv a0, s0
jal traverse
# yes
left_traversed:
# CHECK node->right == 0
beq s1, x0, right_traversed
# no
mv a0, s1
jal traverse
# yes
right_traversed:
# POP
lw ra, 0, sp
lw s0, 4, sp
lw s1, 8, sp
addi sp, sp, 12
ret # traverse
```
:::
## Issue: Branching before Forwarding (Detail on [Github](https://github.com/mortbopet/Ripes/issues/233))
I discovered an oddity in the design layout for 5-Stage Processor without Hazard Detection:
#### 5-Stage

#### 5-Stage without Hazard Detection Only
Here is the layout for one with Hazard Detection for comparison:

### Why are the with-forwarding layouts different for w/ HD and wo/ HD?
Hazard Detection (HD) should be implemented in the Control component, which means it doesn't affect the arrangement of datapath elements, however, a difference is clearly seen in the two layouts:
1. For the one with HD, the Branch Comparator takes its inputs through the two forwarding multiplexers. This way the BC is allowed to take inputs from the Registers, or from WB values forwarded from ALU or DMEM. As it should be.

2. For the one without HD, the Branch Comparator is instead located before the forwarding multiplexers, taking it's input only from the Registers. This means WB values can never reach a branch instruction by forwarding.

For comparison, in the other two cases where there are no forwarding, whether HD is in play doesn't make a difference:
#### 5-Stage without Forwarding Only

#### 5-Stage without Either

As can be seen, although the wirings near the Branch Comparator are drawn different, they are, in fact, the same picture.

:::warning
### :warning: Design Flaw
The problem mentioned above seems to be a design flaw. I published an [issue](https://github.com/mortbopet/Ripes/issues/233) on Ripes github repository to bring this problem to attention.
:::
:::spoiler Load before Use Test Code
```
.data
branch_str:
.string "Branched.\n"
not_branch_str:
.string "Not branched.\n"
.text
main:
# a0 set to 10
addi a0, zero, 10
nop
nop
# a0 set to 0
addi a1, zero, 0
addi a2, zero, 64
sw a1, 0, a2
lw a0, 0, a2
# nop
# zero stalls
# no hazard: a0 = 0 + 3 = 3 / hazard: a0 = 10 + 3 = 13
addi a0, a0, 3
li a7, 1
nop
nop
nop
ecall
exit:
# Exit
li a7, 10
nop
nop
nop
ecall
```
:::
## Issue with Ecalls

Without enough `nop`s, the value in `a7` might not be appropriate and the above message might pop up.
:::danger
- There must be three stalls before ecall even if there is a forwarding unit.
- One of the reasons might be because ecal uses the Branch Comparator. However, that still doesn't explain the following:
- With a normal forwarding component as in **5-Stage Processor**, a `beq` instruction experiences no delay. There is no reason `ecalls` should incur a delay for using the BC.
- Delays are typically two `nop`s. Where is the 3rd one coming from?
:::
:::danger
Can't print anything with `ecall`s when neither the forwarding unit or hazard detection is present.
- Why?
- How does `ecall` work?
:::
## Optimization: Code Scheduling to Avoid Stalls and Hazards
* Assume the worst (no forwarding and no hazard detection).
* Two benefits:
1. With Hazard Detection, performance is better.
2. Without Hazard Detection, performance AND correctness are both better.
### Example: Push to Stack
#### Original
```c=
addi sp, sp, 0
sw ra, 0, sp
sw s0, 4, sp
sw s1, 8, sp
```
Line 1 and 2 is a typical RAW hazard. To avoid that, I move the `addi` part to the end and add a -12 bias to all the offsets.
#### Modified
```c=
sw ra, -12, sp
sw s0, -8, sp
sw s1, -4, sp
addi sp, sp, -12
```
### Example: SWAP Two Values
#### Original
```c=
sw s0, -8, sp
sw s1, -4, sp
sw ra, -12, sp
addi sp, sp, -12
lw s0, 4, a0
lw s1, 8, a0
sw s0, 8, a0
sw s1, 4, a0
```
Line 6 and 7 are only one line away from the next line using the register that they load values into. I would be a good idea to move one instruction from the top to below these two instructions, given the change or order doesn't change the results. Since line 4 doesn't use the same registers, it is safe to move it down.
#### Modified
```c=
sw s0, -8, sp
sw s1, -4, sp
sw ra, -12, sp
lw s0, 4, a0
lw s1, 8, a0
addi sp, sp, -12
sw s0, 8, a0
sw s1, 4, a0
```
### Example Recursion
```c=
traverse:
# PUSH
sw ra, -12, sp
sw s0, -8, sp
sw s1, -4, sp
addi sp, sp, -12
mv t0, a0 # supply
lw a0, 0, a0
li a7, 1 # print integer
nop
nop
nop
ecall
mv a0, t0 # restore
nop
nop
lw s0, 4, a0 # s0 <- M[a0+4] = node->left
lw s1, 8, a0 # s1 <- M[a0+8] = node->right
nop
# BRANCH
beq s0, x0, left_traversed
mv a0, s0
jal traverse
```
```c=
traverse:
li a7, 1
# don't use a0 on the first line, because a0 was just stored 2 instructions ago before jal traverse was called
mv t0, a0 # supply
lw a0, 0, a0
sw ra, -12, sp
sw s0, -8, sp
sw s1, -4, sp
lw s0, 4, t0 # s0 <- M[a0+4] = node->left
lw s1, 8, t0 # s1 <- M[a0+8] = node->right
addi sp, sp, -12
ecall
mv a0, t0 # restore
# BRANCH
# CHECK node->left == 0
beq s0, x0, left_traversed
# no
mv a0, s0
jal traverse
```
:::spoiler Full Revamped Code
```c
.data
n: # 0x10000000
.word 1, 0x1000000c, 0x10000018
nl: # 0x1000000c
.word 2, 0x10000024, 0x10000030
nr: # 0x10000018
.word 3, 0x1000003c, 0x10000048
nll: # 0x10000024
.word 4, 0, 0
nlr: # 0x10000030
.word 5, 0, 0
nrl: # 0x1000003c
.word 6, 0, 0
nrr: # 0x10000048
.word 7, 0, 0
str1:
.string "\nTree DFS: "
.text
main:
# Load argument(s) from static data (pseudo instruction)
# PRE-VALIDATION
la a0, n # Preload node address
# Equivalence:
# auipc x10 0x10000
# lw x10 24 x10 # why 24? [12:0] 0x18 = 24
jal printResult
# EXECUTION
la a0, n
jal invert
# VALIDATION
la a0, n
nop
jal printResult
exit:
# Exit program
li a7, 10
nop
nop
nop
ecall
invert:
# PUSH
sw s0, -8, sp
sw s1, -4, sp
sw ra, -12, sp
lw s0, 4, a0 # s0 <- M[a0+4] = node->left
lw s1, 8, a0 # s1 <- M[a0+8] = node->right
addi sp, sp, -12
# SWAP:
sw s0, 8, a0 # s0 -> M[a0+8]
sw s1, 4, a0 # s1 -> M[a0+4]
# TRAVERSE:
beq s0, x0, left_done # node->left == 0?
# no
mv a0, s0
jal invert
# yes
left_done:
beq s1, x0, right_done # node->right == 0?
# no
mv a0, s1
jal invert
# yes
right_done:
# POP
lw ra, 0, sp
lw s0, 4, sp
lw s1, 8, sp
addi sp, sp, 12
ret # invert
printResult:
# PUSH
li a7, 4 # print string
mv t0, a0
la a0, str1
sw ra, -4, sp
addi sp, sp, -4
nop
ecall
mv a0, t0
# mv a0, a0
jal traverse
# POP
lw ra, 0, sp
addi sp, sp, 4
ret
traverse:
li a7, 1
# don't use a0 on the first line, because a0 was just stored 2 instructions ago before jal traverse was called
mv t0, a0 # supply
lw a0, 0, a0
sw ra, -12, sp
sw s0, -8, sp
sw s1, -4, sp
lw s0, 4, t0 # s0 <- M[a0+4] = node->left
lw s1, 8, t0 # s1 <- M[a0+8] = node->right
addi sp, sp, -12
ecall
mv a0, t0 # restore
# BRANCH
# CHECK node->left == 0
beq s0, x0, left_traversed
# no
mv a0, s0
jal traverse
# yes
left_traversed:
# CHECK node->right == 0
beq s1, x0, right_traversed
# no
mv a0, s1
nop
jal traverse
# yes
right_traversed:
# POP
lw ra, 0, sp
lw s0, 4, sp
lw s1, 8, sp
addi sp, sp, 12
ret # traverse
```
:::
### Tests
| Cycles | Full | wo/ F w/ HD | w/F wo/ HD | wo/ F wo/ HD|
| ---------------- | ---- | ----------- | ------------ | ------------ |
| Original | 648 | 741 | Ecall Error | Ecall Error |
| Code Rescheduled | 660 | 672 | 584 | 191 (Crash) |
## Optimization: Branch Reordering to Minimize Flushes/Kills
- As shown in this very recent (only 10 days before this homework is due) [discussion](https://github.com/mortbopet/Ripes/discussions/230) on Ripes' github repository, no branch prediction is implemented in this simulator. In such case, it will do good for the programmer to write the code so that the less taken route is in the branch and the more frequently occuring follows immediately after the branch instruction. This way, we can ensure that the number of times we have to flush immediately-following instructions is minimized.
- In my case, the branch instruction involves whether the branch is a real branch (attached to another node) or a null branch:
- Suppose there are $n$ nodes.
1. Since this is a binary treem each node except the root has a from-branch:
- There are $n-1$ from branches -> there are $n-1$ real branches.
2. Since each node has two branches (left and right), there are $2n$ branches in total:
- The rest of the $2n - (n-1) = n+1$ branches are null branches.
3. Thus, we have proven that there are more null branches than real branches.
4. Conclusion, we should put the case where `left` or `right` equals `0` immediately after the branch instruction.
- I try the following code:
```c
.text
main:
la a0, n
jal traverse
# Exit program
li a7, 10
ecall
traverse:
# PUSH
addi sp, sp, -12
sw ra, 0, sp
sw s0, 4, sp
sw s1, 8, sp
### DO THINGS
lw s0, 4, a0 # s0 <- M[n+4] = node->left
lw s1, 8, a0 # s1 <- M[n+8] = node->right
# BRANCH
bne s0, x0, left_branch # node->left != 0?
next_branch:
bne s1, x0, right_branch # node->right != 0
traverse_done:
# POP from stack
lw ra, 0, sp
lw s0, 4, sp
lw s1, 8, sp
addi sp, sp, 12
ret
# no
left_branch:
mv a0, s0
jal traverse
j next_branch
right_branch:
mv a0, s1
jal traverse
j traverse_done
```
As seen from the above revised code, the addition of two `j` commands doesn't seem to help with time saving that much:
* In the original version, if there are $n$ nodes, $(n+1)\times{2} = 2n + 2$ kills result.
* in the modified version, if there are $n$ nodes, $(n-1)\times{2}$ results from `bne` taking the branch and $(n-1)\times{2}$ more result from the `j` that is executed after the branch is taken. That is $4n-4$ in total. That's definitely more than the original if for all $n>=2$.
The following version might solve the extra jump problem, but will require chunks of duplicate code and may make it less readable. More memory will be needed to save it. Plus, only $4$ cycles can be saved from this optimization, which becomes increasingly less significant as the number of nodes go up:
```c
.text
main:
la a0, n
jal traverse
# Exit program
li a7, 10
ecall
traverse:
# PUSH
addi sp, sp, -12
sw ra, 0, sp
sw s0, 4, sp
sw s1, 8, sp
### DO THINGS
lw s0, 4, a0 # s0 <- M[n+4] = node->left
lw s1, 8, a0 # s1 <- M[n+8] = node->right
# BRANCH
bne s0, x0, left_branch # node->left != 0?
bne s1, x0, right_branch # node->right != 0
# POP from stack
lw ra, 0, sp
lw s0, 4, sp
lw s1, 8, sp
addi sp, sp, 12
ret
left_branch:
mv a0, s0
jal traverse
bne s1, x0, right_branch # node->right != 0
# POP from stack
lw ra, 0, sp
lw s0, 4, sp
lw s1, 8, sp
addi sp, sp, 12
ret
right_branch:
mv a0, s1
jal traverse
# POP from stack
lw ra, 0, sp
lw s0, 4, sp
lw s1, 8, sp
addi sp, sp, 12
ret
```
### Testing
Using [Dataset A](#Dataset-A), the number of cycles required for each versions are a follows:
1. Original `beq` version: 221 cycles.
2. `bne` version with two `j`s: 235 cycles.
3. `bne` version without `j`, with duplicate code: 217 cycles.