# [Assignment3: SoftCPU](https://hackmd.io/@sysprog/2021-arch-homework3)
contributed by <[chinghongfang](https://github.com/chinghongfang)>
###### tags: `RISC-V` `jserv`
## Before Everything
### Installlations
1. Get [xPack GNU RISC-V Embedded GCC](https://github.com/xpack-dev-tools/riscv-none-embed-gcc-xpack/).
```bash=
$ cd /tmp
$ wget https://github.com/xpack-dev-tools/riscv-none-embed-gcc-xpack/releases/download/v10.2.0-1.1/xpack-riscv-none-embed-gcc-10.2.0-1.1-linux-x64.tar.gz
$ tar zxvf xpack-riscv-none-embed-gcc-10.2.0-1.1-linux-x64.tar.gz
$ cp -af xpack-riscv-none-embed-gcc-10.2.0-1.1 $HOME/riscv-none-embed-gcc
# Set path (example on fish)
set -a PATH /home/hong/Documents/course/Computer_Architecture/riscv-none-gcc/bin
```
2. Get [srv32](https://github.com/sysprog21/srv32).
```bash=
git clone https://github.com/sysprog21/srv32
```
3. Get [gtkwave](http://gtkwave.sourceforge.net).
(Find your version on the website.)
4. Modify the makefile
```bash=
# In srv32/sw/common directory
vim Makefile.common
# Search for CFLAGS, and change if needed
```
5. Put your source in srv32/sw
```bash=
# In srv32/
mkdir sw/hw3
# Copy existing make file to your directory
cp sw/hello/Makefile sw/hw3/
# Write down your code
vim sw/hw3/hw3.c
# Run it
make hw3
```
### Workflow

---
## Start
### Analysis srv32 CPU
srv32 CPU architecture is a **S**imple **r**isc-**v** **32**-bit CPU. By the author of this CPU, this CPU has only 3 pipeline stages.
#### Datapath and Pipeline
This picture is drawn according to the file `srv32/rtl/riscv.v`.
There are two parts that is out of the file. They are **instruction memory** and **data memory**. (By observing the wave, I think that they are positive edge trigger.)

1. Start from the `fetch_pc` register. It is the register that store the program counter and fetch instruction from memory.
2. `Instruction Memory`:
As the clk positive edge triggerred, it output the instruction the last `fetch_pc` want.
4. `ex_reg`:
Get register data on positive edge triggerred.
5. `wb_reg`:
Get executed result. Also the memory data.
The executed result may forward to ALU.
6. `fetch_pc`, `Register Files`:
Change the pc, flush the ex_reg, flush the wb_reg **if branch taken**.
Write date to register files.
The memory data may forward to ALU.
#### Hazzard
No data hazzard. Eliminate by full fowarding.
By the pipeline, the **memory data** read by last instruction can forward to EX stage.
#### Stall
Branches cause stalls.
(To know when the stall happened, observe the `ex_flush` wire and `wb_flush` wire in the wave.fst file. They are the **signals** to clear the pipeline register)
By the pipeline, the branching result is determined at **EX stage**. So we may fetch **wrong inst**ructions. We can flush the wrong instructions and make it looks like a stall.
By the pipeline, branch taken will cause **2 stalls**.
---
### [Min Patches](https://github.com/chinghongfang/ComputerArchitecture/blob/main/hw1/patchingArray.c)
By reading disassembled file, the optimizer inline the function.
Original function:
```c=
int minPatches(const int *nums, int numsSize, int n){
unsigned int m = 1, res = 0, i = 0;
while (m <= n)
m += (i < numsSize && nums[i] <= m) ? nums[i++] : (res++, m);
return res;
}
```
```
./rvsim --memsize 128 -l trace.log ../sw/hw/hw.elf
29
Excuting 1649 instructions, 2135 cycles, 1.295 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.000 s
Simulation cycles: 2135
Simulation speed : 4.886 MHz
```
To lower down the simulation cycles, we start from branch. Because only when the branch not taken takes stalls.
#### Find out branches

(Around 20 flushes looping)
We found that there are **many branch prediction failed** in assembly code 124.
And 11C~124 is `m+=(res++, m);`
```asm=
11c: 00179793 slli a5,a5,0x1
120: 00158593 addi a1,a1,1
124: fef77ce3 bgeu a4,a5,11c <main+0x70>
```
Many branch prediction **failed** here.
#### Optimize
So I expand this section as [following](https://github.com/chinghongfang/ComputerArchitecture/blob/main/hw1/reduceBranch.c). (I know that this loop execute **at most 31 times**. So I can expand 31 times by hand.)
```c=
int minPatches(const int *nums, int numsSize, int n){
unsigned int m = 1, res = 0, i = 0;
//while (m <= n)
//m += (i < numsSize && nums[i] > m) ? nums[i++] : (res++, m);
if (m > n) return 0;
select_add:
if (i >= numsSize) goto only_plus_m;
if (nums[i] > m) goto plus_m;
m += nums[i++];
if (m > n) goto ret;
goto select_add;
plus_m:
m <<= 1;
++res;
if (m > n) goto ret;
goto select_add;
only_plus_m:
m <<= 1;
++res;
if (m > n) goto ret;
m <<= 1;
++res;
if (m > n) goto ret;
m <<= 1;
++res;
if (m > n) goto ret;
// ...
ret:
return res;
}
```
Execution wave:

(10 flushes in total)
Result
```
./rvsim --memsize 128 -l trace.log ../sw/hw/hw.elf
29
Excuting 1625 instructions, 2065 cycles, 1.271 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.000 s
Simulation cycles: 2065
Simulation speed : 4.791 MHz
```
We reduced 2135-2065 = 70 cycles.