# Assignment3: SoftCPU
## setup
1.install [xpack-riscv-none-embed-gcc](https://github.com/xpack-dev-tools/riscv-none-embed-gcc-xpack/)
```shell=
$ wget https://github.com/xpack-dev-tools/riscv-none-embed-gcc-xpack/releases/download/v10.2.0-1.1/xpack-riscv-none-embed-gcc-10.2.0-1.2-linux-x64.tar.gz
$ tar zxvf xpack-riscv-none-embed-gcc-10.2.0-1.2-linux-x64.tar.gz
$ export -a PATH /home/hong/Documents/course/Computer_Architecture/riscv-none-gcc/bin
```
2.install [srv32](https://github.com/sysprog21/srv32)
```shell=
$ git clone https://github.com/sysprog21/srv32
```
3.install [GTKWave](http://gtkwave.sourceforge.net/)
```shell=
$ sudo apt install gtkwave
```
4.put .c file and make file in floder and modify Makefile
```shell=
mkdir maxSubArray
cp ../hello/Makefile ./
vi maxSubArray
vi Makefile
```
5. run
```shell=
cd /home/sean/Documents/srv32
make -s maxSubArray
```
## start
homework1 : [maxSubArray](https://leetcode.com/problems/maximum-subarray/submissions/)
origin c code:
```c=
#include<stdio.h>
int maxSubArray(int* nums, int numsSize);
int main(){
volatile int a[9]={-2,1,-3,4,-1,2,1,-5,4};
int len1=sizeof(a)/sizeof(int);
int best,i;
best=maxSubArray(a,len1);
printf("best= %d\n",best);
}
int maxSubArray(int* nums, int numsSize)
{
int i;
int best = 0x80000000;
int currsum = 0;
for (i = 0; i < numsSize; i++) {
if(currsum<0){
currsum =nums[i];
}else{
currsum = currsum + nums[i];
}
if(best<currsum)
best=currsum;
}
return best;
}
```
### make up maxSubArray
```
sean@sean:~/Documents/srv32$ make -s maxSubArray
best= 6
Excuting 1487 instructions, 1929 cycles, 1.297 CPI
Program terminate
- ../rtl/../testbench/testbench.v:418: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.011 s
Simulation cycles: 1940
Simulation speed : 0.176364 MHz
best= 6
Excuting 1487 instructions, 1929 cycles, 1.297 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.000 s
Simulation cycles: 1929
Simulation speed : 5.575 MHz
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
## hazard
### Data Hazard
```srv32``` support full forwarding, so that RAW data hazard will not need any stall.
### Load-use data hazard
In 5 stage pipeline,there need one stall for Load-use data hazard,because the second instruction's EX stage need to wait the first instruction's MEM to forward,but in ```srv32``` there is not need any stall ,beacuse D-MEM is in WB stage and just need to fordward from WB to EX stage.
### Control Hazard
there are totla 15 times branch from 00000028 to 0000000020,you can see that will flush two instruction after 00000028 WB stage,so it will look like 2 stalls.


## optimized code
At first, I didn't notice that the compiler would automatically figure out the answer and assign it to the register in main, so when I was doing optimization no matter how the program changed, the number of instructions and cycles will not change.
Later, I found this situation in "maxSubArray.dis", so I tried to change the input array type from "int" to "volatile int", so that it can actually run subroutines instead of being calculated by the compiler.I use loop unrolling by hand,to optimal the code,because the branch predict is "not taken",so the control hazard will decrease.
before i change the array type from "int" to "volatile int",the compilier will count the answer and assign to register(9C):

after change the array type,the main function will be able to call the subroutine normally(178).

```c=
#include<stdio.h>
int maxSubArray(int* nums, int numsSize);
int main(){
volatile int a[9]={-2,1,-3,4,-1,2,1,-5,4};
int len1=sizeof(a)/sizeof(int);
int best;
best=maxSubArray(a,len1);
printf("best=%d\n",best);
}
int maxSubArray(int* nums, int numsSize)
{
int i=0;
int best = 0x80000000;
int currsum = 0;
if(i>=numsSize)
goto ret;
if(currsum<0){
currsum =nums[i];
}else{
currsum = currsum + nums[i];
}
if(best<currsum)
best=currsum;
i++;
if(i>=numsSize)
goto ret;
if(currsum<0){
currsum =nums[i];
}else{
currsum = currsum + nums[i];
}
if(best<currsum)
best=currsum;
i++;
if(i>=numsSize)
goto ret;
if(currsum<0){
currsum =nums[i];
}else{
currsum = currsum + nums[i];
}
if(best<currsum)
best=currsum;
i++;
if(i>=numsSize)
goto ret;
if(currsum<0){
currsum =nums[i];
}else{
currsum = currsum + nums[i];
}
if(best<currsum)
best=currsum;
i++;
if(i>=numsSize)
goto ret;
if(currsum<0){
currsum =nums[i];
}else{
currsum = currsum + nums[i];
}
if(best<currsum)
best=currsum;
i++;
if(i>=numsSize)
goto ret;
if(currsum<0){
currsum =nums[i];
}else{
currsum = currsum + nums[i];
}
if(best<currsum)
best=currsum;
i++;
if(i>=numsSize)
goto ret;
if(currsum<0){
currsum =nums[i];
}else{
currsum = currsum + nums[i];
}
if(best<currsum)
best=currsum;
i++;
ret:
return best;
}
```
```
sean@sean:~/Documents/srv32$ make -s maxSubArray ~~~~~^~~~
best=6
Excuting 1479 instructions, 1915 cycles, 1.294 CPI
Program terminate
- ../rtl/../testbench/testbench.v:418: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.013 s
Simulation cycles: 1926
Simulation speed : 0.148154 MHz
best=6
Excuting 1479 instructions, 1915 cycles, 1.295 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.000 s
Simulation cycles: 1915
Simulation speed : 5.148 MHz
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
* Total instruction saved: 1487-1479 = 8.
* Total cycle saved: 1929 - 1926 = 3.