# Lab1: RV32I Simulator
## Defuse the Bomb
Given an array, lengh of the array `n`,and a key number `k`. Adjust array with following rules:
1. If k > 0, replace the $i^{th}$ number with the sum of the next k numbers.
2. If k < 0, replace the $i^{th}$number with the sum of the previous k numbers.
3. If k == 0, replace the $i^{th}$ number with 0.
As code is circular, the next element of `code[n-1]` is `code[0]`, and the previous element of `code[0]` is `code[n-1]`.
## Assembly code
```c
.data
arr1: .word 5,7,1,4 # a[4] = {5,7,1,4}
len: .word 16 # 4*4
target: .word 3
comma: .string " "
str: .string "The answer is "
str2: .string "No answer."
.text
main:
la a0 arr1 # load address of the array
lw a1 target # load the value of target
lw a2 len # load the value of len
add a2,a2,a0 # turn a2 into the the address of end of array
addi a3,a2 -4 # a3 is the address of the last element in array
jal ra, func
# Exit Program
li a7, 10
ecall
func:
mv s9 a0
beq a1,x0 Zero # if k == 0, goto Zero
blt a1,x0 small # if k < 0, goto small
large:
sub t0, a2,a0 # t0 is the number of times that the loop must run
addi t1,a0 0 # set t1 the address of a[0]
loop_large:
addi t0,t0 -4
mv s0,a1 # s0 = target, which count the times that the inner loop has run
mv s1,t1 # s1 = the address of the number needs to be added
addi s2,x0 0
loop_in_large:
addi s0,s0 -1
addi s1,s1 4 # s1 = address of the next element
blt s1 a2 less_than_end # if s1 is smaller than address of the end array a
add s1,x0,s9
less_than_end:
lw s3,(0)s1 # load value in s1
add s2,s2,s3 # add value in s1
bne s0,x0 loop_in_large
#print result
mv a0,s2
li a7, 1
ecall
la a0, comma
li a7, 4
ecall
addi t1,t1 4
bne t0,x0 loop_large
ret
small:
sub t0, a2, a0 # t0 is the number of times that the loop must run
addi t1,a0 0 # set t1 the address of a[0]
loopsmall:
addi t0,t0 -4
mv s0,a1 # s0 = target, which count the times that the inner loop has run
mv s1,t1 # s1 = the address of the number needs to be added
addi s2,x0 0
loop_in_small:
addi s0,s0,1 #
addi s1,s1,-4 # s1 = address of the previous element
bge s1,s9 gre_than_arr # if s1 is larger than address of array a
add s1,x0,a3 # if s1 is smaller than address of a,
# move it to the last element of the array.
gre_than_arr:
lw s3,(0)s1 # load value in s1
add s2,s2,s3 # add value in s1
bne s0,x0 loop_in_small
#print result
mv a0,s2
li a7, 1
ecall
la a0, comma
li a7, 4
ecall
addi t1,t1 4
bne t0,x0 loopsmall
ret
Zero:
sub t0, a2,a0
# Print 0s
loop0:
mv a0,x0
li a7, 1
ecall
la a0, comma
li a7, 4
ecall
addi t0,t0, -4
bne t0,x0 loop0
ret
```
:::warning
:warning: Can you use fewer instructions?
:notes: jserv
:::
### main idea
We use nested loops in order to get the answer. First of all, we compare the key `k` with zero and choose the proper rule. If `key` is zero, we just print `n` zeros. Otherwise, we will use loops to get the answer. The outer loop goes though every element. The inner loop add all the elements needed, and the outer loop will print out the answer.
```asm=
small:
sub t0, a2, a0 # t0 is the number of times thar the loop must run
addi t1,a0 0 # set t1 the address of a[0]
loopsmall:
addi t0,t0 -4
mv s0,a1 # s0 = target, which count the times that the inner loop has run
mv s1,t1
addi s2,x0 0
loop_in_small:
addi s0,s0,1 #
addi s1,s1,-4 # address of a[j]
bge s1,s9 gre_than_arr # if address of a[j] is larger than address of a
add s1,x0,a3 # if address of a[j] is larger than address of a,
# move it to the last element of the array.
gre_than_arr:
lw s3,(0)s1 # load a[j]
add s2,s2,s3 # add a[j]
bne s0,x0 loop_in_small
#print result
mv a0,s2
li a7, 1
ecall
la a0, comma
li a7, 4
ecall
addi t1,t1 4
bne t0,x0 loopsmall
ret
```
In C, we usaully use an integer variable, such as`i`, to go through an array.
For example,
```cpp=
int a[4] = {1,2,3,4};
int i;
for ( i = 0; i < 4; ++i){
printf("%d",a[i]);
}
```
However, this method is not useful in assembly. As a result, we store the address of the array into two register. When running the loop, use these two register as the `i` in the C code above.
## Analyze
### RISC-V Pipline
We choose a 5-stage pipline processor to run the code.
"5-stage" means we divide the process of executing into five parts. They are:
1. Instruction fetch (IF)
2. Instruction decode and register fetch (ID)
3. Execute (EX)
4. Memory access (MEM)
5. Register write back (WB)
The value of each stage will be stored in pipeline register.
The figure below shows the whole pipline of RISC-V.

#### Why we use pipeline?

When we divide an execution process into seperate stages, each instruction can occupy a stage and run independently. For example, the figure above shows how a 5-stage pipline works. At time 4, there are 5 instructions executed in the processor. On the other hand, if the processor is a single-cycle processor, only 1 instruction is executed at time 4.
#### Stages of the pipline
##### Instruction fetch (IF)
In this stage, processor get the instruction from memory with the address stored in PC. Also, PC will add 4 for next instruction if there is no branch or jump instruction.

When there is a `nop` added in the pipline, the enable signal in PC will turn into unable, which means stopping reading the next instruction. Take the figures shown as below for example, the enable signal turns into red, because the instructions `lw x12 0 12` and `add x12 x12 x10` cause a data hazard, which lead to adding a `nop` into the Ex-stage. As a result, the IF-stage will not read the next instrution.


##### Instruction decode and register fetch (ID)
In this stage, processor decodes the instruction and decide which registers are going to use. In addition, we will give a register a value, which is calculated or read from data by the previous instruction.

The instruction is 0x00040413 can be decoded into some parts.
* opcode : `0010011` An I-type instruction
* Register: `0x10`
* Imm : `0x000` ( This value wil be extended to 32-bit)
Current PC and Next PC just go through this stage. They are likely to be used in the next stage.
##### Execute (EX)
In this stage, the ALU of the processor calculates value.Take the figure below for example.

Also, this is the stage that calculate the address of next instruction when jump or branch instructions are executed in this stage.

The figure above shows how the `jal` is executed. The ALU calculates the address and sends back to PC in IF-stage.

The figure above shows how `beq` is executed. The **Branch taken** signal is triggered if the condition is true. The ALU calculates the address and send to PC in IF-stage.
##### Memory access (MEM)
In this stage, processor will store the data int o the memory or read the data from memory with the address caculated by ALU. Take the figure below for example, the processor accesses address `0x10000014`, which is calculated by ALU, in data memory and get the data `0x00000003`.

##### Register write back (WB)
In this stage, the data will be written back to a register if the instruction requests for it. For example, in the figure below, the value `0x00000009` is read from data memory and it is ready to send back to ID stage in order to write back to registerm which address is `0x12`.

The figure below shows how the data write back to register.

### Data Hazards
We are likely to face Data Hazard when instructions nearby acquire same register or data. For example,
```asm=
lw a2 len # load the value of len
add a2,a2,a0 # turn a2 into the end of array
```
The first instruction loads the data from ALU or memory into the register in WB stage, but the second instruction need the data right away. Asa result, we add a `nop` between these two instructions.

In addtion, we use forwarding to send the data to EX stage. As a result, the add instruction can be executed immediately instead of adding another `nop`.

### Control Hazards
Control Hazards are likely to occur when **Control Transfer Instructions** are executed. For exampl,
```asm=
jal ra, func
# Exit Program
li a7, 10
ecall
```
The `jal` instruction will jump to loop1. As a result, the following instructions `li` and `ecall` will not be executed right away. However, these instrutions are now in the pipline. Therefore, we have to use two `nop`to replace them.

In Ripes Simulator, we can see that the two instructions are in the pipline. The `jal` instruction is at Ex stage. The address of `func` has been calculated and send to the IF stage.

Then, the program is going to jump to address of label `func`. The processor realized that a *control hazard* occurs and then flush the following two instructions, which are not likely to be executed in a few cycles. Finally, the first instruction of `func` will get into the pipeline.

### Forwarding
Due to the hazards we have just talked about, `nop` instruction must be added inthe pipline. Take the code below for example, we need to add two additional `nop` so that the `add` instruction can be executed correctly. The `add` instruction needs to wait at ID-stage until the value of register`a2` is written back.
```asm=
lw a2 len # load the value of len
add a2,a2,a0 # turn a2 into the end of array
```
|Instruction|1|2|3|4|5|6|7
|:-:|-|-|:-:|:-:|:-:|:-:|-|
|`lw a2 len`|IF|ID|EX|MEM|WB
|`add a2,a2,a0`||IF|ID|-|-|MEM|WB|
However, if we send the value from WB-stage to EX-stage, only one `nop` is added into the pipline.

The code below shows another example of forwording. The `auipc` completes some calcualtion and store the value into register `x10`. However, the next instruction `addi` needs this value right away. If we do not use the forwarding technic, we have to add two `nop` into pipline so that the `addi` instruction stays in ID-stage until the value is written back.
```asm=
auipc x10 0x10000
addi x10 x10 0
```
|Instruction|1|2|3|4|5|6|7
|:-:|-|-|:-:|:-:|:-:|:-:|-|
|`auipc x10 0x10000`|IF|ID|EX|MEM|WB
|`aaddi x10 x10 0`||IF|ID|-|-|MEM|WB|
If we can add a line from the MEM-stage back to EX-stage, the calculated value can be sent back to Ex-stage right away, and the next instruction are able to use it. As a result, we do not have to put any `nop` into the pipline. The figure below shows how it works.

### Cache Miss
When running a program, the processor has to get some data from memory. However, the processor runs too fast so that the memory can't provide data on time. As a result, we need a cache, which can read and write data much faster than main memory, to store the data that processor needs. When a processor request the data that is not stored in cache, it is called cache miss. The processor needs to get the data from the memory and put in the cache, which causes much more time than just access cache. Therefore, the cache miss rate plays an important role in the performance of a processor.
In In Ripes simulator, the address of data or instruction is divided into four parts, which are tag, index, Block and offset. Processor can find the data it needs in cache with these four parts.

Take `0x10000010` for example, its can be divided in to four parts. They are :
1. Tag : `00010000000000000000000`
2. Index : `00001`
3. Block : `00`
4. offset : `00`
The offset is always `00` because an address is 4-byte.
With the information above, the processor can search the data it needs in cache quickly.

The figure above shows how a **cache hit** looks like. The data is right in the cache so that the processor can get the data immediately.
However, there are some data that the processor needs are not in cache. Thefore, we have to access the memory, get the data and put in the cache. Usually, we bring a block of data instead of only 1 value, since we are likely use the data around which are stored around the value we need now.

The figure above shows that the processor needs the value stored in `0x10000014`, but there is nothing stored in cache, which leads cache miss. Therefore, we have to access main memory and get all data from `0x10000010` to `0x1000001c`(shown in blue frame in figure below), including the value we need (shown in red frame in figure below) and put them in the cache.

#### Data cache miss
In the program, the data cache miss is mainly caused by loading the element of sorted array. If the array becomes larger, the cache hit rate will increase, since we move three more values from memory when a cache miss occur. The figure below shows this phenomenon.

#### Instruction cache miss
When running a program, the processor needs th load the instructions from main memory, which are likely to cause instruction cache miss. Since the number of instructions in a single problem is fixed, the more times a certain part of code is execute, the higher the hit rate is. For example, in my program, the loop will be executed more times if the array becomes larger. Therefore, the hit rate grows, as the figure shown below.
