# Lab2: 9*9 multiplication table with RISC-V RV32I[MA] emulator
### -Os and -O3
There is no difference between the `-O3` and `-Os`, because this program is quite solid which means compiler must to make a three 4 bytes space for the temporary variables.
```shell
00010054 <_start>:
10054: ff010113 addi sp,sp,-16
10058: 00100793 li a5,1
1005c: 00f12223 sw a5,4(sp)
10060: 00412703 lw a4,4(sp)
10064: 00900793 li a5,9
10068: 04e7c863 blt a5,a4,100b8 <_start+0x64>
1006c: 00100613 li a2,1
10070: 00900693 li a3,9
10074: 00c12423 sw a2,8(sp)
10078: 00812783 lw a5,8(sp)
1007c: 02f6c463 blt a3,a5,100a4 <_start+0x50>
10080: 00412783 lw a5,4(sp)
10084: 00812703 lw a4,8(sp)
10088: 02e787b3 mul a5,a5,a4
1008c: 00f12623 sw a5,12(sp)
10090: 00812783 lw a5,8(sp)
10094: 00178793 addi a5,a5,1
10098: 00f12423 sw a5,8(sp)
1009c: 00812783 lw a5,8(sp)
100a0: fef6d0e3 bge a3,a5,10080 <_start+0x2c>
100a4: 00412783 lw a5,4(sp)
100a8: 00178793 addi a5,a5,1
100ac: 00f12223 sw a5,4(sp)
100b0: 00412783 lw a5,4(sp)
100b4: fcf6d0e3 bge a3,a5,10074 <_start+0x20>
100b8: 01010113 addi sp,sp,16
100bc: 00008067 ret
```
However, I found that the optimizer move the sp with 16 bytes which means it want to store the four variables. But this program only have three variables, it is better to allocate the space for variables more concisely. That is, changing the
```
addi sp, sp, -12
addi sp, sp, 12
```
It would not encounter the data hazard, we don't need to use the instruction schedualing manually.
### Branch Prediction
However, this program is written with nested for loop, it will need to check the condition whether is true or false.
To [Ripes](https://github.com/mortbopet/Ripes), the strategy of predict the branch is always will not taken. And the result of statics indicate the `True_counter` is 80% and `False_counter` is 20%.
Let explain how would this be 80% and 20%. Because of always-not-taken strategy, it will always fetch the next instruction next to the branch instruction
- N == Not Taken
- T == Taken
|i|1|2|3|4|5|6|7|8|9|10|
|---|---|---|---|---|---|---|---|---|---|---|
|Branch|i<10|i<10|i<10|i<10|i<10|i<10|i<10|i<10|i<10|i<10|
|Execution Pattern|N|N|N|N|N|N|N|N|N|T|
|Predicted branch|N|N|N|N|N|N|N|N|N|N|
|Correct or incorrect|C|C|C|C|C|C|C|C|C|I|
That is, the acuracy is 8/10 = 80%, and the sum of false positive and false negative is 20%.
### Loop unrolling
Because the nested seconed for-loop will always execute 9 times, it can be optimized by unrolling the for loop to 9 times.
```shell
addi sp,sp,-12
li a5,1
sw a5,4(sp)
lw a4,4(sp)
li a5,9
blt a5,a4,end
li a5,1 # nested loop for j = 1
lw a4,8(sp)
mul a6,a5,a4
sw a6,12(sp)
addi a5,a5,1 # nested loop for j = 2
mul a6,a5,a4
sw a6,12(sp)
addi a5,a5,1 # nested loop for j = 3
mul a6,a5,a4
sw a6,12(sp)
addi a5,a5,1 # nested loop for j = 4
mul a6,a5,a4
sw a6,12(sp)
addi a5,a5,1 # nested loop for j = 5
mul a6,a5,a4
sw a6,12(sp)
addi a5,a5,1 # nested loop for j = 6
mul a6,a5,a4
sw a6,12(sp)
addi a5,a5,1 # nested loop for j = 7
mul a6,a5,a4
sw a6,12(sp)
addi a5,a5,1 # nested loop for j = 8
mul a6,a5,a4
sw a6,12(sp)
addi a5,a5,1 # nested loop for j = 9
mul a6,a5,a4
sw a6,12(sp)
bge a3,a5,L1
lw a5,4(sp)
addi a5,a5,1
sw a5,4(sp)
lw a5,4(sp)
bge a3,a5,10074
addi sp,sp,12
```