# Assignment3: SoftCPU
###### tags: `2021_CS Computer Architecture`
## Modify a customized Makefile for building C and assembly source files
### Building C source files
After building the environment of [srv32 - RISCV RV32IM Soft CPU](https://hackmd.io/@sysprog/S1Udn1Xtt), teacher demonstrate compile & execute `qsort.c` that is in `/srv32/sw/qsort` directory.
#### **Step1.**
In the directory of `~/srv32/sw`, make a directory - `RomanToInteger`. Then, build a file - `RomanToInteger.c` which is comes from [assignment1](https://hackmd.io/@9s5oiDBGQVmyEWDpvMhHtw/BkpBqJNHt).
```
$ cd ~/srv32/sw
$ mkdir RomanToInteger
```
RomanToInteger.c :
```c=
#include <stdio.h>
int transINT(char c)
{
switch(c)
{
case 'I':
return 1;
break;
case 'V':
return 5;
break;
case 'X':
return 10;
break;
case 'L':
return 50;
break;
case 'C':
return 100;
break;
case 'D':
return 500;
break;
default: //'M'
return 1000;
break;
}
}
int romanToInt(char * s){
int i;
int n_size = 0;
int sum;
int cur, pre;
//strSize
i = 0;
while(s[i] != '\0')
i++;
n_size = i;
cur = transINT(*(s + n_size - 1));
sum = cur;
//the rightest element n[n_size-1] must be added in sum
//if sum = 0, it would be out of bound in forloop
for(i = (n_size-1); i > 0 ; i--)
{
pre = transINT(*(s+i-1));
if(cur <= pre)
sum += pre;
else
sum -= pre;
cur = pre;
}
return sum;
}
int main()
{
char s[] = "MCMXCIV";
int number;
printf("%d\n", number);
//system("PAUSE");
return 0;
}
```
---
*Some note for modify C :*
- Remember to add new line `\n`, otherwise it can't print the result.
---
#### **Step2.**
Clone `Makefile` from `qsort` to `RomanToInteger`.
```
$ cp qsort/Makefile RomanToInteger/
```
#### **Step3.**
Modify **SRC** & **TARGET** which is from `Makefile` in `RomanToInteger` directory.
*SRC = source file
TARGET = the directory include the source file*
Makefile (for building C source files) :
```cmake=
include ../common/Makefile.common
EXE = .elf
SRC = RomanToInteger.c
CFLAGS += -L../common
LDFLAGS += -T ../common/default.ld
TARGET = RomanToInteger
OUTPUT = $(TARGET)$(EXE)
.PHONY: all clean
all: $(TARGET)
$(TARGET): $(SRC)
$(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS)
$(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin
$(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin
$(OBJCOPY) -O binary $(OUTPUT) memory.bin
$(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis
$(READELF) -a $(OUTPUT) > $(TARGET).symbol
clean:
$(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin
```

At this step, it would has these two files in `~/srv32/sw/RomanToInteger` directory:
```
Makefile RomanToInteger.c
```
### Result - C
#### **Step4.**
Make `RomanToInteger` in `~/srv32` directory.
```
$ cd ~/srv32
$ make RomanToInteger
```
There is the result and information below.
RTL simulation:
```
1994
Excuting 1918 instructions, 2450 cycles, 1.277 CPI
Program terminate
- ../rtl/../testbench/testbench.v:418: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.093 s
Simulation cycles: 2461
Simulation speed : 0.0264624 MHz
```
ISS simulation:
```
1994
Excuting 1918 instructions, 2450 cycles, 1.277 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.004 s
Simulation cycles: 2450
Simulation speed : 0.607 MHz
=== Simulation passed ===
```
After making RomanToInteger, it would has these files in `~/srv32/sw/RomanToInteger` directory:
```
dmem.bin Makefile RomanToInteger.c RomanToInteger.elf
imem.bin memory.bin RomanToInteger.dis RomanToInteger.symbol
```
#### Remove generated files
According to the Makefile
```c=
clean:
$(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin
```
It is easy to remove the generated files in the `TARGET` directory (`RomanToInteger` is `TARGET` here).
```
$ cd ~/srv32/sw/RomanToInteger
$ make clean
```
### Building assembly source files
Following the steps above, modify the Makefile and assembly source file.
Makefile (for building assembly source files) :
```cmake=
include ../common/Makefile.common
EXE = .elf
SRC = RomanToInteger.s
CFLAGS += -L../common
LDFLAGS += -T ../common/default.ld
TARGET = RomanToInteger
OUTPUT = $(TARGET)$(EXE)
.PHONY: all clean
all: $(TARGET)
$(TARGET): $(SRC)
$(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS) -g
$(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin
$(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin
$(OBJCOPY) -O binary $(OUTPUT) memory.bin
$(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis
$(READELF) -a $(OUTPUT) > $(TARGET).symbol
clean:
$(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin
```
RomanToInteger.s :
```asm=
.data
iformat: .string "%d\n"
str: .byte 77, 67, 77, 88, 67, 73, 86 #MCMXCIV
#s0 = str addr
#s1 = str[s0], str data bit from s0
#a3 = sum = return value from [romantoInt], it is result to whole program
#a2 = return value from [transInt]
#s3 = n_size, how many elements in str
#t0 is alaways i
.text
.global main
main:
addi sp, sp, -4
sw ra, 0(sp)
la s0, str #load 1st addr of str
#romantoInt
call romantoIntnano
#print result
la a0, iformat
add a1, a3, x0
call printf
# restore ra
lw ra, 0(sp)
addi sp, sp, 4
#end of code
ret
romantoInt:
addi sp, sp, -4
sw ra, 0(sp)
#get str size
call strSize
addi s3, t0, 0 #store n_size t0 to s3
#compare the element in str
call Compare
lw ra, 0(sp)
addi sp, sp, 4
ret
strSize:
addi sp, sp, -4
sw ra, 0(sp)
li t0, 0 #i(t0) = 0
strSize_loop:
lb s1, 0(s0)
beq s1, x0, zero
addi s0, s0, 1 #(t1 = n_size), n_size++
addi t0, t0, 1 #n_size++
j strSize_loop
zero:
lw ra, 0(sp)
addi sp, sp, 4
ret
Compare:
#t2 = cur / t3 = pre / t0 = i / sum = a3
#do sum = transINT(s[n_size-1]);
addi sp, sp, -4
sw ra, 0(sp)
addi s0, s0, -1 #through process [strSize], s0 = (str addr + n_size), s0 has to go back to (str addr + n_size - 1)
lb s1, 0(s0) #s1 = str[s0] = str[str addr + n_size - 1]
call transInt
addi a3, a2, 0 #a3 = sum, a2 = return value from [transINT]
addi t2, a2, 0 #t2 = cur
addi t0, s3, -1 #initialize i, i = n_size - 1
Loop:
addi t0, t0, -1 #i--
addi s0, s0, -1
lb s1, 0(s0) #s1 = str[n_size - i]
call transInt
addi t3, a2, 0 #t3 = pre
blt t3, t2, Pre_Lthan_Cur #if pre < cur
Pre_BEthan_Cur:
add a3, a3, t3 #else, sum += pre;
addi t2, t3, 0 #cur = pre
bne t0, x0, Loop #if i != 0, go to [Loop]
lw ra, 0(sp)
addi sp, sp, 4
ret
Pre_Lthan_Cur:
sub a3, a3, t3 #if pre < cur,sum -= pre;
addi t2, t3, 0 #cur = pre
bne t0, x0, Loop #if i != 0, go to [Loop]
lw ra, 0(sp)
addi sp, sp, 4
ret
transInt: #return = a2 / t1 = temp
addi sp, sp, -4
sw ra, 0(sp)
addi t1, x0, 73 #I = 73
beq s1, t1, I
addi t1, x0, 86 #V = 86
beq s1, t1, V
addi t1, x0, 88 #X = 88
beq s1, t1, X
addi t1, x0, 76 #L = 76
beq s1, t1, L
addi t1, x0, 67 #C = 67
beq s1, t1, C
addi t1, x0, 68 #D = 68
beq s1, t1, D
M: addi a2, x0, 1000 #else M
lw ra, 0(sp)
addi sp, sp, 4
ret
I: addi a2, x0, 1
lw ra, 0(sp)
addi sp, sp, 4
ret
V: addi a2, x0, 5
lw ra, 0(sp)
addi sp, sp, 4
ret
X: addi a2, x0, 10
lw ra, 0(sp)
addi sp, sp, 4
ret
L: addi a2, x0, 50
lw ra, 0(sp)
addi sp, sp, 4
ret
C: addi a2, x0, 100
lw ra, 0(sp)
addi sp, sp, 4
ret
D: addi a2, x0, 500
lw ra, 0(sp)
addi sp, sp, 4
ret
```
---
*Some note for modify assembly :*
- `printf`
Refer to [this](https://hackmd.io/@oscarshiang/arch_hw3#Port-count_bits-into-srv32), I use newlib `printf` which provide `%d` as format string to output decimal numbers to print the result. Remember to save the value that you want to preserve before call printf, otherwise the value may be flushed.
```
.data
iformat: .string "%d\n"
```
```
#print result
la a0, iformat
add a1, a3, x0 #a3 is the result that I want to print
call printf
```
- `.global main`
`.global main` makes `main` visible to the linker because other object files will use it. Without it, the symbol `main` is considered local to the object file it's assembled to, and will not appear after the assembly file is assembled.
The simplified error message without `.global main` :
```
(.reset+0x34): undefined reference to `main'
```
*After modify the content above, there were some bugs I found. According to the disassembly code `RomanToInteger.dis` and `wave.fst` which shows the waveform, I check these signals - PC and some critical registers to debug.*
*Problem 1. The code stuck in infinity loop*
In Ripes, the value of every registers would be initialize as 0. Thus, it is a logical error here, but the result is correct in Ripes. However, this made the code stuck in infinity loop in srv32.
(strSize line 51)
```
addi t0, t0, 0 #i = 0
```
*Solution:*
- It is necessary to initialize the register
The corrected code as below:
```
li t0, 0 #i = 0
```
*problem 2.*
In the end of iteration the string, t0 should be 7, and s1 should be 86, but I found that t0 is 9, and there are two character be read. These two characters is '%' and 'd'. Ummm...It is such a silly mistake.
```
.data
str: .byte 77, 67, 77, 88, 67, 73, 86 #MCMXCIV
iformat: .string "%d"
```

*Solution:*
- Exchange the sequence of `iformat` and `str`.
- Remember to add `\n` !!!!!!
The corrected code as below:
```
.data
iformat: .string "%d\n"
str: .byte 77, 67, 77, 88, 67, 73, 86 #MCMXCIV
```
---
### Result - assembly
RTL simulation:
```
1994
Excuting 1980 instructions, 2544 cycles, 1.284 CPI
Program terminate
- ../rtl/../testbench/testbench.v:418: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.097 s
Simulation cycles: 2555
Simulation speed : 0.0263402 MHz
```
ISS simulation:
```
1994
Excuting 1980 instructions, 2544 cycles, 1.285 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.004 s
Simulation cycles: 2544
Simulation speed : 0.686 MHz
```
## Check the waveform by GTKwave.
In **GTKWave**, open `wave.fst` in `~/srv32/sim` directory.

Then correspond to the disassembly code `RomanToInteger.dis` that is in `~/srv32/sw/RomanToInteger` directory.
RomanToInteger.dis
```RISCV=
...
0000003c <main>:
3c: ffc10113 addi sp,sp,-4
40: 00112023 sw ra,0(sp)
44: 00021417 auipc s0,0x21
48: e6440413 addi s0,s0,-412 # 20ea8 <str>
4c: 020000ef jal ra,6c <romantoInt>
50: 00021517 auipc a0,0x21
54: e5450513 addi a0,a0,-428 # 20ea4 <iformat>
58: 000685b3 add a1,a3,zero
5c: 1ac000ef jal ra,208 <printf>
60: 00012083 lw ra,0(sp)
64: 00410113 addi sp,sp,4
68: 00008067 ret
0000006c <romantoInt>:
6c: ffc10113 addi sp,sp,-4
70: 00112023 sw ra,0(sp)
74: 018000ef jal ra,8c <strSize>
78: 00028993 mv s3,t0
7c: 03c000ef jal ra,b8 <Compare>
80: 00012083 lw ra,0(sp)
84: 00410113 addi sp,sp,4
88: 00008067 ret
0000008c <strSize>:
8c: ffc10113 addi sp,sp,-4
90: 00112023 sw ra,0(sp)
94: 00000293 li t0,0
00000098 <strSize_loop>:
98: 00040483 lb s1,0(s0)
9c: 00048863 beqz s1,ac <zero>
a0: 00140413 addi s0,s0,1
a4: 00128293 addi t0,t0,1
a8: ff1ff06f j 98 <strSize_loop>
000000ac <zero>:
ac: 00012083 lw ra,0(sp)
b0: 00410113 addi sp,sp,4
b4: 00008067 ret
000000b8 <Compare>:
b8: ffc10113 addi sp,sp,-4
bc: 00112023 sw ra,0(sp)
c0: fff40413 addi s0,s0,-1
c4: 00040483 lb s1,0(s0)
c8: 058000ef jal ra,120 <transInt>
cc: 00060693 mv a3,a2
d0: 00060393 mv t2,a2
d4: fff98293 addi t0,s3,-1
000000d8 <Loop>:
d8: fff28293 addi t0,t0,-1
dc: fff40413 addi s0,s0,-1
e0: 00040483 lb s1,0(s0)
e4: 03c000ef jal ra,120 <transInt>
e8: 00060e13 mv t3,a2
ec: 007e4e63 blt t3,t2,108 <Pre_Lthan_Cur>
000000f0 <Pre_BEthan_Cur>:
f0: 01c686b3 add a3,a3,t3
f4: 000e0393 mv t2,t3
f8: fe0290e3 bnez t0,d8 <Loop>
fc: 00012083 lw ra,0(sp)
100: 00410113 addi sp,sp,4
104: 00008067 ret
00000108 <Pre_Lthan_Cur>:
108: 41c686b3 sub a3,a3,t3
10c: 000e0393 mv t2,t3
110: fc0294e3 bnez t0,d8 <Loop>
114: 00012083 lw ra,0(sp)
118: 00410113 addi sp,sp,4
11c: 00008067 ret
00000120 <transInt>:
120: ffc10113 addi sp,sp,-4
124: 00112023 sw ra,0(sp)
128: 04900313 li t1,73
12c: 02648e63 beq s1,t1,168 <I>
130: 05600313 li t1,86
134: 04648263 beq s1,t1,178 <V>
138: 05800313 li t1,88
13c: 04648663 beq s1,t1,188 <X>
140: 04c00313 li t1,76
144: 04648a63 beq s1,t1,198 <L>
148: 04300313 li t1,67
14c: 04648e63 beq s1,t1,1a8 <C>
150: 04400313 li t1,68
154: 06648263 beq s1,t1,1b8 <D>
00000158 <M>:
158: 3e800613 li a2,1000
15c: 00012083 lw ra,0(sp)
160: 00410113 addi sp,sp,4
164: 00008067 ret
00000168 <I>:
168: 00100613 li a2,1
16c: 00012083 lw ra,0(sp)
170: 00410113 addi sp,sp,4
174: 00008067 ret
00000178 <V>:
178: 00500613 li a2,5
17c: 00012083 lw ra,0(sp)
180: 00410113 addi sp,sp,4
184: 00008067 ret
00000188 <X>:
188: 00a00613 li a2,10
18c: 00012083 lw ra,0(sp)
190: 00410113 addi sp,sp,4
194: 00008067 ret
00000198 <L>:
198: 03200613 li a2,50
19c: 00012083 lw ra,0(sp)
1a0: 00410113 addi sp,sp,4
1a4: 00008067 ret
000001a8 <C>:
1a8: 06400613 li a2,100
1ac: 00012083 lw ra,0(sp)
1b0: 00410113 addi sp,sp,4
1b4: 00008067 ret
000001b8 <D>:
1b8: 1f400613 li a2,500
1bc: 00012083 lw ra,0(sp)
1c0: 00410113 addi sp,sp,4
1c4: 00008067 ret
...
```
### [srv32](https://hackmd.io/@sysprog/S1Udn1Xtt) pipeline architecture
`srv32` is a 3-stage pipeline architecture with IF/ID, EX, WB stages.

### Data hazard - It can be solved
**Load-use hazard**
In srv32 architecture, register file and D-MEM are in WB stage & full forwording, which can avoid load-use hazard.
Example - `beqz` instr. :
```
00000098 <strSize_loop>:
98: 00040483 lb s1,0(s0)
9c: 00048863 beqz s1,ac <zero>
```

- At clock 0: `lb s1,0(s0)` is in EX stage.
- At clock 1: `lb s1,0(s0)` is in WB stage, s1 will get the value at next clock. At the same time, forwording the input to `beqz s1,ac <zero>` which is in EX stage .
- At clock 2: s1 get the value.
### Control hazard - Branch penalty
When branch tacken, `ex_flush` would enable at next two clock.
According to `riscv.v` in `~/srv32/rtl/` directory, I found the signal `wb_memwr` is critical to determine whether write data to data-memory or not:
```verilog=
always @(posedge clk or negedge resetb) begin
if (!resetb)
wb_memwr <= 1'b0;
else if (ex_memwr && !ex_flush && !ex_st_align_excp)
wb_memwr <= 1'b1;
else if (wb_memwr && dmem_wvalid)
wb_memwr <= 1'b0;
end
```
Example - `j` instr. :
```
a8: ff1ff06f j 98 <strSize_loop>
000000ac <zero>:
ac: 00012083 lw ra,0(sp)
b0: 00410113 addi sp,sp,4
```

- At clock 1: `j 98 <strSize_loop>` is in EX stage, and `branch_tacken` set as 1.
- At clock 2 & 3: in these two clock, `ex_flush set` as 1, `wb_memwr` still as 0.
- At clock 4: `ex_pc` jump to 0x98.
The timing diagram of the above instruction sequence is as follow:
|Instruction|ex_pc|c0|c1|c2|c3|c4|c5|
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
|`j 98 <strSize_loop>`|0xA8|IF/ID|EX|WB||||
|`lw ra,0(sp)`|0xAC||NOP|NOP|NOP|||
|`addi sp,sp,4`|0xB0|||NOP|NOP|NOP||
|`exec if branch taken`|0x98||||IF/ID|EX|WB|
## Optimizations based on the pipeline design of srv32.
Based on the pipeline design of srv32, I can ignore any data hazard and just focus on **control hazard**. So, I just unroll the `strSize_loop` to decrease `j` and `beq` instructions.
Before:
```RISCV=
li t0, 0 #i = t0 = 0
strSize_loop:
lb s1, 0(s0)
beq s1, x0, zero
addi s0, s0, 1 #(t1 = n_size), n_size++
addi t0, t0, 1 #n_size++
j strSize_loop
zero:
lw ra, 0(sp)
addi sp, sp, 4
ret
```
After:
```RISCV=
li t0, 0 #i = t0 = 0
strSize_loop:
lb s1, 0(s0)
addi s0, s0, 1 #(t1 = n_size), n_size++
addi t0, t0, 1 #t0 = 1
lb s1, 0(s0)
addi s0, s0, 1
addi t0, t0, 1 #t0 = 2
lb s1, 0(s0)
addi s0, s0, 1
addi t0, t0, 1 #t0 = 3
lb s1, 0(s0)
addi s0, s0, 1
addi t0, t0, 1 #t0 = 4
lb s1, 0(s0)
addi s0, s0, 1
addi t0, t0, 1 #t0 = 5
lb s1, 0(s0)
addi s0, s0, 1
addi t0, t0, 1 #t0 = 6
lb s1, 0(s0)
addi s0, s0, 1
addi t0, t0, 1 #t0 = 7
lw ra, 0(sp)
addi sp, sp, 4
ret
```
### Result
Compare with the original assembly code, this method reduce 16 excuting instructions, and 32 cycles.
||instructions|cycles|CPI|
|:--|:--:|:--:|:--:|
|Before optimize - C|1918|2540|1.277|
|Before optimize - assembly|1980|2544|1.284|
|After optimize - assembly|1964|2512|1.279|
RTL simulation:
```
1994
Excuting 1964 instructions, 2512 cycles, 1.279 CPI
Program terminate
- ../rtl/../testbench/testbench.v:418: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.101 s
Simulation cycles: 2523
Simulation speed : 0.0249802 MHz
```
ISS simulation:
```
1994
Excuting 1964 instructions, 2512 cycles, 1.279 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.004 s
Simulation cycles: 2512
Simulation speed : 0.629 MHz
```
## Reference
- [srv32 - RISCV RV32IM Soft CPU](https://hackmd.io/@sysprog/S1Udn1Xtt)
- [Assignment3建置教學](https://hackmd.io/@eecheng/B1fEgnQwF)