---
tags : Computer_Architecture, csapp, cs61c, jserv, RISC-V
---
# Computer Architecture HW3
::: spoiler Assignment3
[Assignment3: SoftCPU](https://hackmd.io/@sysprog/2022-arch-homework3)
[Assignment3: SoftCPU_2021](https://hackmd.io/@sysprog/2021-arch-homework3)
[Lab3: srv32 - RISCV RV32IM Soft CPU](https://hackmd.io/@sysprog/S1Udn1Xtt)
[Long Pressed Name(assignment2)](https://hackmd.io/@qwe661234/BJ52GwGVi#Assignment2-RISC-V-Toolchain)
:::
## Requirement 1
I choose the [assignment2 of 陳彥甫](https://hackmd.io/@qwe661234/BJ52GwGVi#Assignment2-RISC-V-Toolchain), and the leetcode is [91. Decode Ways](https://leetcode.com/problems/decode-ways/). My motivation is that I want to practice more about the `string` or `decode` problem. I hope I can deal with this kinds of problem more simple and clear.
### How it work with verilator?
In the `~/srv32/sim` this folder, I find that in the makefile
```makefile!
...
$(TARGET):
$(TARGET_SIM) $(BFLAGS) -o $(TARGET) $(FILELIST)
mv sim_cc/sim .
...
```
And the verilator will translate the verilog into c++ language, then save it in sim_cc. Therefore we can test the CPU from verilog without actually make it in physics.
After it translate into C++, we can input our binary file and get the simulate result.
### assignment2 (Long Pressed Name)
Problem :
> Your friend is typing his name into a keyboard. Sometimes, when typing a character c, the key might get long pressed, and the character will be typed 1 or more times.
You examine the typed characters of the keyboard. Return True if it is possible that it was your friends name, with some characters (possibly none) being long pressed.
In this case, I will use the name `alex` as the example. As following is the origin assembly code.
:::spoiler Origin assembly code
```
.org 0
# Provide program starting address to linker
.global main
/* newlib system calls */
.set SYSEXIT, 93
.set SYSWRITE, 64
.data
addr_name: .string "alex"
addr_typed0: .string "aaleex"
addr_typed1: .string "aalleexx"
addr_typed2: .string "aalewx"
SuccessResult: .string "PASS\n"
.set success_result_size, .-SuccessResult
FailResult: .string "FAIL\n"
.set fail_result_size, .-FailResult
.text
main:
la t0, addr_name #Load "name" 1st address
la t1, addr_typed0 #Load "typed" 1st address
jal ra, isLongPressed #1st test-case
la t1, addr_typed1 #Load "typed" 1st address
jal ra, isLongPressed #2nd test-case
la t1, addr_typed2 #Load "typed" 1st address
jal ra, isLongPressed #3rd test-case
j End
isLongPressed:
addi sp, sp, -8 #Use 2 byte for saving register
sw ra, 0(sp) #1st byte is ra
sw t0, 4(sp) #2nd byte is t0
lb t2, 0(t0) #1st character of name
lb t3, 0(t1) #1st character of typed
bne t2, t3, Fail #1st character must the same
addi t0, t0, 1 #i++
addi t1, t1, 1 #j++
jal Loop #Call Loop
lw ra, 0(sp) #Recall ra
lw t0, 4(sp) #Recall t0
addi sp,sp, 8
ret
Loop: #1st while loop
lb t2, 0(t0) #Load name[i]
lb t3, 0(t1) #Load typed[j]
beq t2, x0, checkLastCharacter #If name[i]= NULL
If:
bne t2, t3, Else #If name[i]!=typed[j]
addi t0, t0, 1 #i++
addi t1, t1, 1 #j++
j Loop
Else:
lb t2, -1(t0) # Load name[i-1]
bne t2, t3,Fail # If typed[j] != name[i-1] then Fail
addi t1, t1, 1 # j++
j Loop
checkLastCharacter: #2nd while loop
lb t2, -1(t0) # Load last character of name
if2:
beq t3, x0, Pass # If typed[j] = NULL, then passed
bne t2, t3,Fail # If typed[j] != name[last] then Fail
addi t1, t1, 1 # j++
lb t3, 0(t1) # Load next character of typed
j if2
Pass:
li a7, SYSWRITE # "write" syscall
li a0, 1 # 1 = standard output (stdout)
la a1, SuccessResult # load address of hello string
li a2, success_result_size # length of hello string
ecall # invoke syscall to print the string
ret
Fail:
li a7, SYSWRITE # "write" syscall
li a0, 1 # 1 = standard output (stdout)
la a1, FailResult # load address of hello string
li a2, fail_result_size # length of hello string
ecall # invoke syscall to print the string
ret
End:
li a0, 0
li a7, SYSEXIT
ecall
```
:::
::: spoiler fixed makefile
```makefile
include ../common/Makefile.common
EXE = .elf
SRC = assignment2.s
CFLAGS += -L../common
LDFLAGS += -T ../common/default.ld
TARGET = assignment2
OUTPUT = $(TARGET)$(EXE)
.PHONY: all clean
all: $(TARGET)
$(TARGET): $(SRC)
$(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS) -g
$(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin
$(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin
$(OBJCOPY) -O binary $(OUTPUT) memory.bin
$(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis
$(READELF) -a $(OUTPUT) > $(TARGET).symbol
clean:
$(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin
```
:::
In order to run in srv32, I modify the system call and the exit part. The makefile also need to be modify, too. Here is my [learning note](https://hackmd.io/Jrr1J1YDR_CtGR46DYotiQ).
```
.data
addr_name: .string "alex"
addr_typed0: .string "aaleex"
addr_typed1: .string "aalleexx"
addr_typed2: .string "aalewx"
SuccessResult: .string "PASS\n"
FailResult: .string "FAIL\n"
test: .word 65
iformat: .string "%d"
.text
.global main
main:
# save ra
addi sp, sp, -4
sw ra, 0(sp)
la t0, addr_name #Load "name" 1st address
la t1, addr_typed0 #Load "typed" 1st address
jal ra, isLongPressed #1st test-case
la t1, addr_typed1 #Load "typed" 1st address
jal ra, isLongPressed #2nd test-case
la t1, addr_typed2 #Load "typed" 1st address
jal ra, isLongPressed #3rd test-case
j End
isLongPressed:
addi sp, sp, -8 #Use 2 byte for saving register
sw ra, 0(sp) #1st byte is ra
sw t0, 4(sp) #2nd byte is t0
lb t2, 0(t0) #1st character of name
lb t3, 0(t1) #1st character of typed
bne t2, t3, Fail #1st character must the same
addi t0, t0, 1 #i++
addi t1, t1, 1 #j++
jal Loop #Call Loop
finally:
lw ra, 0(sp) #Recall ra
lw t0, 4(sp) #Recall t0
addi sp,sp, 8
ret
Loop: #1st while loop
lb t2, 0(t0) #Load name[i]
lb t3, 0(t1) #Load typed[j]
beq t2, x0, checkLastCharacter #If name[i]= NULL
If:
bne t2, t3, Else #If name[i]!=typed[j]
addi t0, t0, 1 #i++
addi t1, t1, 1 #j++
j Loop
Else:
lb t2, -1(t0) # Load name[i-1]
bne t2, t3,Fail # If typed[j] != name[i-1] then Fail
addi t1, t1, 1 # j++
j Loop
checkLastCharacter: #2nd while loop
lb t2, -1(t0) # Load last character of name
if2:
beq t3, x0, Pass # If typed[j] = NULL, then passed
bne t2, t3,Fail # If typed[j] != name[last] then Fail
addi t1, t1, 1 # j++
lb t3, 0(t1) # Load next character of typed
j if2
Pass:
# modify for srv32
la a0, SuccessResult
call printf
j finally
#li a7, SYSWRITE # "write" syscall
#li a0, 1 # 1 = standard output (stdout)
#la a1, SuccessResult # load address of hello string
#li a2, success_result_size # length of hello string
#ecall # invoke syscall to print the string
#ret
Fail:
# modify
la a0, FailResult
call printf
j finally
#li a7, SYSWRITE # "write" syscall
#li a0, 1 # 1 = standard output (stdout)
#la a1, FailResult # load address of hello string
#li a2, fail_result_size # length of hello string
#ecall # invoke syscall to print the string
#ret
End:
# restore ra
lw ra, 0(sp)
addi sp, sp, 4
# exit
li a0, 0
ret
```
And the result in srv32 is
```makefile
hanchi@hanchi:~/srv32$ make debug=1 assignment2
make[1]: 進入目錄「/home/hanchi/srv32/sw」
make -C common
make[2]: 進入目錄「/home/hanchi/srv32/sw/common」
make[2]: 對「all」無需做任何事。
make[2]: 離開目錄「/home/hanchi/srv32/sw/common」
make[2]: 進入目錄「/home/hanchi/srv32/sw/assignment2」
riscv-none-elf-gcc -O3 -Wall -march=rv32im_zicsr -mabi=ilp32 -misa-spec=2.2 -march=rv32im -nostartfiles -nostdlib -L../common -o assignment2.elf assignment2.s -lc -lm -lgcc -lsys -T ../common/default.ld -g
riscv-none-elf-objcopy -j .text -O binary assignment2.elf imem.bin
riscv-none-elf-objcopy -j .data -O binary assignment2.elf dmem.bin
riscv-none-elf-objcopy -O binary assignment2.elf memory.bin
riscv-none-elf-objdump -d assignment2.elf > assignment2.dis
riscv-none-elf-readelf -a assignment2.elf > assignment2.symbol
make[2]: 離開目錄「/home/hanchi/srv32/sw/assignment2」
make[1]: 離開目錄「/home/hanchi/srv32/sw」
make[1]: 進入目錄「/home/hanchi/srv32/sim」
PASS
PASS
FAIL
Excuting 3368 instructions, 4674 cycles, 1.387 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.029 s
Simulation cycles: 4685
Simulation speed : 0.161552 MHz
make[1]: 離開目錄「/home/hanchi/srv32/sim」
make[1]: 進入目錄「/home/hanchi/srv32/tools」
./rvsim --memsize 128 -l trace.log ../sw/assignment2/assignment2.elf
PASS
PASS
FAIL
Excuting 3368 instructions, 4674 cycles, 1.388 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.001 s
Simulation cycles: 4674
Simulation speed : 3.797 MHz
make[1]: 離開目錄「/home/hanchi/srv32/tools」
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
We can get the data table as following.
| Comapre | sim (verilog) | rvsim (c++) |
|:----------------- | ------------- | ----------- |
| Instructions | 3368 | 3368 |
| Cycles | 4674 | 4674 |
| CPI | 1.387 | 1.388 |
| Simulation Times | 0.029 s | 0.001 s |
| Simulation cycles | 4685 | 4674 |
| Simulation Speed | 0.161552 MHz | 3.797 MHz |
### Leetcode (91. Decode Ways)
Problem:
>A message containing letters from A-Z can be encoded into numbers using the following mapping:'A'->1, 'B'->2...'Z'->26. To decode an encoded message, all the digits must be grouped then mapped back into letters using the reverse of the mapping above (there may be multiple ways).In the end, we need to return the number of ways to decode.
As following is my C code, and I modify it into risc-v assembly code. In particular, **I have already confirmed that my program is executed normally, so I complicate my test data.**
:::spoiler C code
```c
#include <stdio.h>
int numDecodings(char *s)
{
volatile int res[2] = {1, 0}; // Avoid compiler skip this part
int previous = (*s) - '0';
if (previous == 0)
{
return 0;
}
int tmp = 0;
s++;
while (*s)
{
tmp = res[1];
if (previous * 10 + (*s) - '0' > 26)
{
res[1] = 0;
}
else
{
res[1] = res[0];
}
if (*s - '0')
{
res[0] += tmp;
}
else
{
res[0] = 0;
}
previous = *s - '0';
s++;
}
return res[0] + res[1];
}
int main(int argc, char *argv[])
{
char test0[23] = "1111111111111111111111";
char test1[23] = "2222222222222222222222";
char test2[23] = "1231231231231231231231";
char test3[23] = "1110612312111221232132";
printf("The result of test0 is %d\n", numDecodings(test0));
printf("The result of test1 is %d\n", numDecodings(test1));
printf("The result of test2 is %d\n", numDecodings(test2));
printf("The result of test3 is %d\n", numDecodings(test3));
return 0;
}
```
:::
Here is my assembly code. The test data is the same with C code which is more complex than the Leetcode given.
```
.data
test0: .string "1111111111111111111111"
test1: .string "2222222222222222222222"
test2: .string "1231231231231231231231"
test3: .string "1110612312111221232132"
result0: .string "The result of test0 is "
# it should be output 2
result1: .string "The result of test1 is "
# it should be output 3
result2: .string "The result of test2 is "
# it should be output 0
result3: .string "The result of test3 is "
# it should be output 2
nextline: .string " \n"
iformat: .string "%d"
.text
.global main
main:
addi sp, sp, -4 # save ra
sw ra, 0(sp)
la a0, result0 # printf result0
call printf
la s0, test0 # go test0
jal ra, numDecodings
la a0, result1 # printf result1
call printf
la s0, test1 # go test1
jal ra, numDecodings
la a0, result2 # printf result2
call printf
la s0, test2 # go test2
jal ra, numDecodings
la a0, result3 # printf result3
call printf
la s0, test3 # go test3
jal ra, numDecodings
finally:
lw ra, 0(sp) # exit program
addi sp, sp, 4
li a0, 0
ret
numDecodings:
addi sp, sp, -4
sw ra, 0(sp)
li t0, 1 # res[0] = 1;
li t1, 0 # res[1] = 0;
lb t2, 0(s0) # get (*s)
addi t3, t2, -48 # int previous = (*s) - '0';
beqz t3, return_0 # if(previous == 0) return 0;
addi t4, x0, 0 # int tmp = 0;
addi s0, s0, 1 # s++;
lb t2, 0(s0) # get new (*s)
while:
beqz t2, return_value # check enter while or not
mv t4, t1 # tmp = res[1];
slli s1, t3, 1 # s1 = previous * 2;
slli s2, t3, 3 # s2 = previous * 8;
add s1, s1, s2 # s1 = previous * 2 + previous * 8;
addi s2, t2, -48 # s2 = (*s) - '0';
addi t5, x0, 26 # t5 = 26;
add s1, s1, s2 # s1 = previous * 10 + (*s) - '0';
ble s1, t5, else1 # if(previous * 10 + (*s) - '0' <= 26) go to else1;
li t1, 0 # res[1] = 0;
middle:
beqz s2, else2 # if((*s) - '0') go to else2;
add t0, t0, t4 # res[0] += tmp;
reset:
mv t3, s2 # previous = (*s) - '0'
addi s0, s0, 1 # s++;
lb t2, 0(s0) # get new (*s)
j while # go back to while loop
else1:
mv t1, t0 # res[1] = res[0];
j middle # go back to middle
else2:
li t0, 0 # res[0] = 0
j reset # go back to reset
return_value:
la a0, iformat # set output format
add a1, t0, t1 # set a1 for output data
call printf # use printf ststem call
la a0, nextline # load '\n' to a0
call printf # use printf system call
lw ra, 0(sp)
addi sp, sp, 4
ret
return_0:
la a0, iformat # set output format
addi a1, x0, 0 # set a1 for output data
call printf # use printf system call
la a0, nextline # load ' \n' to a0
call printf # use printf system call
lw ra, 0(sp)
addi sp, sp, 4
ret
```
And the result in srv32 is
```makefile
hanchi@hanchi:~/srv32$ make debug=1 leetcode
make[1]: 進入目錄「/home/hanchi/srv32/sw」
make -C common
make[2]: 進入目錄「/home/hanchi/srv32/sw/common」
make[2]: 對「all」無需做任何事。
make[2]: 離開目錄「/home/hanchi/srv32/sw/common」
make[2]: 進入目錄「/home/hanchi/srv32/sw/leetcode」
riscv-none-elf-gcc -O3 -Wall -march=rv32im_zicsr -mabi=ilp32 -misa-spec=2.2 -march=rv32im -nostartfiles -nostdlib -L../common -o leetcode.elf leetcode.s -lc -lm -lgcc -lsys -T ../common/default.ld
riscv-none-elf-objcopy -j .text -O binary leetcode.elf imem.bin
riscv-none-elf-objcopy -j .data -O binary leetcode.elf dmem.bin
riscv-none-elf-objcopy -O binary leetcode.elf memory.bin
riscv-none-elf-objdump -d leetcode.elf > leetcode.dis
riscv-none-elf-readelf -a leetcode.elf > leetcode.symbol
make[2]: 離開目錄「/home/hanchi/srv32/sw/leetcode」
make[1]: 離開目錄「/home/hanchi/srv32/sw」
make[1]: 進入目錄「/home/hanchi/srv32/sim」
The result of test0 is 28657
The result of test1 is 28657
The result of test2 is 2187
The result of test3 is 1602
Excuting 12542 instructions, 16758 cycles, 1.336 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.079 s
Simulation cycles: 16769
Simulation speed : 0.212266 MHz
make[1]: 離開目錄「/home/hanchi/srv32/sim」
make[1]: 進入目錄「/home/hanchi/srv32/tools」
./rvsim --memsize 128 -l trace.log ../sw/leetcode/leetcode.elf
The result of test0 is 28657
The result of test1 is 28657
The result of test2 is 2187
The result of test3 is 1602
Excuting 12542 instructions, 16758 cycles, 1.336 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.003 s
Simulation cycles: 16758
Simulation speed : 6.045 MHz
make[1]: 離開目錄「/home/hanchi/srv32/tools」
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
We can get the data table as following.
| Comapre | sim (verilog) | rvsim (c++) |
| ----------------- | ------------- | ----------- |
| Instructions | 12542 | 12542 |
| Cycles | 16758 | 16758 |
| CPI | 1.336 | 1.336 |
| Simulation Times | 0.079 s | 0.003 s |
| Simulation cycles | 16769 | 16758 |
| Simulation Speed | 0.212266 MHz | 6.045 MHz |
---
## Requirement 2
### How to use gtkwave & srv32
#### 3-stage

We can find that the instruction exection order is as following
```mermaid
graph TD;
fetch_pc-->if_pc;
if_pc-->ex_pc;
if_pc-->ex_flush;
ex_pc-->wb_pc;
ex_pc-->wb_flush;
```
#### NOP
I want to observer the NOP in my program, so I check some singals like `fetch_pc`, `if_pc`, `ex_pc`, `wb_pc`, `wb_nop`, `wb_nop_more` . I find that if there is branch instruction and the `ex_pc` executed the branch instruction, `fetch_pc` and `if_pc` will not delete their exist instruction. They will still execute the instruction and make `wb_nop` and `wb_nop_more` become **1**, that's why there is `wb_nop_more` this singal exist, and the **branch penalty is 2**! I will show th example as following.

#### Data Hazard
The disassembly code of hazard is used assignment2 (Long Pressed Name) for example.
Because of the srv32 is fully bypassing processor, there will not have data hazard. Therefore, all that's left is control hazard.
#### Control Hazard
Take the main function for example, this is the disassembly result of assignment2 (Long Pressed Name). We can see this line ` 54: 020000ef jal ra,74 <isLongPressed>`, and the GTKWave is the picture as following.
```
...
0000003c <main>:
3c: ffc10113 addi sp,sp,-4
40: 00112023 sw ra,0(sp)
44: 00021297 auipc t0,0x21
48: dec28293 addi t0,t0,-532 # 20e30 <addr_name>
4c: 00021317 auipc t1,0x21
50: de930313 addi t1,t1,-535 # 20e35 <addr_typed0>
54: 020000ef jal ra,74 <isLongPressed>
58: 00021317 auipc t1,0x21
5c: de430313 addi t1,t1,-540 # 20e3c <addr_typed1>
60: 014000ef jal ra,74 <isLongPressed>
64: 00021317 auipc t1,0x21
68: de130313 addi t1,t1,-543 # 20e45 <addr_typed2>
6c: 008000ef jal ra,74 <isLongPressed>
70: 09c0006f j 10c <End>
...
```

First, we can see the <font color=#ff00>RED Rectangle</font> , the `ex_pc` is excuting the `00000054 (jal ra,74 <isLongPressed>)`, and the next time, `fetch_pc` will fetch the insturction of `00000074` in <font color=#0020FF>Blue Rectangle</font>, but the `if_pc` and `ex_pc` will keep the same instruction, we can find that in <font color=#FFA700>Orange Rectangle</font>. This two error instruction will also be executed and **when it comes to write back, it will make `wb_nop` and `wb_nop_more` turn to 1, therefore it wouldn't be writen back with wrong data**.We also can find the wire `wb_flush` in GTKWave, it also means that the two instruction will be abandoned.Once again it proves that it's branch penalty is 2.
### Explain how to run
#### Use Leetcode (91. Decode Ways) for example

```
...
0000003c <main>:
3c: ffc10113 addi sp,sp,-4
40: 00112023 sw ra,0(sp)
44: 00021517 auipc a0,0x21
48: e4850513 addi a0,a0,-440 # 20e8c <result0>
4c: 170000ef jal ra,1bc <printf>
50: 00021417 auipc s0,0x21
54: de040413 addi s0,s0,-544 # 20e30 <test0>
58: 05c000ef jal ra,b4 <numDecodings>
5c: 00021517 auipc a0,0x21
60: e4850513 addi a0,a0,-440 # 20ea4 <result1>
64: 158000ef jal ra,1bc <printf>
68: 00021417 auipc s0,0x21
6c: ddf40413 addi s0,s0,-545 # 20e47 <test1>
70: 044000ef jal ra,b4 <numDecodings>
74: 00021517 auipc a0,0x21
78: e4850513 addi a0,a0,-440 # 20ebc <result2>
7c: 140000ef jal ra,1bc <printf>
80: 00021417 auipc s0,0x21
84: dde40413 addi s0,s0,-546 # 20e5e <test2>
88: 02c000ef jal ra,b4 <numDecodings>
8c: 00021517 auipc a0,0x21
90: e4850513 addi a0,a0,-440 # 20ed4 <result3>
94: 128000ef jal ra,1bc <printf>
98: 00021417 auipc s0,0x21
9c: ddd40413 addi s0,s0,-547 # 20e75 <test3>
a0: 014000ef jal ra,b4 <numDecodings>
...
```
##### main (0000003c)
We can cross comparison the gtkwave and leetcode.dis, we can find that the program enter the `main` function at `pc = 3c`, and we alse can figure out that `inst[ 31:0 ]` is `FC10113`, that is the machine code in hex.
##### sw (00000040)
We can find the `ra` = 38, and that when `wb_pc` execute `00000040` , we can find that `dmem_wdata` will also wirte the 38 to the memory. The `sp` is also form `00040000` to `0003FFFC` when `addi sp, sp ,-4` be executed. The `dmem_wready`will also be 1, `dmem_raddr` = `sp` = `0003FFFC`.
##### addi (00000048)
We can find that when the `addi` instruction be exectued, the `dmem_raddr` will be `00020e8c`, after the write back, So the `x10_a0` become `00020e8c`.
##### lw (000001cc)

```
...
000001bc <printf>:
1bc: fc010113 addi sp,sp,-64
1c0: 02c12423 sw a2,40(sp)
1c4: 02d12623 sw a3,44(sp)
1c8: 00020317 auipc t1,0x20
1cc: eb832303 lw t1,-328(t1) # 20080 <_impure_ptr>
1d0: 02b12223 sw a1,36(sp)
1d4: 02e12823 sw a4,48(sp)
1d8: 02f12a23 sw a5,52(sp)
1dc: 03012c23 sw a6,56(sp)
1e0: 03112e23 sw a7,60(sp)
1e4: 00832583 lw a1,8(t1)
1e8: 02410693 addi a3,sp,36
1ec: 00050613 mv a2,a0
1f0: 00030513 mv a0,t1
1f4: 00112e23 sw ra,28(sp)
1f8: 00d12623 sw a3,12(sp)
1fc: 010000ef jal ra,20c <_vfprintf_r>
200: 01c12083 lw ra,28(sp)
204: 04010113 addi sp,sp,64
208: 00008067 ret
...
```
In the gtkwave, the `ex_pc` stage is doing the instruction `000001cc`, therefore the `dmem_rready` is turn to 1. The `dmem_waddr` become `00020080`, which means the traget address.
---
## Requirement 3
### Improve the performance
Because of this is 3-stages CPU with fully bypassing, so there is control hazard and Load-use will cause NOP. Therefore, I will focus on reduce the branch penalty in my code, such as loop unrooling, and inline function(macro).
### assignment2 (Long Pressed Name)
In this assembly, I find that there are many loop and do the same work. Therefore, I use the macro first, then do the loop unrooling in the macro. Here is the code, and the performance is showed below.
```
.data
addr_name: .string "alex"
addr_typed0: .string "aaleex"
addr_typed1: .string "aalleexx"
addr_typed2: .string "aalewx"
SuccessResult: .string "PASS"
FailResult: .string "FAIL"
test: .word 65
iformat: .string "%d"
.text
.global main
.macro if2_inline
beq t3, x0, Pass # If typed[j] = NULL, then passed
bne t2, t3,Fail # If typed[j] != name[last] then Fail
addi t1, t1, 1 # j++
lb t3, 0(t1) # Load next character of typed
.endm
.macro if_inline
#Loop: #1st while loop
lb t2, 0(t0) #Load name[i]
lb t3, 0(t1) #Load typed[j]
beq t2, x0, checkLastCharacter #If name[i]= NULL
#If:
bne t2, t3, Else #If name[i]!=typed[j]
addi t0, t0, 1 #i++
addi t1, t1, 1 #j++
j If
.endm
.macro else_inline
#Else:
lb t2, -1(t0) # Load name[i-1]
bne t2, t3,Fail # If typed[j] != name[i-1] then Fail
addi t1, t1, 1 # j++
if_inline
.endm
main:
# save ra
addi sp, sp, -4
sw ra, 0(sp)
la t0, addr_name #Load "name" 1st address
la t1, addr_typed0 #Load "typed" 1st address
jal ra, isLongPressed #1st test-case
la t1, addr_typed1 #Load "typed" 1st address
jal ra, isLongPressed #2nd test-case
la t1, addr_typed2 #Load "typed" 1st address
jal ra, isLongPressed #3rd test-case
#j End
End:
# restore ra
lw ra, 0(sp)
addi sp, sp, 4
# exit
li a0, 0
ret
isLongPressed:
addi sp, sp, -8 #Use 2 byte for saving register
sw ra, 0(sp) #1st byte is ra
sw t0, 4(sp) #2nd byte is t0
lb t2, 0(t0) #1st character of name
lb t3, 0(t1) #1st character of typed
bne t2, t3, Fail #1st character must the same
addi t0, t0, 1 #i++
addi t1, t1, 1 #j++
# go to Loop
#Loop: #1st while loop
#lb t2, 0(t0) #Load name[i]
#lb t3, 0(t1) #Load typed[j]
#beq t2, x0, checkLastCharacter #If name[i]= NULL
If:
#bne t2, t3, Else #If name[i]!=typed[j]
#addi t0, t0, 1 #i++
#addi t1, t1, 1 #j++
#j Loop
if_inline
Else:
#lb t2, -1(t0) # Load name[i-1]
#bne t2, t3,Fail # If typed[j] != name[i-1] then Fail
#addi t1, t1, 1 # j++
#j Loop
else_inline
checkLastCharacter: #2nd while loop
lb t2, -1(t0) # Load last character of name
if2:
#beq t3, x0, Pass # If typed[j] = NULL, then passed
#bne t2, t3,Fail # If typed[j] != name[last] then Fail
#addi t1, t1, 1 # j++
#lb t3, 0(t1) # Load next character of typed
if2_inline
if2_inline
if2_inline
j if2
Pass:
# modify for srv32
la a0, SuccessResult
call puts
lw ra, 0(sp) #Recall ra
lw t0, 4(sp) #Recall t0
addi sp,sp, 8
ret
Fail:
# modify
la a0, FailResult
call puts
lw ra, 0(sp) #Recall ra
lw t0, 4(sp) #Recall t0
addi sp,sp, 8
ret
```
```makefile
hanchi@hanchi:~/srv32$ make debug=1 assignment2
make[1]: 進入目錄「/home/hanchi/srv32/sw」
make -C common
make[2]: 進入目錄「/home/hanchi/srv32/sw/common」
make[2]: 對「all」無需做任何事。
make[2]: 離開目錄「/home/hanchi/srv32/sw/common」
make[2]: 進入目錄「/home/hanchi/srv32/sw/assignment2」
riscv-none-elf-gcc -O3 -Wall -march=rv32im_zicsr -mabi=ilp32 -misa-spec=2.2 -march=rv32im -nostartfiles -nostdlib -L../common -o assignment2.elf assignment2.s -lc -lm -lgcc -lsys -T ../common/default.ld -g
riscv-none-elf-objcopy -j .text -O binary assignment2.elf imem.bin
riscv-none-elf-objcopy -j .data -O binary assignment2.elf dmem.bin
riscv-none-elf-objcopy -O binary assignment2.elf memory.bin
riscv-none-elf-objdump -d assignment2.elf > assignment2.dis
riscv-none-elf-readelf -a assignment2.elf > assignment2.symbol
make[2]: 離開目錄「/home/hanchi/srv32/sw/assignment2」
make[1]: 離開目錄「/home/hanchi/srv32/sw」
make[1]: 進入目錄「/home/hanchi/srv32/sim」
PASS
PASS
FAIL
Excuting 2171 instructions, 2939 cycles, 1.353 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.02 s
Simulation cycles: 2950
Simulation speed : 0.1475 MHz
make[1]: 離開目錄「/home/hanchi/srv32/sim」
make[1]: 進入目錄「/home/hanchi/srv32/tools」
./rvsim --memsize 128 -l trace.log ../sw/assignment2/assignment2.elf
PASS
PASS
FAIL
Excuting 2171 instructions, 2939 cycles, 1.354 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.001 s
Simulation cycles: 2939
Simulation speed : 5.843 MHz
make[1]: 離開目錄「/home/hanchi/srv32/tools」
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
| Comapre | sim | rvsim | optimized sim | optimized rvsim |
|:----------------- | ------------ | --------- | ------------- | --------------- |
| Instructions | 3368 | 3368 | 2171 | 2171 |
| Cycles | 4674 | 4674 | 2939 | 2939 |
| CPI | 1.387 | 1.388 | 1.353 | 1.354 |
| Simulation Times | 0.029 s | 0.001 s | 0.02 s | 0.001 s |
| Simulation cycles | 4685 | 4674 | 2950 | 2939 |
| Simulation Speed | 0.161552 MHz | 3.797 MHz | 0.1475 MHz | 5.843 MHz |
In this table, we can find that there are at least reduce **30%** cycles in optimization.
### Leetcode (91. Decode Ways)
I also do the ssmae thing in leetcode.
```
.data
test0: .string "1111111111111111111111"
test1: .string "2222222222222222222222"
test2: .string "1231231231231231231231"
test3: .string "1110612312111221232132"
result0: .string "The result of test0 is %d\n"
# it should be output 2
result1: .string "The result of test1 is %d\n"
# it should be output 3
result2: .string "The result of test2 is %d\n"
# it should be output 0
result3: .string "The result of test3 is %d\n"
# it should be output 2
.text
.global main
.macro whileloop
#while:
beqz t2, return_value # check enter while or not
mv t4, t1 # tmp = res[1];
slli s1, t3, 1 # s1 = previous * 2;
slli s2, t3, 3 # s2 = previous * 8;
add s1, s1, s2 # s1 = previous * 2 + previous * 8;
addi s2, t2, -48 # s2 = (*s) - '0';
addi t5, x0, 26 # t5 = 26;
add s1, s1, s2 # s1 = previous * 10 + (*s) - '0';
ble s1, t5, else1 # if(previous * 10 + (*s) - '0' <= 26) go to else1;
li t1, 0 # res[1] = 0;
#middle:
beqz s2, else2 # if((*s) - '0') go to else2;
add t0, t0, t4 # res[0] += tmp;
#reset:
mv t3, s2 # previous = (*s) - '0'
addi s0, s0, 1 # s++;
lb t2, 0(s0) # get new (*s)
#j while # go back to while loop
# double while
beqz t2, return_value # check enter while or not
mv t4, t1 # tmp = res[1];
slli s1, t3, 1 # s1 = previous * 2;
slli s2, t3, 3 # s2 = previous * 8;
add s1, s1, s2 # s1 = previous * 2 + previous * 8;
addi s2, t2, -48 # s2 = (*s) - '0';
addi t5, x0, 26 # t5 = 26;
add s1, s1, s2 # s1 = previous * 10 + (*s) - '0';
ble s1, t5, else1 # if(previous * 10 + (*s) - '0' <= 26) go to else1;
li t1, 0 # res[1] = 0;
#middle:
beqz s2, else2 # if((*s) - '0') go to else2;
add t0, t0, t4 # res[0] += tmp;
#reset:
mv t3, s2 # previous = (*s) - '0'
addi s0, s0, 1 # s++;
lb t2, 0(s0) # get new (*s)
#j while # go back to while loop
.endm
main:
addi sp, sp, -4 # save ra
sw ra, 0(sp)
la s0, test0 # go test0
jal ra, numDecodings
la a0, result0
call printf
la s0, test1 # go test1
jal ra, numDecodings
la a0, result1
call printf
la s0, test2 # go test2
jal ra, numDecodings
la a0, result2
call printf
la s0, test3 # go test3
jal ra, numDecodings
la a0, result3
call printf
finally:
lw ra, 0(sp) # exit program
addi sp, sp, 4
li a0, 0
ret
numDecodings:
addi sp, sp, -4
sw ra, 0(sp)
li t0, 1 # res[0] = 1;
li t1, 0 # res[1] = 0;
lb t2, 0(s0) # get (*s)
addi t3, t2, -48 # int previous = (*s) - '0';
beqz t3, return_0 # if(previous == 0) return 0;
addi t4, x0, 0 # int tmp = 0;
addi s0, s0, 1 # s++;
lb t2, 0(s0) # get new (*s)
while:
whileloop
j while # go back to while loop
else1:
mv t1, t0 # res[1] = res[0];
#j middle # go back to middle
#middle:
beqz s2, else2 # if((*s) - '0') go to else2;
add t0, t0, t4 # res[0] += tmp;
#reset:
mv t3, s2 # previous = (*s) - '0'
addi s0, s0, 1 # s++;
lb t2, 0(s0) # get new (*s)
#j while # go back to while loop
whileloop
j while
else2:
li t0, 0 # res[0] = 0
#j reset
# reset:
mv t3, s2 # previous = (*s) - '0'
addi s0, s0, 1 # s++;
lb t2, 0(s0) # get new (*s)
#j while # go back to while loop
whileloop
j while
return_value:
add a1, t0, t1
lw ra, 0(sp)
addi sp, sp, 4
ret
return_0:
addi a1, x0, 0 # set a1 for output data
lw ra, 0(sp)
addi sp, sp, 4
ret
```
```makefile
hanchi@hanchi:~/srv32$ make debug=1 leetcode
make[1]: 進入目錄「/home/hanchi/srv32/sw」
make -C common
make[2]: 進入目錄「/home/hanchi/srv32/sw/common」
make[2]: 對「all」無需做任何事。
make[2]: 離開目錄「/home/hanchi/srv32/sw/common」
make[2]: 進入目錄「/home/hanchi/srv32/sw/leetcode」
riscv-none-elf-gcc -O3 -Wall -march=rv32im_zicsr -mabi=ilp32 -misa-spec=2.2 -march=rv32im -nostartfiles -nostdlib -L../common -o leetcode.elf leetcode.s -lc -lm -lgcc -lsys -T ../common/default.ld
riscv-none-elf-objcopy -j .text -O binary leetcode.elf imem.bin
riscv-none-elf-objcopy -j .data -O binary leetcode.elf dmem.bin
riscv-none-elf-objcopy -O binary leetcode.elf memory.bin
riscv-none-elf-objdump -d leetcode.elf > leetcode.dis
riscv-none-elf-readelf -a leetcode.elf > leetcode.symbol
make[2]: 離開目錄「/home/hanchi/srv32/sw/leetcode」
make[1]: 離開目錄「/home/hanchi/srv32/sw」
make[1]: 進入目錄「/home/hanchi/srv32/sim」
The result of test0 is 28657
The result of test1 is 28657
The result of test2 is 2187
The result of test3 is 1602
Excuting 10492 instructions, 13902 cycles, 1.325 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.066 s
Simulation cycles: 13913
Simulation speed : 0.210803 MHz
make[1]: 離開目錄「/home/hanchi/srv32/sim」
make[1]: 進入目錄「/home/hanchi/srv32/tools」
./rvsim --memsize 128 -l trace.log ../sw/leetcode/leetcode.elf
The result of test0 is 28657
The result of test1 is 28657
The result of test2 is 2187
The result of test3 is 1602
Excuting 10492 instructions, 13902 cycles, 1.325 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.003 s
Simulation cycles: 13902
Simulation speed : 4.748 MHz
make[1]: 離開目錄「/home/hanchi/srv32/tools」
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
| Comapre | sim | rvsim | optimized sim | optimized rvsim |
|:----------------- | ------------ | --------- | ------------- | --------------- |
| Instructions | 12542 | 12542 | 10492 | 10492 |
| Cycles | 16758 | 16758 | 13902 | 13902 |
| CPI | 1.336 | 1.336 | 1.325 | 1.325 |
| Simulation Times | 0.079 s | 0.003 s | 0.066 s | 0.003 s |
| Simulation cycles | 16769 | 16758 | 13913 | 13902 |
| Simulation Speed | 0.212266 MHz | 6.045 MHz | 0.210803 MHz | 4.748 MHz |
In this table, we can find that there are at least reduce **16%** cycles in optimization.
---
## Conclusion
In this project, I have learned a lot, such as `verilator`, `analyze verilog`, optimize the assembly on three-stage pipeline CPU, and so on. By using the srv32, we can get the comparision with the verilog simulator and cpp emulator. In the end, I made the 30% and 16% improvement in assembly
---
## Reference links
[srv32 學習紀錄](https://hackmd.io/Jrr1J1YDR_CtGR46DYotiQ)
[Assignment3: SoftCPU (Oscar shiang)](https://hackmd.io/@oscarshiang/arch_hw3#Improve-the-performance)
[Verilator](https://www.veripool.org/verilator/)