# [Assignment3: SoftCPU](https://hackmd.io/@sysprog/2022-arch-homework3)
## Trobule shooting
**problem**: when I follow lab3 to build riscv-gnu-toolchain, my ubuntu crash
**how to slove**: I thing the reason of ubuntu crash is using to many thread so I change make -j paramater from nproc to 1, and it successfully build without crash
```diff
sudo apt install autoconf automake autotools-dev curl gawk git \
build-essential bison flex texinfo gperf libtool patchutils bc git \
libmpc-dev libmpfr-dev libgmp-dev gawk zlib1g-dev libexpat1-dev
git clone --recursive https://github.com/riscv/riscv-gnu-toolchain
cd riscv-gnu-toolchain
mkdir -p build && cd build
../configure --prefix=/opt/riscv --with-isa-spec=20191213 \
--with-multilib-generator="rv32i-ilp32--;rv32im-ilp32--;rv32imac-ilp32--;rv32im_zicsr-ilp32--;rv32imac_zicsr-ilp32--;rv64imac-lp64--;rv64imac_zicsr-lp64--"
- make -j$(nproc)
+ make -j 1
```
## C Code and assemble code fom comoiler
### Problem from leetcode
leetcode Problem: [3. Longest Substring Without Repeating Characters](https://leetcode.com/problems/longest-substring-without-repeating-characters/)
**Problem description:**
```
Given a string s, find the length of the longest substring without repeating characters.
```
**Example:**
```
Input: s = "abcabcbb"
Output: 3
Explanation: The answer is "abc", with the length of 3.
```
**Example 2:**
```
Input: s = "bbbbb"
Output: 1
Explanation: The answer is "b", with the length of 1.
```
### C code
code
```
#include<stdio.h>
#include<stdlib.h>
// max function
int max(int x, int y) {
return x ^ ((x ^ y) & -(x < y));
}
// cacluate Longest Substring
int lengthOfLongestSubstring(char* s, int n) {
// hsah map for ascii
int freqMap[256] = {0};
int L = 0;
int R = 0;
int ans = 0;
while(R < n) {
char right = s[R++];
freqMap[right]++;
while(freqMap[right] > 1) {
char left = s[L++];
freqMap[left]--;
}
ans = max(ans, R - L);
}
return ans;
}
//main function for test data
int main()
{
int ans = 0;
char test[] = "abcd123";
int test_size = 7;
ans = lengthOfLongestSubstring(test, test_size);
printf("%d\n", ans);
return 0;
}
```
### assemble code from compiler
file path: ~/srv32/sw/leetcode_3/leetcode.c
compile c code to executable file
```
riscv64-unknown-elf-gcc -O3 -Wall -march=rv32im_zicsr -mabi=ilp32 -nostartfiles -nostdlib -L../common -o leetcode_3.elf leetcode_3.c -lc -lm -lgcc -lsys -T ../common/default.ld
```
get assemble code
```
$riscv64-unknown-elf-objdump -d leetcode_3.elf > leetcode_3.dis
0000003c <max>:
3c: 00b55463 bge a0,a1,44 <max+0x8>
40: 00058513 mv a0,a1
44: 00008067 ret
00000048 <lengthOfLongestSubstring>:
48: bf010113 addi sp,sp,-1040
4c: 40812423 sw s0,1032(sp)
50: 40912223 sw s1,1028(sp)
54: 00050413 mv s0,a0
58: 00058493 mv s1,a1
5c: 40000613 li a2,1024
60: 00000593 li a1,0
64: 00010513 mv a0,sp
68: 40112623 sw ra,1036(sp)
6c: 0f4000ef jal ra,160 <memset>
70: 00000513 li a0,0
74: 06905a63 blez s1,e8 <lengthOfLongestSubstring+0xa0>
78: 00000893 li a7,0
7c: 00000693 li a3,0
80: 00100813 li a6,1
84: 00188893 addi a7,a7,1
88: 011407b3 add a5,s0,a7
8c: fff7c603 lbu a2,-1(a5)
90: 00261613 slli a2,a2,0x2
94: 40060793 addi a5,a2,1024
98: 00278633 add a2,a5,sp
9c: c0062783 lw a5,-1024(a2)
a0: 00178793 addi a5,a5,1
a4: c0f62023 sw a5,-1024(a2)
a8: 02f85863 bge a6,a5,d8 <lengthOfLongestSubstring+0x90>
ac: 00168693 addi a3,a3,1
b0: 00d407b3 add a5,s0,a3
b4: fff7c783 lbu a5,-1(a5)
b8: 00279793 slli a5,a5,0x2
bc: 40078793 addi a5,a5,1024
c0: 002787b3 add a5,a5,sp
c4: c007a703 lw a4,-1024(a5)
c8: fff70713 addi a4,a4,-1
cc: c0e7a023 sw a4,-1024(a5)
d0: c0062783 lw a5,-1024(a2)
d4: fcf84ce3 blt a6,a5,ac <lengthOfLongestSubstring+0x64>
d8: 40d887b3 sub a5,a7,a3
dc: 00f55463 bge a0,a5,e4 <lengthOfLongestSubstring+0x9c>
e0: 00078513 mv a0,a5
e4: fb1490e3 bne s1,a7,84 <lengthOfLongestSubstring+0x3c>
e8: 40c12083 lw ra,1036(sp)
ec: 40812403 lw s0,1032(sp)
f0: 40412483 lw s1,1028(sp)
f4: 41010113 addi sp,sp,1040
f8: 00008067 ret
000000fc <main>:
fc: 646367b7 lui a5,0x64636
100: fe010113 addi sp,sp,-32
104: 26178793 addi a5,a5,609 # 64636261 <_stack+0x645f6261>
108: 00f12423 sw a5,8(sp)
10c: 003337b7 lui a5,0x333
110: 23178793 addi a5,a5,561 # 333231 <_stack+0x2f3231>
114: 00700593 li a1,7
118: 00810513 addi a0,sp,8
11c: 00112e23 sw ra,28(sp)
120: 00812c23 sw s0,24(sp)
124: 00f12623 sw a5,12(sp)
128: f21ff0ef jal ra,48 <lengthOfLongestSubstring>
12c: 00050413 mv s0,a0
130: 00020537 lui a0,0x20
134: 03c50513 addi a0,a0,60 # 2003c <__malloc_trim_threshold+0x4>
138: 144000ef jal ra,27c <printf>
13c: 00020537 lui a0,0x20
140: 00040593 mv a1,s0
144: 04450513 addi a0,a0,68 # 20044 <__malloc_trim_threshold+0xc>
148: 134000ef jal ra,27c <printf>
14c: 01c12083 lw ra,28(sp)
150: 01812403 lw s0,24(sp)
154: 00000513 li a0,0
158: 02010113 addi sp,sp,32
15c: 00008067 ret
```
### Result
**RTL Simulation result**
```
7
Excuting 1885 instructions, 2421 cycles, 1.284 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.01 s
Simulation cycles: 2432
Simulation speed : 0.2432 MHz
```
**ISS (Instruction Set Simulator)**
```
./rvsim --memsize 128 -l trace.log ../sw/leetcode_3/leetcode_3.elf
7
Excuting 1885 instructions, 2421 cycles, 1.284 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.000 s
Simulation cycles: 2421
Simulation speed : 6.801 MHz
```
## [srv32](https://github.com/sysprog21/srv32) with code
below picture show srv32 architecture, It is a 3-stage pipeline architecture with IF/ID, EX, WB stages.
* support fully bypass so we don't need consider data hazards
* when take branch the pc update at WB stages, so control hazard will take 2 nop.

### control hazard
In srv32, nop will produce at wb stage.
generate wave.fst at ~/srv32/sim/wave.fst
```
$ cd ~/srv32 && make leetcode_3
```
#### using GTKWave observe wave.fst
below code is from leetcode_3.elf and trace it for observe control hazard
```
code
00000020 <_bss_clear>:
20: 0002a023 sw zero,0(t0)
24: 00428293 addi t0,t0,4
28: fe62ece3 bltu t0,t1,20 <_bss_clear>
2c: 00040117 auipc sp,0x40
30: fd410113 addi sp,sp,-44 # 40000 <_stack>
```
<h4>before take branch:ex_stage addi t0,t04</h4>
| next_pc | feteh_pc | if_pc | ex_pc | wb_pc |
| -------- | -------- | -------- | -------- | -------- |
| 00000030 | 0000002c | 00000028 | 00000024 | 00000020 |
show in GTKWave

<h4>take branch:ex_stage bltu t0,t1,20 <_bss_clear></h4>
| next_pc | feteh_pc | if_pc | ex_pc | wb_pc |
| -------- | -------- | -------- | -------- | -------- |
| 00000020 | 00000030(nop) | 0000002c(nop) | 00000028 | 00000024 |
show in GTKWave, we can notice branch taken = 1, we should abort feteh_pc and if_pc and next_pc is 0x20

<h4>ex_stage auipc sp,0x40 -> shold be abort(nop)</h4>
| next_pc | feteh_pc | if_pc | ex_pc | wb_pc |
| -------- | -------- | -------- | -------- | -------- |
| 00000024 | 00000020 | 00000030(nop) | 0000002c(nop) | 00000028 |
show in GTKWave, we can notice wb stage is 0x28(bltu t0,t1,20), so wb stage should still work

<h4>ex_stage addi sp, sp, -44 #40000 <_stack> ->shold be abort(nop)</h4>
| next_pc | feteh_pc | if_pc | ex_pc | wb_pc |
| -------- | -------- | -------- | -------- | -------- |
| 00000028 | 00000024 | 00000020 | 00000030(nop) | 0000002c(nop) |
show in GTKWave, we can notice wb_flush and wb_nop is 1 because in wb_stage wb_pc is 0x2c, and it should't be execute

<h4>ex_stage bltu t0,t1,20 <_bss_clear></h4>
| next_pc | feteh_pc | if_pc | ex_pc | wb_pc |
| -------- | -------- | -------- | -------- | -------- |
| 0000002c | 00000028 | 00000024 | 00000020 | 00000030(nop) |
show in GTKWave, we can notice wb_flush and wb_nore_nop is 1 because in wb_stage wb_pc is 0x30, and it should't be execute

## Improve
### Improve c code
the idea is from [leetcode discussion](https://leetcode.com/problems/longest-substring-without-repeating-characters/discuss/2739188/PythonC%2B%2BJavaRust-0-ms-position-difference-(with-detailed-comments)), and I change some initial value for reduce two add instruction, it using jump to deal with duplicate character, replace while loop in original code
```
#include<stdio.h>
#include<stdlib.h>
int max(int x, int y) {
return x ^ ((x ^ y) & -(x < y));
}
int lengthOfLongestSubstring(char* s, int n) {
int pos[256];
memset(pos, -1, 1024);
int L = -1;
int R = 0;
int ans = 0;
while (R <n) {
char ch = s[R];
L = max(L, pos[ch]);
ans = max(ans, R - L);
pos[ch] = R;
R++;
}
return ans;
}
int main()
{
int ans = 0;
char test[] = "abcd123";
int test_size = 7;
ans = lengthOfLongestSubstring(test, test_size);
printf("%d\n", ans);
return 0;
}
```
### result form improve
**RTL Simulation result**
```
7
Excuting 1897 instructions, 2437 cycles, 1.284 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.01 s
Simulation cycles: 2448
Simulation speed : 0.2448 MHz
make[1]: Leaving directory '/home/sunny/srv32/sim'
make[1]: Entering directory '/home/sunny/srv32/tools'
```
**ISS (Instruction Set Simulator)**
```
./rvsim --memsize 128 -l trace.log ../sw/leetcode_improve_3/leetcode_improve_3.elf
7
Excuting 1897 instructions, 2437 cycles, 1.285 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.000 s
Simulation cycles: 2437
Simulation speed : 6.313 MHz
```
### observe
i find the result don't get improve(Excuting cycles 2421 -> 2437), and I think the reason is my test case("abce123") is some case that has no duplicate so it dosen't get benfit from eliminate while loop from original code.
#### difference from lengthOfLongestSubstring function, no condiser memset
First, we compare assemble code from improve and original
In original code, lengthOfLongestSubstring function has five instruction may cause control hazard
| Instruction | wether branch taken|
| -------- | -------- |
| blez s1,e8 | Not Taken |
| bge a6,a5,d8 | always Taken |
| blt a6,a5,ac | not performance, because while(freqMap[right] > 1) always false |
| bge a0,a5,e4 | always Not Taken, because (R - L) always > ans |
| bne s1,a7,84 | Taken when R < n |
In improve code, lengthOfLongestSubstring function has four instruction may cause control hazard
| Instruction | wether branch taken|
| -------- | -------- |
| blez s0,c0 | Not Taken |
| bge a3,a2,a8 | always Taken |
| bge a0,a2,b4 | always Not Taken, because (R - L) always > ans |
| bne s0,a4,80 | Taken when R < n |
the assemble code show both are almost same, the biggest difference is the way to determine L position (while loop v.s L = max(L, os[ch])), Original has 45 line of code ,Improve has 35 line of code, But In my test case some branch always taken so some of code will never execute, by calculate both execute 34 line of code, and they have same branch taken number so the difference is seem not from lengthOfLongestSubstring
```
// original
00000048 <lengthOfLongestSubstring>:
48: bf010113 addi sp,sp,-1040
4c: 40812423 sw s0,1032(sp)
50: 40912223 sw s1,1028(sp)
54: 00050413 mv s0,a0
58: 00058493 mv s1,a1
5c: 40000613 li a2,1024
60: 00000593 li a1,0
64: 00010513 mv a0,sp
68: 40112623 sw ra,1036(sp)
6c: 0dc000ef jal ra,148 <memset>
70: 00000513 li a0,0
74: 06905a63 blez s1,e8 <lengthOfLongestSubstring+0xa0>
78: 00000893 li a7,0
7c: 00000693 li a3,0
80: 00100813 li a6,1
84: 00188893 addi a7,a7,1
88: 011407b3 add a5,s0,a7
8c: fff7c603 lbu a2,-1(a5)
90: 00261613 slli a2,a2,0x2
94: 40060793 addi a5,a2,1024
98: 00278633 add a2,a5,sp
9c: c0062783 lw a5,-1024(a2)
a0: 00178793 addi a5,a5,1
a4: c0f62023 sw a5,-1024(a2)
a8: 02f85863 bge a6,a5,d8 <lengthOfLongestSubstring+0x90>
ac: 00168693 addi a3,a3,1
b0: 00d407b3 add a5,s0,a3
b4: fff7c783 lbu a5,-1(a5)
b8: 00279793 slli a5,a5,0x2
bc: 40078793 addi a5,a5,1024
c0: 002787b3 add a5,a5,sp
c4: c007a703 lw a4,-1024(a5)
c8: fff70713 addi a4,a4,-1
cc: c0e7a023 sw a4,-1024(a5)
d0: c0062783 lw a5,-1024(a2)
d4: fcf84ce3 blt a6,a5,ac <lengthOfLongestSubstring+0x64>
d8: 40d887b3 sub a5,a7,a3
dc: 00f55463 bge a0,a5,e4 <lengthOfLongestSubstring+0x9c>
e0: 00078513 mv a0,a5
e4: fb1490e3 bne s1,a7,84 <lengthOfLongestSubstring+0x3c>
e8: 40c12083 lw ra,1036(sp)
ec: 40812403 lw s0,1032(sp)
f0: 40412483 lw s1,1028(sp)
f4: 41010113 addi sp,sp,1040
f8: 00008067 ret
// Improve
48: bf010113 addi sp,sp,-1040
4c: 40812423 sw s0,1032(sp)
50: 40912223 sw s1,1028(sp)
54: 00058413 mv s0,a1
58: 00050493 mv s1,a0
5c: 40000613 li a2,1024
60: fff00593 li a1,-1
64: 00010513 mv a0,sp
68: 40112623 sw ra,1036(sp)
6c: 0b4000ef jal ra,120 <memset>
70: 00000513 li a0,0
74: 04805663 blez s0,c0 <lengthOfLongestSubstring+0x78>
78: 00000713 li a4,0
7c: fff00693 li a3,-1
80: 00e487b3 add a5,s1,a4
84: 0007c783 lbu a5,0(a5)
88: 00279793 slli a5,a5,0x2
8c: 40078613 addi a2,a5,1024
90: 00260633 add a2,a2,sp
94: c0062603 lw a2,-1024(a2)
98: 40078793 addi a5,a5,1024
9c: 002787b3 add a5,a5,sp
a0: 00c6d463 bge a3,a2,a8 <lengthOfLongestSubstring+0x60>
a4: 00060693 mv a3,a2
a8: 40d70633 sub a2,a4,a3
ac: 00c55463 bge a0,a2,b4 <lengthOfLongestSubstring+0x6c>
b0: 00060513 mv a0,a2
b4: c0e7a023 sw a4,-1024(a5)
b8: 00170713 addi a4,a4,1
bc: fce412e3 bne s0,a4,80 <lengthOfLongestSubstring+0x38>
c0: 40c12083 lw ra,1036(sp)
c4: 40812403 lw s0,1032(sp)
c8: 40412483 lw s1,1028(sp)
cc: 41010113 addi sp,sp,1040
d0: 00008067 ret
```
#### using GTK_Wave to Check the difference come from
I decide to obversive some time slice in GTK_Wave, like call lengthOfLongestSubstring, call memset, back to main function, the result show in below table, and we can notice when back to main function both difference are 64 pps is 16 cycle, it is meets the execute result (Original: Excuting 1885 instructions, 2421 cycles, Improve: Excuting 1897 instructions, 2437 cycles, both diffenence are 16 cycles)
| Time Slice | original | Improve |
| -------- | -------- | -------- |
| call lengthOfLongestSubstring | 445ps | 445ps |
| call memset | 493ps | 493ps |
| back to lengthOfLongestSubstring | 2585ps | 2625ps |
| back to main function | 3129ps | 3193ps |
further more I trace <memset> and find why initial value -1 take more time, and the reason is when initial value is not zero is will fullfill 4 byte in initial value so it can use instruction "sw" to initial 4 byte once, and it will take two jump compare initial value is 0
```
<memset>:
148: 00f00313 li t1,15
14c: 00050713 mv a4,a0
150: 02c37e63 bgeu t1,a2,18c
154: 00f77793 andi a5,a4,15
158: 0a079063 bnez a5,1f8
15c: 08059263 bnez a1,1e0 //if initial value not zero jump
.
.
.
1b8: 0ff5f593 zext.b a1,a1 // Zero extend byte = andi a1, a1, 255
1bc: 00859693 slli a3,a1,0x8
1c0: 00d5e5b3 or a1,a1,a3
1c4: 01059693 slli a3,a1,0x10
1c8: 00d5e5b3 or a1,a1,a3
1cc: f6dff06f j 138
```
bleow picture show how I check the timeslice, when exe_pc is call function, I recoed pps


```
128: jal ra,48 <lengthOfLongestSubstring>
```
:::info
not compelete: find difference at lengthOfLongestSubstring
:::
### other test_case show improve
test case = "abcd123abd123456789", this case has duplicate characters, Improvement code can get advantage Form locate L by jump.
Excuting cycles improvement: 34 (2898 -> 2854)
original code result:
```
13
Excuting 2280 instructions, 2898 cycles, 1.271 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.011 s
Simulation cycles: 2909
Simulation speed : 0.264455 MHz
make[1]: Leaving directory '/home/sunny/srv32/sim'
make[1]: Entering directory '/home/sunny/srv32/tools'
./rvsim --memsize 128 -l trace.log ../sw/leetcode_3/leetcode_3.elf
13
Excuting 2280 instructions, 2898 cycles, 1.271 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.000 s
Simulation cycles: 2898
Simulation speed : 6.662 MHz
```
improve code result
```
13
Excuting 2234 instructions, 2854 cycles, 1.277 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.012 s
Simulation cycles: 2865
Simulation speed : 0.23875 MHz
make[1]: Leaving directory '/home/sunny/srv32/sim'
make[1]: Entering directory '/home/sunny/srv32/tools'
./rvsim --memsize 128 -l trace.log ../sw/leetcode_improve_3/leetcode_improve_3.elf
13
Excuting 2234 instructions, 2854 cycles, 1.278 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.000 s
Simulation cycles: 2854
Simulation speed : 6.811 MHz
make[1]: Leaving directory '/home/sunny/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
### Improve assemble code
Try to improve (Improve_code) in assemble code.
when branch taken below code will get two nop and execute sub a2, a4, a3 it will take 3 instructions, on the contrary it will take 2 instructions,if a3 == a2 branch taken and not taken will peform same result at srv32 so I change eaual situation from branch taken to branch not taken
```diff
- a0: bge a3, a2, a8 // if a3 >= a2 jump a8
+ blt a2, a3, a8 // if a3 > a2 jump a8
a4: mv a3, a2 // a3 = a2
a8: sub a2, a4, a3
```
get assemblecode code
```
riscv64-unknown-elf-gcc -O3 -Wall -march=rv32im_zicsr -mabi=ilp32 -S -o leetcode_improve_3.s leetcode_improve_3.c
```
before improvement result
```
1
Excuting 1883 instructions, 2419 cycles, 1.284 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.01 s
Simulation cycles: 2430
Simulation speed : 0.243 MHz
make[1]: Leaving directory '/home/sunny/srv32/sim'
make[1]: Entering directory '/home/sunny/srv32/tools'
./rvsim --memsize 128 -l trace.log ../sw/leetcode_improve_3/leetcode_improve_3.elf
1
Excuting 1883 instructions, 2419 cycles, 1.285 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.000 s
Simulation cycles: 2419
Simulation speed : 6.048 MHz
make[1]: Leaving directory '/home/sunny/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
after improvement result, add 1 instruction but reduce 1 cycles
```
1
Excuting 1884 instructions, 2418 cycles, 1.283 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.01 s
Simulation cycles: 2429
Simulation speed : 0.2429 MHz
make[1]: Leaving directory '/home/sunny/srv32/sim'
make[1]: Entering directory '/home/sunny/srv32/tools'
./rvsim --memsize 128 -l trace.log ../sw/leetcode_improve_3/leetcode_improve_3.elf
1
Excuting 1884 instructions, 2418 cycles, 1.283 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.000 s
Simulation cycles: 2418
Simulation speed : 6.792 MHz
make[1]: Leaving directory '/home/sunny/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```