owned this note
owned this note
Published
Linked with GitHub
# Assignment3: SoftCPU
###### tags: `Computer Architecture`
## Getting Started
The code from my [assignment 1](https://hackmd.io/MJ5M3NnUTj2ZN9aGLOpznQ) is chosen, which is the solution of the problem "longest palindrome" from [LeetCode 409](https://leetcode.com/problems/longest-palindrome/). Based on Jserv's comment in my first assignment, I modified my code as follow:
```c
#include <stdio.h>
int longestPalindrome(char *s)
{
int count['z' - 'A' + 1] = {0}, res = 0, i = 0, N;
while (s[i] != '\0') {
if (++count[s[i] - 'A'] == 2) {
count[s[i] - 'A'] = 0;
res += 2;
}
++i;
}
N = i;
if (!(N & 1))
res += (res == N ? 0 : 1);
return res + (N & 1);
}
int main() {
char *str = "abccccdd";
printf("%d\n", longestPalindrome(str));
return 0;
}
```
The generated assembly code:
```
...
0000003c <longestPalindrome>:
3c: f0010113 addi sp,sp,-256
40: 0e812c23 sw s0,248(sp)
44: 0e800613 li a2,232
48: 00050413 mv s0,a0
4c: 00000593 li a1,0
50: 00810513 addi a0,sp,8
54: 0e112e23 sw ra,252(sp)
58: 0dc000ef jal ra,134 <memset>
5c: 00044783 lbu a5,0(s0)
60: 08078663 beqz a5,ec <longestPalindrome+0xb0>
64: 00000693 li a3,0
68: 00000513 li a0,0
6c: 00200593 li a1,2
70: fbf78793 addi a5,a5,-65
74: 00279793 slli a5,a5,0x2
78: 0f010713 addi a4,sp,240
7c: 00f70633 add a2,a4,a5
80: f1862703 lw a4,-232(a2)
84: 00168693 addi a3,a3,1
88: 00d407b3 add a5,s0,a3
8c: 00170713 addi a4,a4,1
90: 0007c783 lbu a5,0(a5)
94: 02b70463 beq a4,a1,bc <longestPalindrome+0x80>
98: f0e62c23 sw a4,-232(a2)
9c: fc079ae3 bnez a5,70 <longestPalindrome+0x34>
a0: 0016f793 andi a5,a3,1
a4: 02078663 beqz a5,d0 <longestPalindrome+0x94>
a8: 0fc12083 lw ra,252(sp)
ac: 0f812403 lw s0,248(sp)
b0: 00150513 addi a0,a0,1
b4: 10010113 addi sp,sp,256
b8: 00008067 ret
bc: f0062c23 sw zero,-232(a2)
c0: 00250513 addi a0,a0,2
c4: fa0796e3 bnez a5,70 <longestPalindrome+0x34>
c8: 0016f793 andi a5,a3,1
cc: fc079ee3 bnez a5,a8 <longestPalindrome+0x6c>
d0: 0fc12083 lw ra,252(sp)
d4: 0f812403 lw s0,248(sp)
d8: 40d506b3 sub a3,a0,a3
dc: 00d036b3 snez a3,a3
e0: 00a68533 add a0,a3,a0
e4: 10010113 addi sp,sp,256
e8: 00008067 ret
ec: 0fc12083 lw ra,252(sp)
f0: 0f812403 lw s0,248(sp)
f4: 00000513 li a0,0
f8: 10010113 addi sp,sp,256
fc: 00008067 ret
00000100 <main>:
100: 00020537 lui a0,0x20
104: ff010113 addi sp,sp,-16
108: 03c50513 addi a0,a0,60 # 2003c <__malloc_trim_threshold+0x4>
10c: 00112623 sw ra,12(sp)
110: f2dff0ef jal ra,3c <longestPalindrome>
114: 00050593 mv a1,a0
118: 00020537 lui a0,0x20
11c: 04850513 addi a0,a0,72 # 20048 <__malloc_trim_threshold+0x10>
120: 130000ef jal ra,250 <printf>
124: 00c12083 lw ra,12(sp)
128: 00000513 li a0,0
12c: 01010113 addi sp,sp,16
130: 00008067 ret
...
```
Also, I created a directory called "longest_palindrome" under *sw/* and I modified the Makefile from *hello*. The directory structure looks like:

Makefile:
```
include ../common/Makefile.common
EXE = .elf
SRC = longest_palindrome.c
CFLAGS += -L../common
LDFLAGS += -T ../common/default.ld
TARGET = longest_palindrome
OUTPUT = $(TARGET)$(EXE)
.PHONY: all clean
all: $(TARGET)
$(TARGET): $(SRC)
$(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS)
$(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin
$(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin
$(OBJCOPY) -O binary $(OUTPUT) memory.bin
$(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis
$(READELF) -a $(OUTPUT) > $(TARGET).symbol
clean:
$(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin
```
To run the test, execute `make longest_palindrome` under the root directory of *srv32*:
```
$ make longest_palindrome
make[1]: Entering directory '/home/axwl03/Desktop/srv32/sw'
make -C common
make[2]: Entering directory '/home/axwl03/Desktop/srv32/sw/common'
make[2]: Nothing to be done for 'all'.
make[2]: Leaving directory '/home/axwl03/Desktop/srv32/sw/common'
make[2]: Entering directory '/home/axwl03/Desktop/srv32/sw/longest_palindrome'
riscv-none-embed-gcc -O3 -Wall -march=rv32im -mabi=ilp32 -nostartfiles -nostdlib -L../common -o longest_palindrome.elf longest_palindrome.c -lc -lm -lgcc -lsys -T ../common/default.ld
riscv-none-embed-objcopy -j .text -O binary longest_palindrome.elf imem.bin
riscv-none-embed-objcopy -j .data -O binary longest_palindrome.elf dmem.bin
riscv-none-embed-objcopy -O binary longest_palindrome.elf memory.bin
riscv-none-embed-objdump -d longest_palindrome.elf > longest_palindrome.dis
riscv-none-embed-readelf -a longest_palindrome.elf > longest_palindrome.symbol
make[2]: Leaving directory '/home/axwl03/Desktop/srv32/sw/longest_palindrome'
make[1]: Leaving directory '/home/axwl03/Desktop/srv32/sw'
make[1]: Entering directory '/home/axwl03/Desktop/srv32/sim'
7
Excuting 1484 instructions, 1900 cycles, 1.280 CPI
Program terminate
- ../rtl/../testbench/testbench.v:418: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.022 s
Simulation cycles: 1911
Simulation speed : 0.0868636 MHz
make[1]: Leaving directory '/home/axwl03/Desktop/srv32/sim'
make[1]: Entering directory '/home/axwl03/Desktop/srv32/tools'
./rvsim --memsize 128 -l trace.log ../sw/longest_palindrome/longest_palindrome.elf
7
Excuting 1484 instructions, 1900 cycles, 1.280 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.001 s
Simulation cycles: 1900
Simulation speed : 1.870 MHz
make[1]: Leaving directory '/home/axwl03/Desktop/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
We can see that the result is correct (it outputs `7` for the string `abccccdd`).
## Waveform
To generate wave.fst file, run `make longest_palindrom.run` under *sim/* directory. Next, I built [GTKWave](https://github.com/gtkwave/gtkwave/tree/master/gtkwave3-gtk3) from source. To import the file, drag it into GTKWave.

### Pipeline
#### The first instruction in `main`
1. IF/ID stage: This is where main function started. PC contains `0x00000100`, which is the instruction `lui a0, 0x20` (`imem_rdata`: `0x00020537`).

2. EX stage: After the clock ticks, `lui a0, 0x20` goes to EX stage. `ex_insn` becomes `0x00020537` and `ex_imm` becomes `0x00020000` (`lui` loads immediate value to the top 20 bits). The result is calculated in this stage.

3. WB stage: Finally, `wb_result` gets the execution result `0x00020000` in WB stage and register `a0` will be updated in the next clock.

Let's see another example.
#### longestPalindrome
I choose the instruction at `0x0000005c`, which is `lbu a5, 0(s0)` (`0x00044783`). Because this instruction uses `a5` and `s0` register, I add both register to the graph.
1. IF/ID stage: We can observe that `imm_rdata` becomes `0x00044783` in this stage.

2. EX stage: `ex_src1_sel` indicates which register that should be read. In this case, it points to the 8th register (`s0`). The value stored in register `s0` is read into `reg_rdata1`. `reg_rdata1` then becomes `0x0002003c`.

3. WB stage: The result is read from DMEM, which is `0x61`. Therefore, `wb_rdata` becomes `0x61`. This value is written back into register `a5` in the next clock.

#### Stall
I added two signals to the graph, which are `wb_flush` and `ex_flush`. Both indicate that the prediction failed for branch/jump instructions.

The address in the screenshot above is where longestPalindrome function is called. Below table demonstrates the pipeline process. Both instructions at `0x00000114` and `0x00000118` are flushed.
| address | instruction | 0 | 1 | 2 | 3 | 4 | 5 |
| -------- | -------- | -------- | ------- | ------- | ------- | ------- | ------- |
| 0x00000110 | jal ra,3c \<longestPalindrome> | IF/ID | EX | WB | | | | |
| <font color="#f00">0x00000114</font> | <font color="#f00">mv a1,a0</font> | | <font color="#f00">NOP</font> | <font color="#f00">NOP</font> | <font color="#f00">NOP</font> | | |
| <font color="#f00">0x00000118</font> | <font color="#f00">lui a0,0x20</font> | | | <font color="#f00">NOP</font> | <font color="#f00">NOP</font> | <font color="#f00">NOP</font> | |
| 0x0000003c | addi sp,sp,-256 | | | | IF/ID | EX | WB |
## Software Optimization
In this algorithm, latter values depend on former value to calculate (e.x. 8, 9 depends on 4), so we cannot perform loop unrolling in this case. To enhance the performance, I tried eliminating if clause to prevent branching. I also change the input (`abccccdd` -> `aaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbccccccccccccccccccccde`).
The original code becomes:
```c
#include <stdio.h>
int longestPalindrome(char *s)
{
int count['z' - 'A' + 1] = {0}, res = 0, i = 0, N;
while (s[i] != '\0') {
if (++count[s[i] - 'A'] == 2) {
count[s[i] - 'A'] = 0;
res += 2;
}
++i;
}
N = i;
if (!(N & 1))
res += (res == N ? 0 : 1);
return res + (N & 1);
}
int main() {
char *str = "aaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbccccccccccccccccccccde";
printf("%d\n", longestPalindrome(str));
return 0;
}
```
Execute `make longest_palindrome`:
```
...
make[1]: Leaving directory '/home/axwl03/Desktop/srv32/sim'
make[1]: Entering directory '/home/axwl03/Desktop/srv32/tools'
./rvsim --memsize 128 -l trace.log ../sw/longest_palindrome/longest_palindrome.elf
61
Excuting 2424 instructions, 3070 cycles, 1.267 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.002 s
Simulation cycles: 3070
Simulation speed : 1.837 MHz
make[1]: Leaving directory '/home/axwl03/Desktop/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
The first thing I tried is to eliminate the if clause here:
```c
if (!(N & 1))
res += (res == N ? 0 : 1);
```
It becomes:
```c
res += !(N & 1) & (res != N);
```
Execute `make longest_palindrome`, the result is:
```
...
make[1]: Leaving directory '/home/axwl03/Desktop/srv32/sim'
make[1]: Entering directory '/home/axwl03/Desktop/srv32/tools'
./rvsim --memsize 128 -l trace.log ../sw/longest_palindrome/longest_palindrome.elf
61
Excuting 2426 instructions, 3070 cycles, 1.265 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.002 s
Simulation cycles: 3070
Simulation speed : 1.861 MHz
make[1]: Leaving directory '/home/axwl03/Desktop/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
We can see that CPI becomes 1.265, slightly better than original version.
Another place I modified is:
```c
while (s[i] != '\0') {
if (++count[s[i] - 'A'] == 2) {
count[s[i] - 'A'] = 0;
res += 2;
}
++i;
}
```
It becomes:
```c
while (s[i] != '\0') {
res += ++count[s[i] - 'A'] & 2;
count[s[i++] - 'A'] &= 1;
}
```
It results in:
```
...
make[1]: Leaving directory '/home/axwl03/Desktop/srv32/sim'
make[1]: Entering directory '/home/axwl03/Desktop/srv32/tools'
./rvsim --memsize 128 -l trace.log ../sw/longest_palindrome/longest_palindrome.elf
61
Excuting 2519 instructions, 3103 cycles, 1.232 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.002 s
Simulation cycles: 3103
Simulation speed : 1.795 MHz
make[1]: Leaving directory '/home/axwl03/Desktop/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===
```
The CPI decreases (1.232). However, the total number of cycles and instructions increases.
My final code:
```c
#include <stdio.h>
int longestPalindrome(char *s)
{
int count['z' - 'A' + 1] = {0}, res = 0, i = 0, N;
while (s[i] != '\0') {
res += ++count[s[i] - 'A'] & 2;
count[s[i++] - 'A'] &= 1;
}
N = i;
res += !(N & 1) & (res != N);
return res + (N & 1);
}
int main() {
char *str = "aaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbccccccccccccccccccccde";
printf("%d\n", longestPalindrome(str));
return 0;
}
```
## How Compliance Tests Works
* The goal of the compliance tests is to test whether an implementation of the CPU meets the standard. In this case, we are testing RISC-V implementation.