---
tags: Computer Architecture 2022
---
# Assignment3: SoftCPU
## Prerequisites
Several things you need on ubuntu
1. riscv GNU toolchain [xPack GNU RISC-V Embedded GCC](https://xpack.github.io/riscv-none-elf-gcc/releases/)
After unzipping it you have to define the environment variable.
```
export CROSS_COMPILE=~/riscv-none-elf-gcc/bin/riscv-none-elf-
```
2. dependent packages
```
sudo apt install build-essential lcov ccache libsystemc-dev
```
3. verilator
```
cd $HOME
git clone https://github.com/verilator/verilator
cd verilator
git checkout stable
export VERILATOR_ROOT=`pwd`
./configure
make
```
Then, you can set environment variables in advance.
```
export VERILATOR_ROOT=$HOME/verilator
export PATH=$VERILATOR_ROOT/bin:$PATH
```
4. srv32
```
git clone https://github.com/sysprog21/srv32
```
Then, you also need to set environment variables.
```
export EXTRA_CFLAGS="-misa-spec=2.2 -march=rv32im"
```
## Requirements 1
### C code
The C code following is from my [assignment2](https://hackmd.io/@Fo7UsdePRsKPVV4CPYGbpA/Sy7ZmxN7i). The leetcode question is [Convert Binary Number in a Linked List to Integer](https://leetcode.com/problems/convert-binary-number-in-a-linked-list-to-integer/).
```clike=
#include <stdlib.h>
#include <stdio.h>
const int array[3] = {1, 0, 1};
struct ListNode
{
int val;
struct ListNode *next;
};
void create_list(struct ListNode **cur){
for(int i = 0 ; i < 3 ; i++){
struct ListNode *new_node = (struct ListNode *)malloc(sizeof(struct ListNode));
new_node->val = array[i];
new_node->next = NULL;
*cur = new_node;
cur = &((*cur)->next);
}
}
int getDecimalValue(struct ListNode* head){
int ans = 0;
while(head){
ans <<= 1;
ans |= head->val;
head = head->next;
}
return ans;
}
int main(){
struct ListNode *head = NULL;
int ans;
create_list(&head);
ans = getDecimalValue(head);
if(ans == 5){
printf("correct!\n");
}else{
printf("wrong\n");
}
return 0;
```
#### Run ISS sim
```
./rvsim --memsize 128 -l trace.log ../sw/hw3/hw3.elf
correct!
Excuting 1404 instructions, 1894 cycles, 1.349 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.001 s
Simulation cycles: 1894
Simulation speed : 1.436 MHz
```
#### Run RTL sim
```
correct!
Excuting 1404 instructions, 1894 cycles, 1.349 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.042 s
Simulation cycles: 1905
Simulation speed : 0.0453571 MHz
```
#### compare
| | RTL | ISS |
|:----------------- |:-------------:|:---------:|
| instructions | 1404 | 1404 |
| cycles | 1894 | 1894 |
| CPI | 1.349 | 1.349 |
| Simulation time | 0.042 s | 0.001 s |
| Simulation cycles | 1905 | 1894 |
| Simulation speed | 0.0453571 MHz | 1.436 MHz |
### My assembly code
Finally, I figure out how to modify my code to run simulation. The Makefile and assembly code is by follow.
:::spoiler Makefile
```include ../common/Makefile.common
EXE = .elf
SRC = hw3.S
CFLAGS += -L../common
LDFLAGS += -T ../common/default.ld
TARGET = hw3
OUTPUT = $(TARGET)$(EXE)
.PHONY: all clean
all: $(TARGET)
$(TARGET): $(SRC)
$(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS)
$(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin
$(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin
$(OBJCOPY) -O binary $(OUTPUT) memory.bin
$(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis
$(READELF) -a $(OUTPUT) > $(TARGET).symbol
clean:
$(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin
```
:::
:::spoiler My assembly code
```clike=
.data
array:
.word 1
.word 0
.word 1
answer:
.word 5
correct:
.string "correct\n"
wrong:
.string "wrong answer\n"
iformat:
.string "%d"
end:
.string "\n"
.text
.global main
main:
addi sp, sp, -8 # head pointer
sw x0, 0(sp) # head = NULL
sw ra, 4(sp) # store ra
la s0, array # s0 = array[1,0,1]
lw s1, answer # s1 = 5
add a0, sp, x0 # head pointer's address
jal ra, createList
lw a0, 0(sp)
jal ra, getDecimalValue
bne a0, s1, wrongAnswer
la a0, correct
call printf
exit:
lw ra, 4(sp)
addi sp, sp, 8
li a0, 0
ret
wrongAnswer:
la a0, wrong
call printf
j exit
createList:
addi sp, sp, -12 # store s0, s1 (callee save)
# store ra (caller save)
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
mv s0, a0 # s0 = head pointer address
li a0, 8 # allocate 8 byte for structure ListNode
call sbrk # system call malloc
sw a0, 0(s0) # head pointer points to first node
li t0, 1 # value = 1
sw t0, 0(a0)
mv t1, a0 # t1 = first node address
li a0, 8
call sbrk # malloc second node
sw a0, 4(t1) # first node points to second node
li t0, 0 # value = 0
sw t0, 0(a0)
mv t1, a0 # t1 = second node address
li a0, 8
call sbrk
sw a0, 4(t1) # second node points to third node
li t0, 1 # value = 1
sw t0, 0(a0)
li t0, 1 # value = 1
sw t0, 0(a0)
sw x0, 4(a0) # third node's next pointer point to NULL
lw ra, 0(sp) # restore register value in stack
lw s0, 4(sp)
lw s1, 8(sp)
addi sp, sp, 12
jr ra
getDecimalValue:
addi sp, sp, -8 # store ra, s0(for a new pointer)
sw ra, 0(sp)
sw s0, 4(sp)
mv s0, a0
li a0, 0 # a0 = answer
# loop unroll
# first node
lw a0, 0(s0) # a0 = first node value
lw s0, 4(s0) # head = head->next
# second node
lw t0, 0(s0) # t0 = second node value
slli a0, a0, 1
or a0, a0, t0
lw s0, 4(s0)
# third node
lw t0, 0(s0) # t0 = third node value
slli a0, a0, 1
or a0, a0, t0 # a0 is the final answer
lw ra, 0(sp)
lw s0, 4(sp)
addi sp, sp, 8
```
:::
#### Run ISS sim
```
make[1]: Entering directory '/home/cheng/srv32/tools'
./rvsim --memsize 128 -l trace.log ../sw/hw3/hw3.elf
correct
Excuting 2064 instructions, 2920 cycles, 1.415 CPI
Program terminate
Simulation statistics
=====================
Simulation time : 0.002 s
Simulation cycles: 2920
Simulation speed : 1.413 MHz
make[1]: Leaving directory '/home/cheng/srv32/tools'
```
#### Run RTL sim
```
make[1]: Entering directory '/home/cheng/srv32/sim'
correct
Excuting 2064 instructions, 2920 cycles, 1.414 CPI
Program terminate
- ../rtl/../testbench/testbench.v:434: Verilog $finish
Simulation statistics
=====================
Simulation time : 0.048 s
Simulation cycles: 2931
Simulation speed : 0.0610625 MHz
make[1]: Leaving directory '/home/cheng/srv32/sim'
```
#### compare
| | RTL | ISS |
|:----------------- |:-------------:|:---------:|
| instructions | 2064 | 2064 |
| cycles | 2920 | 2920 |
| CPI | 1.414 | 1.415 |
| Simulation time | 0.048 s | 0.002 s |
| Simulation cycles | 2931 | 2920 |
| Simulation speed | 0.0610625 MHz | 1.413 MHz |
### Analysis
We can see a big diffence between my assembly code and my c code through GNU gcc.
| | C code | assembly code |
|:------------ |:------:|:-------------:|
| instructions | 1404 | 2064 |
| cycles | 1894 | 2920 |
## Requirements 2
Let's take a look at srv32 architecture. srv32 is a 3-stage pipeline architecture with IF/ID, EX, WB stages.

Now, check the waveform of my assembly code by using GTKwave.
### Data hazard
Since srv32 support full bypassing, there is no both data hazard and load-use data hazard in srv32.
### Control hazard
Here is part of disassemble code.
```
0000003c <main>:
3c: ff810113 addi sp,sp,-8
40: 00012023 sw zero,0(sp)
44: 00112223 sw ra,4(sp)
48: 00021417 auipc s0,0x21
4c: de840413 addi s0,s0,-536 # 20e30 <array>
50: 00021497 auipc s1,0x21
54: dec4a483 lw s1,-532(s1) # 20e3c <answer>
58: 00010533 add a0,sp,zero
5c: 03c000ef jal ra,98 <createList>
60: 00012503 lw a0,0(sp)
64: 0ac000ef jal ra,110 <getDecimalValue>
68: 02951063 bne a0,s1,88 <wrongAnswer>
6c: 00021517 auipc a0,0x21
70: dd450513 addi a0,a0,-556 # 20e40 <correct>
74: 124000ef jal ra,198 <printf>
```
Here is the waveform.

Address 5C is a jal instruction. It should jump to the label "createList". It loads two wrong instructions, **`wb_nop` and `wb_nop_more` turn to 1** to make sure it won't write back to register. The **branch penalty is 2 cycle**.
### Explain how srv32 works

#### IF/ID
:::spoiler IF/ID image

:::
Now address 5c instruction which is `jal ra,98 <createList>` is in IF/ID stage. `imem_rdata` is 03C000EF and decode to `instr` 03C000EF and `imm` 3C. 03C000EF is machine code of `jal ra,98 <createList>`.
#### EX
:::spoiler EX image

:::
Address 58 instruction which is `add a0,sp,zero` . `add` instruction will add two source register `sp` and `zero`(zero = x0). As we can see `reg_rdata1` is 0003FFF8 and `reg_rdata2` is 0, so the `ex_result` is 0003FFF8.
#### WB
:::spoiler WB image

:::
Address 54 instruction which is `lw s1,-532(s1) # 20e3c <answer>`. `lw` instruction read data from an address. Read address is `wb_raddress` 00020E3C. The data we get is `wb_rdata` 5.
## Requirements 3
I'll list some strategies that I use in below.
1. loop unrolling, basically already done in my origin code.
2. use fewer `lw` or `sw` instructions, e.g., instead of reading from data section that already know, I use `li` to give value to register. I do not store some callee save register because I won't change it, like `s0`, `s1` registers.
3. out of order. Due to the srv32 static branch prediction, I make sure my code won't do any branch.
:::spoiler assembly code
```clike=
.data
array:
.word 1, 0, 1
answer:
.word 5
correct:
.string "correct\n"
wrong:
.string "wrong answer\n"
.text
.global main
main:
addi sp, sp, -8 # head pointer
sw x0, 0(sp) # head = NULL
sw ra, 4(sp) # store ra
li s0, 5 # answer s1 = 5
add a0, sp, x0 # head pointer's address
createList:
addi sp, sp, -4 # store ra (caller save)
sw ra, 0(sp)
mv t2, a0 # t2 = head pointer address
li a0, 8 # allocate 8 byte for structure ListNode
call sbrk # system call malloc
sw a0, 0(t2) # head pointer points to first node
#mv t2, a0 # we can store node address
li t0, 1 # value = 1
sw t0, 0(a0)
mv t1, a0 # t1 = first node address
li a0, 8
call sbrk # malloc second node
sw a0, 4(t1) # first node points to second node
li t0, 0 # value = 0
sw t0, 0(a0)
mv t1, a0 # t1 = second node address
li a0, 8
call sbrk
sw a0, 4(t1) # second node points to third node
li t0, 1 # value = 1
sw t0, 0(a0)
sw x0, 4(a0) # third node's next pointer point to NULL
lw ra, 0(sp) # restore register value in stack
addi sp, sp, 4
getDecimalValue:
lw t1, 0(sp)
li a0, 0 # a0 = answer
# loop unroll
# first node
lw a0, 0(t1) # a0 = first node value
lw t1, 4(t1) # head = head->next
# second node
lw t0, 0(t1) # t0 = second node value
slli a0, a0, 1
or a0, a0, t0
lw t1, 4(t1)
# third node
lw t0, 0(t1) # t0 = third node value
slli a0, a0, 1
or a0, a0, t0 # a0 is the final answer
validate:
bne a0, s0, wrongAnswer
la a0, correct
call printf
exit:
lw ra, 4(sp)
addi sp, sp, 8
li a0, 0
jr ra
wrongAnswer: # print "wrong answer" and exit
la a0, wrong
call printf
j exit
```
:::
| | Origin assembly | Optimized assembly |
|:------------ |:---------------:|:------------------:|
| Instructions | 2064 | 2044 |
| Cycles | 2920 | 2892 |
Although I reduced only close to 30 cycles, but **my code only take branches while calling syscall and at the end of the code**(return 0).