--- tags: Computer Architecture 2022 --- # Assignment3: SoftCPU ## Prerequisites Several things you need on ubuntu 1. riscv GNU toolchain [xPack GNU RISC-V Embedded GCC](https://xpack.github.io/riscv-none-elf-gcc/releases/) After unzipping it you have to define the environment variable. ``` export CROSS_COMPILE=~/riscv-none-elf-gcc/bin/riscv-none-elf- ``` 2. dependent packages ``` sudo apt install build-essential lcov ccache libsystemc-dev ``` 3. verilator ``` cd $HOME git clone https://github.com/verilator/verilator cd verilator git checkout stable export VERILATOR_ROOT=`pwd` ./configure make ``` Then, you can set environment variables in advance. ``` export VERILATOR_ROOT=$HOME/verilator export PATH=$VERILATOR_ROOT/bin:$PATH ``` 4. srv32 ``` git clone https://github.com/sysprog21/srv32 ``` Then, you also need to set environment variables. ``` export EXTRA_CFLAGS="-misa-spec=2.2 -march=rv32im" ``` ## Requirements 1 ### C code The C code following is from my [assignment2](https://hackmd.io/@Fo7UsdePRsKPVV4CPYGbpA/Sy7ZmxN7i). The leetcode question is [Convert Binary Number in a Linked List to Integer](https://leetcode.com/problems/convert-binary-number-in-a-linked-list-to-integer/). ```clike= #include <stdlib.h> #include <stdio.h> const int array[3] = {1, 0, 1}; struct ListNode { int val; struct ListNode *next; }; void create_list(struct ListNode **cur){ for(int i = 0 ; i < 3 ; i++){ struct ListNode *new_node = (struct ListNode *)malloc(sizeof(struct ListNode)); new_node->val = array[i]; new_node->next = NULL; *cur = new_node; cur = &((*cur)->next); } } int getDecimalValue(struct ListNode* head){ int ans = 0; while(head){ ans <<= 1; ans |= head->val; head = head->next; } return ans; } int main(){ struct ListNode *head = NULL; int ans; create_list(&head); ans = getDecimalValue(head); if(ans == 5){ printf("correct!\n"); }else{ printf("wrong\n"); } return 0; ``` #### Run ISS sim ``` ./rvsim --memsize 128 -l trace.log ../sw/hw3/hw3.elf correct! Excuting 1404 instructions, 1894 cycles, 1.349 CPI Program terminate Simulation statistics ===================== Simulation time : 0.001 s Simulation cycles: 1894 Simulation speed : 1.436 MHz ``` #### Run RTL sim ``` correct! Excuting 1404 instructions, 1894 cycles, 1.349 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.042 s Simulation cycles: 1905 Simulation speed : 0.0453571 MHz ``` #### compare | | RTL | ISS | |:----------------- |:-------------:|:---------:| | instructions | 1404 | 1404 | | cycles | 1894 | 1894 | | CPI | 1.349 | 1.349 | | Simulation time | 0.042 s | 0.001 s | | Simulation cycles | 1905 | 1894 | | Simulation speed | 0.0453571 MHz | 1.436 MHz | ### My assembly code Finally, I figure out how to modify my code to run simulation. The Makefile and assembly code is by follow. :::spoiler Makefile ```include ../common/Makefile.common EXE = .elf SRC = hw3.S CFLAGS += -L../common LDFLAGS += -T ../common/default.ld TARGET = hw3 OUTPUT = $(TARGET)$(EXE) .PHONY: all clean all: $(TARGET) $(TARGET): $(SRC) $(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS) $(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin $(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin $(OBJCOPY) -O binary $(OUTPUT) memory.bin $(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis $(READELF) -a $(OUTPUT) > $(TARGET).symbol clean: $(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin ``` ::: :::spoiler My assembly code ```clike= .data array: .word 1 .word 0 .word 1 answer: .word 5 correct: .string "correct\n" wrong: .string "wrong answer\n" iformat: .string "%d" end: .string "\n" .text .global main main: addi sp, sp, -8 # head pointer sw x0, 0(sp) # head = NULL sw ra, 4(sp) # store ra la s0, array # s0 = array[1,0,1] lw s1, answer # s1 = 5 add a0, sp, x0 # head pointer's address jal ra, createList lw a0, 0(sp) jal ra, getDecimalValue bne a0, s1, wrongAnswer la a0, correct call printf exit: lw ra, 4(sp) addi sp, sp, 8 li a0, 0 ret wrongAnswer: la a0, wrong call printf j exit createList: addi sp, sp, -12 # store s0, s1 (callee save) # store ra (caller save) sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) mv s0, a0 # s0 = head pointer address li a0, 8 # allocate 8 byte for structure ListNode call sbrk # system call malloc sw a0, 0(s0) # head pointer points to first node li t0, 1 # value = 1 sw t0, 0(a0) mv t1, a0 # t1 = first node address li a0, 8 call sbrk # malloc second node sw a0, 4(t1) # first node points to second node li t0, 0 # value = 0 sw t0, 0(a0) mv t1, a0 # t1 = second node address li a0, 8 call sbrk sw a0, 4(t1) # second node points to third node li t0, 1 # value = 1 sw t0, 0(a0) li t0, 1 # value = 1 sw t0, 0(a0) sw x0, 4(a0) # third node's next pointer point to NULL lw ra, 0(sp) # restore register value in stack lw s0, 4(sp) lw s1, 8(sp) addi sp, sp, 12 jr ra getDecimalValue: addi sp, sp, -8 # store ra, s0(for a new pointer) sw ra, 0(sp) sw s0, 4(sp) mv s0, a0 li a0, 0 # a0 = answer # loop unroll # first node lw a0, 0(s0) # a0 = first node value lw s0, 4(s0) # head = head->next # second node lw t0, 0(s0) # t0 = second node value slli a0, a0, 1 or a0, a0, t0 lw s0, 4(s0) # third node lw t0, 0(s0) # t0 = third node value slli a0, a0, 1 or a0, a0, t0 # a0 is the final answer lw ra, 0(sp) lw s0, 4(sp) addi sp, sp, 8 ``` ::: #### Run ISS sim ``` make[1]: Entering directory '/home/cheng/srv32/tools' ./rvsim --memsize 128 -l trace.log ../sw/hw3/hw3.elf correct Excuting 2064 instructions, 2920 cycles, 1.415 CPI Program terminate Simulation statistics ===================== Simulation time : 0.002 s Simulation cycles: 2920 Simulation speed : 1.413 MHz make[1]: Leaving directory '/home/cheng/srv32/tools' ``` #### Run RTL sim ``` make[1]: Entering directory '/home/cheng/srv32/sim' correct Excuting 2064 instructions, 2920 cycles, 1.414 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.048 s Simulation cycles: 2931 Simulation speed : 0.0610625 MHz make[1]: Leaving directory '/home/cheng/srv32/sim' ``` #### compare | | RTL | ISS | |:----------------- |:-------------:|:---------:| | instructions | 2064 | 2064 | | cycles | 2920 | 2920 | | CPI | 1.414 | 1.415 | | Simulation time | 0.048 s | 0.002 s | | Simulation cycles | 2931 | 2920 | | Simulation speed | 0.0610625 MHz | 1.413 MHz | ### Analysis We can see a big diffence between my assembly code and my c code through GNU gcc. | | C code | assembly code | |:------------ |:------:|:-------------:| | instructions | 1404 | 2064 | | cycles | 1894 | 2920 | ## Requirements 2 Let's take a look at srv32 architecture. srv32 is a 3-stage pipeline architecture with IF/ID, EX, WB stages. ![](https://i.imgur.com/UiB23CQ.png) Now, check the waveform of my assembly code by using GTKwave. ### Data hazard Since srv32 support full bypassing, there is no both data hazard and load-use data hazard in srv32. ### Control hazard Here is part of disassemble code. ``` 0000003c <main>: 3c: ff810113 addi sp,sp,-8 40: 00012023 sw zero,0(sp) 44: 00112223 sw ra,4(sp) 48: 00021417 auipc s0,0x21 4c: de840413 addi s0,s0,-536 # 20e30 <array> 50: 00021497 auipc s1,0x21 54: dec4a483 lw s1,-532(s1) # 20e3c <answer> 58: 00010533 add a0,sp,zero 5c: 03c000ef jal ra,98 <createList> 60: 00012503 lw a0,0(sp) 64: 0ac000ef jal ra,110 <getDecimalValue> 68: 02951063 bne a0,s1,88 <wrongAnswer> 6c: 00021517 auipc a0,0x21 70: dd450513 addi a0,a0,-556 # 20e40 <correct> 74: 124000ef jal ra,198 <printf> ``` Here is the waveform. ![](https://i.imgur.com/VAtnuqV.png) Address 5C is a jal instruction. It should jump to the label "createList". It loads two wrong instructions, **`wb_nop` and `wb_nop_more` turn to 1** to make sure it won't write back to register. The **branch penalty is 2 cycle**. ### Explain how srv32 works ![](https://i.imgur.com/BBkpMWO.png) #### IF/ID :::spoiler IF/ID image ![](https://i.imgur.com/TTFGopl.png) ::: Now address 5c instruction which is `jal ra,98 <createList>` is in IF/ID stage. `imem_rdata` is 03C000EF and decode to `instr` 03C000EF and `imm` 3C. 03C000EF is machine code of `jal ra,98 <createList>`. #### EX :::spoiler EX image ![](https://i.imgur.com/PQI3Bu4.png) ::: Address 58 instruction which is `add a0,sp,zero` . `add` instruction will add two source register `sp` and `zero`(zero = x0). As we can see `reg_rdata1` is 0003FFF8 and `reg_rdata2` is 0, so the `ex_result` is 0003FFF8. #### WB :::spoiler WB image ![](https://i.imgur.com/gcm78PR.png) ::: Address 54 instruction which is `lw s1,-532(s1) # 20e3c <answer>`. `lw` instruction read data from an address. Read address is `wb_raddress` 00020E3C. The data we get is `wb_rdata` 5. ## Requirements 3 I'll list some strategies that I use in below. 1. loop unrolling, basically already done in my origin code. 2. use fewer `lw` or `sw` instructions, e.g., instead of reading from data section that already know, I use `li` to give value to register. I do not store some callee save register because I won't change it, like `s0`, `s1` registers. 3. out of order. Due to the srv32 static branch prediction, I make sure my code won't do any branch. :::spoiler assembly code ```clike= .data array: .word 1, 0, 1 answer: .word 5 correct: .string "correct\n" wrong: .string "wrong answer\n" .text .global main main: addi sp, sp, -8 # head pointer sw x0, 0(sp) # head = NULL sw ra, 4(sp) # store ra li s0, 5 # answer s1 = 5 add a0, sp, x0 # head pointer's address createList: addi sp, sp, -4 # store ra (caller save) sw ra, 0(sp) mv t2, a0 # t2 = head pointer address li a0, 8 # allocate 8 byte for structure ListNode call sbrk # system call malloc sw a0, 0(t2) # head pointer points to first node #mv t2, a0 # we can store node address li t0, 1 # value = 1 sw t0, 0(a0) mv t1, a0 # t1 = first node address li a0, 8 call sbrk # malloc second node sw a0, 4(t1) # first node points to second node li t0, 0 # value = 0 sw t0, 0(a0) mv t1, a0 # t1 = second node address li a0, 8 call sbrk sw a0, 4(t1) # second node points to third node li t0, 1 # value = 1 sw t0, 0(a0) sw x0, 4(a0) # third node's next pointer point to NULL lw ra, 0(sp) # restore register value in stack addi sp, sp, 4 getDecimalValue: lw t1, 0(sp) li a0, 0 # a0 = answer # loop unroll # first node lw a0, 0(t1) # a0 = first node value lw t1, 4(t1) # head = head->next # second node lw t0, 0(t1) # t0 = second node value slli a0, a0, 1 or a0, a0, t0 lw t1, 4(t1) # third node lw t0, 0(t1) # t0 = third node value slli a0, a0, 1 or a0, a0, t0 # a0 is the final answer validate: bne a0, s0, wrongAnswer la a0, correct call printf exit: lw ra, 4(sp) addi sp, sp, 8 li a0, 0 jr ra wrongAnswer: # print "wrong answer" and exit la a0, wrong call printf j exit ``` ::: | | Origin assembly | Optimized assembly | |:------------ |:---------------:|:------------------:| | Instructions | 2064 | 2044 | | Cycles | 2920 | 2892 | Although I reduced only close to 30 cycles, but **my code only take branches while calling syscall and at the end of the code**(return 0).