owned this note
owned this note
Published
Linked with GitHub
# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < `dck9661` >
## Install riscv rv32ima toolchain
First,install riscv toolchain for this homework. In the beginning you will get default 64 bits RISCV toolchain.
```bash
$ git clone https://github.com/riscv/riscv-gnu-toolchain
$ cd riscv-gnu-toolchain
$ git submodule update --init --recursive
```
Because We need to use RISCV 32I, run the script to build 32 bits toolchain. There are many projects in the toolchain including riscv compiler and simulator "spike".
```bash
#! /bin/bash
#
# Script to build RISC-V ISA simulator, proxy kernel, and GNU toolchain.
# Tools will be installed to $RISCV.
. build.common
echo "Starting RISC-V Toolchain build process"
build_project riscv-fesvr --prefix=$RISCV
build_project riscv-isa-sim --prefix=$RISCV --with-fesvr=$RISCV --with-isa=rv32ima
build_project riscv-gnu-toolchain --prefix=$RISCV --with-arch=rv32ima --with-abi=ilp32
CC= CXX= build_project riscv-pk --prefix=$RISCV --host=riscv32-unknown-elf
build_project riscv-openocd --prefix=$RISCV --enable-remote-bitbang --disable-werror
echo -e "\\nRISC-V Toolchain installation completed!"
```
## Start a simple test
I write a simple multiplier C code using riscv compiler to compile and run the code with spike to check the answer is right or not.
```bash
$ riscv32-unknown-elf-gcc -o mul mul.c
$ spike pk mul
```
```cpp
#include<stdio.h>
int main()
{
int mul1 = -1;
int mul2 = -2;
int result = 0 ;
for(int i=0;i<32;i++)
{
if((mul2 >> i) & 0x1)
result = result + (mul1 << i);
}
printf("result:%d\n",result);
}
```
After check the correctness of C code,I use compiler to generate assembly code to be my golden.
```bash
riscv32-unknown-elf-gcc -S mul.c -o mul_asm
```
```cpp
.file "mul2.c"
.option nopic
.text
.align 2
.globl main
.type main, @function
main:
addi sp,sp,-32
sw s0,28(sp)
addi s0,sp,32
li a5,-22
sw a5,-28(s0)
li a5,-5
sw a5,-32(s0)
sw zero,-20(s0)
sw zero,-24(s0)
j .L2
.L4:
lw a4,-32(s0)
lw a5,-24(s0)
sra a5,a4,a5
andi a5,a5,1
beqz a5,.L3
lw a4,-28(s0)
lw a5,-24(s0)
sll a5,a4,a5
lw a4,-20(s0)
add a5,a4,a5
sw a5,-20(s0)
.L3:
lw a5,-24(s0)
addi a5,a5,1
sw a5,-24(s0)
.L2:
lw a4,-24(s0)
li a5,31
ble a4,a5,.L4
li a5,0
mv a0,a5
lw s0,28(sp)
addi sp,sp,32
jr ra
.size main, .-main
.ident "GCC: (GNU) 7.2.0"
```
## RISC-V assembly program (R32I ISA)
This code is my assembly program,having five different parts.
1. main
Initial values, store some data to memory and jump to the L2
2. L2
for loop if i < 32 jump to L3.
4. L3
first,multiplier shift one bit.If the the least significant bit is one, jump to L4 or jump to L2
4. L4
Let multiplicand left shift i and plus result. After calculate 32 times,the result will store in a3
5. printResult
[Environment calls](https://github.com/mortbopet/Ripes/wiki/Environment-calls) are defined by Ripes simulator. If we want to print something, need to follow its rules. Using a0,a1 registers to activate system calls.
```cpp
.data
argument: .word -12
argument2: .word -7
str1: .string "result is "
.text
main:
addi sp,sp,-32
sw s0,28(sp)
addi s0,sp,32
lw a5,argument
sw a5,-28(s0)
lw a5,argument2
sw a5,-32(s0)
li a3,0
sw zero,-20(s0)
sw zero,-24(s0)
li a2,1
li a4,0
li a5,31
j .L2
.L2:
blt a4,a5,.L3
li a0,10
jal ra,printResult
li a0,10
ecall
.L3:
lw a6,-28(s0)
lw a7,-32(s0)
srl a7,a7,a4
andi a7,a7,1
beq a2,a7,.L4
addi a4,a4,1
j .L2
.L4:
sll a6,a6,a4
add a3,a3,a6
addi a4,a4,1
j .L2
printResult:
mv t0, a0
mv t1, a3
la a1, str1
li a0, 4
ecall
mv a1, t1
li a0, 1
ecall
ret
```
## Ripes simulator
### How a instuction works in pipeline CPU
First I need to introduce how a instruction run in Ripes simulator.Choose a instrction from fetch,and we will see how it progress to the end.
#### Fetch
In the begininng, PC is 0 means instruction addr is 0 so go to instruction memory to find instruction 0xfe010113.

When we see the memory, we can ensure that addr 0 has a data 0xfe010113, so it is correct instruction.

#### Decode
After decoder decode the instruction,we will known it need to read x2 registers. Getting one register data and a imm value 0xffffffe0 (-32) because it is a addi instrcution.

#### Execute
x2 register has a initial value 0x7fffffff0,so ALU will calculate 0x7ffffff0 + 0xffffffe0(-32) = 0x7ffffffd0. In the meantime, sw x8 28(x2) in the decode. It will read two registers and get one imm value for sw instruction.

#### Mem
Addi instruction don't need to access memory or write data. It just need to pass by the mem stage.

But sw instruction will write data into the memory,in this case it write x8 register data to addr 28+x2.

#### WB
In write back stage we can see addi will write 0x7fffffd0 to x2 register.x2 is 0x7ffffff0 when addi in the WB stage. In next cycle x2 will update to 0x7fffffd0 from below table we can observe this .




### Issue in pipeline CPU
In pipeline CPU, we need to focus on dependency issue, control instructions, register updates, memory updates.
1. dependency
Dependency means when pipeline CPU encounter RAW issue,need to do some forwarding or the performance will be very low. Because this CPU is in order issue in order execution, we don't need to care about other dependency like WAR,WAW.
- example 1 (R-type,R-type)
In this case we can see there are two registers.srl will update x17 register,meanwhile andi need x17 source register.Because CPU has forwarding mechanism so andi x17 can read the latest data without inserting a bubble.

- example 2 (lw,R-type)
In this case there is a srl instruction after lw instruction.srl need to get the latest x17 register data.

In this cycle, we can see there is a stall before srl instruction.why need this stall? Becasue lw still can't access memory in this cycle.

In this cycle srl go to EXE stage.lw is in WB stage,and x17 already get the latest data,so forwarding mechanism is work.

- example 3 (control instruction)
first we can see beq instruction beq x12,x17,12 in decode stage.meanwhile,andi x17,x17,1 in EXE stage. Because beq need to get the latest x17 register,it need wait for two cycle.When andi instruction in WB stage,beq can get latest data and calculate the result to choose branch or not. If branch is true then the fetch PC will change to new one.




2. control instruction
Control instrction like j,jal,jr,jalr is checked in decode stage. When it need to branch, decode stage will caculate branch PC to fetch.
- example
In this cycle beq can start to calculate. If branch is true PC will be PC+12 or PC will be PC+4.

Next cycle decode is changed to nop because branch occure.PC is PC + 12 now,so flush addi instruction and fetch sll instruction.

3. register updates
In above example, I already show how a register updates in CPU.
4. memory updates
- example
sw is in mem stage. In this cycle we can see x8 register hase data 0x7ffffff0 and x15 register has data 0xfffffb50, so next cycle we can predict that 0x7fffffd4 (0x7ffffff0 - 0x1c) will store 0xfffffb50.


Next cycle sw is in WB stage. We can see 0x7fffffd4 store 0xffffb50. sw instruction finishs its job.



