owned this note
owned this note
Published
Linked with GitHub
# Assignment1 RV32I Simulator
###### tags: `Computer Architecture`
**Table of Contents**
[TOC]
About convolution
---
In mathematics, convolution is a mathematical operation on two functions (f and g) that produces a third function expressing how the shape of one is modified by the other.
It is also an important concept in the field of discrete-time signal processing. An example of LTI system is that an output signal y[n] can be calculated as a input signal x[n] convolve with an impulse response of the system h[n] :

And it can also be abbreviated as follows :

In this assignment, I'm going to implement a function of convolution sum which can convolve two finite-length integer sequences. To simplify the function, I restrict two integer sequences to be equal length.

Where
A[], B[] : input integer array
C[] : output integer array
N+1 : length of A[], B[], C[]
C code implementation
---
There are two loops and one conditional branch(if-else) in this implementation.
```clike=
#include<stdio.h>
int main()
{
int A[5] = {10,12,31,17,25};
int B[5] = {3,8,78,91,71};
int N = 4;
int C[5];
for(int i=0; i<N+1; i++)
{
for(int j=0; j<i+1; j++){
if(!j){
C[i] = A[j]*B[i-j];
}else{
C[i] = C[i]+(A[j]*B[i-j]);
}
}
printf("C[%d] = %d\n", i, C[i]);
}
return 0;
}
```
Assembly code implementation
---
I convert C code to assembly code manually. The assembly code contains four functional parts.
**1. Store arrays to memory**
Initialize arrays and variable.
**2. Loops**
Indices i and j decide which values in arrays are going to participate the calculation.
**3. else and then**
j is 0 means that the multiply operation only has one term, so execute "C[i] = A[0]*B[i]". Otherwise, execute "C[i] = C[i]+A[0]*B[i]" to accumlate multiply operation to the result.
**4. printResult**
Environment calls are defined by Ripes simulator. Put the correspond value into a0 and a7 registers to activate particular system calls you want.
```clike=
.data
str1: .string "C["
str2: .string "]="
str3: .string "\n"
# s0:address of A[]
# s1:address of B[]
# s2:address of C[]
# t0:i
# t1:j
# t2:terminate value of loopi
# t3:terminate value of loopj
.text
main:
add s0, zero, zero # initialize the base address of A[]
addi s1, s0, 20 # set the base address of B[]
addi s2, s1, 20 # set the base address of C[]
li t0, 10 # A[] = {10,12,31,17,25}
sw t0, 0(s0)
li t0, 12
sw t0, 4(s0)
li t0, 31
sw t0, 8(s0)
li t0, 17
sw t0, 12(s0)
li t0, 25
sw t0, 16(s0)
li t0, 3 # B[] = {3,8,78,91,71}
sw t0, 0(s1)
li t0, 8
sw t0, 4(s1)
li t0, 78
sw t0, 8(s1)
li t0, 91
sw t0, 12(s1)
li t0, 71
sw t0, 16(s1)
addi t0, zero, 0 # i=0
li t2, 4
addi t2, t2, 1 # t2=N+1
loopi:
bge t0, t2, exit # if i>=N+1, then exit loop i
addi t1, zero, 0 # j=0
loopj:
addi t3, t0, 1 # t3=i+1
bge t1, t3, endi # if j>=i+1, then exit loop j
beq t1, zero, then # if j==0,then C[i] = A[j]*B[i-j]
else:
slli t4, t1, 2 # t4=j*4
slli t5, t0, 2 # t5=i*4
sub t6, t5, t4 # t6=(i-j)*4
add s3, s2, t5 # s3=address of C[i]
add s4, s1, t6 # s4=address of B[i-j]
add s5, s0, t4 # s5=address of A[j]
lw s6, 0(s5) # s6=A[i]
lw s7, 0(s4) # s7=B[i-j]
lw s9, 0(s3) # s9=C[i]
mul t4, s6, s7 # t4=A[i]*B[i-j]
add s9, s9, t4 # t5=C[i]+A[i]*B[i-j]
sw s9, 0(s3) # C[i]=C[i]+A[i]*B[i-j]
j endj # end of j loop
then:
slli t4, t1, 2 # t4=j*4
slli t5, t0, 2 # t5=i*4
sub t6, t5, t4 # t6=(i-j)*4
add s3, s2, t5 # s3=address of C[i]
add s4, s1, t6 # s4=address of B[i-j]
add s5, s0, t4 # s5=address of A[j]
lw s6, 0(s5) # s6=A[i]
lw s7, 0(s4) # s7=B[i-j]
mul t4, s6, s7 # t4=A[i]*B[i-j]
sw t4, 0(s3) # C[i]=A[i]*B[i-j]
endj:
addi t1, t1, 1 # j = j+1
j loopj
endi:
la a0, str1 # print string 'C['
li a7, 4
ecall
mv a0, t0 # print integer i
li a7, 1
ecall
la a0, str2 # print string ']='
li a7, 4
ecall
lw s8, 0(s3) # print integer C[i]
mv a0, s8
li a7, 1
ecall
la a0, str3 # print string '\n'
li a7, 4
ecall
addi t0, t0, 1 # i = i+1
j loopi
exit:
li a7, 10 # Halts the simulator
ecall
```
Results
---
Using Visual studio to execute the C code, the convolution sum of arrays A[] = {10,12,31,17,25} and B[] = {3,8,78,91,71} is array C[]. Each element in C[] is as follows:

Result of executing assembly code on Ripes is same as C code!:smile:

How RSIC-V assembly code works on Ripes
---
- **Store elements of array to memory**
For example, the first integer to store to memory is 10. So the instruction "sw x5 0(s0)" store value 10 to the A[0] (register s0 stores the base address of A[]).

***
When instruction "sw x5 0(s0)" is at memory access stage, memory write will be enable by setting signal line "Wr en" to 1 (as the green dot in data memory below).

***
And the value of the first block of memory is 0x0000000a, which is 10 in decimal representation.

***
After setup the A[] and B[] in memory, the memory looks like picture below, which A[] = {10,12,31,17,25} and B[] = {3,8,78,91,71}

***
- **Loops**
In the first i loop, when instruction "bge x5, x7, exit" is at the execution(EX) stage, the decision of whether to branch or not is made by ALU.

***
And the Ripes simulator automatically flushs the previous two stages to avoid branch harzard.

***
In next cycle, since value of register x5 is smaller than x7, branch will not be taken in this case, so the next instruction "bge x6, x28, endi" will come into pipeline.


***
- **else and then**
When the instruction "beq x6, x0, then" is at the execution stage(EX), the if-else branch decision in the loop j is made.

***
In this case, register x6 is equal to zero. Therefore, The program counter(PC) will be set to the address of the label "then" at the same time(as the green dot in the PC block).

***
In the next cycle, the next insruction will be fetched is "slli x29, x6, 2". That means when "if" condition is true, "then" condition is going to execute "slli x29, x6, 2" and its following instructions.

***
In the previous cycle, notice that there were two instructions already at the instrcution fetch(IF) and instruction decode(ID) stage respectively. They are "slli x29, x6, 2" and "slli x30, x5, 2". Now they must be flushed since the branch is taken.

***
- **printResult**
The instruction "ecall" will evoke OS to do the I/O stuff. In Ripes, you put the corresponding value into register a0 and a7 before "ecall" can help. In my code, I tried to print out integers and strings to the console.

***
x86-64 vs RISC-V calling convention
---
Here is an example reference to CS:App 3th edition:
```clike=
//This is an example from CS:App 3th edition Fig 3.29
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
call_proc();
return 0;
}
long call_proc(){
long x1 = 1; int x2 = 2;
short x3 = 3; char x4 = 4;
proc(x1, &x1, x2, &x2, x3, &x3, x4, &x4);
return (x1+x2)*(x3-x4);
}
void proc(long a1, long *a1p,
int a2, int *a2p,
short a3, short *a3p,
char a4, char *a4p)
{
*a1p += a1;
*a2p += a2;
*a3p += a3;
*a4p += a4;
}
```
### x86-64 calling convention
We could generate an executable program as follows:
```clike=
$ gcc -Og -o prog proc_call.c
```
We can disassemble the file prog:
```clike=
$ objdump -d prog
```
The disassembler will extract various code sequences, including the following:
```clike=
Disassembly of section .text:
0000000000001169 <proc>:
1169: f3 0f 1e fa endbr64
116d: 48 8b 44 24 10 mov 0x10(%rsp),%rax ## Load &x4 from stack to register %rax
1172: 48 01 3e add %rdi,(%rsi) ## Perform *x1p += x1;
1175: 01 11 add %edx,(%rcx) ## Perform *x2p += x2;
1177: 66 45 01 01 add %r8w,(%r9) ## Perform *x3p += x3;
117b: 8b 54 24 08 mov 0x8(%rsp),%edx ## Load x4 from stack to register %edx
117f: 00 10 add %dl,(%rax) ## Perform *x4p += x4;
1181: c3 retq
0000000000001182 <call_proc>:
1182: f3 0f 1e fa endbr64
1186: 53 push %rbx
1187: 48 83 ec 20 sub $0x20,%rsp ## Allocate 20 bytes stack frame
118b: bb 28 00 00 00 mov $0x28,%ebx
1190: 64 48 8b 03 mov %fs:(%rbx),%rax ## Rerieve canary
1194: 48 89 44 24 18 mov %rax,0x18(%rsp) ## Store on stack
1199: 31 c0 xor %eax,%eax ## Zero out register
119b: 48 c7 44 24 10 01 00 movq $0x1,0x10(%rsp) ## Save long x1 to stack for getting &x1
11a2: 00 00
11a4: c7 44 24 0c 02 00 00 movl $0x2,0xc(%rsp) ## Save int x2 to stack for getting &x2
11ab: 00
11ac: 66 c7 44 24 0a 03 00 movw $0x3,0xa(%rsp) ## Save short x3 to stack for getting &x3
11b3: c6 44 24 09 04 movb $0x4,0x9(%rsp) ## Save char x4 to stack for getting &x4
11b8: 48 8d 4c 24 0c lea 0xc(%rsp),%rcx ## Let register %rcx hold address &x2
11bd: 48 8d 74 24 10 lea 0x10(%rsp),%rsi ## Let register %rsi hold address &x1
11c2: 48 8d 44 24 09 lea 0x9(%rsp),%rax ## Let register %rax hold address &x4
11c7: 50 push %rax ## Spilling register argument &x4 need to be store on stack
11c8: 6a 04 pushq $0x4 ## Spilling register argument x4 need to be store on stack
11ca: 4c 8d 4c 24 1a lea 0x1a(%rsp),%r9 ## Let register %r9 hold address &x4
11cf: 41 b8 03 00 00 00 mov $0x3,%r8d ## Pass 3 as argument
11d5: ba 02 00 00 00 mov $0x2,%edx ## Pass 2 as argument
11da: bf 01 00 00 00 mov $0x1,%edi ## Pass 1 as argument
11df: e8 85 ff ff ff callq 1169 <proc> ## Call procedure proc()
11e4: 48 63 4c 24 1c movslq 0x1c(%rsp),%rcx
11e9: 48 03 4c 24 20 add 0x20(%rsp),%rcx
11ee: 0f bf 54 24 1a movswl 0x1a(%rsp),%edx
11f3: 0f be 44 24 19 movsbl 0x19(%rsp),%eax
11f8: 29 c2 sub %eax,%edx
11fa: 48 63 c2 movslq %edx,%rax
11fd: 48 0f af c1 imul %rcx,%rax
1201: 48 83 c4 10 add $0x10,%rsp
1205: 48 8b 7c 24 18 mov 0x18(%rsp),%rdi
120a: 64 48 33 3b xor %fs:(%rbx),%rdi
120e: 75 06 jne 1216 <call_proc+0x94>
1210: 48 83 c4 20 add $0x20,%rsp
1214: 5b pop %rbx
1215: c3 retq
1216: e8 45 fe ff ff callq 1060 <__stack_chk_fail@plt>
000000000000121b <main>:
121b: f3 0f 1e fa endbr64
121f: 48 83 ec 08 sub $0x8,%rsp
1223: b8 00 00 00 00 mov $0x0,%eax
1228: e8 55 ff ff ff callq 1182 <call_proc>
122d: 48 89 c2 mov %rax,%rdx
1230: 48 8d 35 cd 0d 00 00 lea 0xdcd(%rip),%rsi # 2004 <_IO_stdin_used+0x4>
1237: bf 01 00 00 00 mov $0x1,%edi
123c: b8 00 00 00 00 mov $0x0,%eax
1241: e8 2a fe ff ff callq 1070 <__printf_chk@plt>
1246: b8 00 00 00 00 mov $0x0,%eax
124b: 48 83 c4 08 add $0x8,%rsp
124f: c3 retq
```
### RISC-V calling convention
Since we are not using standard library for this case, the program should be modified as follow:
```clike=
// Modified to fit RISC-V compiler
// Some operators are changed for the purpose of convenience
long call_proc();
void proc( long, long *,
int, int*,
short, short*,
char, char *);
void _start()
{
volatile char* tx = (volatile char*) 0x40002000;
const char *str1 = "The result is:";
long *p, result = call_proc();
//printf("The result is: %ld\n", result);
p = &result;
*tx = *p + '0';
}
long call_proc(){
long x1 = 1; int x2 = 2;
short x3 = 3; char x4 = 4;
proc(x1, &x1, x2, &x2, x3, &x3, x4, &x4);
return (x1+x2)+(x3-x4);
}
void proc(long a1, long *a1p,
int a2, int *a2p,
short a3, short *a3p,
char a4, char *a4p)
{
*a1p += a1;
*a2p += a2;
*a3p += a3;
*a4p += a4;
}
```
We could generate an executable program as follows:
```
$ riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -Og -nostdlib proc_call.c -o proc_call
```
We can disassemble the file prog:
```
$ riscv-none-embed-objdump -d proc_call
```
The disassembler will extract various code sequences, including the following:
```clike=
00010074 <proc>:
10074: 0005a303 lw t1,0(a1)
10078: 00a30333 add t1,t1,a0
1007c: 0065a023 sw t1,0(a1) ## *a1p += a1;
10080: 0006a583 lw a1,0(a3)
10084: 00c585b3 add a1,a1,a2
10088: 00b6a023 sw a1,0(a3) ## *a2p += a2;
1008c: 0007d683 lhu a3,0(a5)
10090: 00d70733 add a4,a4,a3
10094: 00e79023 sh a4,0(a5) ## *a3p += a3;
10098: 0008c783 lbu a5,0(a7)
1009c: 00f80833 add a6,a6,a5
100a0: 01088023 sb a6,0(a7) ## *a4p += a4;
100a4: 00008067 ret
000100a8 <call_proc>:
100a8: fe010113 addi sp,sp,-32 ## Allocate 32 byte stack frame
100ac: 00112e23 sw ra,28(sp) ## Store return address
100b0: 00100793 li a5,1
100b4: 00f12623 sw a5,12(sp) ## Immediates must be put on stack first
100b8: 00200793 li a5,2
100bc: 00f12423 sw a5,8(sp)
100c0: 00300793 li a5,3
100c4: 00f11323 sh a5,6(sp)
100c8: 00400793 li a5,4
100cc: 00f102a3 sb a5,5(sp)
100d0: 00510893 addi a7,sp,5 ## Load the address of immediates &x1, &x2, &x3, &x4 to registers a7, a5, a3, a1 respectively
100d4: 00400813 li a6,4 ## Load the immediate to registers a0, a2, a4, a6
100d8: 00610793 addi a5,sp,6
100dc: 00300713 li a4,3
100e0: 00810693 addi a3,sp,8
100e4: 00200613 li a2,2
100e8: 00c10593 addi a1,sp,12
100ec: 00100513 li a0,1
100f0: f85ff0ef jal ra,10074 <proc> ## Call proc()
100f4: 00c12503 lw a0,12(sp)
100f8: 00812783 lw a5,8(sp)
100fc: 00f50533 add a0,a0,a5
10100: 00611783 lh a5,6(sp)
10104: 00514703 lbu a4,5(sp)
10108: 40e787b3 sub a5,a5,a4
1010c: 00f50533 add a0,a0,a5
10110: 01c12083 lw ra,28(sp)
10114: 02010113 addi sp,sp,32
10118: 00008067 ret
0001011c <_start>:
1011c: ff010113 addi sp,sp,-16
10120: 00112623 sw ra,12(sp)
10124: f85ff0ef jal ra,100a8 <call_proc>
10128: 03050513 addi a0,a0,48
1012c: 0ff57513 zext.b a0,a0
10130: 400027b7 lui a5,0x40002
10134: 00a78023 sb a0,0(a5) # 40002000 <__global_pointer$+0x3fff06b0>
10138: 00c12083 lw ra,12(sp)
1013c: 01010113 addi sp,sp,16
10140: 00008067 ret
```
In terms of RISC-V assembly, we can see that there aren't any spilling register in this example. The arguments can be put in a0-a7 and then call procedure `proc()`.
But the number of lines in RISC-V assembly is more than that of x86-64. One of the reason is that the arithemtic and logical opertations in RISC-V only use register operands but not memory operands.