Assignment1: RISC-V Assembly and Instruction Pipeline

contributed by < kaeteyaruyo >

Dot product

In Algebraic definition, The dot product of two vectors

a = [a_{1}, a_{2}, \dots, a_{n}]

and

b = [b_{1}, b_{2}, \dots, b_{n}]

is defined as:

a \cdot b = \sum_{i = 1}^{n} a_{i} b_{i} = a_{1} b_{1} + a_{2} b_{2} + \dots + a_{n} b_{n}

As in implementation, we store two vectors in two arrays that have same length, travel through them and multiply the same position elements, and add the result into summation, finally we get the dot product.

Implementation

You can find the source code here. Feel free to fork and modify it.

C code

I use arrays of length = 3 to illustrate the dot product process. You can change the length or element inside them to see if the result is still correct.














#include <stdio.h>

int main()
{
    int a[3] = {1, 2, 3};
    int b[3] = {4, 5, 6};
    int sum = 0;
    for (int i = 0; i < 3; ++i)
    {
        sum += a[i] * b[i];
    }
    printf("The inner product of two vectors is %d", sum);
    return 0;
}

Assembly code

In this example, I don't handle addition and multiplication overflow. So please ensure that the result of mul at line 33 and add at line 34 won't exceed the range of valid value (

- 2^{31}

2^{31} - 1














































.data
arr1: .word 1, 2, 3 # a[3] = {2, 4, 6}
arr2: .word 4, 5, 6 # b[3] = {8, 10, 12}
len:  .word 3       # array length = 3
str:  .string "The inner product of two vectors is "

.text
# s1 = arr1 base address
# s2 = arr2 base address
# s3 = array length
# s4 = sum
# t0 = i
# t1 = a[i]
# t2 = b[i]
# t3 = a[i] * b[i] (assume no overflow, lower 32 bits)

main:
    la  s1, arr1      # s1 = a
    la  s2, arr2      # s2 = b
    lw  s3, len       # s3 = 3
    add s4, x0, x0    # sum = 0
    add t0, x0, x0    # i = 0
    jal ra, loop      # jump to for loop
    jal ra, print     # jump to for loop
    li  a7, 10        # end program
    ecall

loop:
    lw   t1, 0(s1)    # t1 = a[i]
    addi s1, s1, 4    # ++a (address move forward)
    lw   t2, 0(s2)    # t2 = b[i]
    addi s2, s2, 4    # ++b
    mul  t3, t1, t2   # t3 = a[i] * b[i]
    add  s4, s4, t3   # sum += a[i] * b[i]
    addi t0, t0, 1    # ++i
    blt  t0, s3, loop # if i < 3, go to loop
    ret               # else, return to main

print:
    la  a0, str       # load string
    li  a7, 4         # print string
    ecall
    add a0, s4, x0    # load result
    li  a7, 1         # print integer
    ecall
    ret               # go back to main

Analysis

We test our code using Ripes simulator.

Pseudo instruction

Put code above into editor and we will see that Ripe doesn't execute it literally. Instead, it replace pseudo instruction (p.110) into equivalent one, and change register name from ABI name to sequencial one.
The translated code looks like:

00000000 <main>:
    0:  10000497  auipc x9 0x10000
    4:  00048493  addi  x9 x9 0
    8:  10000917  auipc x18 0x10000
    c:  00490913  addi  x18 x18 4
    10: 10000997  auipc x19 0x10000
    14: 0089a983  lw    x19 8(x19)
    18: 00000a33  add   x20 x0 x0
    1c: 000002b3  add   x5 x0 x0
    20: 010000ef  jal   x1 0x30 <loop>
    24: 030000ef  jal   x1 0x54 <print>
    28: 00a00893  addi  x17 x0 10
    2c: 00000073  ecall

00000030 <loop>:
    30: 0004a303  lw   x6 0(x9)
    34: 00448493  addi x9 x9 4
    38: 00092383  lw   x7 0(x18)
    3c: 00490913  addi x18 x18 4
    40: 02730e33  mul  x28 x6 x7
    44: 01ca0a33  add  x20 x20 x28
    48: 00128293  addi x5 x5 1
    4c: ff32c2e3  blt  x5 x19 -28 <loop>
    50: 00008067  jalr x0 x1 0

00000054 <print>:
    54: 10000517  auipc x10 0x10000
    58: fc850513  addi  x10 x10 -56
    5c: 00400893  addi  x17 x0 4
    60: 00000073  ecall
    64: 000a0533  add   x10 x20 x0
    68: 00100893  addi  x17 x0 1
    6c: 00000073  ecall
    70: 00008067  jalr  x0 x1 0

In each row it denotes address in instruction memory, instruction's machine code (in hex) and instruction itself respectively.

5-stage pipelined processor

Now we can choose a processor to run this code. Ripes provide four kinds of processor for us to choose, including single cycle processor, 5-stage processor, 5-stage with hazard detection and 5-stage with forward and hazard detection. Here we choose the 5 stage processor. Its block diagram look like this:

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

The "5-stage" means this processor using five-stage pipeline to parallelize instructions. The stages are:

Instruction fetch (IF)
Instruction decode and register fetch (ID)
Execute (EX)
Memory access (MEM)
Register write back (WB)

You can see that each stage is separated by pipeline registers (the rectangle block with stage names on its each side) in the block diagram.

Instruction in different type of format will go through 5 stages with different signal turned on. Let's discuss each format in detail with example.

U-type format

The first instruction in this program is auipc x9 0x10000. According to RISC-V Manual (p.14):

AUIPC (add upper immediate to pc) is used to build pc-relative addresses and uses the U-type format. AUIPC forms a 32-bit offset from the 20-bit U-immediate, filling in the lowest 12 bits with zeros, adds this offset to the pc, then places the result in register rd.

Let's see how it go through each stage.

Instruction fetch (IF)

We start from instruction put at 0x00000000, so addr is equal to 0x00000000
The machine code of first instruction is 0x10000497 (you can look it up in the code snippet above), so instr is equal to 0x10000497.
PC will increment by 4 automatically using the above adder.
Because there is no branch occur, next instruction will be at PC + 4, so the multiplexer before PC choose input come from adder.

Instruction decode and register fetch (ID)

Instruction 0x10000497 is decoded to three part:
- opcode = auipc
- Wr idx = 0x09
- imm. = 0x10000000 (0x10000 in upper 20 bits, filling in the lowest 12 bits with zeros)
U-type format doesn't read register, so R1 idx and R2 idx are both 0x00.
Reg 1 and Reg 2 read value from 0x00 register, so their value are both 0x00000000 too.
Current PC value (0x00000000) and next PC value (0x00000004) are just send through this stage, we don't use them.

Execute (EX)

First level multiplexers choose value come from Reg 1 and Reg 2, but this is an U-type format instruction, we don't use them. So they are filtered by second level multiplexer.
Reg 1 and Reg 2 are also send to branch block, but no branch is taken.
Second level multiplexer choose value come from current PC value (upper one) and immediate (lower one) as Op1 and Op2 of ALU.
ALU add two operand togeher, so the Res is equal to 0x00000000.
Next PC value (0x00000004) and Wr idx (0x09) are just send through this stage, we don't use them.

Memory access (MEM)

Res from ALU is send to 3 ways:
- Pass through this stage and go to WB stage
- Send back to EXE stage for next instruction to use
- Use as data memory address. Memory read data at address 0x10000000, so Read out is equal to 0x00000001. The table below denotes the data section of memory.
Reg 2 is send to Data in, but memory doesn't enable writing.
Next PC value (0x00000004) and Wr idx (0x09) are just send through this stage, we don't use them.

The multiplexer choose Res from ALU as final output. So the output value is 0x10000000.
The output value and Wr idx are send back to registers block. Finally, the value 0x10000000 will be write into x9 register, whose ABI name is s1.

After all these stage are done, the register is updated like this:

R-type format

Explain your program with the visualization for multiplexer input selection, register write/enable signals and more. You have to illustrate each stage such as IF, ID, IE, MEM, and WB. In addition, you should discuss the steps of memory updates accordingly.