Assignment1: RISC-V Assembly and Instruction Pipeline

contributed by <5k6m4>

Experimental Environment

The experimental environment is a 5-stage in-order pipeline processor with hazard detection and data forwarding. The pipeline consists of five stages: instruction fetch (IF), instruction decode (ID), execute (EX), memory (MEM), and write back (WB).

The following provides a brief introduction to each stage using an example assembly code and shows the functionality using the Ripes simulator.

Example Assembly Code

.data
    test_data: .word 0x0000ffff

.text
main:
    # demonstrate memory operation and data hazard
    la t0, test_data        # load the address of test_data
    lw t1, 0(t0)            # t1 = 0x0000ffff
    addi t1, x0, 0          # t1 = 0
    addi t2, t1, 2          # t2 = 2
    sw t1, 0(t0)            # mem[t0] = 0
    lw t2, 0(t0)            # t2 = 0

if:
    # demonstrate control hazard
    beq t1, x0, exit
    addi t2, x0, 2
    addi t3, x0, 3
    addi t4, x0, 4

exit:
    li a7, 10               # system call code for exiting the program
    ecall                   # make the exit system call

Example Pipeline State

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Instruction Fetch (IF)

The processor loads an instruction using the value of the program counter (PC) as the address.
In the example pipeline state, the processor load the instruction sw x6, 0, x5.

Instruction Decode (ID)

The instruction is decoded to obtain the source register index and the immediate value. Next, the register file is accessed to retrieve the value of the source register.
The instruction in the ID stage is addi x7, x6, 2. Therefore, one operand is set to the value of register x6, and the other is the immediate value 2.

Execute (EX)

The ALU performs the operation specified by the instruction in this stage.
In the example where the instruction is addi x6, x0, 0, the ALU adds the value in register x0 to the immediate value 0.

Memory (MEM)

If the instruction in this stage is a load/store instruction, the processor accesses the memory using the address stored in the source register.
When the load instruction lw x6, 0(x5) is in this stage, the CPU requests to read the value stored at the address contained in register x5.

Write Back (WB)

The value computed from the instruction of access from the memory will be written back to the destination register in this stage.
In the example pipeline stage, the instruction in the WB stage is addi x5, x5, 0. Therefore, the processor writes the value computed from the EX stage to register x5.

Quiz1 ProblemB

Choose one problem (A, B, or C) from Quiz1, translate it from C code to a complete RISC-V assembly program, and include the relevant test data.

C Program

#include <stdint.h>

typedef struct {
    uint16_t bits;
} bf16_t;

static inline float bf16_to_fp32(bf16_t h) {
    union {
        float f;
        uint32_t i;
    } u = {.i = (uint32_t)h.bits << 16};
    return u.f;
}

static inline bf16_t fp32_to_bf16(float s) {
    bf16_t h;
    union {
        float f;
        uint32_t i;
    } u = {.f = s};
    if ((u.i & 0x7fffffff) > 0x7f800000) { /* NaN */
        h.bits = (u.i >> 16) | 64; /* force to quiet */
        return h;
    }
    h.bits = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10;
    return h;
}

bf16_to_fp32()

Test Case

	Input in bf16	Expected Output in fp32	Description
Case0	0x3f80	0x3f800000	normal positive of bf16
Case1	0xbf80	0xbf800000	normal negative of bf16
Case2	0x0000	0x00000000	positive zero of bf16
Case3	0x7f80	0x7f800000	positive infinity of bf16
Case4	0x7fc0	0x7fc00000	quiet NaN of bf16
Case5	0x0001	0x00010000	subnormal positive of bf16
Case6	0x8001	0x80010000	subnormal negative of bf16
Case7	0x7f7f	0x7f7f0000	normal maximum of bf16
Case8	0x0080	0x00800000	normal minimum of bf16

Assembly Code

.data
    # test cases pair = {input bf16 value, golden fp32 value}
    test_case:
        .word 0x00003f80, 0x3f800000 # 0: normal positive of bf16
        .word 0x0000bf80, 0xbf800000 # 1: normal negative of bf16
        .word 0x00000000, 0x00000000 # 2: positive zero of bf16
        .word 0x00007f80, 0x7f800000 # 3: positive infinity of bf16
        .word 0x00007fc0, 0x7fc00000 # 4: quiet NaN of bf16
        .word 0x00000001, 0x00010000 # 5: subnormal positive of bf16
        .word 0x00008001, 0x80010000 # 6: subnormal negative of bf16
        .word 0x00007f7f, 0x7f7f0000 # 7: normal maximum of bf16
        .word 0x00000080, 0x00800000 # 8: normal minimum of bf16
    err_str0: .string "TestCase"
    err_str1: .string " of bf16_to_fp32() failed!!\n"
    pass_str: .string "Pass all test cases of bf16_to_fp32()!!\n"

.text
main:
    la s0, test_case            # load the address of test_case to s0

    # Initialize the test error to zero
    addi s1, x0, 0              # s1 = test_err = 0

    # Initialize local variables for the mainLoop
    addi s2, x0, 0              # s2 = i = 0
    addi s3, x0, 9              # s3 = const. = 9

mainLoop:
    bgeu s2, s3, endMainLoop    # when i >= 9, end mainLoop

    # Load the argument that bf16_to_fp32() needs
    lw a0, 0(s0)

    # Function call
    jal ra, bf16_to_fp32        # jump to functioin bf16_to_fp32

    # Compare bf16_to_fp32() result with the golden
    lw s4, 4(s0)                # load golden, s4 = golden value
    beq a0, s4, passTest        # if result == golden, pass test case

    # Update test error
    addi s1, s1, 1              # test_err++

    # Print failed message
    la a0, err_str0             # load the address of err_str0
    li a7, 4                    # system call code for printing a string
    ecall                       # print err_str0

    addi a0, s2, 0              # load the number of the test case
    li a7, 1                    # system call code for printing a integer
    ecall                       # print the number of the test case

    la a0, err_str1             # load the address of err_str1
    li a7, 4                    # system call code for printing a string
    ecall                       # print err_str1

passTest:
    # Update test case and loop variable
    addi s0, s0, 8              # move s0 to the address of the next test case
    addi s2, s2, 1              # i++

    j mainLoop                  # jal x0, mainLoop

endMainLoop:
    bne s1, x0, exit            # if test_err != 0, do not print pass message

    # Print pass message
    la a0, pass_str             # load the address of pass_str
    li a7, 4                    # system call code for printing a string
    ecall                       # print pass_str16

exit:
    li a7, 10                   # system call code for exiting the program
    ecall                       # make the exit system call

bf16_to_fp32:
    slli a0, a0, 16             # a0 = h.bits << 16
    ret                         # jalr x0, ra, 0

fp32_to_bf16()

Test Case

	Input in fp32	Expected Output in bf16	Description
Case0	0x41200000	0x4120	normal positive of fp32
Case1	0xc1200000	0xc120	normal negative of fp32
Case2	0x00000000	0x0000	positive zero of fp32
Case3	0x7f800000	0x7f80	positive infinity of fp32
Case4	0x7fc00000	0x7fc0	quiet NaN of fp32
Case5	0x00000001	0x0000	subnormal positive of fp32
Case6	0x80000001	0x8000	subnormal negative of fp32
Case7	0x7f7fffff	0x7f80	normal maximum of fp32
Case8	0x00800000	0x0080	normal minimum of fp32

Assembly Code

.data
    # test cases pair = {input fp32 value, golden bf16 value}
    test_case:
        .word 0x41200000, 0x00004120 # 0: normal positive of fp32
        .word 0xc1200000, 0x0000c120 # 1: normal negative of fp32
        .word 0x00000000, 0x00000000 # 2: positive zero of fp32
        .word 0x7f800000, 0x00007f80 # 3: positive infinity of fp32
        .word 0x7fc00000, 0x00007fc0 # 4: quiet NaN of fp32
        .word 0x00000001, 0x00000000 # 5: subnormal positive of fp32
        .word 0x80000001, 0x00008000 # 6: subnormal negative of fp32
        .word 0x7f7fffff, 0x00007f80 # 7: normal maximum of fp32
        .word 0x00800000, 0x00000080 # 8: normal minimum of fp32
    err_str0: .string "TestCase"
    err_str1: .string " of fp32_to_bf16() failed!!\n"
    pass_str: .string "Pass all test cases of fp32_to_bf16()!!\n"

.text
main:
    la s0, test_case            # load the address of test_case to s0

    # Initialize the test error to zero
    addi s1, x0, 0              # s1 = test_err = 0

    # Initialize local variables for the mainLoop
    addi s2, x0, 0              # s2 = i = 0
    addi s3, x0, 9              # s3 = const. = 9

mainLoop:
    bgeu s2, s3, endMainLoop    # when i >= 9, end mainLoop

    # Load the argument that fp32_to_bf16() needs
    lw a0, 0(s0)

    # Function call
    jal ra, fp32_to_bf16        # jump to functioin fp32_to_bf16

    # Compare fp32_to_bf16() result with the golden
    lw s4, 4(s0)                # load golden, s4 = golden value
    beq a0, s4, passTest        # if result == golden, pass test case

    # Update test error
    addi s1, s1, 1              # test_err++

    # Print failed message
    la a0, err_str0             # load the address of err_str0
    li a7, 4                    # system call code for printing a string
    ecall                       # print err_str0

    addi a0, s2, 0              # load the number of the test case
    li a7, 1                    # system call code for printing a integer
    ecall                       # print the number of the test case

    la a0, err_str1             # load the address of err_str1
    li a7, 4                    # system call code for printing a string
    ecall                       # print err_str1

passTest:
    # Update test case and loop variable
    addi s0, s0, 8              # move s0 to the address of the next test case
    addi s2, s2, 1              # i++

    j mainLoop                  # jal x0, mainLoop

endMainLoop:
    bne s1, x0, exit            # if test_err != 0, do not print pass message

    # Print pass message
    la a0, pass_str             # load the address of pass_str
    li a7, 4                    # system call code for printing a string
    ecall                       # print pass_str16

exit:
    li a7, 10                   # system call code for exiting the program
    ecall                       # make the exit system call

fp32_to_bf16:
    addi t0, a0, 0              # pass argument s, t0 = s

    li t1, 0x7fffffff           # t1 = 0x7fffffff
    and t1, t0, t1              # t1 = u.i & 0x7fffffff
    li t2, 0x7f800000           # t2 = 0x7f800000

if:
    bge t2, t1, endIf           # if 0x7f800000 >= (u.i & 0x7fffffff), skip NaN handling

    # Handle NaN value
    srli t1, t0, 16             # t1 = u.i >> 16
    ori a0, t1, 64              # t1 = h = (u.i >> 16) | 64
    ret                         # jalr x0, ra, 0

endIf:
    # Round to nearest even value
    srli t1, t0, 16             # t1 = u.i >> 0x10
    andi t1, t1, 1              # t1 = ((u.i >> 0x10) & 1)
    li t2, 0x7fff               # t2 = 0x7fff
    add t1, t2, t1              # t1 = 0x7fff + ((u.i >> 0x10) & 1)
    add t1, t0, t1              # t1 = u.i + (0x7fff + ((u.i >> 0x10) & 1))
    srli a0, t1, 16             # t1 = h = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10;
    ret                         # jalr x0, ra, 0

Hamming Distance (Leetcode 461)

Description: The Hamming distance between two integers is the number of positions at which the corresponding bits are different.

Given two integers x and y, return the Hamming distance between them.

C Program

#include <stdint.h>

int hammingDistance(int x, int y) {
    int n = x ^ y;
    int count = 0;
    while (n > 0) {
        count += n & 1;
        n = n >> 1;
    }
    return count;
}

You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking.

Test Case

	Input1	Input2	Golden	Description
Case0	0x0000000f	0x0000000f	0	no bit is different
Case1	0x0000000f	0x0000001f	1	odd different bit
Case2	0x0000000f	0x0000003f	2	even different bits
Case3	0x0f0f0f0f	0xf0f0f0f0	32	all bits are different

Assembly Code

.data
    test_case0: .word 0x0000000f, 0x0000000f
    test_case1: .word 0x0000000f, 0x0000001f
    test_case2: .word 0x0000000f, 0x0000003f
    test_case3: .word 0x0f0f0f0f, 0xf0f0f0f0
    golden: .word 0, 1, 2, 32
    err_str0: .string "TestCase"
    err_str1: .string " failed!!\n"
    pass_str: .string "Pass all test cases!!\n"

.text
main:
    la s0, test_case0           # load the address of test_case0 to s0
    la s1, golden               # load the address of golden to s1
    
    # Initialize the test error to zero
    addi t0, x0, 0              # t0 = test_err = 0

    # Initialize local variable for the mainLoop
    addi t1, x0, 0              # t1 = i = 0
    addi t2, x0, 4              # t2 = const. = 4

mainLoop:
    bgeu t1, t2, endMainLoop    # when i >= 4, end mainLoop

    # Prepare the arguments that hammingDistance() need
    lw a0, 0(s0)                # load argument x, a0 = x
    lw a1, 4(s0)                # load argument y, a1 = y

    # RISC-V calling convention and function call
    addi sp, sp, -12            # push t0 ~ t2 to the stack
    sw t0, 0(sp)
    sw t1, 4(sp)
    sw t2, 8(sp)
    jal ra, hammingDistance     # jump to function hammingDistance
    lw t0, 0(sp)
    lw t1, 4(sp)
    lw t2, 8(sp)
    addi sp, sp, 12             # pop t0~t2 from the stack

    # Compare hammingDistance() result with the golden
    lw t3, 0(s1)                # load golden, t3 = golden[i]
    beq a0, t3, passTest        # if the result == golden, pass test

    # Update test error
    addi t0, t0, 1              # test_err++

    # Print faild message
    la a0, err_str0             # load the address of err_str0
    li a7, 4                    # system call code for printing a string
    ecall                       # print err_str0

    addi a0, t1, 0              # load the number of the test case
    li a7, 1                    # system call code for printing a integer
    ecall                       # print the number of the test case

    la a0, err_str1             # load the address of err_str1
    li a7, 4                    # system call code for printing a string
    ecall                       # print err_str1
    
passTest:
    # Update test case, golden and loop variable
    addi s0, s0, 8              # move s0 to next test_case
    addi s1, s1, 4              # move s1 to nest golden
    addi t1, t1, 1              # i++
    
    j mainLoop                  # jal x0, mainLoop              

endMainLoop:
    bne t0, x0, exit            # if test_err != 0, do not print pass message

    # Print pass message
    la a0, pass_str             # load the address of pass_str
    li a7, 4                    # system call code for printing a string
    ecall                       # print pass_str

exit:
    li a7, 10                   # system call code for exiting the program
    ecall                       # make the exit system call

hammingDistance:
    # RISC-V calling convention
    addi sp, sp, -8             # push s0 and s1 to stack
    sw s0, 0(sp)
    sw s1, 4(sp)

    # Pass argument used in hammingDistance()
    addi s0, a0, 0              # pass argument x, s0 = x
    addi s1, a1, 0              # pass argument y, s1 = y

    # Initialize local variables of hammingDistance()
    xor t0, s0, s1              # n = x ^ y
    addi t1, x0, 0              # count = 0

loop:
    bgeu x0, t0, endLoop        # when 0 >= n, end loop
    andi t2, t0, 1              # t2 = n & 1
    add t1, t1, t2              # count += n & 1
    srli t0, t0, 1              # n = n >> 1
    j loop                      # jal x0, loop

endLoop:
    # RISC-V calling convention
    lw s0, 0(sp)
    lw s1, 4(sp)
    addi sp, sp, 8              # pop s0 and s1 from stack

    # Return count
    addi a0, t1, 0              # move t1 to a0, a0 = count
    ret                         # jalr x0, ra, 0

Metrics	Value
Cycles	494
Instrs. retired	354
CPI	1.4
IPC	0.717
Code size	82

Use fewer instructions.

Assembly Optimization

Seperate the use of callee-saved and caller-saved registers

.data
    test_case0: .word 0x0000000f, 0x0000000f
    test_case1: .word 0x0000000f, 0x0000001f
    test_case2: .word 0x0000000f, 0x0000003f
    test_case3: .word 0x0f0f0f0f, 0xf0f0f0f0
    golden: .word 0, 1, 2, 32
    err_str0: .string "TestCase"
    err_str1: .string " failed!!\n"
    pass_str: .string "Pass all test cases!!\n"

.text
main:
    la s0, test_case0           # load the address of test_case0 to s0
    la s1, golden               # load the address of golden to s1
    
    # Initialize the test error to zero
    addi s2, x0, 0              # s2 = test_err = 0

    # Initialize local variable for the mainLoop
    addi s3, x0, 0              # s3 = i = 0
    addi s4, x0, 4              # s4 = const. = 4

mainLoop:
    bgeu s3, s4, endMainLoop    # when i >= 4, end mainLoop

    # Prepare the arguments that hammingDistance() need
    lw a0, 0(s0)                # load argument x, a0 = x
    lw a1, 4(s0)                # load argument y, a1 = y

    # Function call
    jal ra, hammingDistance     # jump to function hammingDistance

    # Compare hammingDistance() result with the golden
    lw s5, 0(s1)                # load golden, s5 = golden[i]
    beq a0, s5, passTest        # if the result == golden, pass test

    # Update test error
    addi s2, s2, 1              # test_err++

    # Print faild message
    la a0, err_str0             # load the address of err_str0
    li a7, 4                    # system call code for printing a string
    ecall                       # print err_str0

    addi a0, s3, 0              # load the number of the test case
    li a7, 1                    # system call code for printing a integer
    ecall                       # print the number of the test case

    la a0, err_str1             # load the address of err_str1
    li a7, 4                    # system call code for printing a string
    ecall                       # print err_str1
    
passTest:
    # Update test case, golden and loop variable
    addi s0, s0, 8              # move s0 to next test_case
    addi s1, s1, 4              # move s1 to nest golden
    addi s3, s3, 1              # i++
    
    j mainLoop                  # jal x0, mainLoop              

endMainLoop:
    bne s2, x0, exit            # if test_err != 0, do not print pass message

    # Print pass message
    la a0, pass_str             # load the address of pass_str
    li a7, 4                    # system call code for printing a string
    ecall                       # print pass_str

exit:
    li a7, 10                   # system call code for exiting the program
    ecall                       # make the exit system call

hammingDistance:
    # Pass argument used in hammingDistance()
    addi t0, a0, 0              # pass argument x, t0 = x
    addi t1, a1, 0              # pass argument y, t1 = y

    # Initialize local variables of hammingDistance()
    xor t2, t0, t1              # n = x ^ y
    addi t3, x0, 0              # count = 0

loop:
    bgeu x0, t2, endLoop        # when 0 >= n, end loop
    andi t4, t2, 1              # t4 = n & 1
    add t3, t3, t4              # count += n & 1
    srli t2, t2, 1              # n = n >> 1
    j loop                      # jal x0, loop

endLoop:
    # Return count
    addi a0, t3, 0              # move t3 to a0, a0 = count
    ret                         # jalr x0, ra, 0

Metrics	Value
Cycles	438
Instrs. retired	298
CPI	1.47
IPC	0.68
Code size	68

Loop Unrolling

Unrolling 2 Times

loop:
    bgeu x0, t2, endLoop        # if n > 0 (0 >= n), end loop
    
    andi t4, t2, 1              # t4 = n & 1
    add t3, t3, t4              # count += n & 1
    srli t2, t2, 1              # n = n >> 1
        
    andi t4, t2, 1              # t4 = (n >> 1) & 1
    add t3, t3, t4              # count += (n >> 1) & 1
    srli t2, t2, 1              # n = n >> 2
        
    j loop                      # jal x0, loop

Metrics	Value
Cycles	357
Instrs. retired	259
CPI	1.38
IPC	0.725
Code size	71

Unrolling 4 Times

loop:
    bgeu x0, t2, endLoop        # if n > 0 (0 >= n), end loop

    andi t4, t2, 1              # t4 = n & 1
    add t3, t3, t4              # count += n & 1
    srli t2, t2, 1              # n = n >> 1

    andi t4, t2, 1              # t4 = (n >> 1) & 1
    add t3, t3, t4              # count += (n >> 1) & 1
    srli t2, t2, 1              # n = n >> 2

    andi t4, t2, 1              # t4 = (n >> 2) & 1
    add t3, t3, t4              # count += (n >> 2) & 1
    srli t2, t2, 1              # n = n >> 3

    andi t4, t2, 1              # t4 = (n >> 3) & 1
    add t3, t3, t4              # count += (n >> 3) & 1
    srli t2, t2, 1              # n = n >> 4

    j loop                      # jal x0, loop

Metrics	Value
Cycles	329
Instrs. retired	251
CPI	1.31
IPC	0.763
Code size	77

Unrolling Times Analysis

Metrics	w/o Unrolling	2 Times	4 Times	8 Times	16 Times	32 Times
Cycles	438	357 (-18%)	329 (-25%)	305 (-30%)	345 (-21%)	505 (+15%)
Instrs. retired	298	259 (-13%)	251 (-16%)	239 (-20%)	283 (-5%)	459 (+54%)
CPI	1.47	1.38 (-6%)	1.31 (-11%)	1.28 (-13%)	1.22 (-17%)	1.1 (-25%)
IPC	0.68	0.725 (+7%)	0.763 (+12%)	0.784 (+15%)	0.82 (+21%)	0.909 (+34%)
Code size	68	71 (+4% )	77 (+13%)	89 (+31%)	118 (+74%)	158 (+132%)

Assignment1: RISC-V Assembly and Instruction Pipeline

Experimental Environment

Example Assembly Code

Example Pipeline State

Quiz1 ProblemB

C Program

bf16_to_fp32()

Test Case

Assembly Code

fp32_to_bf16()

Test Case

Assembly Code

Hamming Distance (Leetcode 461)

C Program

Test Case

Assembly Code

Assembly Optimization

Seperate the use of callee-saved and caller-saved registers

Loop Unrolling

Unrolling 2 Times

Unrolling 4 Times

Unrolling Times Analysis

Reference