Try   HackMD

Assignment1: RISC-V Assembly and Instruction Pipeline

contributed by <5k6m4>

Experimental Environment

The experimental environment is a 5-stage in-order pipeline processor with hazard detection and data forwarding. The pipeline consists of five stages: instruction fetch (IF), instruction decode (ID), execute (EX), memory (MEM), and write back (WB).

The following provides a brief introduction to each stage using an example assembly code and shows the functionality using the Ripes simulator.

Example Assembly Code

.data
    test_data: .word 0x0000ffff

.text
main:
    # demonstrate memory operation and data hazard
    la t0, test_data        # load the address of test_data
    lw t1, 0(t0)            # t1 = 0x0000ffff
    addi t1, x0, 0          # t1 = 0
    addi t2, t1, 2          # t2 = 2
    sw t1, 0(t0)            # mem[t0] = 0
    lw t2, 0(t0)            # t2 = 0

if:
    # demonstrate control hazard
    beq t1, x0, exit
    addi t2, x0, 2
    addi t3, x0, 3
    addi t4, x0, 4

exit:
    li a7, 10               # system call code for exiting the program
    ecall                   # make the exit system call

Example Pipeline State

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Instruction Fetch (IF)

  • The processor loads an instruction using the value of the program counter (PC) as the address.
  • In the example pipeline state, the processor load the instruction sw x6, 0, x5.

Instruction Decode (ID)

  • The instruction is decoded to obtain the source register index and the immediate value. Next, the register file is accessed to retrieve the value of the source register.
  • The instruction in the ID stage is addi x7, x6, 2. Therefore, one operand is set to the value of register x6, and the other is the immediate value 2.

Execute (EX)

  • The ALU performs the operation specified by the instruction in this stage.
  • In the example where the instruction is addi x6, x0, 0, the ALU adds the value in register x0 to the immediate value 0.

Memory (MEM)

  • If the instruction in this stage is a load/store instruction, the processor accesses the memory using the address stored in the source register.
  • When the load instruction lw x6, 0(x5) is in this stage, the CPU requests to read the value stored at the address contained in register x5.

Write Back (WB)

  • The value computed from the instruction of access from the memory will be written back to the destination register in this stage.
  • In the example pipeline stage, the instruction in the WB stage is addi x5, x5, 0. Therefore, the processor writes the value computed from the EX stage to register x5.

Quiz1 ProblemB

Choose one problem (A, B, or C) from Quiz1, translate it from C code to a complete RISC-V assembly program, and include the relevant test data.

C Program

#include <stdint.h>

typedef struct {
    uint16_t bits;
} bf16_t;

static inline float bf16_to_fp32(bf16_t h) {
    union {
        float f;
        uint32_t i;
    } u = {.i = (uint32_t)h.bits << 16};
    return u.f;
}

static inline bf16_t fp32_to_bf16(float s) {
    bf16_t h;
    union {
        float f;
        uint32_t i;
    } u = {.f = s};
    if ((u.i & 0x7fffffff) > 0x7f800000) { /* NaN */
        h.bits = (u.i >> 16) | 64; /* force to quiet */
        return h;
    }
    h.bits = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10;
    return h;
}

bf16_to_fp32()

Test Case

Input in bf16 Expected Output in fp32 Description
Case0 0x3f80 0x3f800000 normal positive of bf16
Case1 0xbf80 0xbf800000 normal negative of bf16
Case2 0x0000 0x00000000 positive zero of bf16
Case3 0x7f80 0x7f800000 positive infinity of bf16
Case4 0x7fc0 0x7fc00000 quiet NaN of bf16
Case5 0x0001 0x00010000 subnormal positive of bf16
Case6 0x8001 0x80010000 subnormal negative of bf16
Case7 0x7f7f 0x7f7f0000 normal maximum of bf16
Case8 0x0080 0x00800000 normal minimum of bf16

Assembly Code

.data
    # test cases pair = {input bf16 value, golden fp32 value}
    test_case:
        .word 0x00003f80, 0x3f800000 # 0: normal positive of bf16
        .word 0x0000bf80, 0xbf800000 # 1: normal negative of bf16
        .word 0x00000000, 0x00000000 # 2: positive zero of bf16
        .word 0x00007f80, 0x7f800000 # 3: positive infinity of bf16
        .word 0x00007fc0, 0x7fc00000 # 4: quiet NaN of bf16
        .word 0x00000001, 0x00010000 # 5: subnormal positive of bf16
        .word 0x00008001, 0x80010000 # 6: subnormal negative of bf16
        .word 0x00007f7f, 0x7f7f0000 # 7: normal maximum of bf16
        .word 0x00000080, 0x00800000 # 8: normal minimum of bf16
    err_str0: .string "TestCase"
    err_str1: .string " of bf16_to_fp32() failed!!\n"
    pass_str: .string "Pass all test cases of bf16_to_fp32()!!\n"

.text
main:
    la s0, test_case            # load the address of test_case to s0

    # Initialize the test error to zero
    addi s1, x0, 0              # s1 = test_err = 0

    # Initialize local variables for the mainLoop
    addi s2, x0, 0              # s2 = i = 0
    addi s3, x0, 9              # s3 = const. = 9

mainLoop:
    bgeu s2, s3, endMainLoop    # when i >= 9, end mainLoop

    # Load the argument that bf16_to_fp32() needs
    lw a0, 0(s0)

    # Function call
    jal ra, bf16_to_fp32        # jump to functioin bf16_to_fp32

    # Compare bf16_to_fp32() result with the golden
    lw s4, 4(s0)                # load golden, s4 = golden value
    beq a0, s4, passTest        # if result == golden, pass test case

    # Update test error
    addi s1, s1, 1              # test_err++

    # Print failed message
    la a0, err_str0             # load the address of err_str0
    li a7, 4                    # system call code for printing a string
    ecall                       # print err_str0

    addi a0, s2, 0              # load the number of the test case
    li a7, 1                    # system call code for printing a integer
    ecall                       # print the number of the test case

    la a0, err_str1             # load the address of err_str1
    li a7, 4                    # system call code for printing a string
    ecall                       # print err_str1

passTest:
    # Update test case and loop variable
    addi s0, s0, 8              # move s0 to the address of the next test case
    addi s2, s2, 1              # i++

    j mainLoop                  # jal x0, mainLoop

endMainLoop:
    bne s1, x0, exit            # if test_err != 0, do not print pass message

    # Print pass message
    la a0, pass_str             # load the address of pass_str
    li a7, 4                    # system call code for printing a string
    ecall                       # print pass_str16

exit:
    li a7, 10                   # system call code for exiting the program
    ecall                       # make the exit system call

bf16_to_fp32:
    slli a0, a0, 16             # a0 = h.bits << 16
    ret                         # jalr x0, ra, 0

fp32_to_bf16()

Test Case

Input in fp32 Expected Output in bf16 Description
Case0 0x41200000 0x4120 normal positive of fp32
Case1 0xc1200000 0xc120 normal negative of fp32
Case2 0x00000000 0x0000 positive zero of fp32
Case3 0x7f800000 0x7f80 positive infinity of fp32
Case4 0x7fc00000 0x7fc0 quiet NaN of fp32
Case5 0x00000001 0x0000 subnormal positive of fp32
Case6 0x80000001 0x8000 subnormal negative of fp32
Case7 0x7f7fffff 0x7f80 normal maximum of fp32
Case8 0x00800000 0x0080 normal minimum of fp32

Assembly Code

.data
    # test cases pair = {input fp32 value, golden bf16 value}
    test_case:
        .word 0x41200000, 0x00004120 # 0: normal positive of fp32
        .word 0xc1200000, 0x0000c120 # 1: normal negative of fp32
        .word 0x00000000, 0x00000000 # 2: positive zero of fp32
        .word 0x7f800000, 0x00007f80 # 3: positive infinity of fp32
        .word 0x7fc00000, 0x00007fc0 # 4: quiet NaN of fp32
        .word 0x00000001, 0x00000000 # 5: subnormal positive of fp32
        .word 0x80000001, 0x00008000 # 6: subnormal negative of fp32
        .word 0x7f7fffff, 0x00007f80 # 7: normal maximum of fp32
        .word 0x00800000, 0x00000080 # 8: normal minimum of fp32
    err_str0: .string "TestCase"
    err_str1: .string " of fp32_to_bf16() failed!!\n"
    pass_str: .string "Pass all test cases of fp32_to_bf16()!!\n"

.text
main:
    la s0, test_case            # load the address of test_case to s0

    # Initialize the test error to zero
    addi s1, x0, 0              # s1 = test_err = 0

    # Initialize local variables for the mainLoop
    addi s2, x0, 0              # s2 = i = 0
    addi s3, x0, 9              # s3 = const. = 9

mainLoop:
    bgeu s2, s3, endMainLoop    # when i >= 9, end mainLoop

    # Load the argument that fp32_to_bf16() needs
    lw a0, 0(s0)

    # Function call
    jal ra, fp32_to_bf16        # jump to functioin fp32_to_bf16

    # Compare fp32_to_bf16() result with the golden
    lw s4, 4(s0)                # load golden, s4 = golden value
    beq a0, s4, passTest        # if result == golden, pass test case

    # Update test error
    addi s1, s1, 1              # test_err++

    # Print failed message
    la a0, err_str0             # load the address of err_str0
    li a7, 4                    # system call code for printing a string
    ecall                       # print err_str0

    addi a0, s2, 0              # load the number of the test case
    li a7, 1                    # system call code for printing a integer
    ecall                       # print the number of the test case

    la a0, err_str1             # load the address of err_str1
    li a7, 4                    # system call code for printing a string
    ecall                       # print err_str1

passTest:
    # Update test case and loop variable
    addi s0, s0, 8              # move s0 to the address of the next test case
    addi s2, s2, 1              # i++

    j mainLoop                  # jal x0, mainLoop

endMainLoop:
    bne s1, x0, exit            # if test_err != 0, do not print pass message

    # Print pass message
    la a0, pass_str             # load the address of pass_str
    li a7, 4                    # system call code for printing a string
    ecall                       # print pass_str16

exit:
    li a7, 10                   # system call code for exiting the program
    ecall                       # make the exit system call

fp32_to_bf16:
    addi t0, a0, 0              # pass argument s, t0 = s

    li t1, 0x7fffffff           # t1 = 0x7fffffff
    and t1, t0, t1              # t1 = u.i & 0x7fffffff
    li t2, 0x7f800000           # t2 = 0x7f800000

if:
    bge t2, t1, endIf           # if 0x7f800000 >= (u.i & 0x7fffffff), skip NaN handling

    # Handle NaN value
    srli t1, t0, 16             # t1 = u.i >> 16
    ori a0, t1, 64              # t1 = h = (u.i >> 16) | 64
    ret                         # jalr x0, ra, 0

endIf:
    # Round to nearest even value
    srli t1, t0, 16             # t1 = u.i >> 0x10
    andi t1, t1, 1              # t1 = ((u.i >> 0x10) & 1)
    li t2, 0x7fff               # t2 = 0x7fff
    add t1, t2, t1              # t1 = 0x7fff + ((u.i >> 0x10) & 1)
    add t1, t0, t1              # t1 = u.i + (0x7fff + ((u.i >> 0x10) & 1))
    srli a0, t1, 16             # t1 = h = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10;
    ret                         # jalr x0, ra, 0

Hamming Distance (Leetcode 461)

Description: The Hamming distance between two integers is the number of positions at which the corresponding bits are different.

Given two integers x and y, return the Hamming distance between them.

C Program

#include <stdint.h>

int hammingDistance(int x, int y) {
    int n = x ^ y;
    int count = 0;
    while (n > 0) {
        count += n & 1;
        n = n >> 1;
    }
    return count;
}

You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking.

Test Case

Input1 Input2 Golden Description
Case0 0x0000000f 0x0000000f 0 no bit is different
Case1 0x0000000f 0x0000001f 1 odd different bit
Case2 0x0000000f 0x0000003f 2 even different bits
Case3 0x0f0f0f0f 0xf0f0f0f0 32 all bits are different

Assembly Code

.data
    test_case0: .word 0x0000000f, 0x0000000f
    test_case1: .word 0x0000000f, 0x0000001f
    test_case2: .word 0x0000000f, 0x0000003f
    test_case3: .word 0x0f0f0f0f, 0xf0f0f0f0
    golden: .word 0, 1, 2, 32
    err_str0: .string "TestCase"
    err_str1: .string " failed!!\n"
    pass_str: .string "Pass all test cases!!\n"

.text
main:
    la s0, test_case0           # load the address of test_case0 to s0
    la s1, golden               # load the address of golden to s1
    
    # Initialize the test error to zero
    addi t0, x0, 0              # t0 = test_err = 0

    # Initialize local variable for the mainLoop
    addi t1, x0, 0              # t1 = i = 0
    addi t2, x0, 4              # t2 = const. = 4

mainLoop:
    bgeu t1, t2, endMainLoop    # when i >= 4, end mainLoop

    # Prepare the arguments that hammingDistance() need
    lw a0, 0(s0)                # load argument x, a0 = x
    lw a1, 4(s0)                # load argument y, a1 = y

    # RISC-V calling convention and function call
    addi sp, sp, -12            # push t0 ~ t2 to the stack
    sw t0, 0(sp)
    sw t1, 4(sp)
    sw t2, 8(sp)
    jal ra, hammingDistance     # jump to function hammingDistance
    lw t0, 0(sp)
    lw t1, 4(sp)
    lw t2, 8(sp)
    addi sp, sp, 12             # pop t0~t2 from the stack

    # Compare hammingDistance() result with the golden
    lw t3, 0(s1)                # load golden, t3 = golden[i]
    beq a0, t3, passTest        # if the result == golden, pass test

    # Update test error
    addi t0, t0, 1              # test_err++

    # Print faild message
    la a0, err_str0             # load the address of err_str0
    li a7, 4                    # system call code for printing a string
    ecall                       # print err_str0

    addi a0, t1, 0              # load the number of the test case
    li a7, 1                    # system call code for printing a integer
    ecall                       # print the number of the test case

    la a0, err_str1             # load the address of err_str1
    li a7, 4                    # system call code for printing a string
    ecall                       # print err_str1
    
passTest:
    # Update test case, golden and loop variable
    addi s0, s0, 8              # move s0 to next test_case
    addi s1, s1, 4              # move s1 to nest golden
    addi t1, t1, 1              # i++
    
    j mainLoop                  # jal x0, mainLoop              

endMainLoop:
    bne t0, x0, exit            # if test_err != 0, do not print pass message

    # Print pass message
    la a0, pass_str             # load the address of pass_str
    li a7, 4                    # system call code for printing a string
    ecall                       # print pass_str

exit:
    li a7, 10                   # system call code for exiting the program
    ecall                       # make the exit system call

hammingDistance:
    # RISC-V calling convention
    addi sp, sp, -8             # push s0 and s1 to stack
    sw s0, 0(sp)
    sw s1, 4(sp)

    # Pass argument used in hammingDistance()
    addi s0, a0, 0              # pass argument x, s0 = x
    addi s1, a1, 0              # pass argument y, s1 = y

    # Initialize local variables of hammingDistance()
    xor t0, s0, s1              # n = x ^ y
    addi t1, x0, 0              # count = 0

loop:
    bgeu x0, t0, endLoop        # when 0 >= n, end loop
    andi t2, t0, 1              # t2 = n & 1
    add t1, t1, t2              # count += n & 1
    srli t0, t0, 1              # n = n >> 1
    j loop                      # jal x0, loop

endLoop:
    # RISC-V calling convention
    lw s0, 0(sp)
    lw s1, 4(sp)
    addi sp, sp, 8              # pop s0 and s1 from stack

    # Return count
    addi a0, t1, 0              # move t1 to a0, a0 = count
    ret                         # jalr x0, ra, 0
Metrics Value
Cycles 494
Instrs. retired 354
CPI 1.4
IPC 0.717
Code size 82

Use fewer instructions.

Assembly Optimization

Seperate the use of callee-saved and caller-saved registers

.data
    test_case0: .word 0x0000000f, 0x0000000f
    test_case1: .word 0x0000000f, 0x0000001f
    test_case2: .word 0x0000000f, 0x0000003f
    test_case3: .word 0x0f0f0f0f, 0xf0f0f0f0
    golden: .word 0, 1, 2, 32
    err_str0: .string "TestCase"
    err_str1: .string " failed!!\n"
    pass_str: .string "Pass all test cases!!\n"

.text
main:
    la s0, test_case0           # load the address of test_case0 to s0
    la s1, golden               # load the address of golden to s1
    
    # Initialize the test error to zero
    addi s2, x0, 0              # s2 = test_err = 0

    # Initialize local variable for the mainLoop
    addi s3, x0, 0              # s3 = i = 0
    addi s4, x0, 4              # s4 = const. = 4

mainLoop:
    bgeu s3, s4, endMainLoop    # when i >= 4, end mainLoop

    # Prepare the arguments that hammingDistance() need
    lw a0, 0(s0)                # load argument x, a0 = x
    lw a1, 4(s0)                # load argument y, a1 = y

    # Function call
    jal ra, hammingDistance     # jump to function hammingDistance

    # Compare hammingDistance() result with the golden
    lw s5, 0(s1)                # load golden, s5 = golden[i]
    beq a0, s5, passTest        # if the result == golden, pass test

    # Update test error
    addi s2, s2, 1              # test_err++

    # Print faild message
    la a0, err_str0             # load the address of err_str0
    li a7, 4                    # system call code for printing a string
    ecall                       # print err_str0

    addi a0, s3, 0              # load the number of the test case
    li a7, 1                    # system call code for printing a integer
    ecall                       # print the number of the test case

    la a0, err_str1             # load the address of err_str1
    li a7, 4                    # system call code for printing a string
    ecall                       # print err_str1
    
passTest:
    # Update test case, golden and loop variable
    addi s0, s0, 8              # move s0 to next test_case
    addi s1, s1, 4              # move s1 to nest golden
    addi s3, s3, 1              # i++
    
    j mainLoop                  # jal x0, mainLoop              

endMainLoop:
    bne s2, x0, exit            # if test_err != 0, do not print pass message

    # Print pass message
    la a0, pass_str             # load the address of pass_str
    li a7, 4                    # system call code for printing a string
    ecall                       # print pass_str

exit:
    li a7, 10                   # system call code for exiting the program
    ecall                       # make the exit system call

hammingDistance:
    # Pass argument used in hammingDistance()
    addi t0, a0, 0              # pass argument x, t0 = x
    addi t1, a1, 0              # pass argument y, t1 = y

    # Initialize local variables of hammingDistance()
    xor t2, t0, t1              # n = x ^ y
    addi t3, x0, 0              # count = 0

loop:
    bgeu x0, t2, endLoop        # when 0 >= n, end loop
    andi t4, t2, 1              # t4 = n & 1
    add t3, t3, t4              # count += n & 1
    srli t2, t2, 1              # n = n >> 1
    j loop                      # jal x0, loop

endLoop:
    # Return count
    addi a0, t3, 0              # move t3 to a0, a0 = count
    ret                         # jalr x0, ra, 0
Metrics Value
Cycles 438
Instrs. retired 298
CPI 1.47
IPC 0.68
Code size 68

Loop Unrolling

Unrolling 2 Times

loop:
    bgeu x0, t2, endLoop        # if n > 0 (0 >= n), end loop
    
    andi t4, t2, 1              # t4 = n & 1
    add t3, t3, t4              # count += n & 1
    srli t2, t2, 1              # n = n >> 1
        
    andi t4, t2, 1              # t4 = (n >> 1) & 1
    add t3, t3, t4              # count += (n >> 1) & 1
    srli t2, t2, 1              # n = n >> 2
        
    j loop                      # jal x0, loop
Metrics Value
Cycles 357
Instrs. retired 259
CPI 1.38
IPC 0.725
Code size 71

Unrolling 4 Times

loop:
    bgeu x0, t2, endLoop        # if n > 0 (0 >= n), end loop

    andi t4, t2, 1              # t4 = n & 1
    add t3, t3, t4              # count += n & 1
    srli t2, t2, 1              # n = n >> 1

    andi t4, t2, 1              # t4 = (n >> 1) & 1
    add t3, t3, t4              # count += (n >> 1) & 1
    srli t2, t2, 1              # n = n >> 2

    andi t4, t2, 1              # t4 = (n >> 2) & 1
    add t3, t3, t4              # count += (n >> 2) & 1
    srli t2, t2, 1              # n = n >> 3

    andi t4, t2, 1              # t4 = (n >> 3) & 1
    add t3, t3, t4              # count += (n >> 3) & 1
    srli t2, t2, 1              # n = n >> 4

    j loop                      # jal x0, loop
Metrics Value
Cycles 329
Instrs. retired 251
CPI 1.31
IPC 0.763
Code size 77

Unrolling Times Analysis

Metrics w/o Unrolling 2 Times 4 Times 8 Times 16 Times 32 Times
Cycles 438 357 (-18%) 329 (-25%) 305 (-30%) 345 (-21%) 505 (+15%)
Instrs. retired 298 259 (-13%) 251 (-16%) 239 (-20%) 283 (-5%) 459 (+54%)
CPI 1.47 1.38 (-6%) 1.31 (-11%) 1.28 (-13%) 1.22 (-17%) 1.1 (-25%)
IPC 0.68 0.725 (+7%) 0.763 (+12%) 0.784 (+15%) 0.82 (+21%) 0.909 (+34%)
Code size 68 71 (+4% ) 77 (+13%) 89 (+31%) 118 (+74%) 158 (+132%)

Reference