Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by <5k6m4
>
Experimental Environment
The experimental environment is a 5-stage in-order pipeline processor with hazard detection and data forwarding. The pipeline consists of five stages: instruction fetch (IF), instruction decode (ID), execute (EX), memory (MEM), and write back (WB).
The following provides a brief introduction to each stage using an example assembly code and shows the functionality using the Ripes simulator.
Example Assembly Code
Example Pipeline State
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Instruction Fetch (IF)
- The processor loads an instruction using the value of the program counter (PC) as the address.
- In the example pipeline state, the processor load the instruction
sw x6, 0, x5
.
Instruction Decode (ID)
- The instruction is decoded to obtain the source register index and the immediate value. Next, the register file is accessed to retrieve the value of the source register.
- The instruction in the ID stage is
addi x7, x6, 2
. Therefore, one operand is set to the value of register x6
, and the other is the immediate value 2
.
Execute (EX)
- The ALU performs the operation specified by the instruction in this stage.
- In the example where the instruction is
addi x6, x0, 0
, the ALU adds the value in register x0
to the immediate value 0
.
Memory (MEM)
- If the instruction in this stage is a load/store instruction, the processor accesses the memory using the address stored in the source register.
- When the load instruction
lw x6, 0(x5)
is in this stage, the CPU requests to read the value stored at the address contained in register x5
.
Write Back (WB)
- The value computed from the instruction of access from the memory will be written back to the destination register in this stage.
- In the example pipeline stage, the instruction in the WB stage is
addi x5, x5, 0
. Therefore, the processor writes the value computed from the EX stage to register x5
.
Choose one problem (A, B, or C) from Quiz1, translate it from C code to a complete RISC-V assembly program, and include the relevant test data.
C Program
bf16_to_fp32()
Test Case
|
Input in bf16 |
Expected Output in fp32 |
Description |
Case0 |
0x3f80 |
0x3f800000 |
normal positive of bf16 |
Case1 |
0xbf80 |
0xbf800000 |
normal negative of bf16 |
Case2 |
0x0000 |
0x00000000 |
positive zero of bf16 |
Case3 |
0x7f80 |
0x7f800000 |
positive infinity of bf16 |
Case4 |
0x7fc0 |
0x7fc00000 |
quiet NaN of bf16 |
Case5 |
0x0001 |
0x00010000 |
subnormal positive of bf16 |
Case6 |
0x8001 |
0x80010000 |
subnormal negative of bf16 |
Case7 |
0x7f7f |
0x7f7f0000 |
normal maximum of bf16 |
Case8 |
0x0080 |
0x00800000 |
normal minimum of bf16 |
Assembly Code
.data
# test cases pair = {input bf16 value, golden fp32 value}
test_case:
.word 0x00003f80, 0x3f800000 # 0: normal positive of bf16
.word 0x0000bf80, 0xbf800000 # 1: normal negative of bf16
.word 0x00000000, 0x00000000 # 2: positive zero of bf16
.word 0x00007f80, 0x7f800000 # 3: positive infinity of bf16
.word 0x00007fc0, 0x7fc00000 # 4: quiet NaN of bf16
.word 0x00000001, 0x00010000 # 5: subnormal positive of bf16
.word 0x00008001, 0x80010000 # 6: subnormal negative of bf16
.word 0x00007f7f, 0x7f7f0000 # 7: normal maximum of bf16
.word 0x00000080, 0x00800000 # 8: normal minimum of bf16
err_str0: .string "TestCase"
err_str1: .string " of bf16_to_fp32() failed!!\n"
pass_str: .string "Pass all test cases of bf16_to_fp32()!!\n"
.text
main:
la s0, test_case # load the address of test_case to s0
# Initialize the test error to zero
addi s1, x0, 0 # s1 = test_err = 0
# Initialize local variables for the mainLoop
addi s2, x0, 0 # s2 = i = 0
addi s3, x0, 9 # s3 = const. = 9
mainLoop:
bgeu s2, s3, endMainLoop # when i >= 9, end mainLoop
# Load the argument that bf16_to_fp32() needs
lw a0, 0(s0)
# Function call
jal ra, bf16_to_fp32 # jump to functioin bf16_to_fp32
# Compare bf16_to_fp32() result with the golden
lw s4, 4(s0) # load golden, s4 = golden value
beq a0, s4, passTest # if result == golden, pass test case
# Update test error
addi s1, s1, 1 # test_err++
# Print failed message
la a0, err_str0 # load the address of err_str0
li a7, 4 # system call code for printing a string
ecall # print err_str0
addi a0, s2, 0 # load the number of the test case
li a7, 1 # system call code for printing a integer
ecall # print the number of the test case
la a0, err_str1 # load the address of err_str1
li a7, 4 # system call code for printing a string
ecall # print err_str1
passTest:
# Update test case and loop variable
addi s0, s0, 8 # move s0 to the address of the next test case
addi s2, s2, 1 # i++
j mainLoop # jal x0, mainLoop
endMainLoop:
bne s1, x0, exit # if test_err != 0, do not print pass message
# Print pass message
la a0, pass_str # load the address of pass_str
li a7, 4 # system call code for printing a string
ecall # print pass_str16
exit:
li a7, 10 # system call code for exiting the program
ecall # make the exit system call
bf16_to_fp32:
slli a0, a0, 16 # a0 = h.bits << 16
ret # jalr x0, ra, 0
fp32_to_bf16()
Test Case
|
Input in fp32 |
Expected Output in bf16 |
Description |
Case0 |
0x41200000 |
0x4120 |
normal positive of fp32 |
Case1 |
0xc1200000 |
0xc120 |
normal negative of fp32 |
Case2 |
0x00000000 |
0x0000 |
positive zero of fp32 |
Case3 |
0x7f800000 |
0x7f80 |
positive infinity of fp32 |
Case4 |
0x7fc00000 |
0x7fc0 |
quiet NaN of fp32 |
Case5 |
0x00000001 |
0x0000 |
subnormal positive of fp32 |
Case6 |
0x80000001 |
0x8000 |
subnormal negative of fp32 |
Case7 |
0x7f7fffff |
0x7f80 |
normal maximum of fp32 |
Case8 |
0x00800000 |
0x0080 |
normal minimum of fp32 |
Assembly Code
.data
# test cases pair = {input fp32 value, golden bf16 value}
test_case:
.word 0x41200000, 0x00004120 # 0: normal positive of fp32
.word 0xc1200000, 0x0000c120 # 1: normal negative of fp32
.word 0x00000000, 0x00000000 # 2: positive zero of fp32
.word 0x7f800000, 0x00007f80 # 3: positive infinity of fp32
.word 0x7fc00000, 0x00007fc0 # 4: quiet NaN of fp32
.word 0x00000001, 0x00000000 # 5: subnormal positive of fp32
.word 0x80000001, 0x00008000 # 6: subnormal negative of fp32
.word 0x7f7fffff, 0x00007f80 # 7: normal maximum of fp32
.word 0x00800000, 0x00000080 # 8: normal minimum of fp32
err_str0: .string "TestCase"
err_str1: .string " of fp32_to_bf16() failed!!\n"
pass_str: .string "Pass all test cases of fp32_to_bf16()!!\n"
.text
main:
la s0, test_case # load the address of test_case to s0
# Initialize the test error to zero
addi s1, x0, 0 # s1 = test_err = 0
# Initialize local variables for the mainLoop
addi s2, x0, 0 # s2 = i = 0
addi s3, x0, 9 # s3 = const. = 9
mainLoop:
bgeu s2, s3, endMainLoop # when i >= 9, end mainLoop
# Load the argument that fp32_to_bf16() needs
lw a0, 0(s0)
# Function call
jal ra, fp32_to_bf16 # jump to functioin fp32_to_bf16
# Compare fp32_to_bf16() result with the golden
lw s4, 4(s0) # load golden, s4 = golden value
beq a0, s4, passTest # if result == golden, pass test case
# Update test error
addi s1, s1, 1 # test_err++
# Print failed message
la a0, err_str0 # load the address of err_str0
li a7, 4 # system call code for printing a string
ecall # print err_str0
addi a0, s2, 0 # load the number of the test case
li a7, 1 # system call code for printing a integer
ecall # print the number of the test case
la a0, err_str1 # load the address of err_str1
li a7, 4 # system call code for printing a string
ecall # print err_str1
passTest:
# Update test case and loop variable
addi s0, s0, 8 # move s0 to the address of the next test case
addi s2, s2, 1 # i++
j mainLoop # jal x0, mainLoop
endMainLoop:
bne s1, x0, exit # if test_err != 0, do not print pass message
# Print pass message
la a0, pass_str # load the address of pass_str
li a7, 4 # system call code for printing a string
ecall # print pass_str16
exit:
li a7, 10 # system call code for exiting the program
ecall # make the exit system call
fp32_to_bf16:
addi t0, a0, 0 # pass argument s, t0 = s
li t1, 0x7fffffff # t1 = 0x7fffffff
and t1, t0, t1 # t1 = u.i & 0x7fffffff
li t2, 0x7f800000 # t2 = 0x7f800000
if:
bge t2, t1, endIf # if 0x7f800000 >= (u.i & 0x7fffffff), skip NaN handling
# Handle NaN value
srli t1, t0, 16 # t1 = u.i >> 16
ori a0, t1, 64 # t1 = h = (u.i >> 16) | 64
ret # jalr x0, ra, 0
endIf:
# Round to nearest even value
srli t1, t0, 16 # t1 = u.i >> 0x10
andi t1, t1, 1 # t1 = ((u.i >> 0x10) & 1)
li t2, 0x7fff # t2 = 0x7fff
add t1, t2, t1 # t1 = 0x7fff + ((u.i >> 0x10) & 1)
add t1, t0, t1 # t1 = u.i + (0x7fff + ((u.i >> 0x10) & 1))
srli a0, t1, 16 # t1 = h = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10;
ret # jalr x0, ra, 0
Description: The Hamming distance between two integers is the number of positions at which the corresponding bits are different.
Given two integers x
and y
, return the Hamming distance between them.
C Program
You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking.
Test Case
|
Input1 |
Input2 |
Golden |
Description |
Case0 |
0x0000000f |
0x0000000f |
0 |
no bit is different |
Case1 |
0x0000000f |
0x0000001f |
1 |
odd different bit |
Case2 |
0x0000000f |
0x0000003f |
2 |
even different bits |
Case3 |
0x0f0f0f0f |
0xf0f0f0f0 |
32 |
all bits are different |
Assembly Code
.data
test_case0: .word 0x0000000f, 0x0000000f
test_case1: .word 0x0000000f, 0x0000001f
test_case2: .word 0x0000000f, 0x0000003f
test_case3: .word 0x0f0f0f0f, 0xf0f0f0f0
golden: .word 0, 1, 2, 32
err_str0: .string "TestCase"
err_str1: .string " failed!!\n"
pass_str: .string "Pass all test cases!!\n"
.text
main:
la s0, test_case0 # load the address of test_case0 to s0
la s1, golden # load the address of golden to s1
# Initialize the test error to zero
addi t0, x0, 0 # t0 = test_err = 0
# Initialize local variable for the mainLoop
addi t1, x0, 0 # t1 = i = 0
addi t2, x0, 4 # t2 = const. = 4
mainLoop:
bgeu t1, t2, endMainLoop # when i >= 4, end mainLoop
# Prepare the arguments that hammingDistance() need
lw a0, 0(s0) # load argument x, a0 = x
lw a1, 4(s0) # load argument y, a1 = y
# RISC-V calling convention and function call
addi sp, sp, -12 # push t0 ~ t2 to the stack
sw t0, 0(sp)
sw t1, 4(sp)
sw t2, 8(sp)
jal ra, hammingDistance # jump to function hammingDistance
lw t0, 0(sp)
lw t1, 4(sp)
lw t2, 8(sp)
addi sp, sp, 12 # pop t0~t2 from the stack
# Compare hammingDistance() result with the golden
lw t3, 0(s1) # load golden, t3 = golden[i]
beq a0, t3, passTest # if the result == golden, pass test
# Update test error
addi t0, t0, 1 # test_err++
# Print faild message
la a0, err_str0 # load the address of err_str0
li a7, 4 # system call code for printing a string
ecall # print err_str0
addi a0, t1, 0 # load the number of the test case
li a7, 1 # system call code for printing a integer
ecall # print the number of the test case
la a0, err_str1 # load the address of err_str1
li a7, 4 # system call code for printing a string
ecall # print err_str1
passTest:
# Update test case, golden and loop variable
addi s0, s0, 8 # move s0 to next test_case
addi s1, s1, 4 # move s1 to nest golden
addi t1, t1, 1 # i++
j mainLoop # jal x0, mainLoop
endMainLoop:
bne t0, x0, exit # if test_err != 0, do not print pass message
# Print pass message
la a0, pass_str # load the address of pass_str
li a7, 4 # system call code for printing a string
ecall # print pass_str
exit:
li a7, 10 # system call code for exiting the program
ecall # make the exit system call
hammingDistance:
# RISC-V calling convention
addi sp, sp, -8 # push s0 and s1 to stack
sw s0, 0(sp)
sw s1, 4(sp)
# Pass argument used in hammingDistance()
addi s0, a0, 0 # pass argument x, s0 = x
addi s1, a1, 0 # pass argument y, s1 = y
# Initialize local variables of hammingDistance()
xor t0, s0, s1 # n = x ^ y
addi t1, x0, 0 # count = 0
loop:
bgeu x0, t0, endLoop # when 0 >= n, end loop
andi t2, t0, 1 # t2 = n & 1
add t1, t1, t2 # count += n & 1
srli t0, t0, 1 # n = n >> 1
j loop # jal x0, loop
endLoop:
# RISC-V calling convention
lw s0, 0(sp)
lw s1, 4(sp)
addi sp, sp, 8 # pop s0 and s1 from stack
# Return count
addi a0, t1, 0 # move t1 to a0, a0 = count
ret # jalr x0, ra, 0
Metrics |
Value |
Cycles |
494 |
Instrs. retired |
354 |
CPI |
1.4 |
IPC |
0.717 |
Code size |
82 |
Assembly Optimization
Seperate the use of callee-saved and caller-saved registers
.data
test_case0: .word 0x0000000f, 0x0000000f
test_case1: .word 0x0000000f, 0x0000001f
test_case2: .word 0x0000000f, 0x0000003f
test_case3: .word 0x0f0f0f0f, 0xf0f0f0f0
golden: .word 0, 1, 2, 32
err_str0: .string "TestCase"
err_str1: .string " failed!!\n"
pass_str: .string "Pass all test cases!!\n"
.text
main:
la s0, test_case0 # load the address of test_case0 to s0
la s1, golden # load the address of golden to s1
# Initialize the test error to zero
addi s2, x0, 0 # s2 = test_err = 0
# Initialize local variable for the mainLoop
addi s3, x0, 0 # s3 = i = 0
addi s4, x0, 4 # s4 = const. = 4
mainLoop:
bgeu s3, s4, endMainLoop # when i >= 4, end mainLoop
# Prepare the arguments that hammingDistance() need
lw a0, 0(s0) # load argument x, a0 = x
lw a1, 4(s0) # load argument y, a1 = y
# Function call
jal ra, hammingDistance # jump to function hammingDistance
# Compare hammingDistance() result with the golden
lw s5, 0(s1) # load golden, s5 = golden[i]
beq a0, s5, passTest # if the result == golden, pass test
# Update test error
addi s2, s2, 1 # test_err++
# Print faild message
la a0, err_str0 # load the address of err_str0
li a7, 4 # system call code for printing a string
ecall # print err_str0
addi a0, s3, 0 # load the number of the test case
li a7, 1 # system call code for printing a integer
ecall # print the number of the test case
la a0, err_str1 # load the address of err_str1
li a7, 4 # system call code for printing a string
ecall # print err_str1
passTest:
# Update test case, golden and loop variable
addi s0, s0, 8 # move s0 to next test_case
addi s1, s1, 4 # move s1 to nest golden
addi s3, s3, 1 # i++
j mainLoop # jal x0, mainLoop
endMainLoop:
bne s2, x0, exit # if test_err != 0, do not print pass message
# Print pass message
la a0, pass_str # load the address of pass_str
li a7, 4 # system call code for printing a string
ecall # print pass_str
exit:
li a7, 10 # system call code for exiting the program
ecall # make the exit system call
hammingDistance:
# Pass argument used in hammingDistance()
addi t0, a0, 0 # pass argument x, t0 = x
addi t1, a1, 0 # pass argument y, t1 = y
# Initialize local variables of hammingDistance()
xor t2, t0, t1 # n = x ^ y
addi t3, x0, 0 # count = 0
loop:
bgeu x0, t2, endLoop # when 0 >= n, end loop
andi t4, t2, 1 # t4 = n & 1
add t3, t3, t4 # count += n & 1
srli t2, t2, 1 # n = n >> 1
j loop # jal x0, loop
endLoop:
# Return count
addi a0, t3, 0 # move t3 to a0, a0 = count
ret # jalr x0, ra, 0
Metrics |
Value |
Cycles |
438 |
Instrs. retired |
298 |
CPI |
1.47 |
IPC |
0.68 |
Code size |
68 |
Loop Unrolling
Unrolling 2 Times
Metrics |
Value |
Cycles |
357 |
Instrs. retired |
259 |
CPI |
1.38 |
IPC |
0.725 |
Code size |
71 |
Unrolling 4 Times
Metrics |
Value |
Cycles |
329 |
Instrs. retired |
251 |
CPI |
1.31 |
IPC |
0.763 |
Code size |
77 |
Unrolling Times Analysis
Metrics |
w/o Unrolling |
2 Times |
4 Times |
8 Times |
16 Times |
32 Times |
Cycles |
438 |
357 (-18%) |
329 (-25%) |
305 (-30%) |
345 (-21%) |
505 (+15%) |
Instrs. retired |
298 |
259 (-13%) |
251 (-16%) |
239 (-20%) |
283 (-5%) |
459 (+54%) |
CPI |
1.47 |
1.38 (-6%) |
1.31 (-11%) |
1.28 (-13%) |
1.22 (-17%) |
1.1 (-25%) |
IPC |
0.68 |
0.725 (+7%) |
0.763 (+12%) |
0.784 (+15%) |
0.82 (+21%) |
0.909 (+34%) |
Code size |
68 |
71 (+4% ) |
77 (+13%) |
89 (+31%) |
118 (+74%) |
158 (+132%) |
Reference