Assignment1: RISC-V Assembly and Instruction Pipeline === contributed by <[`5k6m4`](https://github.com/5k6m4)> # Experimental Environment The experimental environment is a 5-stage in-order pipeline processor with hazard detection and data forwarding. The pipeline consists of five stages: instruction fetch (IF), instruction decode (ID), execute (EX), memory (MEM), and write back (WB). The following provides a brief introduction to each stage using an example assembly code and shows the functionality using the [Ripes](https://github.com/mortbopet/Ripes) simulator. ### Example Assembly Code ```c .data test_data: .word 0x0000ffff .text main: # demonstrate memory operation and data hazard la t0, test_data # load the address of test_data lw t1, 0(t0) # t1 = 0x0000ffff addi t1, x0, 0 # t1 = 0 addi t2, t1, 2 # t2 = 2 sw t1, 0(t0) # mem[t0] = 0 lw t2, 0(t0) # t2 = 0 if: # demonstrate control hazard beq t1, x0, exit addi t2, x0, 2 addi t3, x0, 3 addi t4, x0, 4 exit: li a7, 10 # system call code for exiting the program ecall # make the exit system call ``` ### Example Pipeline State ![pipeline_example](https://hackmd.io/_uploads/HyCbBscJke.png) **Instruction Fetch (IF)** - The processor loads an instruction using the value of the program counter (PC) as the address. - In the example pipeline state, the processor load the instruction `sw x6, 0, x5`. **Instruction Decode (ID)** - The instruction is decoded to obtain the source register index and the immediate value. Next, the register file is accessed to retrieve the value of the source register. - The instruction in the ID stage is `addi x7, x6, 2`. Therefore, one operand is set to the value of register `x6`, and the other is the immediate value `2`. **Execute (EX)** - The ALU performs the operation specified by the instruction in this stage. - In the example where the instruction is `addi x6, x0, 0`, the ALU adds the value in register `x0` to the immediate value `0`. **Memory (MEM)** - If the instruction in this stage is a load/store instruction, the processor accesses the memory using the address stored in the source register. - When the load instruction `lw x6, 0(x5)` is in this stage, the CPU requests to read the value stored at the address contained in register `x5`. **Write Back (WB)** - The value computed from the instruction of access from the memory will be written back to the destination register in this stage. - In the example pipeline stage, the instruction in the WB stage is `addi x5, x5, 0`. Therefore, the processor writes the value computed from the EX stage to register `x5`. # [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol) ProblemB :::danger Choose one problem (A, B, or C) from [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol), translate it from C code to a complete RISC-V assembly program, and include the relevant test data. ::: ## C Program ```c #include <stdint.h> typedef struct { uint16_t bits; } bf16_t; static inline float bf16_to_fp32(bf16_t h) { union { float f; uint32_t i; } u = {.i = (uint32_t)h.bits << 16}; return u.f; } static inline bf16_t fp32_to_bf16(float s) { bf16_t h; union { float f; uint32_t i; } u = {.f = s}; if ((u.i & 0x7fffffff) > 0x7f800000) { /* NaN */ h.bits = (u.i >> 16) | 64; /* force to quiet */ return h; } h.bits = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10; return h; } ``` ## bf16_to_fp32() ### Test Case | | Input in bf16 | Expected Output in fp32 | Description | | ----- | ------------: | ----------------------: | :---------- | | Case0 | 0x3f80 | 0x3f800000 | normal positive of bf16 | | Case1 | 0xbf80 | 0xbf800000 | normal negative of bf16 | | Case2 | 0x0000 | 0x00000000 | positive zero of bf16 | | Case3 | 0x7f80 | 0x7f800000 | positive infinity of bf16 | | Case4 | 0x7fc0 | 0x7fc00000 | quiet NaN of bf16 | | Case5 | 0x0001 | 0x00010000 | subnormal positive of bf16 | | Case6 | 0x8001 | 0x80010000 | subnormal negative of bf16 | | Case7 | 0x7f7f | 0x7f7f0000 | normal maximum of bf16 | | Case8 | 0x0080 | 0x00800000 | normal minimum of bf16 | ### Assembly Code ```c .data # test cases pair = {input bf16 value, golden fp32 value} test_case: .word 0x00003f80, 0x3f800000 # 0: normal positive of bf16 .word 0x0000bf80, 0xbf800000 # 1: normal negative of bf16 .word 0x00000000, 0x00000000 # 2: positive zero of bf16 .word 0x00007f80, 0x7f800000 # 3: positive infinity of bf16 .word 0x00007fc0, 0x7fc00000 # 4: quiet NaN of bf16 .word 0x00000001, 0x00010000 # 5: subnormal positive of bf16 .word 0x00008001, 0x80010000 # 6: subnormal negative of bf16 .word 0x00007f7f, 0x7f7f0000 # 7: normal maximum of bf16 .word 0x00000080, 0x00800000 # 8: normal minimum of bf16 err_str0: .string "TestCase" err_str1: .string " of bf16_to_fp32() failed!!\n" pass_str: .string "Pass all test cases of bf16_to_fp32()!!\n" .text main: la s0, test_case # load the address of test_case to s0 # Initialize the test error to zero addi s1, x0, 0 # s1 = test_err = 0 # Initialize local variables for the mainLoop addi s2, x0, 0 # s2 = i = 0 addi s3, x0, 9 # s3 = const. = 9 mainLoop: bgeu s2, s3, endMainLoop # when i >= 9, end mainLoop # Load the argument that bf16_to_fp32() needs lw a0, 0(s0) # Function call jal ra, bf16_to_fp32 # jump to functioin bf16_to_fp32 # Compare bf16_to_fp32() result with the golden lw s4, 4(s0) # load golden, s4 = golden value beq a0, s4, passTest # if result == golden, pass test case # Update test error addi s1, s1, 1 # test_err++ # Print failed message la a0, err_str0 # load the address of err_str0 li a7, 4 # system call code for printing a string ecall # print err_str0 addi a0, s2, 0 # load the number of the test case li a7, 1 # system call code for printing a integer ecall # print the number of the test case la a0, err_str1 # load the address of err_str1 li a7, 4 # system call code for printing a string ecall # print err_str1 passTest: # Update test case and loop variable addi s0, s0, 8 # move s0 to the address of the next test case addi s2, s2, 1 # i++ j mainLoop # jal x0, mainLoop endMainLoop: bne s1, x0, exit # if test_err != 0, do not print pass message # Print pass message la a0, pass_str # load the address of pass_str li a7, 4 # system call code for printing a string ecall # print pass_str16 exit: li a7, 10 # system call code for exiting the program ecall # make the exit system call bf16_to_fp32: slli a0, a0, 16 # a0 = h.bits << 16 ret # jalr x0, ra, 0 ``` ## fp32_to_bf16() ### Test Case | | Input in fp32 | Expected Output in bf16 | Description | | ----- | ------------: | ----------------------: | :---------- | | Case0 | 0x41200000 | 0x4120 | normal positive of fp32 | | Case1 | 0xc1200000 | 0xc120 | normal negative of fp32 | | Case2 | 0x00000000 | 0x0000 | positive zero of fp32 | | Case3 | 0x7f800000 | 0x7f80 | positive infinity of fp32 | | Case4 | 0x7fc00000 | 0x7fc0 | quiet NaN of fp32 | | Case5 | 0x00000001 | 0x0000 | subnormal positive of fp32 | | Case6 | 0x80000001 | 0x8000 | subnormal negative of fp32 | | Case7 | 0x7f7fffff | 0x7f80 | normal maximum of fp32 | | Case8 | 0x00800000 | 0x0080 | normal minimum of fp32 | ### Assembly Code ```c .data # test cases pair = {input fp32 value, golden bf16 value} test_case: .word 0x41200000, 0x00004120 # 0: normal positive of fp32 .word 0xc1200000, 0x0000c120 # 1: normal negative of fp32 .word 0x00000000, 0x00000000 # 2: positive zero of fp32 .word 0x7f800000, 0x00007f80 # 3: positive infinity of fp32 .word 0x7fc00000, 0x00007fc0 # 4: quiet NaN of fp32 .word 0x00000001, 0x00000000 # 5: subnormal positive of fp32 .word 0x80000001, 0x00008000 # 6: subnormal negative of fp32 .word 0x7f7fffff, 0x00007f80 # 7: normal maximum of fp32 .word 0x00800000, 0x00000080 # 8: normal minimum of fp32 err_str0: .string "TestCase" err_str1: .string " of fp32_to_bf16() failed!!\n" pass_str: .string "Pass all test cases of fp32_to_bf16()!!\n" .text main: la s0, test_case # load the address of test_case to s0 # Initialize the test error to zero addi s1, x0, 0 # s1 = test_err = 0 # Initialize local variables for the mainLoop addi s2, x0, 0 # s2 = i = 0 addi s3, x0, 9 # s3 = const. = 9 mainLoop: bgeu s2, s3, endMainLoop # when i >= 9, end mainLoop # Load the argument that fp32_to_bf16() needs lw a0, 0(s0) # Function call jal ra, fp32_to_bf16 # jump to functioin fp32_to_bf16 # Compare fp32_to_bf16() result with the golden lw s4, 4(s0) # load golden, s4 = golden value beq a0, s4, passTest # if result == golden, pass test case # Update test error addi s1, s1, 1 # test_err++ # Print failed message la a0, err_str0 # load the address of err_str0 li a7, 4 # system call code for printing a string ecall # print err_str0 addi a0, s2, 0 # load the number of the test case li a7, 1 # system call code for printing a integer ecall # print the number of the test case la a0, err_str1 # load the address of err_str1 li a7, 4 # system call code for printing a string ecall # print err_str1 passTest: # Update test case and loop variable addi s0, s0, 8 # move s0 to the address of the next test case addi s2, s2, 1 # i++ j mainLoop # jal x0, mainLoop endMainLoop: bne s1, x0, exit # if test_err != 0, do not print pass message # Print pass message la a0, pass_str # load the address of pass_str li a7, 4 # system call code for printing a string ecall # print pass_str16 exit: li a7, 10 # system call code for exiting the program ecall # make the exit system call fp32_to_bf16: addi t0, a0, 0 # pass argument s, t0 = s li t1, 0x7fffffff # t1 = 0x7fffffff and t1, t0, t1 # t1 = u.i & 0x7fffffff li t2, 0x7f800000 # t2 = 0x7f800000 if: bge t2, t1, endIf # if 0x7f800000 >= (u.i & 0x7fffffff), skip NaN handling # Handle NaN value srli t1, t0, 16 # t1 = u.i >> 16 ori a0, t1, 64 # t1 = h = (u.i >> 16) | 64 ret # jalr x0, ra, 0 endIf: # Round to nearest even value srli t1, t0, 16 # t1 = u.i >> 0x10 andi t1, t1, 1 # t1 = ((u.i >> 0x10) & 1) li t2, 0x7fff # t2 = 0x7fff add t1, t2, t1 # t1 = 0x7fff + ((u.i >> 0x10) & 1) add t1, t0, t1 # t1 = u.i + (0x7fff + ((u.i >> 0x10) & 1)) srli a0, t1, 16 # t1 = h = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10; ret # jalr x0, ra, 0 ``` # Hamming Distance ([Leetcode 461](https://leetcode.com/problems/hamming-distance/description/)) > Description: The [Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance) between two integers is the number of positions at which the corresponding bits are different. > > Given two integers `x` and `y`, return *the **Hamming distance** between them*. ## C Program ```c #include <stdint.h> int hammingDistance(int x, int y) { int n = x ^ y; int count = 0; while (n > 0) { count += n & 1; n = n >> 1; } return count; } ``` :::danger You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking. ::: ## Test Case | | Input1 | Input2 | Golden | Description | | ----- | :----: | :----: | :----: | :---------: | | Case0 | 0x0000000f | 0x0000000f | 0 | no bit is different | | Case1 | 0x0000000f | 0x0000001f | 1 | odd different bit | | Case2 | 0x0000000f | 0x0000003f | 2 | even different bits | | Case3 | 0x0f0f0f0f | 0xf0f0f0f0 | 32 | all bits are different | ## Assembly Code ```c .data test_case0: .word 0x0000000f, 0x0000000f test_case1: .word 0x0000000f, 0x0000001f test_case2: .word 0x0000000f, 0x0000003f test_case3: .word 0x0f0f0f0f, 0xf0f0f0f0 golden: .word 0, 1, 2, 32 err_str0: .string "TestCase" err_str1: .string " failed!!\n" pass_str: .string "Pass all test cases!!\n" .text main: la s0, test_case0 # load the address of test_case0 to s0 la s1, golden # load the address of golden to s1 # Initialize the test error to zero addi t0, x0, 0 # t0 = test_err = 0 # Initialize local variable for the mainLoop addi t1, x0, 0 # t1 = i = 0 addi t2, x0, 4 # t2 = const. = 4 mainLoop: bgeu t1, t2, endMainLoop # when i >= 4, end mainLoop # Prepare the arguments that hammingDistance() need lw a0, 0(s0) # load argument x, a0 = x lw a1, 4(s0) # load argument y, a1 = y # RISC-V calling convention and function call addi sp, sp, -12 # push t0 ~ t2 to the stack sw t0, 0(sp) sw t1, 4(sp) sw t2, 8(sp) jal ra, hammingDistance # jump to function hammingDistance lw t0, 0(sp) lw t1, 4(sp) lw t2, 8(sp) addi sp, sp, 12 # pop t0~t2 from the stack # Compare hammingDistance() result with the golden lw t3, 0(s1) # load golden, t3 = golden[i] beq a0, t3, passTest # if the result == golden, pass test # Update test error addi t0, t0, 1 # test_err++ # Print faild message la a0, err_str0 # load the address of err_str0 li a7, 4 # system call code for printing a string ecall # print err_str0 addi a0, t1, 0 # load the number of the test case li a7, 1 # system call code for printing a integer ecall # print the number of the test case la a0, err_str1 # load the address of err_str1 li a7, 4 # system call code for printing a string ecall # print err_str1 passTest: # Update test case, golden and loop variable addi s0, s0, 8 # move s0 to next test_case addi s1, s1, 4 # move s1 to nest golden addi t1, t1, 1 # i++ j mainLoop # jal x0, mainLoop endMainLoop: bne t0, x0, exit # if test_err != 0, do not print pass message # Print pass message la a0, pass_str # load the address of pass_str li a7, 4 # system call code for printing a string ecall # print pass_str exit: li a7, 10 # system call code for exiting the program ecall # make the exit system call hammingDistance: # RISC-V calling convention addi sp, sp, -8 # push s0 and s1 to stack sw s0, 0(sp) sw s1, 4(sp) # Pass argument used in hammingDistance() addi s0, a0, 0 # pass argument x, s0 = x addi s1, a1, 0 # pass argument y, s1 = y # Initialize local variables of hammingDistance() xor t0, s0, s1 # n = x ^ y addi t1, x0, 0 # count = 0 loop: bgeu x0, t0, endLoop # when 0 >= n, end loop andi t2, t0, 1 # t2 = n & 1 add t1, t1, t2 # count += n & 1 srli t0, t0, 1 # n = n >> 1 j loop # jal x0, loop endLoop: # RISC-V calling convention lw s0, 0(sp) lw s1, 4(sp) addi sp, sp, 8 # pop s0 and s1 from stack # Return count addi a0, t1, 0 # move t1 to a0, a0 = count ret # jalr x0, ra, 0 ``` | Metrics | Value | | :------ | ----: | | Cycles | 494 | | Instrs. retired | 354 | | CPI | 1.4 | | IPC | 0.717 | | Code size | 82 | :::danger Use fewer instructions. ::: ## Assembly Optimization ### Seperate the use of callee-saved and caller-saved registers ```c .data test_case0: .word 0x0000000f, 0x0000000f test_case1: .word 0x0000000f, 0x0000001f test_case2: .word 0x0000000f, 0x0000003f test_case3: .word 0x0f0f0f0f, 0xf0f0f0f0 golden: .word 0, 1, 2, 32 err_str0: .string "TestCase" err_str1: .string " failed!!\n" pass_str: .string "Pass all test cases!!\n" .text main: la s0, test_case0 # load the address of test_case0 to s0 la s1, golden # load the address of golden to s1 # Initialize the test error to zero addi s2, x0, 0 # s2 = test_err = 0 # Initialize local variable for the mainLoop addi s3, x0, 0 # s3 = i = 0 addi s4, x0, 4 # s4 = const. = 4 mainLoop: bgeu s3, s4, endMainLoop # when i >= 4, end mainLoop # Prepare the arguments that hammingDistance() need lw a0, 0(s0) # load argument x, a0 = x lw a1, 4(s0) # load argument y, a1 = y # Function call jal ra, hammingDistance # jump to function hammingDistance # Compare hammingDistance() result with the golden lw s5, 0(s1) # load golden, s5 = golden[i] beq a0, s5, passTest # if the result == golden, pass test # Update test error addi s2, s2, 1 # test_err++ # Print faild message la a0, err_str0 # load the address of err_str0 li a7, 4 # system call code for printing a string ecall # print err_str0 addi a0, s3, 0 # load the number of the test case li a7, 1 # system call code for printing a integer ecall # print the number of the test case la a0, err_str1 # load the address of err_str1 li a7, 4 # system call code for printing a string ecall # print err_str1 passTest: # Update test case, golden and loop variable addi s0, s0, 8 # move s0 to next test_case addi s1, s1, 4 # move s1 to nest golden addi s3, s3, 1 # i++ j mainLoop # jal x0, mainLoop endMainLoop: bne s2, x0, exit # if test_err != 0, do not print pass message # Print pass message la a0, pass_str # load the address of pass_str li a7, 4 # system call code for printing a string ecall # print pass_str exit: li a7, 10 # system call code for exiting the program ecall # make the exit system call hammingDistance: # Pass argument used in hammingDistance() addi t0, a0, 0 # pass argument x, t0 = x addi t1, a1, 0 # pass argument y, t1 = y # Initialize local variables of hammingDistance() xor t2, t0, t1 # n = x ^ y addi t3, x0, 0 # count = 0 loop: bgeu x0, t2, endLoop # when 0 >= n, end loop andi t4, t2, 1 # t4 = n & 1 add t3, t3, t4 # count += n & 1 srli t2, t2, 1 # n = n >> 1 j loop # jal x0, loop endLoop: # Return count addi a0, t3, 0 # move t3 to a0, a0 = count ret # jalr x0, ra, 0 ``` | Metrics | Value | | :------ | ----: | | Cycles | 438 | | Instrs. retired | 298 | | CPI | 1.47 | | IPC | 0.68 | | Code size | 68 | ### Loop Unrolling #### Unrolling 2 Times ```c loop: bgeu x0, t2, endLoop # if n > 0 (0 >= n), end loop andi t4, t2, 1 # t4 = n & 1 add t3, t3, t4 # count += n & 1 srli t2, t2, 1 # n = n >> 1 andi t4, t2, 1 # t4 = (n >> 1) & 1 add t3, t3, t4 # count += (n >> 1) & 1 srli t2, t2, 1 # n = n >> 2 j loop # jal x0, loop ``` | Metrics | Value | | :------ | ----: | | Cycles | 357 | | Instrs. retired | 259 | | CPI | 1.38 | | IPC | 0.725 | | Code size | 71 | #### Unrolling 4 Times ```c loop: bgeu x0, t2, endLoop # if n > 0 (0 >= n), end loop andi t4, t2, 1 # t4 = n & 1 add t3, t3, t4 # count += n & 1 srli t2, t2, 1 # n = n >> 1 andi t4, t2, 1 # t4 = (n >> 1) & 1 add t3, t3, t4 # count += (n >> 1) & 1 srli t2, t2, 1 # n = n >> 2 andi t4, t2, 1 # t4 = (n >> 2) & 1 add t3, t3, t4 # count += (n >> 2) & 1 srli t2, t2, 1 # n = n >> 3 andi t4, t2, 1 # t4 = (n >> 3) & 1 add t3, t3, t4 # count += (n >> 3) & 1 srli t2, t2, 1 # n = n >> 4 j loop # jal x0, loop ``` | Metrics | Value | | :------ | ----: | | Cycles | 329 | | Instrs. retired | 251 | | CPI | 1.31 | | IPC | 0.763 | | Code size | 77 | #### Unrolling Times Analysis | Metrics | w/o Unrolling | 2 Times | 4 Times | 8 Times | 16 Times | 32 Times | | :------ | ------------: | ------: | ------: | ------: | -------: | -------: | | Cycles | 438 | 357 (-18%) | 329 (-25%) | 305 (-30%) | 345 (-21%) | 505 (+15%) | | Instrs. retired | 298 | 259 (-13%) | 251 (-16%) | 239 (-20%) | 283 (-5%) | 459 (+54%) | | CPI | 1.47 | 1.38 (-6%) | 1.31 (-11%) | 1.28 (-13%) | 1.22 (-17%) | 1.1 (-25%) | | IPC | 0.68 | 0.725 (+7%) | 0.763 (+12%) | 0.784 (+15%) | 0.82 (+21%) | 0.909 (+34%) | | Code size | 68 | 71 (+4% )| 77 (+13%) | 89 (+31%) | 118 (+74%) | 158 (+132%) | # Reference - [bfloat16 floating-point format](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) - [The RISC-V Instruction Set Manual Volume I: User-Level ISA](https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf)