Assignment1: RISC-V Assembly and Instruction Pipeline

Check the sample pages, making the title consistent.

I change the title thanks.

Quiz 1 Problem B

Introduction

bfloat16, also known as brain float 16, originated from the Google Brain team, which introduced this new floating-point standard to accelerate machine learning computations.

We know that floating-point numbers consist of three parts: Sign, Exponent, and Fraction (or Mantissa). The Exponent represents the range of values that can be expressed (also called dynamic range), and the Fraction represents the precision of the values. In a limited hardware space, a trade-off between these two properties is inevitable. The length of bfloat16 is only half that of float32. When performing calculations on the same CPU using bfloat16, the ideal throughput would be double that of float32. Therefore, if calculations are performed using bfloat16 under the same throughput, the CPU’s processing speed will ideally increase by 2x. Additionally, in terms of storage space, bfloat16 saves approximately half the space compared to float32.

Although the existing IEEE 754 standard already defines a half-precision floating-point format, known as float16, which has similar advantages over float32 in terms of computation speed and storage space, why do we still need bfloat16?

The first reason: float16’s dynamic range is not as wide as bfloat16. Google explains that “neural networks are more sensitive to the size of the exponent than the size of the mantissa.”

The second reason: the area of a float16 multiplier is twice that of a bfloat16 multiplier. This is because the physical size of a hardware multiplier increases with the square of the mantissa width. Since float16 has a mantissa of 10 bits and bfloat16 has a mantissa of 7 bits, squaring these values results in 100 and 49, respectively. Dividing 100 by 49 gives an area size approximately 2 times larger.

The third reason: conversion between float16 and float32 is more difficult than between bfloat16 and float32. This is because the formats of float16 and float32 differ entirely in both exponent size and mantissa size, so during conversion, both the exponent and mantissa must be adjusted. In contrast, the only difference between bfloat16 and float32 is in the mantissa size, making bfloat16 much easier to convert.

bfloat16 format：

\overset{sign}{\overset{⏞}{\underset{1 bit}{\underset{⏟}{0}}}} ∣ \overset{exponent}{\overset{⏞}{\underset{8 bits}{\underset{⏟}{00000000}}}} ∣ \overset{fraction}{\overset{⏞}{\underset{7bits}{\underset{⏟}{0000000}}}}

float16 format：

\overset{sign}{\overset{⏞}{\underset{1 bit}{\underset{⏟}{0}}}} ∣ \overset{exponent}{\overset{⏞}{\underset{5 bits}{\underset{⏟}{00000}}}} ∣ \overset{fraction}{\overset{⏞}{\underset{10bits}{\underset{⏟}{0000000000}}}}

float 32 format：

\overset{sign}{\overset{⏞}{\underset{1 bit}{\underset{⏟}{0}}}} ∣ \overset{exponent}{\overset{⏞}{\underset{8 bits}{\underset{⏟}{00000000}}}} ∣ \overset{fraction}{\overset{⏞}{\underset{23bits}{\underset{⏟}{00000000000000000000000}}}}

Thinking

The process for converting between float32 and bfloat16 is as follows:

The first step in converting float32 to bfloat16 is to check if the value is NaN. If it is NaN, then the bfloat16 conversion will also result in NaN. NaN can be further classified into QNaN (Quiet NaN) and SNaN (Signaling NaN). The difference between the two is that QNaN represents an undefined result and does not cause a program interruption, while SNaN represents an uninitialized value and will cause a program interruption. Since we want to avoid interruptions during machine learning model computations, NaN must be forcibly converted to QNaN. By definition, if the Most Significant Bit (MSB) of the mantissa is 1, it is classified as QNaN; if the MSB is 0, it is classified as SNaN.

If the float32 value is not NaN, the mantissa needs to be checked and rounded if necessary. We use the “Round to Nearest, ties to Even” rounding mode. If the Least Significant Bit (LSB) of bfloat16 is 1, we add 1 to the binary representation of bfloat16; otherwise, no addition is necessary.

Finally, shift the entire value to the right by 16 bits to convert it into bfloat16 format.

To convert bfloat16 back to float32, simply shift the value to the left by 16 bits.

C program

Why was union used rather than arbitrary pointers?

A union allows different data types to share the same memory space, enabling direct conversion between a floating-point number and its corresponding binary representation. This allows seamless switching between the memory representation of a float and a uint32_t. In contrast, if type casting or pointer operations were used to achieve the same effect, such an approach might lead to undefined behavior.

static inline float bf16_to_fp32(bf16_t h)
{
    union {
        float f;
        uint32_t i;
    } u = {.i = (uint32_t)h.bits << 16}; 
    // First, convert uint16_t to uint32_t and then right shift by 16 bits.
    return u.f;
}

static inline bf16_t fp32_to_bf16(float s)
{
    bf16_t h;
    union {
        float f;
        uint32_t i;
    } u = {.f = s};
    // The following block handles converting NaN to QNaN.
    if ((u.i & 0x7fffffff) > 0x7f800000) { /* NaN */ // If the exponent is all 1s and the fraction is not 0, then it is NaN. If it is NaN, proceed to the next line.
        h.bits = (u.i >> 16) | 64;         /* Force NaN to quiet (set the first bit of the Fraction to 1) */ // Right shift u.i by 16 bits and set the first bit of the Fraction to 1.
        return h;            // 64 converted to binary = 01000000                                                                                                                                    
    }
    // The following block handles non-NaN values.
    h.bits = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10; 
    //u.i >> 0x10，Right shift by 16 bits.
    //(u.i >> 0x10) & 1，Check if it is odd or even; odd is 1 and even is 0.
    return h;
}

RISC-V (RV32I)

.data
num1:    .word 0x3F80, 0xBF80, 0x7F80
num2:    .dword 0x407F8000, 0x40C08000, 0x7F800000
iterate: .word 3
str1:    .string "\nThe fp32 value of bf16 number "
str2:    .string "\nThe bf16 value of fp32 number "
str3:    .string " is "
.text
main:
    lw t6,iterate
    la t5,num1
test1:
    lw a1,0(t5)
    jal ra, bf16_to_fp32
    jal ra, printResult_fp32
    addi t6,t6,-1
    addi t5,t5,4
    bne zero, t6, test1

    lw t6,iterate
    la t5,num2
test2:
    lw a1,0(t5)
    jal ra, fp32_to_bf16
    jal ra, printResult_bf16
    addi t6,t6,-1
    addi t5,t5,8
    bne zero, t6, test2
    
    # Exit the program
    li a7, 10                  # System call code for exiting the program
    ecall                      # Make the exit system call
    
bf16_to_fp32:
    slli t0, a1, 16
    ret
    
fp32_to_bf16:
    li t1, 0x7FFFFFFF #check nosign
    li t2, 0x7F800000 #check is NaN 
    and t1, a1, t1
    srli t0, a1, 16     #fp32 shift right logic 16 bits convert to bf16
    blt t2, t1, fp32_to_bf16_isNaN #
    andi t3, t0, 1      #check bf16 fraction LSB is one?
    bnez t3, LSB_is_one #goto Round to Nearest, ties to Even
    ret
LSB_is_one:
    addi t0, t0, 1      #When adding 1 to a bf16 value, it is rounded to even (Round to even).
    ret
fp32_to_bf16_isNaN:
    ori t0, t0, 0x40
    ret
    
printResult_fp32:
    la a0, str1 #set the system call use data
    li a7, 4    #set the system call number
    ecall
    
    mv a0, a1 #set the system call use data
    li a7, 1  #set the system call number
    ecall
    
    la a0, str3
    li a7, 4
    ecall
    
    mv a0, t0
    li a7, 1
    ecall
    ret
    
printResult_bf16:
    la a0, str2 #set the system call use data
    li a7, 4    #set the system call number
    ecall
    
    mv a0, a1 #set the system call use data
    li a7, 1  #set the system call number
    ecall
    
    la a0, str3
    li a7, 4
    ecall
    
    mv a0, t0
    li a7, 1
    ecall
    ret

LeetCode 2239 Find Closest Number to Zero

Description

Given an integer array nums of size n, return the number with the value closest to 0 in nums. If there are multiple answers, return the number with the largest value.

Constraints:

1 <= n <= 1000
-105 <= nums[i] <= 105

Example 1:

Input: nums = [-4,-2,1,4,8]
Output: 1
Explanation:
    The distance from -4 to 0 is |-4| = 4.
    The distance from -2 to 0 is |-2| = 2.
    The distance from 1 to 0 is |1| = 1.
    The distance from 4 to 0 is |4| = 4.
    The distance from 8 to 0 is |8| = 8.
    Thus, the closest number to 0 in the array is 1.

Example 2:

Input: nums = [2,-1,1]
Output: 1
Explanation: 1 and -1 are both the closest numbers to 0, 
             so 1 being larger is returned.

Thinking

My intuitive idea is to set a parameter called closest_num to represent the number closest to 0, with its default value set as the first element of the array. Using a for loop to scan through the array, if the absolute value of the current array element is smaller than the absolute value of closest_num, it means the current element is closer to 0, so I replace it with the new value of closest_num. Additionally, if the absolute value of the current array element is equal to the absolute value of closest_num, I need to check if the current element is greater than closest_num. If the current element is greater, it means it is larger, so I replace closest_num with the new value.

C Program

int findClosestNumber(int* nums, int numsSize) {
    int closest_num = nums[0];
    for(int i=0;i<numsSize;i++){
        if(abs(nums[i])<abs(closest_num)){
            closest_num = nums[i];
        }else if(abs(closest_num)==abs(nums[i])){
            if(nums[i]>closest_num){
                closest_num = nums[i];
            }
        }
    }
    return closest_num;
}

RISC-V (RV32I) version 1













































































    .data
nums:
    .word -4,-2,1,4,8
    .word 0,-2,2
    .word 99,-99,100,-100,101,-10
numsSize:
    .word 5
    .word 3
    .word 6
str: .string "\n"
.text
main:
    li s1, 3         #number of test case
    la s2, numsSize  #load address of numsSize
    la s3, nums      #load address of nums
test:
    jal ra, findNonMinOrMax #goto findNonMinOrMax function
    jal ra, printResult     #goto printResult function           
    addi s2, s2, 4          #next numsSize
    addi s1, s1, -1         
    bne s1, zero, test
    # Exit the program
    li a7, 10      # System call code for exiting the program
    ecall          # Make the exit system call

findNonMinOrMax:
    lw t0, 0(s2)  #int i
    lw t1, 0(s3)  #closest_num = nums[0]
loop:
    lw t2, 0(s3)  #nums[i]
    
    #set abs(closest_num)
    add a1, zero, t1 
    addi sp, sp, -4
    sw ra, 0(sp)
    jal ra, set_abs
    lw ra, 0(sp)
    addi sp, sp 4
    add t3, a1, zero

    #set abs(nums[i])
    add a1, zero, t2
    addi sp, sp, -4
    sw ra, 0(sp)
    jal ra, set_abs
    lw ra, 0(sp)
    addi sp, sp 4
    add t4, a1, zero
next_1: 
    #if(abs(nums[i])<abs(closest_num)) then closest_num=nums[i]
    bge t4, t3, next_3 
    add t1, zero, t2
    j loop_decrement
    #if(abs(nums[i])==abs(closest_num)) then continue
next_2: bne t4, t3, loop_decrement
        #if(nums[i]>closest_num) then closest_num=nums[i]
        ble t2, t1, loop_decrement 
        add t1, zero, t2
loop_decrement:
    addi s3, s3, 4
    addi t0, t0, -1 
    bne t0,zero, loop
    ret
set_abs:
    bgez a1, return
    neg a1,a1
return:
    ret
printResult:
    mv a0, t1 #set the system call use data
    li a7, 1  #set the system call number
    ecall
    
    la a0, str #set the system call use data
    li a7, 4  #set the system call number
    ecall
    ret

RISC-V (RV32I) version 2
































































.data
nums:
    .word -4,-2,1,4,8
    .word 0,-2,2
    .word 99,-99,100,-100,101,-10
numsSize:
    .word 5
    .word 3
    .word 6
str: .string "\n"
.text
main:
    li s1, 3         #number of test case
    la s2, numsSize  #load address of numsSize
    la s3, nums      #load address of nums
test:
    jal ra, findNonMinOrMax #goto findNonMinOrMax function
    jal ra, printResult     #goto printResult function           
    addi s2, s2, 4          #next numsSize
    addi s1, s1, -1         
    bne s1, zero, test
    # Exit the program
    li a7, 10      # System call code for exiting the program
    ecall          # Make the exit system call

findNonMinOrMax:
    lw t0, 0(s2)  #int i
    lw t1, 0(s3)  #closest_num = nums[0]
loop:
    lw t2, 0(s3)  #nums[i]
    
    #set abs(closest_num)
    add t3, zero, t1 
    bgez t3, next_1
    neg t3,t3
next_1:
    #set abs(nums[i])
    add t4, zero, t2
    bgez t4, next_2
    neg t4,t4
next_2: 
    #if(abs(nums[i])<abs(closest_num)) then closest_num=nums[i]
    bge t4, t3, next_3 
    add t1, zero, t2
    j loop_decrement
    #if(abs(nums[i])==abs(closest_num)) then continue
next_3: bne t4, t3, loop_decrement
        #if(nums[i]>closest_num) then closest_num=nums[i]
        ble t2, t1, loop_decrement 
        add t1, zero, t2
loop_decrement:
    addi s3, s3, 4
    addi t0, t0, -1 
    bne t0,zero, loop
    ret
printResult:
    mv a0, t1 #set the system call use data
    li a7, 1  #set the system call number
    ecall
    
    la a0, str #set the system call use data
    li a7, 4  #set the system call number
    ecall
    ret

Difference



















































































.data
nums:
    .word -4,-2,1,4,8
    .word 0,-2,2
    .word 99,-99,100,-100,101,-10
numsSize:
    .word 5
    .word 3
    .word 6
str: .string "\n"
.text
main:
    li s1, 3         #number of test case
    la s2, numsSize  #load address of numsSize
    la s3, nums      #load address of nums
test:
    jal ra, findNonMinOrMax #goto findNonMinOrMax function
    jal ra, printResult     #goto printResult function           
    addi s2, s2, 4          #next numsSize
    addi s1, s1, -1         
    bne s1, zero, test
    # Exit the program
    li a7, 10      # System call code for exiting the program
    ecall          # Make the exit system call

findNonMinOrMax:
    lw t0, 0(s2)  #int i
    lw t1, 0(s3)  #closest_num = nums[0]
loop:
    lw t2, 0(s3)  #nums[i]
    
    #set abs(closest_num)
-   add a1, zero, t1 
-   addi sp, sp, -4
-   sw ra, 0(sp)
-   jal ra, set_abs
-   lw ra, 0(sp)
-   addi sp, sp 4
-   add t3, a1, zero
+   add t3, zero, t1 
+   bgez t3, next_1
+   neg t3,t3
next_1:
    #set abs(nums[i])
-   add a1, zero, t2
-   addi sp, sp, -4
-   sw ra, 0(sp)
-   jal ra, set_abs
-   lw ra, 0(sp)
-   addi sp, sp 4
-   add t4, a1, zero
+   add t4, zero, t2
+   bgez t4, next_2
+   neg t4,t4
next_2: 
    #if(abs(nums[i])<abs(closest_num)) then closest_num=nums[i]
    bge t4, t3, next_3 
    add t1, zero, t2
    j loop_decrement
    #if(abs(nums[i])==abs(closest_num)) then continue
next_3: bne t4, t3, loop_decrement
        #if(nums[i]>closest_num) then closest_num=nums[i]
        ble t2, t1, loop_decrement 
        add t1, zero, t2
loop_decrement:
    addi s3, s3, 4
    addi t0, t0, -1 
    bne t0,zero, loop
    ret
set_abs:
-   bgez a1, return
-    neg a1,a1
return:
-   ret
printResult:
    mv a0, t1 #set the system call use data
    li a7, 1  #set the system call number
    ecall
    
    la a0, str #set the system call use data
    li a7, 4  #set the system call number
    ecall
    ret

In the second version, I modified the first version where the process of setting the absolute value would jump to another set_abs function, and changed it to directly setting the absolute value within the findNonMinOrMax function. After this modification, we can also observe that the number of execution cycles decreased from 675 to 367, indicating a very effective optimization process.

Analyze

After placing my second version of the assembly program into Ripes and successfully compiling it, I noticed that there were additional instructions that I hadn’t originally written. The reason for this is that my assembly program used some pseudo-instructions, which are created to make writing programs more convenient for programmers. In reality, the machine cannot execute pseudo-instructions. After being translated by the assembler, these pseudo-instructions are converted into equivalent assembly instructions. Additionally, I also noticed that the register names were converted from ABI names to the actual RISC-V register numbers.

Pseudo instruction

00000000 <main>:
    0:         00300493        addi x9 x0 3
    4:         10000917        auipc x18 0x10000
    8:         03490913        addi x18 x18 52
    c:         10000997        auipc x19 0x10000
    10:        ff498993        addi x19 x19 -12

00000014 <test>:
    14:        01c000ef        jal x1 28 <findNonMinOrMax>
    18:        064000ef        jal x1 100 <printResult>
    1c:        00490913        addi x18 x18 4
    20:        fff48493        addi x9 x9 -1
    24:        fe0498e3        bne x9 x0 -16 <test>
    28:        00a00893        addi x17 x0 10
    2c:        00000073        ecall

00000030 <findNonMinOrMax>:
    30:        00092283        lw x5 0 x18
    34:        0009a303        lw x6 0 x19

00000038 <loop>:
    38:        0009a383        lw x7 0 x19
    3c:        00600e33        add x28 x0 x6
    40:        000e5463        bge x28 x0 8 <next_1>
    44:        41c00e33        sub x28 x0 x28

00000048 <next_1>:
    48:        00700eb3        add x29 x0 x7
    4c:        000ed463        bge x29 x0 8 <next_2>
    50:        41d00eb3        sub x29 x0 x29

00000054 <next_2>:
    54:        01ced663        bge x29 x28 12 <next_3>
    58:        00700333        add x6 x0 x7
    5c:        0100006f        jal x0 16 <loop_decrement>

00000060 <next_3>:
    60:        01ce9663        bne x29 x28 12 <loop_decrement>
    64:        00735463        bge x6 x7 8 <loop_decrement>
    68:        00700333        add x6 x0 x7

0000006c <loop_decrement>:
    6c:        00498993        addi x19 x19 4
    70:        fff28293        addi x5 x5 -1
    74:        fc0292e3        bne x5 x0 -60 <loop>
    78:        00008067        jalr x0 x1 0

0000007c <printResult>:
    7c:        00030513        addi x10 x6 0
    80:        00100893        addi x17 x0 1
    84:        00000073        ecall
    88:        10000517        auipc x10 0x10000
    8c:        fbc50513        addi x10 x10 -68
    90:        00400893        addi x17 x0 4
    94:        00000073        ecall
    98:        00008067        jalr x0 x1 0

5-stage pipelined processor

I chose to use Ripes’ 5-stage processor, which includes a hazard detection unit and a forwarding unit. The architecture diagram is shown below:

The so-called 5-stage pipeline refers to the following stages:：
1. IF (Instruction Fetch)
2. ID (Instruction Decode)
3. EX (Execution)
4. MEM (Memory Access)
5. WB (Write Back)

The key to allowing instructions to be executed in concurrent in a pipeline lies in the rectangular bars that separate each stage, also known as pipeline registers. These registers store the data and control signals for each stage, ensuring that the correct data and control signals are available for the next stage.

In the RISC-V ISA, machine code formats are divided into R, I, B, U, J, and S types. Below, I will introduce each instruction format in order and explain how they are executed in the pipeline.

R-type format & data flow

In RISC-V assembly language, R-type instructions have two source registers and one destination register, which all follow this format. Examples include add, sub, and so on. The machine code format is as follows:

\overset{funct7}{\overset{⏞}{\underset{7 bits}{\underset{⏟}{0000000}}}} ∣ \overset{rs2}{\overset{⏞}{\underset{5 bits}{\underset{⏟}{00000}}}} ∣ \overset{rs1}{\overset{⏞}{\underset{5 bits}{\underset{⏟}{00000}}}} ∣ \overset{funct3}{\overset{⏞}{\underset{3 bits}{\underset{⏟}{000}}}} ∣ \overset{rd}{\overset{⏞}{\underset{5 bits}{\underset{⏟}{00000}}}} ∣ \overset{opcode}{\overset{⏞}{\underset{7 bits}{\underset{⏟}{0000000}}}}

Taking add x29, x0, x7 as an example for conversion into machine code, we can refer to the 2024 RISC-V spec. (page 554) and know that for the add instruction, funct7 = 0000000, funct3 = 000, and opcode = 0110011. The source register rs1 is x0, which is converted into binary as 00000, and rs2 is x7, which is converted into binary as 00111. The destination register rd is x29, which is converted into binary as 11101. Finally, combining this information results in the 32-bit machine code:

\overset{funct7}{\overset{⏞}{0000000}} ∣ \overset{rs2}{\overset{⏞}{00111}} ∣ \overset{rs1}{\overset{⏞}{00000}} ∣ \overset{funct3}{\overset{⏞}{000}} ∣ \overset{rd}{\overset{⏞}{11101}} ∣ \overset{opcode}{\overset{⏞}{0110011}}

Finally, let’s use Ripes to see how the add instruction is executed in the pipeline. Using the code above as an example, here is an excerpt of the relevant portion of the code:

00000048 <next_1>:
    48:        00700eb3        add x29 x0 x7
    4c:        000ed463        bge x29 x0 8 <next_2>
    50:        41d00eb3        sub x29 x0 x29

IF stage (Instruction Fetch)

The Program Counter outputs the instruction address 0x00000048, and then this address enters the adder, which automatically adds 4. The Program Counter stores 0x0000004c.
The Instruction Memory receives the input address 0x00000048 and retrieves the corresponding instruction from that address, with the output being 0x00700eb3.
The Compressed Decoder receives the instruction 0x00700eb3 and checks whether it is a compressed instruction. Since the instruction is not a compressed instruction, the output remains 0x00700eb3.
Finally, the relevant data is stored in the IF/ID register.

ID stage (Instruction Decoder)

The Decoder receives the instruction 0x00700eb3 and performs the decoding.
- opcode (0110011), funct3 (000), and funct7 (0000000) are decoded, indicating that the instruction type is add.
- R1 idx is decoded as 0x00, which corresponds to the x0 register.
- R2 idx is decoded as 0x07, which corresponds to the x7 register.
- Wr idx is decoded as 0x1d, which corresponds to the x29 register.
The Immediate Generate Unit, based on the input type being R-type, finds that there is no available immediate field, so it outputs the default value 0xdeadbeef. For more on why the default is 0xdeadbeef, you can find discussions on the topic in the Reddit forum.
The Register Files, based on the input R1 idx, find the corresponding x0 register and output the content of the x0 register (0x00000000).
The Register Files, based on the input R2 idx, find the corresponding x7 register and output the content of the x7 register (0xfffffffc).
Finally, the relevant data is stored in the ID/EX register.

EX stage (Execution)

The value of rs1 and the other two forwarding paths form the first-level 3x1 multiplexer. Since the forwarding function is not used here, the first-level 3x1 multiplexer passes through the value of rs1 (0x00000000). In the second-level 2x1 multiplexer, which consists of the output from the first-level 3x1 multiplexer and the program counter value, only B-type, J-type, and auipc instructions will select the upper path. Therefore, the second-level 2x1 multiplexer passes the output of the first-level 3x1 multiplexer (0x00000000) to Op1.
The value of rs2 and the other two forwarding paths form the first-level 3x1 multiplexer. Since the forwarding function is not used here, the first-level 3x1 multiplexer passes through the value of rs2 (0xfffffffc). In the second-level 2x1 multiplexer, which consists of the output from the first-level 3x1 multiplexer and the immediate value, since R-type instructions operate using Rs1 and Rs2, the second-level 2x1 multiplexer passes the output of the first-level 3x1 multiplexer (0xfffffffc) to Op2.
Since the instruction is not a branch type, the Branch unit will not take any action.
The ALU performs addition based on the control signals, adding 0x00000000 to 0xfffffffc and calculating the result Res as 0xfffffffc.
Finally, the relevant data is stored in the EX/MEM register.

MEM stage (Memory Access)

Although the value of rs2 is input to Data in and Res is input to Addr., under an R-type instruction, the control signal for Memory write is 0, indicating that no data is written to Memory.
Based on the address pointed to by Res, we can observe the state of the Memory. The address used is only up to 0x00000098, and addresses after that have not been used, meaning that 0xfffffffc has not yet been allocated any data. Therefore, the output at that address (Read Out) is 0x00000000.
Finally, the relevant data is stored in the MEM/WB register.

WB stage (Write back)

The 3x1 multiplexer is composed of the values from the next program counter, Res, and the value read from memory. Since the instruction type is R-type, the multiplexer selector will choose Res (0xfffffffc) to pass through. Res will then be input into Wr Data, while the rd value stored in the MEM/WB register (0x1c) will be input into Wr idx. Finally, 0xfffffffc will be written to the x29 register (ABI name: t4).

After completing all stages, the registers are updated as follows:

We can see that 0xfffffffc was successfully written to the t4 register.

I-type format & data flow

In RISC-V assembly language, I-type instructions mainly consist of load instructions and immediate instructions (except for lui and auipc), such as lw, addi, subi, and so on. The machine code format is as follows:

\overset{immediate}{\overset{⏞}{\underset{12 bits}{\underset{⏟}{000000000000}}}} ∣ \overset{rs1}{\overset{⏞}{\underset{5 bits}{\underset{⏟}{00000}}}} ∣ \overset{funct3}{\overset{⏞}{\underset{3 bits}{\underset{⏟}{000}}}} ∣ \overset{rd}{\overset{⏞}{\underset{5 bits}{\underset{⏟}{00000}}}} ∣ \overset{opcode}{\overset{⏞}{\underset{7 bits}{\underset{⏟}{0000000}}}}

Taking addi x9, x0, 3 as an example for conversion into machine code, we can refer to the 2024 RISC-V spec. (page 554) and find that for the addi instruction, funct3 = 000 and opcode = 0010011. The source register rs1 is x0, which converts to binary as 00000, and the destination register rd is x9, which converts to binary as 01001. The immediate value 3 converts to binary as 000000000011. Finally, combining this information results in the 32-bit machine code:

\overset{immediate}{\overset{⏞}{000000000011}} ∣ \overset{rs1}{\overset{⏞}{00000}} ∣ \overset{funct3}{\overset{⏞}{000}} ∣ \overset{rd}{\overset{⏞}{01001}} ∣ \overset{opcode}{\overset{⏞}{0010011}}

Finally, let’s use Ripes to see how the addi instruction is executed in the pipeline. Using the code above as an example, here is an excerpt of the relevant portion of the code:

00000000 <main>:
    0:        00300493        addi x9 x0 3
    4:        10000917        auipc x18 0x10000
    8:        03490913        addi x18 x18 52
    c:        10000997        auipc x19 0x10000
    10:       ff498993        addi x19 x19 -12

IF stage (Instruction Fetch)

The Program Counter outputs the instruction address 0x00000000, and then this address enters the adder, which automatically adds 4. The Program Counter stores 0x00000004.
The Instruction Memory receives the input address 0x00000004 and retrieves the corresponding instruction from that address, with the output being 0x00300493.
The Compressed Decoder receives the instruction 0x00300493 and checks whether it is a compressed instruction. Since the instruction is not a compressed instruction, the output remains 0x00300493.
Finally, the relevant data is stored in the IF/ID register.

ID stage (Instruction Decoder)

The Decoder receives the instruction 0x00300493 and performs the decoding.
- opcode (0010011) and funct3 (000) are decoded, indicating that the instruction type is addi.
- R1 idx is decoded as 0x00, which corresponds to the x0 register.
- R2 idx is decoded as 0x03, which corresponds to the x3 register (although I-type does not have an rs2 field, the corresponding position in the machine code format is the rightmost 5 bits of the immediate[11:0] field, which is 00011).
- Wr idx is decoded as 0x09, which corresponds to the x9 register.
The Immediate Generate Unit, based on the input type being I-type, reads immediate[11:0] and performs sign extension to 32 bits, with the output being 0x00000003.
The Register Files, based on the input R2 idx, find the corresponding x3 register and output the content of the x3 register (0x10000000).
Finally, the relevant data is stored in the ID/EX register.

EX stage (Execution)

The value of rs1 and the other two forwarding paths form the first-level 3x1 multiplexer. Since the forwarding function is not used here, the first-level 3x1 multiplexer passes through the value of rs1 (0x00000000). In the second-level 2x1 multiplexer, which consists of the output from the first-level 3x1 multiplexer and the program counter value, since I-type instructions operate using rs1 and the immediate value, the second-level 2x1 multiplexer passes the output of the first-level 3x1 multiplexer (0x00000000) through to Op1.
The value of rs2 and the other two forwarding paths form the first-level 3x1 multiplexer. Since the forwarding function is not used here, the first-level 3x1 multiplexer passes through the value of rs2 (0x10000000). In the second-level 2x1 multiplexer, which consists of the output from the first-level 3x1 multiplexer and the immediate value, since I-type instructions operate using rs1 and the immediate value, the second-level 2x1 multiplexer passes the immediate output (0x00000003) through to Op2.
Since the instruction is not a branch type, the Branch unit will not take any action.
The ALU performs addition based on the control signals, adding 0x00000000 to 0x00000003, resulting in Res being 0x00000003.
Finally, the relevant data is stored in the EX/MEM register.

MEM stage (Memory Access)

Although the value of rs2 is input to Data in and Res is input to Addr., under an S-type instruction, the control signal for memory write is 0, indicating that no data is written to Memory.
Based on the address pointed to by Res, we can observe the memory’s storage status. The address 0x00000003 corresponds to the address of Byte 3 at 0x00000000, while the other 3 bytes are taken from the address 0x00000004, meaning Byte 0, Byte 1, and Byte 2. Since RISC-V uses little-endian format, Byte 3 at 0x00000000 holds the least significant byte (LSB), and Byte 2 at 0x00000004 holds the most significant byte (MSB). Finally, the address outputs Read Out as 0x00091700. (Typically, memory accesses must follow alignment rules, where word accesses must be at addresses that are multiples of 4. However, this does not affect the pipeline as it doesn’t rely on the value read from memory.)
Finally, the relevant data is stored in the MEM/WB register.

WB stage (Write back)

The 3x1 multiplexer is composed of the values from the next program counter, Res, and the value read from memory. Since the instruction type is I-type, the multiplexer selector will choose Res (0x00000003) to pass through. Res will then be input into Wr Data, and the rd value stored in the MEM/WB register (0x09) will be input into Wr idx. Finally, 0x00000003 will be written to the x9 register (ABI name: s1).

After completing all stages, the register updates are as follows:

We can see that 0x00000003 was successfully written to the s1 register.

S-type format & data flow

In RISC-V assembly language, S-type instructions are primarily related to Store Memory operations, such as sw, sb, and so on. The machine code format is as follows:

\overset{immediate[11:5]}{\overset{⏞}{\underset{7 bits}{\underset{⏟}{0000000}}}} ∣ \overset{rs2}{\overset{⏞}{\underset{5 bits}{\underset{⏟}{00000}}}} ∣ \overset{rs1}{\overset{⏞}{\underset{5 bits}{\underset{⏟}{00000}}}} ∣ \overset{funct3}{\overset{⏞}{\underset{3 bits}{\underset{⏟}{000}}}} ∣ \overset{immediate[4:0]}{\overset{⏞}{\underset{5 bits}{\underset{⏟}{00000}}}} ∣ \overset{opcode}{\overset{⏞}{\underset{7 bits}{\underset{⏟}{0000000}}}}

Taking sw x5, 32(x7) as an example for conversion into machine code, we can refer to the 2024 RISC-V spec. (page 554) and find that for the sw instruction, funct3 = 010 and opcode = 0100011. The source register rs1 is x7, which converts to binary as 00111, and rs2 is x5, which converts to binary as 00101. The immediate value 32 converts to binary as 000000100000. Finally, combining this information results in the 32-bit machine code:

\overset{immediate[11:5]}{\overset{⏞}{0000001}} ∣ \overset{rs2}{\overset{⏞}{00101}} ∣ \overset{rs1}{\overset{⏞}{00111}} ∣ \overset{funct3}{\overset{⏞}{010}} ∣ \overset{immediate[4:0]}{\overset{⏞}{00000}} ∣ \overset{opcode}{\overset{⏞}{0100011}}

Finally, let’s use Ripes to see how the sw instruction is executed in the pipeline. Since the previous code did not include the sw instruction, a small example is created as follows:

00000000 <main>:
    0:        06400293        addi x5 x0 100
    4:        f9c00313        addi x6 x0 -100
    8:        fe628ce3        beq x5 x6 -8 <main>
    c:        00531313        slli x6 x6 5
    10:       0253a023        sw x5 32 x7

IF stage (Instruction Fetch)

The Program Counter outputs the instruction address 0x00000010, and then this address enters the adder, which automatically adds 4. The Program Counter stores 0x00000014.
The Instruction Memory receives the input address 0x00000010 and retrieves the corresponding instruction from that address, with the output being 0x0253a023.
The Compressed Decoder receives the instruction 0x0253a023 and checks whether it is a compressed instruction. Since the instruction is not a compressed instruction, the output remains 0x0253a023.
Finally, the relevant data is stored in the IF/ID register.

ID stage (Instruction Decoder)

The Decoder receives the instruction 0x0253a023 and performs the decoding.
- opcode (0100011) and funct3 (010) are decoded, indicating that the instruction type is sw.
- R1 idx is decoded as 0x07, which corresponds to the x7 register.
- R2 idx is decoded as 0x05, which corresponds to the x5 register.
- Wr idx is decoded as 0x00, which corresponds to the x0 register (although S-type does not have an rd field, the corresponding position in the machine code format is the immediate[4:0] field, which is 00000).
The Immediate Generate Unit, based on the input type being S-type, combines immediate[11:5] and immediate[4:0] into a 12-bit value, and then performs sign extension to 32 bits, with the output being 0x00000020.
The Register Files, based on the input R1 idx, find the corresponding x7 register and output the content of the x7 register (0x00000000).
The Register Files, based on the input R2 idx, find the corresponding x5 register and output the content of the x5 register (0x00000064).
Finally, the relevant data is stored in the ID/EX register.

EX stage (Execution)

The value of rs1 and the other two forwarding paths form the first-level 3x1 multiplexer. Since the forwarding function is not used here, the first-level 3x1 multiplexer passes through the value of rs1 (0x00000000). In the second-level 2x1 multiplexer, which consists of the output from the first-level 3x1 multiplexer and the program counter value, because this is an S-type instruction that needs to calculate the memory address from rs1 and the immediate value, the second-level 2x1 multiplexer passes the output of the first-level 3x1 multiplexer (0x00000000) through to Op1.
The value of rs2 and the other two forwarding paths form the first-level 3x1 multiplexer. Since the forwarding function is not used here, the first-level 3x1 multiplexer passes through the value of rs2 (0x00000064). In the second-level 2x1 multiplexer, which consists of the output from the first-level 3x1 multiplexer and the program counter value, because this is an S-type instruction that needs to calculate the memory address from rs1 and the immediate value, the second-level 2x1 multiplexer passes the immediate value (0x00000020) through to Op2.
Since the instruction is not a branch type, the Branch unit will not take any action.
The ALU performs addition based on the control signals, adding 0x00000000 to 0x00000020, resulting in Res being 0x00000020.
Finally, the relevant data is stored in the EX/MEM register.

MEM stage (Memory Access)

The value of rs2 (0x00000064) is input to Data in, and Res (0x00000020) is input to Addr.. In the case of an S-type instruction, the Memory write control signal is 1, indicating that data will be written to Memory.
Based on the address pointed to by Res, we can check the Memory’s current status. The address 0x00000020 has not yet been allocated data, and the write operation has not yet occurred at this point, so the address outputs Read Out as 0x00000000.
Finally, the relevant data is stored in the MEM/WB register.

WB stage (Write back)

The 3x1 multiplexer is composed of the values from the next program counter, Res, and the value read from memory. Since the instruction type is S-type, the multiplexer selector will choose Res (0x00000020) to pass through. Res will then be input into Wr Data, and the rd value stored in the MEM/WB register (0x00) will be input into Wr idx. However, because the WrEn control signal is 0, the value cannot be successfully written to the register.

After the instruction completes all stages, the Memory is updated as follows:

We can see that 0x00000064 was successfully written to the memory address 0x00000020.

B-type format & data flow

In RISC-V assembly language, B-type instructions primarily involve branch operations, such as beq, bne, bge, and so on. The machine code format is as follows:

\overset{immediate[12]}{\overset{⏞}{\underset{1 bit}{\underset{⏟}{0}}}} ∣ \overset{immediate[10:5]}{\overset{⏞}{\underset{6 bits}{\underset{⏟}{000000}}}} ∣ \overset{rs2}{\overset{⏞}{\underset{5 bits}{\underset{⏟}{00000}}}} ∣ \overset{rs1}{\overset{⏞}{\underset{5 bits}{\underset{⏟}{00000}}}} ∣ \overset{funct3}{\overset{⏞}{\underset{3 bits}{\underset{⏟}{000}}}} ∣ \overset{immediate[4:1]}{\overset{⏞}{\underset{4 bits}{\underset{⏟}{0000}}}} ∣ \overset{immediate[11]}{\overset{⏞}{\underset{1 bit}{\underset{⏟}{0}}}} ∣ \overset{opcode}{\overset{⏞}{\underset{7 bits}{\underset{⏟}{0000000}}}}

We can see that in the machine code format, the position for immediate[0] is missing. The reason for this is that branch instructions are restricted to jump only to even addresses, so the actual immediate is 13 bits. After combining the machine code format into 12 bits, a 0 needs to be added on the right, which means that immediate[0] is always 0.

Therefore, the jump range for branch instructions is:

- 4096 \sim 4094 bytes

Taking bne x5, x0, -60 as an example for conversion into machine code, we can refer to the 2024 RISC-V spec. (page 554) and find that for the bne instruction, funct3 = 001 and opcode = 1100011. The source register rs1 is x5, which converts to binary as 00101, and rs2 is x0, which converts to binary as 00000. The immediate value -60 converts to binary as 1111111000100. Finally, combining this information results in the 32-bit machine code:

\overset{imm[12]}{\overset{⏞}{1}} ∣ \overset{imm[10:5]}{\overset{⏞}{111110}} ∣ \overset{rs2}{\overset{⏞}{00000}} ∣ \overset{rs1}{\overset{⏞}{00101}} ∣ \overset{funct3}{\overset{⏞}{001}} ∣ \overset{imm[4:1]}{\overset{⏞}{0010}} ∣ \overset{imm[11]}{\overset{⏞}{1}} ∣ \overset{opcode}{\overset{⏞}{1100011}}

Finally, let’s use Ripes to see how the bne instruction is executed in the pipeline. Using the code above as an example, here is an excerpt of the relevant portion of the code:

0000006c <loop_decrement>:
    6c:        00498993        addi x19 x19 4
    70:        fff28293        addi x5 x5 -1
    74:        fc0292e3        bne x5 x0 -60 <loop>
    78:        00008067        jalr x0 x1 0

IF stage (Instruction Fetch)

The Program Counter outputs the instruction address 0x00000074, and then this address enters the adder, which automatically adds 4. The Program Counter stores 0x00000078.
The Instruction Memory receives the input address 0x00000074 and retrieves the corresponding instruction from that address, with the output being 0xfc0292e3.
The Compressed Decoder receives the instruction 0xfc0292e3 and checks whether it is a compressed instruction. Since the instruction is not a compressed instruction, the output remains 0xfc0292e3.
Finally, the relevant data is stored in the IF/ID register.

ID stage (Instruction Decoder)

The Decoder receives the instruction 0xfc0292e3 and performs the decoding.
- opcode (1100011) and funct3 (101) are decoded, indicating that the instruction type is bge.
- R1 idx is decoded as 0x1d, which corresponds to the x5 register.
- R2 idx is decoded as 0x1c, which corresponds to the x0 register.
- Wr idx is decoded as 0x0c, which corresponds to the x5 register (although B-type does not have an rd field, the corresponding position in the machine code format is immediate[4:1] and immediate[11], combined into a 5-bit value, which is 00101).
The Immediate Generate Unit, based on the input type being B-type, combines immediate[12], immediate[11], immediate[10:5], immediate[4:1], and 0 into a 13-bit value, then performs sign extension to 32 bits, with the output being 0xffffffc4.
The Register Files, based on the input R1 idx, find the corresponding x5 register and output the content of the x5 register (0x00000005).
The Register Files, based on the input R2 idx, find the corresponding x0 register and output the content of the x0 register (0x00000000).
Finally, the relevant data is stored in the ID/EX register.

EX stage (Execution)

The value of rs1 and the other two forwarding paths form the first-level 3x1 multiplexer. Since the previous instruction (addi x5, x5, -1) modified the x5 register, the forwarding function will be used here. Therefore, the first-level 3x1 multiplexer passes through the value from the previous instruction’s result (Res), generated in the EX stage, which is 0x00000004. In the second-level 2x1 multiplexer, composed of the output from the first-level 3x1 multiplexer and the program counter value, since the Branch Unit determines that a jump will occur, the second-level 2x1 multiplexer passes through the program counter value (0x00000074) into Op1.
The value of rs2 and the other two forwarding paths form the first-level 3x1 multiplexer, and since the forwarding function is not used here, the first-level 3x1 multiplexer passes through the value of rs2 (0x00000000). In the second-level 2x1 multiplexer, composed of the output from the first-level 3x1 multiplexer and the program counter value, since the Branch Unit determines that a jump will occur, the second-level 2x1 multiplexer passes through the immediate value (0xffffffc4) into Op2.
The Branch Unit determines that the branch instruction will jump.
The ALU performs addition based on the control signals, adding 0x00000074 and 0xffffffc4, resulting in Res being 0x00000038.
After calculating the address, in the next cycle, the value of Res will be written to the Program Counter.
Finally, the relevant data is stored in the EX/MEM register.
We can see that when the branch instruction decides to jump, the two instructions immediately following the branch are cleared and replaced with nop (no operation).

MEM stage (Memory Access)

The value of rs2 (0x00000000) is input into Data in, and Res (0x00000038) is input into Addr. However, under a B-type instruction, the Memory write control signal is 0, indicating that data cannot be written to Memory.
Based on the address pointed to by Res, we can check the memory status. The address 0x00000038 stores 0x0009a383, so the output from that address (Read Out) is 0x0009a383.
Finally, the relevant data is stored in the MEM/WB register.

WB stage (Write back)

The 3x1 multiplexer is composed of the values from the next program counter, Res, and the value read from memory. Since the instruction type is B-type, the multiplexer selector will choose Res (0x00000038) to pass through. Res will then be input into Wr Data, and the rd value stored in the MEM/WB register (0x05) will be input into Wr idx. However, because the WrEn control signal is 0, the value cannot be successfully written to the register.

After completing all stages, the register updates are as follows:

Since the bne instruction is preceded by the addi x5, x5, -1 instruction, the x5 register will be modified from 0x00000005 to 0x00000004. The bne instruction does not change the value of the register, so the x5 register remains 0x00000004.

U-type format & data flow

In RISC-V assembly language, there are only two U-type instructions: lui and auipc. The machine code format is as follows:

\overset{immediate[31:12]}{\overset{⏞}{\underset{20 bits}{\underset{⏟}{00000000000000000000}}}} ∣ \overset{rd}{\overset{⏞}{\underset{5 bits}{\underset{⏟}{00000}}}} ∣ \overset{opcode}{\overset{⏞}{\underset{7 bits}{\underset{⏟}{0000000}}}}

We can see that in the machine code format, the position for immediate[11:0] is missing. This is because the lui and auipc instructions operate on the upper 20 bits, meaning the actual immediate needs to be left-shifted by 12 bits. This indicates that immediate[11:0] consists of 12 zeros as a result of the left shift.

Taking auipc x18, x10000 as an example for conversion into machine code, we can refer to the 2024 RISC-V spec. (page 554) and find that the auipc opcode is 0010111. The destination register rd is x18, which converts to binary as 10010, and the immediate value 0x10000 converts to binary as 00010000000000000000. Finally, combining this information results in the 32-bit machine code:

\overset{immediate[31:12]}{\overset{⏞}{00010000000000000000}} ∣ \overset{rd}{\overset{⏞}{10010}} ∣ \overset{opcode}{\overset{⏞}{0010111}}

Finally, let’s use Ripes to see how the auipc instruction is executed in the pipeline. Using the code above as an example, here is an excerpt of the relevant portion of the code:

00000000 <main>:
    0:        00300493        addi x9 x0 3
    4:        10000917        auipc x18 0x10000
    8:        03490913        addi x18 x18 52
    c:        10000997        auipc x19 0x10000
    10:       ff498993        addi x19 x19 -12

IF stage (Instruction Fetch)

The Program Counter outputs the instruction address 0x00000004, and then this address enters the adder, which automatically adds 4. The Program Counter stores 0x00000008.
The Instruction Memory receives the input address 0x00000004 and retrieves the corresponding instruction from that address, with the output being 0x10000917.
The Compressed Decoder receives the instruction 0x10000917 and checks whether it is a compressed instruction. Since the instruction is not a compressed instruction, the output remains 0x10000917.instruction，而該指令非 compression instruction，因此輸出仍為 0x10000917。
Finally, the relevant data is stored in the IF/ID register.

ID stage (Instruction Decoder)

The Decoder receives the instruction 0x10000917 and performs the decoding.
- opcode (0010111) is decoded, indicating that the instruction type is auipc.
- R1 idx is decoded as 0x00, which corresponds to the x0 register (although U-type does not have an rs1 field, the corresponding position in the machine code format is the immediate[20:16] field, which is 00000).
- R2 idx is decoded as 0x00, which corresponds to the x0 register (although U-type does not have an rs2 field, the corresponding position in the machine code format is the immediate[25:21] field, which is 00000).
- Wr idx is decoded as 0x12, which corresponds to the x18 register.
The Immediate Generate Unit, based on the input type being U-type, left-shifts immediate[31:12] by 12 bits, with the output being 0x10000000.
The Register Files, based on the input R1 idx, find the corresponding x0 register and output the content of the x0 register (0x00000000).
The Register Files, based on the input R2 idx, find the corresponding x0 register and output the content of the x0 register (0x00000000).
-Finally, the relevant data is stored in the ID/EX register.

EX stage (Execution)

The value of rs1 and the other two forwarding paths form the first-level 3x1 multiplexer. Since the forwarding function is not used here, the first-level 3x1 multiplexer passes through the value of rs1 (0x00000000). In the second-level 2x1 multiplexer, which consists of the output from the first-level 3x1 multiplexer and the Program Counter value, since the auipc instruction adds the Program Counter value to the immediate constant, the second-level 2x1 multiplexer passes through the Program Counter value (0x00000004) into Op1.
The value of rs2 and the other two forwarding paths form the first-level 3x1 multiplexer. Since the forwarding function is not used here, the first-level 3x1 multiplexer passes through the value of rs2 (0x00000064). In the second-level 2x1 multiplexer, which consists of the output from the first-level 3x1 multiplexer and the Program Counter value, since the auipc instruction adds the Program Counter value to the immediate constant, the second-level 2x1 multiplexer passes through the immediate value (0x10000000) into Op2.
Since the instruction is not a branch type, the Branch unit will not take any action.
The ALU performs addition based on the control signals, adding 0x00000004 to 0x10000000, resulting in Res being 0x10000004.
Finally, the relevant data is stored in the EX/MEM register.

MEM stage (Memory Access)

The value of rs2 (0x00000000) is input into Data in, and Res (0x10000004) is input into Addr. However, under a U-type instruction, the Memory write control signal is 0, indicating that data cannot be written to Memory.
Based on the address pointed to by Res, we can check the memory status. The addresses used only go up to 0x00000098, and addresses beyond that have not been used, meaning that 0x10000004 has not yet been allocated any data. Therefore, the output from that address (Read Out) is 0x00000000.

Finally, the relevant data is stored in the MEM/WB register.

WB stage (Write back)

The 3x1 multiplexer is composed of the values from the next program counter, Res, and the value read from memory. Since the instruction type is R-type, the multiplexer selector will choose Res (0x10000004) to pass through. Res will then be input into Wr Data, and the rd value stored in the MEM/WB register (0x12) will be input into Wr idx. Finally, 0x10000004 will be written to the x18 register (ABI name: s2).

After completing all stages, the register updates are as follows:

We can see that 0x10000004 was successfully written to the s2 register.

J-type format & data flow

In RISC-V assembly language, there is only one J-type instruction: jal. The machine code format is as follows:

\overset{immediate[20]}{\overset{⏞}{\underset{1 bit}{\underset{⏟}{0}}}} ∣ \overset{immediate[10:1]}{\overset{⏞}{\underset{10 bits}{\underset{⏟}{0000000000}}}} ∣ \overset{immediate[11]}{\overset{⏞}{\underset{1 bit}{\underset{⏟}{0}}}} ∣ \overset{immediate[19:12]}{\overset{⏞}{\underset{8 bits}{\underset{⏟}{00000000}}}} ∣ \overset{rd}{\overset{⏞}{\underset{5 bits}{\underset{⏟}{00000}}}} ∣ \overset{opcode}{\overset{⏞}{\underset{7 bits}{\underset{⏟}{0000000}}}}

Taking jal x1, 28 as an example for conversion into machine code, we can refer to the 2024 RISC-V spec. (page 554) and find that the jal opcode is 1101111. The destination register rd is x1, which converts to binary as 00001, and the immediate value 28 converts to binary as 00000000000000011100. Finally, combining this information results in the 32-bit machine code:

\overset{immediate[20]}{\overset{⏞}{0}} ∣ \overset{immediate[10:1]}{\overset{⏞}{0000011100}} ∣ \overset{immediate[11]}{\overset{⏞}{0}} ∣ \overset{immediate[19:12]}{\overset{⏞}{00000000}} ∣ \overset{rd}{\overset{⏞}{00001}} ∣ \overset{opcode}{\overset{⏞}{1101111}}

Finally, let’s use Ripes to see how the jal instruction is executed in the pipeline. Using the code above as an example, here is an excerpt of the relevant portion of the code:

00000014 <test>:
    14:        01c000ef        jal x1 28 <findNonMinOrMax>
    18:        064000ef        jal x1 100 <printResult>
    1c:        00490913        addi x18 x18 4
    20:        fff48493        addi x9 x9 -1
    24:        fe0498e3        bne x9 x0 -16 <test>
    28:        00a00893        addi x17 x0 10
    2c:        00000073        ecall

IF stage (Instruction Fetch)

The Program Counter outputs the instruction address 0x00000014, and then this address enters the adder, which automatically adds 4. The Program Counter stores 0x00000018.
The Instruction Memory receives the input address 0x00000014 and retrieves the corresponding instruction from that address, with the output being 0x01c000ef.
The Compressed Decoder receives the instruction 0x01c000ef and checks whether it is a compressed instruction. Since the instruction is not a compressed instruction, the output remains 0x01c000ef.
Finally, the relevant data is stored in the IF/ID register.

ID stage (Instruction Decoder)

The Decoder receives the instruction 0x01c000ef and performs the decoding.
- opcode (1101111) is decoded, indicating that the instruction type is jal.
- R1 idx is decoded as 0x00, which corresponds to the x0 register (although J-type does not have an rs2 field, the corresponding position in the machine code format is the left 5 bits of the immediate[19:12] field, which is 00000).
- R2 idx is decoded as 0x1c, which corresponds to the x28 register (although J-type does not have an rs1 field, the corresponding position in the machine code format is the right 4 bits of immediate[10:1] combined with the immediate[11] field, resulting in a 5-bit value of 11100).
- Wr idx is decoded as 0x01, which corresponds to the x1 register.
The Immediate Generate Unit, based on the input type being J-type, combines immediate[20], immediate[19:12], immediate[11], immediate[10:1], and 0 into a 21-bit value, then performs sign extension to 32 bits, with the output being 0x0000001c.
The Register Files, based on the input R1 idx, find the corresponding x0 register and output the content of the x0 register (0x00000000).
The Register Files, based on the input R2 idx, find the corresponding x28 register and output the content of the x28 register (0x00000000).
Finally, the relevant data is stored in the ID/EX register.

EX stage (Execution)

The value of rs1 and the other two forwarding paths form the first-level 3x1 multiplexer. Since the forwarding function is not used here, the first-level 3x1 multiplexer passes through the value of rs1 (0x00000000). In the second-level 2x1 multiplexer, which consists of the output from the first-level 3x1 multiplexer and the Program Counter value, since the jal instruction adds the Program Counter value to the immediate constant, the second-level 2x1 multiplexer passes through the Program Counter value (0x00000014) into Op1.
The value of rs2 and the other two forwarding paths form the first-level 3x1 multiplexer. Since the forwarding function is not used here, the first-level 3x1 multiplexer passes through the value of rs2 (0x00000000). In the second-level 2x1 multiplexer, which consists of the output from the first-level 3x1 multiplexer and the Program Counter value, since the jal instruction adds the Program Counter value to the immediate constant, the second-level 2x1 multiplexer passes through the immediate value (0x0000001c) into Op2.
The ALU performs addition based on the control signals, adding 0x00000014 and 0x0000001c, resulting in Res being 0x00000030.
After calculating the address, in the next cycle, the value of Res will be written to the Program Counter.
Finally, the relevant data is stored in the EX/MEM register.
We can see that when the jal instruction jumps, the two instructions immediately following the jal are cleared and replaced with nop (no operation).

MEM stage (Memory Access)

The value of rs2 (0x00000000) is input into Data in, and Res (0x00000030) is input into Addr. However, under a J-type instruction, the Memory write control signal is 0, indicating that data cannot be written to Memory.
Based on the address pointed to by Res, we can check the memory status. The address 0x00000030 stores 0x00092283, so the output from that address (Read Out) is 0x00092283.

Finally, the relevant data is stored in the MEM/WB register.
5. WB stage (Write back)

The 3x1 multiplexer is composed of the values from the next program counter, Res, and the value read from memory. Since the instruction type is J-type, the multiplexer selector will choose the program counter value (0x00000018) to pass through. Res will then be input into Wr Data, and the rd value stored in the MEM/WB register (0x01) will be input into Wr idx. Finally, 0x00000018 will be written to the x1 register (ABI name: ra).
After completing all stages, the register updates are as follows:

We can see that 0x00000018 was successfully written to the ra register.

forwarding

In this architecture, executing instructions can lead to some issues, such as data hazards. A data hazard occurs when a later instruction requires data from an earlier instruction, but the data from the earlier instruction has not yet been written back to the register. As a result, the later instruction may access stale data, leading to incorrect results. If the data is incorrect, the final results of the executed instructions will not match the intended outcomes of the original instructions.

The forwarding unit is one solution to address data hazards. The core idea is to send the freshly produced data directly to the instruction that needs it. Using the code from above as an example:

00000000 <main>:
    0:        00300493        addi x9 x0 3
    4:        10000917        auipc x18 0x10000
    8:        03490913        addi x18 x18 52
    c:        10000997        auipc x19 0x10000
    10:       ff498993        addi x19 x19 -12

After executing from the beginning to the 3rd cycle, the progress of each instruction in the pipeline and the contents stored in the registers are shown in the diagram below:

We can see that the instruction in the EX stage (auipc x18, 0x10000) needs to write back to the x18 register, while the instruction in the ID stage (addi x18, 52) also needs to read from the x18 register. However, it has already retrieved incorrect data in the ID stage (x18 = 0x00000000). Next, we observe the progress of each instruction in the pipeline and the contents stored in the registers after the 4th cycle, as shown in the diagram below:

Due to the forwarding unit, it (marked in yellow) pulls the data generated by the instruction (auipc x18, 0x10000) from the EX stage back to the EX stage. This data, along with Reg1 (marked in red) and another forwarding path (marked in blue), is fed into a 3x1 multiplexer. After detecting a data hazard between the instructions (auipc x18, 0x10000) and (addi x18, 52), the multiplexer is configured to allow the yellow forwarding path to pass through, enabling the instruction (addi x18, 52) to read the correct data from the x18 register (x18 = 0x10000004).

Finally, we manually calculate addi x18, x18, 52 (0x10000004 + 52 = 0x10000004 + 0x34), and the execution result sets x18 to 0x10000038. We check the result after the addi x18, x18, 52 instruction passes through the WB stage and confirm that this is indeed the case:

Hazard detection

Although forwarding provides a way to resolve data hazards, there is a specific type of data hazard called Load-use data hazard that cannot be resolved simply through forwarding. A Load-use data hazard occurs when the destination register of a Load instruction is the same as the source register of the subsequent instruction. This is because the Load instruction needs to read data through the MEM stage before it can forward the required data. However, by the time the Load instruction reads the data from the MEM stage, the subsequent instruction has already computed its result in the EX stage, making it impossible to use the correct data. If the subsequent instruction could be stalled for one cycle, it would have time to use the forwarded data correctly.

The solution is to add a hazard detection unit that, when the destination register of the Load instruction matches the source register of the following instruction, stalls the subsequent instruction for 1 cycle, thus resolving the Load-use data hazard.

Here is a code example:

00000000 <main>:
    0:        06428293        addi x5 x5 100
    4:        0002a303        lw x6 0 x5
    8:        00638333        add x6 x7 x6
    c:        00000513        addi x10 x0 0

After executing from the beginning to the 3rd cycle, the progress of each instruction in the pipeline and the contents stored in the registers are shown in the diagram below:

We can see that the instruction in the EX stage (lw x6, 0(x5)) needs to write back to the x6 register, while the instruction in the ID stage (add x6, x7, x6) also needs to read from the x6 register. This will be detected as a Load-use data hazard by the hazard detection unit. Next, we observe the progress of each instruction in the pipeline and the contents stored in the registers after the 4th cycle, as shown in the diagram below:

The instruction originally in the ID stage (add x6, x7, x6) should have moved to the EX stage in the next cycle. However, because the hazard detection unit detected a Load-use data hazard, it stalled both the ID stage and the IF stage instructions for one cycle and cleared the EX stage to a NOP instruction. Next, we observe the progress of each instruction in the pipeline and the contents stored in the registers after the 5th cycle, as shown in the diagram below:

We can see that after stalling for 1 cycle, the instruction (add x6, x7, x6) can utilize the memory forwarding path (marked in yellow). The data transmitted by this forwarding path uses the value of the x5 register (0x00000064) as the memory address to store it in the x6 register. From the diagram below, we can see that the value stored at address 0x00000064 is 0x00000000, so the value returned by the forwarding path is 0x00000000. Finally, we manually calculate add x6, x7, x6 (0x00000000 + 0x00000000 = 0x00000000), resulting in x6 being set to 0x00000000.

We check the result after the instruction add x6, x7, x6 passes through the WB stage, and we confirm that this is indeed the case:

Change the permissions of the above diagrams, making them visible to all!

I change the permissions of the above diagrams, thanks.

Resource

The RISC-V Instruction Set Manual Volume I (Version 2024/04/11)