Assignment1: RISC-V Assembly and Instruction Pipeline

contributed < vestata >

Problem B in Quiz1

C code

fp32_to_bf16

static inline bf16_t fp32_to_bf16(float s)
{
    bf16_t h;
    union {
        float f;
        uint32_t i;
    } u = {.f = s};
    if ((u.i & 0x7fffffff) > 0x7f800000) { /* NaN */
        h.bits = (u.i >> 16) | 64;         /* force to quiet */
        return h;                                                                                                                                             
    }
    h.bits = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10;
    return h;
}

This code is a function written in C that converts a 32-bit floating-point number (FP32) into a 16-bit floating-point format called bfloat16 (BF16).
Recall that:

        ┌ sign 
        │
        │   ┌ exponent
        │   │
        │   │      ┌ mantissa 
        │   │      │
        │┌──┴───┐┌─┴───┐
      0b0000000000000000 bfloat16

Handling special case: In the bfloat16 format, a value is considered NaN (Not a Number) when the exponent is all 1s (11111111 in binary), and the mantissa (fraction) is not entirely zero. This is similar to how NaNs are represented in the IEEE 754 floating-point formats.

If the exponent is all 1s but the mantissa is completely zero, the value represents infinity (positive or negative, depending on the sign bit). To distinguish NaN from infinity, at least one bit in the mantissa must be set to 1.

In particular, setting the first bit of the mantissa to 1 by | 64 ensures that the value is treated as NaN instead of infinity. Single-precision_floating-point wiki

In normal case: The expression (u.i >> 0x10) & 1 checks whether the bfloat16 result would be even or odd. It does this by examining the least significant bit of the lower 16 bits that will be discarded during the conversion.If the value is odd (i.e., the result of the check is 1), the code adds an extra 1 to u.i to help with rounding.After this adjustment, the value is right-shifted by 16 bits (>> 0x10), effectively converting the 32-bit float to bfloat16 by keeping the upper 16 bits.

This ensures that the conversion properly handles rounding by following the "round-to-nearest-even" rule.

The bfloat16 is stored in the lower 16 bits of a 32-bit word.

RISC-V Assembly Implementation

handling union in risc-v

fp32_to_bf16:
    # Input: a0 = 32-bit float value
    # Output: a0 = 16-bit bfloat16 value in the lower section of uint32_t
    add t0, x0, a0
    
    # special case
    li t1, 0x7fffffff       # set t1 to 0x7fffffff
    and t2, t0, t1          # set t2 to u.i & 0x7fffffff
    li t3, 0x7f800000       # set t3 to 0x7f800000
    bltu t3, t2, handle_nan
    # normal case
    srli t5, a0, 0x10       # (u.i >> 0x10)
    andi t5, t5, 1          #  & 1
    li t6, 0x7fff           # set t6 to 0x7fff
    add t5, t5, t6          # (0x7fff + ((u.i >> 0x10) & 1))
    add a0, a0, t5          # apply round-to-nearest-even
    srli a0, a0, 16         # move bf16 to less significant 16 bit
    jr ra
    
handle_nan:
    add t4, x0, a0          # set t4 to a0
    srli t4, t4, 16         # u.i >> 16
    ori t4, t4, 64          # | 64, /* force to quiet */
    add a0, x0, t4          # set a0 to return
    jr ra

Problem C in Quiz1

version 1

C code

static inline int my_clz(uint32_t x) {
    int count = 0;
    for (int i = 31; i >= 0; --i) {
        if (x & (1U << i))
            break;
        count++;
    }
    return count;
}

Assembly code

main:
    li a0, 0x13
    jal ra, my_clz
    
    li a7, 10
    ecall

my_clz:
    # Input: a0 = uint32_t
    # Output: a0 = clz

    li t0, 0            # count = 0
    li t1, 31           # i = 31
clz_loop:
    li t2, 1
    sll t2, t2, t1      # (1U << i)
    and t2, t2, a0      # x & (1U << i)
    bnez t2, clz_end    # if (x & (1U << i)) break
    addi t0, t0, 1      # count++ 
    addi t1, t1, -1     # i--
    bgez t1, clz_loop   # for loop
clz_end:
    mv a0, t0           # set output
    ret

Info	Value
Cycles	267
Instrs. retired	201
CPI	1.33
IPC	0.753
Clock rate	0 Hz

version 2

Use bitmask to implement clz is an easy way to boost the speed of the code.

C code

int my_clz(uint32_t x) {
    if (x == 0) return 32;
    int n = 0;
    if (x <= 0x0000FFFF) { n += 16; x <<= 16; }
    if (x <= 0x00FFFFFF) { n += 8; x <<= 8; }
    if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; }
    if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; }
    if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; }
    return n;
}

Assembly

my_clz:
    # Input: a0 = uint32_t
    # Output: a0 = clz
    beqz a0, clz_zero
    li t0, 0                    # n = 0
    li t1, 0x0000FFFF            
    blt t1, a0, shift16_skip
    addi t0, t0, 16             # n += 16;
    slli a0, a0, 16             # x <<= 16;
    
shift16_skip:
    li t1, 0x00FFFFFF
    blt t1, a0, shift8_skip
    addi t0, t0, 8              # n += 8;
    slli a0, a0, 8              # x <<= 8;   

shift8_skip:
    li t1, 0x0FFFFFFF
    blt t1, a0, shift4_skip
    addi t0, t0, 4              # n += 4;
    slli a0, a0, 4              # x <<= 4; 

shift4_skip:
    li t1, 0x3FFFFFFF
    blt t1, a0, shift2_skip
    addi t0, t0, 2              # n += 2;
    slli a0, a0, 2              # x <<= 2; 

shift2_skip:
    li t1, 0x7FFFFFFF
    blt t1, a0, clz_done
    addi t0, t0, 1              # n += 1;
    slli a0, a0, 1              # x <<= 1; 
    j clz_done

clz_zero:
    li a0, 32
    ret
    
clz_done:
    li a0, t0
    ret

Info	Value
Cycles	46
Instrs. retired	32
CPI	1.44
IPC	0.696
Clock rate	0 Hz

3307. Find the K-th Character in String Game II

Description:
Alice and Bob are playing a game. Initially, Alice has a string word = "a".
You are given a positive integer k. You are also given an integer array operations, where operations[i] represents the type of the ith operation.
Now Bob will ask Alice to perform all operations in sequence:
If operations[i] == 0, append a copy of word to itself.
If operations[i] == 1, generate a new string by changing each character in word to its next character in the English alphabet, and append it to the original word. For example, performing the operation on "c" generates "cd" and performing the operation on "zb" generates "zbac".
Return the value of the kth character in word after performing all the operations.
Note that the character 'z' can be changed to 'a' in the second type of operation.

Solution:
So we need to find how many times a changes to reach the input position k, the first step is to calculate the word length of the final operation. Since the word initially starts with a, in the first iteration, the word length is two. In the second iteration, the word length becomes four. The word length grows exponentially. Therefore, to determine the word size at position k, it can be represented as

2^{\log_{2} (k) + 1}

, which corresponds to

\log_{2} (k) + 1

operations.

The next step is to determine in which half of the string position k is located at each operation. If k is in the right half of the word, we increase the change count by one if operation[i] == 1. Then, we subtract the position of k to find its parent character. Use a for loop to go through all the operations.

In the final step, apply the accumulated changes to the character a.

char kthCharacter(long long k, int* operations, int operationsSize) {
    int mutations = 0;
    for (int op = log2(k); op >= 0; --op)
        if (k > 1LL << op) {
            k -= 1LL << op;
            mutations += operations[op];
        }
    return 'a' + mutations % 26;
}

The relationship between ilog2 (integer log base 2) and clz (count leading zeros) stems from the fact that finding the logarithm base 2 of an integer can be directly related to identifying the position of the highest set bit in the binary representation of the number.
The log2() function can be defined by custom function, my_clz().

int ilog2(int n){
    return 31 - my_clz(n);
}

Notice that the input of this question is in long long, so we hath to modify my_clz() and ilog2() to support 64-bit format.

int my_clz64(uint64_t x) {
    if (x == 0) return 64;
    int n = 0;
    if (x <= 0x00000000FFFFFFFF) { n += 32; x <<= 32; }
    if (x <= 0x0000FFFFFFFFFFFF) { n += 16; x <<= 16; }
    if (x <= 0x00FFFFFFFFFFFFFF) { n += 8; x <<= 8; }
    if (x <= 0x0FFFFFFFFFFFFFFF) { n += 4; x <<= 4; }
    if (x <= 0x3FFFFFFFFFFFFFFF) { n += 2; x <<= 2; }
    if (x <= 0x7FFFFFFFFFFFFFFF) { n += 1; x <<= 1; }
    return n;
}

int ilog2(long long n){
    return 63 - my_clz64(n);
}

The full code is right here.

Since we have to handle long long, which is 64-bit data, we need to modify the rv32i code to use two registers to handle long long. Following the C code, we have defined three long long operations: shift_left_64, blt_64, and sub_64.

my_clz64:
    # Input: 
    # a0 (lower 32 bits of x)
    # a1 (upper 32 bits of x)
    # Output: 
    # a0 (clz)      

    or t0, a0, a1
    beqz t0, zero_case       # if x == 0
    
    li t1, 0                 # set n = 0
    
    bnez a1, shift32_skip
    addi t1, t1, 32          # n += 32
    mv a1, a0                # x <<= 32
    li a0, 0                 # x_low = 0
        
shift32_skip:
    # if (x_hi <= 0x0000ffff)
    li t2, 0xffff
    sltu t3, t2, a1          # t3(flag) = (0xfff < x_hi) 
    bnez t3, shift16_skip    # if (x_hi <= 0000ffff)
    addi t1, t1, 16          # n += 16
    slli a1, a1, 16          # x_hi <<= 16
    mv t3, a0
    srli t3, t3, 16
    or a1, a1, t3            # x_hi = (x_hi << 16) | (x_lo >> 16)
    slli a0, a0, 16          # 
    
shift16_skip:
    # if (x_hi <= 0x00ffffff)
    li t2, 0xffffff
    sltu t3, t2, a1
    bnez t3, shift8_skip
    addi t1, t1, 8           # n += 8
    slli a1, a1, 8           # x_hi <<= 8
    mv t3, a0                # set a extra register to preserve the bits 
                             # being shifted
    srli t3, t3, 8
    or a1, a1, t3            # x_hi = (x_hi << 8) | (x_lo >> 8)
    slli a0, a0, 8           # 

shift8_skip:
    # if (x_hi <= 0x0fffffff)
    li t2, 0xfffffff
    sltu t3, t2, a1
    bnez t3, shift4_skip
    addi t1, t1, 4           # n += 4
    slli a1, a1, 4           # x_hi <<= 4
    mv t3, a0
    srli t3, t3, 4
    or a1, a1, t3            # x_hi = (x_hi << 4) | (x_lo >> 4)
    slli a0, a0, 4           # 

shift4_skip:
    # if (x_hi <= 0x3fffffff)
    li t2, 0x3fffffff
    sltu t3, t2, a1
    bnez t3, shift2_skip
    addi t1, t1, 2           # n += 2
    slli a1, a1, 2           # x_hi <<= 2
    mv t3, a0
    srli t3, t3, 2
    or a1, a1, t3            # x_hi = (x_hi << 2) | (x_lo >> 2)
    slli a0, a0, 2           # 

shift2_skip:
    # if (x_hi <= 0x7fffffff)
    li t2, 0x7fffffff
    sltu t3, t2, a1
    bnez t3, clz_end
    addi t1, t1, 1           # n += 1
    slli a1, a1, 1           # x_hi <<= 1
    mv t3, a0
    srli t3, t3, 1
    or a1, a1, t3            # x_hi = (x_hi << 1) | (x_lo >> 1)
    slli a0, a0, 1           # 
    
clz_end:
    mv a0, t1                # return n   
    ret

zero_case:
    li a0 64                 # retrun 64      
    ret
    
ilog2:
    addi sp, sp, -4
    sw ra, 0(sp)
    jal ra, my_clz64
    lw ra, 0(sp) 
    addi sp, sp, 4
    li t0, 63
    sub a0, t0, a0
    
    ret

kthCharacter:
    # Inputs:
    # a0: k_low (lower 32 bits of k)
    # a1: k_high (upper 32 bits of k)
    # a2: operations (pointer to array)
    # a3: operationsSize
    # Output:
    # a0: resulting character
    addi sp, sp, -28
    sw ra, 24(sp)             # save return address
    sw s0, 20(sp)             # save mutations
    sw s1, 16(sp)             # save op
    sw s2, 12(sp)             # save k_low
    sw s3, 8(sp)              # save k_high
    sw s4, 4(sp)              # save operations address
    sw s5, 0(sp)              # save operationsSize
    
    mv s4, a2
    mv s5, s3
    
    li s0, 0                  # set s0 to mutation = 0
    mv s2, a0
    mv s3, a1
    
    jal ra, ilog2             # call ilog2, result in a0
    mv s1, a0                 # set s1 = op

loop:
    bltz s1, loop_end         # branch if op < 0
        
    mv a0, s1                 # a0 = op
    jal ra, shift_left_64     # return a0, a1 for 64 bit data
    mv t0, a0                 # set t0 = shift_reult_low
    mv t1, a1                 # set t1 = shift_reult_high
    
    mv a0, t0                 # set a0 = shift_result_low
    mv a1, t1                 # set a1 = shift_result_high
    mv a2, s2                 # set a2 = k_low
    mv a3, s3                 # set a3 = k_high
    jal ra, blt_64            # blt 1LL << op, k, return boolean result 
                              # in a0
    beqz a0, skip_if
    
    mv a0, s2                 # a0 = k_low
    mv a1, s3                 # a1 = k_high
    mv a2, t0                 # a2 = shift_result_low
    mv a3, t1                 # a3 = shift_result_high
    jal ra, sub_64            # return result in a0, a1
    
    # update k
    mv s2, a0                 # s2 = k_low
    mv s3, a1                 # s3 = k_high
    
    # operation[op]
    slli t0, s1, 2
    add t1, s4, t0
    lw t2, 0(t1)
        
    add s0, s0, t2            # s0（mutations） += operations[op]
    
skip_if:
    addi s1, s1, -1           # op--
    j loop

loop_end:
    li t0, 26
    rem t1, s0, t0            # set t1 = mutations % 26
    li t0, 0x61               # set t0 = ascii of 'a'
    add a0, t0, t1            # a0 = 'a' + (mutations % 26)
        
    lw s5, 0(sp)              
    lw s4, 4(sp)              
    lw s3, 8(sp)              
    lw s2, 12(sp)             
    lw s1, 16(sp)             
    lw s0, 20(sp)             
    lw ra, 24(sp)             
    addi sp, sp, 28
    ret
    
# To handle left shift for 64-bit long long
shift_left_64:
    # Inputs(long long):
    # a0: shift amount(op)
    # Outputs(long long):
    # a0: lower 32 bits
    # a1: upper 32 bits

    li t4, 32
    blt a0, t4, shift_less_32       # check if shift over the
                                    # representation of 32-bit
    # left shift more than 31bits
    addi a0, a0, -32                # 
    li a1, 1                        #
    sll a1, a1, a0                  # set x_hi = 1 << (op - 32)
    li a0, 0                        # set x_low = 0
    j end

shift_less_32:
    # left shift more than 32 bits
    li a1, 1                        # temp set 1
    sll a0, a1, a0                  # set x_low = 1LL << op 
    li a1, 0                        # set x_hi = 0
    j end

end:
    ret
    

# To handle comaprison of two long long data
blt_64:
    # Inputs(two long long):
    # a0: x1_low
    # a1: x1_high
    # a2: x2_low
    # a3: x2_high
    # Outputs(long long):
    # a0: boolean

    bltu a1, a3, true
    beq a1, a3, equal
false:
    li a0, 0                       # x1_high > x2_high, return false
    ret

true:
    # x1 < x2, return true
    li a0, 1                  
    ret

equal:
    # if x1_high = x2_high, compare lower
    bltu a0, a2, true               # x1_low < x2_low, return true
    j false


# To handle substraction of two long long data
sub_64:
    # Inputs(two long long):
    # a0: x1_low
    # a1: x1_high
    # a2: x2_low
    # a3: x2_high
    # Outputs(long long):
    # a0: r_low
    # a1: r_high
    
    sltu t0, a0, a2             # t0 = (a0 < a2) ? 1 : 0, check if borrow
    sub a0, a0, a2              # a0 = a0 - a2
    sub a1, a1, a3              # a1 = a1 - a3
    sub a1, a1, t0              # a1 = a1 - t0
    ret

Test cases

Test case 1:

Input: k = 12145134613, operations = [0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1]
Output: "i" (0x69)

Test case 2:

Input: k = 10, operations = [0, 1, 0, 1]
Output: "b" (0x62)

Test case 3:

Input: k = 25, operations = [1, 1, 1, 0, 1, 0, 1, 0, 1, 0]
Output: "b" (0x62)

The test result as below:

All tests pass!

Program exited with code: 0

The execution info as below:

Info	Value
Cycles	2687
Instrs. retired	1709
CPI	1.57
IPC	0.636
Clock rate	0 Hz

Analysis

Pseudo instruction

00000000 <main>:
    0:        10000297        auipc x5 0x10000
    4:        08c28293        addi x5 x5 140
    8:        0002a503        lw x10 0 x5
    c:        10000317        auipc x6 0x10000
    10:        08430313        addi x6 x6 132
    14:        00032583        lw x11 0 x6
    18:        10000617        auipc x12 0x10000
    1c:        fe860613        addi x12 x12 -24
    20:        10000397        auipc x7 0x10000
    24:        07438393        addi x7 x7 116
    28:        0003a683        lw x13 0 x7
    2c:        1e0000ef        jal x1 480 <kthCharacter>
    30:        10000e17        auipc x28 0x10000
    34:        068e0e13        addi x28 x28 104
    38:        000e2e83        lw x29 0 x28
    3c:        09c000ef        jal x1 156 <check_result>
    40:        10000297        auipc x5 0x10000
    44:        06c28293        addi x5 x5 108
    48:        0002a503        lw x10 0 x5
    4c:        10000317        auipc x6 0x10000
    50:        06430313        addi x6 x6 100
    54:        00032583        lw x11 0 x6
    58:        10000617        auipc x12 0x10000
    5c:        04460613        addi x12 x12 68
    60:        10000397        auipc x7 0x10000
    64:        05438393        addi x7 x7 84
    68:        0003a683        lw x13 0 x7
    6c:        1a0000ef        jal x1 416 <kthCharacter>
    70:        10000e17        auipc x28 0x10000
    74:        048e0e13        addi x28 x28 72
    78:        000e2e83        lw x29 0 x28
    7c:        05c000ef        jal x1 92 <check_result>
    80:        10000297        auipc x5 0x10000
    84:        06428293        addi x5 x5 100
    88:        0002a503        lw x10 0 x5
    8c:        10000317        auipc x6 0x10000
    90:        05c30313        addi x6 x6 92
    94:        00032583        lw x11 0 x6
    98:        10000617        auipc x12 0x10000
    9c:        02460613        addi x12 x12 36
    a0:        10000397        auipc x7 0x10000
    a4:        04c38393        addi x7 x7 76
    a8:        0003a683        lw x13 0 x7
    ac:        160000ef        jal x1 352 <kthCharacter>
    b0:        10000e17        auipc x28 0x10000
    b4:        040e0e13        addi x28 x28 64
    b8:        000e2e83        lw x29 0 x28
    bc:        01c000ef        jal x1 28 <check_result>
    c0:        00400893        addi x17 x0 4
    c4:        10000517        auipc x10 0x10000
    c8:        03050513        addi x10 x10 48
    cc:        00000073        ecall
    d0:        00a00893        addi x17 x0 10
    d4:        00000073        ecall

000000d8 <check_result>:
    d8:        01d51463        bne x10 x29 8 <fail>
    dc:        00008067        jalr x0 x1 0

000000e0 <fail>:
    e0:        00400893        addi x17 x0 4
    e4:        10000517        auipc x10 0x10000
    e8:        02150513        addi x10 x10 33
    ec:        00000073        ecall
    f0:        00a00893        addi x17 x0 10
    f4:        00000073        ecall

000000f8 <my_clz64>:
    f8:        00b562b3        or x5 x10 x11
    fc:        0e028463        beq x5 x0 232 <zero_case>
    100:        00000313        addi x6 x0 0
    104:        00059863        bne x11 x0 16 <shift32_skip>
    108:        02030313        addi x6 x6 32
    10c:        00050593        addi x11 x10 0
    110:        00000513        addi x10 x0 0

00000114 <shift32_skip>:
    114:        000103b7        lui x7 0x10
    118:        fff38393        addi x7 x7 -1
    11c:        00b3be33        sltu x28 x7 x11
    120:        000e1e63        bne x28 x0 28 <shift16_skip>
    124:        01030313        addi x6 x6 16
    128:        01059593        slli x11 x11 16
    12c:        00050e13        addi x28 x10 0
    130:        010e5e13        srli x28 x28 16
    134:        01c5e5b3        or x11 x11 x28
    138:        01051513        slli x10 x10 16

0000013c <shift16_skip>:
    13c:        010003b7        lui x7 0x1000
    140:        fff38393        addi x7 x7 -1
    144:        00b3be33        sltu x28 x7 x11
    148:        000e1e63        bne x28 x0 28 <shift8_skip>
    14c:        00830313        addi x6 x6 8
    150:        00859593        slli x11 x11 8
    154:        00050e13        addi x28 x10 0
    158:        008e5e13        srli x28 x28 8
    15c:        01c5e5b3        or x11 x11 x28
    160:        00851513        slli x10 x10 8

00000164 <shift8_skip>:
    164:        100003b7        lui x7 0x10000
    168:        fff38393        addi x7 x7 -1
    16c:        00b3be33        sltu x28 x7 x11
    170:        000e1e63        bne x28 x0 28 <shift4_skip>
    174:        00430313        addi x6 x6 4
    178:        00459593        slli x11 x11 4
    17c:        00050e13        addi x28 x10 0
    180:        004e5e13        srli x28 x28 4
    184:        01c5e5b3        or x11 x11 x28
    188:        00451513        slli x10 x10 4

0000018c <shift4_skip>:
    18c:        400003b7        lui x7 0x40000
    190:        fff38393        addi x7 x7 -1
    194:        00b3be33        sltu x28 x7 x11
    198:        000e1e63        bne x28 x0 28 <shift2_skip>
    19c:        00230313        addi x6 x6 2
    1a0:        00259593        slli x11 x11 2
    1a4:        00050e13        addi x28 x10 0
    1a8:        002e5e13        srli x28 x28 2
    1ac:        01c5e5b3        or x11 x11 x28
    1b0:        00251513        slli x10 x10 2

000001b4 <shift2_skip>:
    1b4:        800003b7        lui x7 0x80000
    1b8:        fff38393        addi x7 x7 -1
    1bc:        00b3be33        sltu x28 x7 x11
    1c0:        000e1e63        bne x28 x0 28 <clz_end>
    1c4:        00130313        addi x6 x6 1
    1c8:        00159593        slli x11 x11 1
    1cc:        00050e13        addi x28 x10 0
    1d0:        001e5e13        srli x28 x28 1
    1d4:        01c5e5b3        or x11 x11 x28
    1d8:        00151513        slli x10 x10 1

000001dc <clz_end>:
    1dc:        00030513        addi x10 x6 0
    1e0:        00008067        jalr x0 x1 0

000001e4 <zero_case>:
    1e4:        04000513        addi x10 x0 64
    1e8:        00008067        jalr x0 x1 0

000001ec <ilog2>:
    1ec:        ffc10113        addi x2 x2 -4
    1f0:        00112023        sw x1 0 x2
    1f4:        f05ff0ef        jal x1 -252 <my_clz64>
    1f8:        00012083        lw x1 0 x2
    1fc:        00410113        addi x2 x2 4
    200:        03f00293        addi x5 x0 63
    204:        40a28533        sub x10 x5 x10
    208:        00008067        jalr x0 x1 0

0000020c <kthCharacter>:
    20c:        fe410113        addi x2 x2 -28
    210:        00112c23        sw x1 24 x2
    214:        00812a23        sw x8 20 x2
    218:        00912823        sw x9 16 x2
    21c:        01212623        sw x18 12 x2
    220:        01312423        sw x19 8 x2
    224:        01412223        sw x20 4 x2
    228:        01512023        sw x21 0 x2
    22c:        00060a13        addi x20 x12 0
    230:        00098a93        addi x21 x19 0
    234:        00000413        addi x8 x0 0
    238:        00050913        addi x18 x10 0
    23c:        00058993        addi x19 x11 0
    240:        fadff0ef        jal x1 -84 <ilog2>
    244:        00050493        addi x9 x10 0

00000248 <loop>:
    248:        0604c063        blt x9 x0 96 <loop_end>
    24c:        00048513        addi x10 x9 0
    250:        08c000ef        jal x1 140 <shift_left_64>
    254:        00050293        addi x5 x10 0
    258:        00058313        addi x6 x11 0
    25c:        00028513        addi x10 x5 0
    260:        00030593        addi x11 x6 0
    264:        00090613        addi x12 x18 0
    268:        00098693        addi x13 x19 0
    26c:        0a0000ef        jal x1 160 <blt_64>
    270:        02050863        beq x10 x0 48 <skip_if>
    274:        00090513        addi x10 x18 0
    278:        00098593        addi x11 x19 0
    27c:        00028613        addi x12 x5 0
    280:        00030693        addi x13 x6 0
    284:        0a8000ef        jal x1 168 <sub_64>
    288:        00050913        addi x18 x10 0
    28c:        00058993        addi x19 x11 0
    290:        00249293        slli x5 x9 2
    294:        005a0333        add x6 x20 x5
    298:        00032383        lw x7 0 x6
    29c:        00740433        add x8 x8 x7

000002a0 <skip_if>:
    2a0:        fff48493        addi x9 x9 -1
    2a4:        fa5ff06f        jal x0 -92 <loop>

000002a8 <loop_end>:
    2a8:        01a00293        addi x5 x0 26
    2ac:        02546333        rem x6 x8 x5
    2b0:        06100293        addi x5 x0 97
    2b4:        00628533        add x10 x5 x6
    2b8:        00012a83        lw x21 0 x2
    2bc:        00412a03        lw x20 4 x2
    2c0:        00812983        lw x19 8 x2
    2c4:        00c12903        lw x18 12 x2
    2c8:        01012483        lw x9 16 x2
    2cc:        01412403        lw x8 20 x2
    2d0:        01812083        lw x1 24 x2
    2d4:        01c10113        addi x2 x2 28
    2d8:        00008067        jalr x0 x1 0

000002dc <shift_left_64>:
    2dc:        02000e93        addi x29 x0 32
    2e0:        01d54c63        blt x10 x29 24 <shift_less_32>
    2e4:        fe050513        addi x10 x10 -32
    2e8:        00100593        addi x11 x0 1
    2ec:        00a595b3        sll x11 x11 x10
    2f0:        00000513        addi x10 x0 0
    2f4:        0140006f        jal x0 20 <end>

000002f8 <shift_less_32>:
    2f8:        00100593        addi x11 x0 1
    2fc:        00a59533        sll x10 x11 x10
    300:        00000593        addi x11 x0 0
    304:        0040006f        jal x0 4 <end>

00000308 <end>:
    308:        00008067        jalr x0 x1 0

0000030c <blt_64>:
    30c:        00d5e863        bltu x11 x13 16 <true>
    310:        00d58a63        beq x11 x13 20 <equal>

00000314 <false>:
    314:        00000513        addi x10 x0 0
    318:        00008067        jalr x0 x1 0

0000031c <true>:
    31c:        00100513        addi x10 x0 1
    320:        00008067        jalr x0 x1 0

00000324 <equal>:
    324:        fec56ce3        bltu x10 x12 -8 <true>
    328:        fedff06f        jal x0 -20 <false>

0000032c <sub_64>:
    32c:        00c532b3        sltu x5 x10 x12
    330:        40c50533        sub x10 x10 x12
    334:        40d585b3        sub x11 x11 x13
    338:        405585b3        sub x11 x11 x5
    33c:        00008067        jalr x0 x1 0

5-stage pipelined processor

Now we can choose a processor to run this code. Ripes provide four kinds of processor for us to choose, including single cycle processor, 5-stage processor, 5-stage with hazard detection and 5-stage with forward and hazard detection. Here we choose the 5 stage processor. Its block diagram look like this:

The "5-stage" means this processor using five-stage pipeline to parallelize instructions. The stages are:

Instruction fetch (IF)
Instruction decode and register fetch (ID)
Execute (EX)
Memory access (MEM)
Register write back (WB)

You can see that each stage is separated by pipeline registers (the rectangle block with stage names on its each side) in the block diagram.

I-Format load

Detailed Analysis of lw t2, 0(t1) in RISC-V Pipeline:

instruction format : lw t2, 0(t1)
The register t1 stores the address of the input operation array plus the offset of the variable op. We can then check whether the data of operation array is 0 or 1, and store the result in t2.

1. Instruction fetch (IF)

Follow online tool for RISC-V Instruction Encoder/Decoder we can find the machine code to lw t2, 0(t1) is 0x00032383.
PC will increment by 4 automatically using the above adder.
Because there is no branch occur, next instruction will be at PC + 4, so the multiplexer before PC choose input come from adder.

Instruction decode and register fetch (ID)

imm[11:0]	rs1[5]	funct3[3]	rd[5]	opcode[7]	Instruction
imm[11:0]	rs1	010	rd	0000011	lw
000000000000	00110	010	00111	0000011	lw(in binary)
0x00000000	0x06	0x02	0x07	0x03	lw(in hex)

Recall I-format load, we can split lw t2, 0(t1) into five parts. Which is shown above.
R2 idx is not used in lw so it is 0x00
Current PC value (0x00000298) and next PC value (0x0000029c) are just send through this stage, we don't use them.

3. Execute (EX)

Since this is an I-type instruction, the first-level multiplexers (which normally select values from register 1 and register 2) are not used. These register values are filtered out by the second-level multiplexers, as the instruction does not rely on register-to-register operations.
Reg 1 and Reg 2 are send to branch block, but no branch is taken.
The second-level multiplexers select the current value of the program counter (PC) and the immediate value (0) to be used as the operands for the ALU.
The ALU performs an addition operation between the base address in 0x06 and the immediate value (0). The result is the effective memory address, which is then passed to the memory access stage (EX/MEM).

4. Memory access (MEM)

Use as data memory address. Memory read data at address 0x10000084, so Read out is equal to 0x00000001. The table below denotes the data section of memory.
Next PC value (0x0000029c) and Wr idx (0x07) are just send through this stage, we don't use them.
Reg 2 is send to Data in, but memory doesn't enable writing.

5. Register write back (WB)

The multiplexer choose Read out from data memory as final output. So the output value is 0x00000001.
The output value and Wr idx are send back to registers block. Finally, the value 0x00000001 will be write into x7 register, whose ABI name is t2.

After all these stage are done, the register is updated like this:

TODO

Check if popcount version clz can go faster
Improve blt_64, shift_left_64, sub_64

Reference

Quiz1
Leetcode 3307. Find the K-th Character in String Game II
online tool for RISC-V Instruction Encoder/Decoder

Assignment1: RISC-V Assembly and Instruction Pipeline

Problem B in Quiz1

C code

RISC-V Assembly Implementation

Problem C in Quiz1

version 1

C code

Assembly code

version 2

C code

Assembly

3307. Find the K-th Character in String Game II

Test cases

Analysis

Pseudo instruction

5-stage pipelined processor

I-Format load

1. Instruction fetch (IF)

Instruction decode and register fetch (ID)

3. Execute (EX)

4. Memory access (MEM)

5. Register write back (WB)

TODO

Reference

Read more

2025q1 Homework1 (lab0)

2025q1 Homework1 (ideas)

Kernel_matmul

633. Sum of Square Numbers