Try   HackMD

Assignment1: RISC-V Assembly and Instruction Pipeline

contributed by <吳睿秉>

Be aware of Markdown headings.

Quiz1 Problem C

Use color sparingly, only when necessary to differentiate from plain text. Prioritize content over formatting.

fabsf

The function fabsf is used to obtain the absolute value of a floating-point number. This is achieved by applying the AND operation with the mask 0x7FFFFFFF, which sets the sign bit to 0.

C

In this C program, the input floating point number's address is first cast to a uint32_t type to enable bitwise operations. After this, the uint32_t value is masked with 0x7FFFFFFF to clear the sign bit. Finally, the result is cast back to the original floating point type, which gives the absolute value of the floating point number.

static inline float fabsf(float x) {
    uint32_t i = *(uint32_t *)&x;
    i &= 0x7FFFFFFF;
    x = *(float *)&i;
    return x;
}

Don't paste code snip without comprehensive discussions.

RISC-V

There are three test cases, so the len is set to 3 in advance. and the process will print the input to the console before calling the function fabsf. After the function fabsf returns, the output will be stored in a0 and then moved to a1. Next, the value in a1 will be compared to the expected answer in a2. The console will print 1 if the output equals the expected answer.

register name content
t0 data start address (input)
t1 number of data
t2 data start address (expect output)
t3 hold the value in a0 during a syscall temporarily
t4 hold 0x7FFFFFFF temporarily
a1 fabsf output
a2 expect output
.data
data:    .word 0x3f000000, 0xc1420000, 0xf0700000
len:     .word 0x3
expect:  .word 0x3f000000, 0x41420000, 0x70700000
endl: .string "\n"

.text
    la  t0, data
    lw  t1, len
    la  t2, expect

main:
    lw  a0, 0(t0)              # Show input value       
    li  a7, 2                  #
    ecall                      #
    jal ra, newline            #
    jal ra, fabsf              # Call fabsf       
    add a1, zero, a0           # Show output value    
    li  a7, 2                  #
    ecall                      #
    jal ra, newline            #
    lw  a2, 0(t2)              # Check if the output matches the expected result    
    li  a7, 1                  #
    bne a1, a2, wrong          #
right:                         # If output == expected, print 1
    li  a0, 1                  # 
    j   goback
wrong:                         # If output != expected, print 0
    li  a0, 0                  # 
goback:
    ecall
    jal  ra newline        
    addi t0, t0, 4             # Iterate data array
    addi t1, t1,-1             
    addi t2, t2, 4             # Iterate expect array
    bne  t1, zero, main        
    li   a7, 10             
    ecall

fabsf:                        # fabsf function
    li  t4, 0x7fffffff
    and a0, a0, t4            # x &= 0x7FFFFFFF;
    ret

newline:                      # print "\n"
    mv  t3, a0
    la  a0, endl
    li  a7, 4
    ecall
    mv  a0, t3
    ret

Testcase

  • Case1:

Input: 0x3f000000 (0.5 in decimal)
Output: 0x3f000000 (0.5 in decimal)

  • Case2:

Input: 0xc1420000 (-12.125 in decimal)
Output: 0x41420000 (12.125 in decimal)

  • Case3:

Input: 0xf0700000 (-2.97106e+29 in decimal)
Output: 0x70700000 (2.97106e+29 in decimal)

Console

0.5
0.5
1
-12.125
12.125
1
-2.97106e+29
2.97106e+29
1

Program exited with code: 0

Use fewer instructions.

my_clz

The my_clz function counts the leading zeros in a 32-bit digit. Starting from the leftmost digit and moving toward the right, it counts the zeros until a '1' appears. The function then returns the number of zeros in t4 register. Note that if the 32-bit digit is all zeros (i.e., 0), the leading zeros are considered undefined.

C

This C program simply counts the leading zeros in the input x. It starts by setting the count to 0. Then, it iterates from left to right through all 32 bits. Using a left shift operation (1U << i), for i=31 to 0, it performs an AND operation with x to check if the bit is 1. As soon as the first 1 appears from the left side, the loop breaks and returns count, which stores the number of leading zeros.

static inline int my_clz(uint32_t x) {
    int count = 0;
    for (int i = 31; i >= 0; --i) {
        if (x & (1U << i))
            break;
        count++;
    }
    return count;
}

RISC-V

The RISC-V program for counting leading zeros works as follows. There are four inputs. The program displays the 32-bit input before starting my_clz. It uses a loop to accumulate the number of leading zeros until the t6 register holds a 1, meaning the condition (x & (1 << i)) is true. At this point, the loop stops, and the count is stored in t4. The program automatically checks the answer, and if it's correct, it will print "correct."

register name content
t0 data start address (input)
t1 number of data
t2 data start address (expect output)
t3 hold the value in a0 during a syscall temporarily
t4 count
t5 i
t6 (x & (1 << i))
a1 my_clz output
a2 expect output
s0 store the immediate value 32 to check if the result is undefined.
.data
data:       .word 0x0000002, 0x000000FF, 0xFFFFFFFF, 0x00000000
len:        .word 0x4
expect:     .word 30, 24, 0, 32
endl:       .string "\n"
undef:      .string "undefined"
correct:    .string " correct"
incorrect:  .string " incorrect"

.text
    la  t0, data
    lw  t1, len 
    la  t2, expect

main:
    lw  a0, 0(t0)           # Show input value
    li  a7, 1               #
    ecall                   #
    jal ra, newline         #
    jal ra, my_clz          # Call my_clz
    mv  a1, a0              # Show output value  
    li  s0, 32              # Check if the result is undefined
    beq a1, s0, isundef     # 
    li  a7, 1               # Show output value
    ecall                   # 
    lw  a2, 0(t2)           # Check if the output matches the expected result
    li  a7, 4               # 
    bne a1, a2, wrong       # 
right:                      # If output == expected, print correct 
    la  a0, correct         # 
    j   goback              # 
wrong:                      # If output != expected, print incorrect 
    la  a0, incorrect       # 
    j   goback              # 
isundef:                    # undefined
    la  a0, undef           # 
goback:
    ecall
    jal  ra, newline
    addi t0, t0, 4
    addi t1, t1,-1
    addi t2, t2, 4
    bne  t1, zero, main
    li   a7, 10
    ecall

my_clz:                     # my_clz function
    li  t4 0
    li  t5 31
loop:
    li   t6, 1              # t6 =       1
    sll  t6, t6, t5         # t6 =      (1 << i)
    and  t6, a0, t6         # t6 = (x & (1 << i))
    bnez t6, quit           # if (x & (1U << i)) break;
    addi t4, t4, 1
    addi t5, t5, -1
    bge  t5, zero, loop
quit:
    mv  a0, t4
    ret

newline:                    # to print "\n"
    mv  t3, a0
    la  a0, endl
    li  a7, 4
    ecall
    mv  a0, t3
    ret

Testcase

  • Case1:

Input: 0x0000002 (2 in decimal, 00000000 00000000 00000000 00000010 in binary)
Output: 30

  • Case2:

Input: 0x000000FF (255 in decimal, 00000000 00000000 00000000 11111111 in binary)
Output: 24

  • Case3:

Input: 0xFFFFFFFF (-1 in decimal, 11111111 11111111 11111111 11111111 in binary)
Output: 0

  • Case4:

Input: 0x00000000 (0)
Output: undefined

Console

2
30 correct
255
24 correct
-1
0 correct
0
undefined

Program exited with code: 0

fp16_to_fp32

The purpose of fp16_to_fp32 is to convert a 16-bit floating-point number to a 32-bit floating-point number. Due to the format differences between FP16 and FP32, certain transformations are required during the conversion process.

There are several steps involved. First, the input h is shifted left by 16 bits into a 32-bit uint32_t, where the upper bits hold the FP16 data and the lower bits remain zero.

Second, the sign bit is extracted, and the nonsign bits (i.e., exponent and mantissa) are separated.

The third step involves calculating the renormalization shift. The function my_clz, as mentioned earlier, is used to count the leading zeros in the nonsign part (exponent and mantissa). The result is adjusted by subtracting 5 if it's greater than 5, otherwise it's set to 0. This shift value is used to normalize the mantissa by shifting it left and then right.

The fourth step handles the special cases for the exponent being all ones (infinity or NaN). In FP16, an exponent of 0x1F (31 in decimal) indicates either infinity or NaN. The result is right-shifted by 8 bits to move the exponent to the correct position in the 32-bit representation. The mask 0x7F800000 ensures that only the exponent bits are retained in the final result.

In the fifth step, the zero_mask is used to checks if nonsign is zero. If so, equals 0, subtracting 1 causes overflow, setting bit 31 to 1. If nonsign is not zero, bit 31 remains 0.

Finally, the function combines the sign, the normalized nonsign, and applies the inf_nan_mask and zero_mask to handle cases of infinity, NaN, and zero, ultimately returning the 32-bit floating-point value.

C

static inline uint32_t fp16_to_fp32(uint16_t h) {

    const uint32_t w = (uint32_t) h << 16;
    
    const uint32_t sign = w & UINT32_C(0x80000000);
    const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
    
    uint32_t renorm_shift = my_clz(nonsign);
    renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
    
    const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000);
    
    const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
    
    return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
}

RISC-V

This RISC-V program executes the process described in the C program, utilizing the previously mentioned my_clz function to convert from FP16 to FP32.

register name content
s1 h, w, renorm_shift
s2 sign
s3 nonsign
s4 inf_nan_mask
s5 zero_mask
s6 (nonsign << renorm_shift >> 3)
s7 ((0x70 - renorm_shift) << 23)
s8 result
s11 Temporarily store an immediate value for operations.
.data
data:       .word 0x3C00, 0xCC10, 0x962C
len:        .word 0x3
expect:     .word 0x3F800000, 0xC1820000, 0xBAC58000
endl:       .string "\n"
correct:    .string " correct"
incorrect:  .string " incorrect"

.text
    la  t0, data
    lw  t1, len 
    la  t2, expect
main:
    lw  a0, 0(t0)           # Show input value
    li  a7, 1               #
    ecall                   #
    jal ra, newline         #
    jal ra, fp16_to_fp32          # Call my_clz
    mv  a1, a0              # 
    li  a7, 1               # Show output value
    ecall                   # 
    lw  a2, 0(t2)           # Check if the output matches the expected result
    li  a7, 4               # 
    bne a1, a2, wrong       # 
right:                      # If output == expected, print correct 
    la  a0, correct         # 
    j   goback              # 
wrong:                      # If output != expected, print incorrect 
    la  a0, incorrect       # 
goback:
    ecall
    jal  ra, newline
    addi t0, t0, 4
    addi t1, t1,-1
    addi t2, t2, 4
    bne  t1, zero, main
    li   a7, 10
    ecall

fp16_to_fp32:
    slli a0 a0 16
    li s11 0x80000000
    and s2 a0 s11
    li s11 0x7FFFFFFF
    and s3 a0 s11
    mv a0 s3

my_clz:                     # my_clz function
    li  t4 0
    li  t5 31
loop:
    li   t6, 1              # t6 =       1
    sll  t6, t6, t5         # t6 =      (1 << i)
    and  t6, a0, t6         # t6 = (x & (1 << i))
    bnez t6, quit           # if (x & (1U << i)) break;
    addi t4, t4, 1
    addi t5, t5, -1
    bge  t5, zero, loop 
quit:
    mv s1 t4

    li s11 5
    blt s11 s1 Sub_Five      # if (5 < renorm_shift)
    li s1 0
    j continue
Sub_Five:
    addi s1 s1 -5   # renorm_shift -= 5
continue:
    li s11 0x04000000
    add s4 s3 s11
    srli s4 s4 8
    li s11 0x7F800000
    and s4 s4 s11
    addi s5 s3 -1
    srli s5 s5 31
    sll s6 s3 s1
    srli s6 s6 3
    li s11 0x70
    sub s7 s11 s1
    slli s7 s7 23
    add s8 s6 s7
    or s8 s8 s4
    xori s5 s5 -1
    and s8 s8 s5
    or s8 s8 s2
    mv a0 s8
    ret

newline:                    # to print "\n"
    mv  t3, a0
    la  a0, endl
    li  a7, 4
    ecall
    mv  a0, t3
    ret

Testcase

  • Case1:

Input: 0x3C00 (1 in fp16, 15360 in decimal on the console)
Output: 0x3F800000 (1 in fp32, 1065353216 in decimal on the console)

  • Case2:

Input: 0xCC10 (-16.25 in fp16, 52240 in decimal on the console)
Output: 0xC1820000 (-16.25 in fp32, -1048444928 in decimal on the console)

  • Case3:

Input: 0x962C (-0.001507 in fp16, 38444 in decimal on the console)
Output: 0xBAC58000 (-0.0015068054 in fp32, -1161461760 in decimal on the console)

Console

15360
1065353216 correct
52240
-1048444928 correct
38444
-1161461760 correct

Program exited with code: 0

3158. Find the XOR of Numbers Which Appear Twice

Description

You are given an array nums, where each number in the array appears either once or twice.
Return the bitwise XOR of all the numbers that appear twice in the array, or 0 if no number appears twice.

Example:

Input: nums = [1,2,2,1]
Output: 3
Explanation: Numbers 1 and 2 appeared twice. 1 XOR 2 == 3.

Constraints:

  • 1 <= nums.length <= 50
  • 1 <= nums[i] <= 50
  • Each number in nums appears either once or twice.

Solution - Naive Approach

This problem can easily be solved with a nested loop that checks the values in the array. If a value repeats, perform the XOR operation to get the final answer. This approach is straightforward and easy to understand.

C

    int duplicateNumbersXOR(int* nums, int numsSize) {
        int xor = 0;
        for(int i=0;i<numsSize;++i)
            for(int j=i+1;j<numsSize;++j)
                if(nums[i]==nums[j])
                    xor ^= nums[i];
        return xor;
    }

RISC-V

register name content
s3 nums[i] address
s9 nums[i] value
s11 i
s4 nums[j] address
t2 nums[j] value
t1 j
.text
    lw   s1, len    
    la   s2, size    
    la   s3, dataset             # for nums[i]
main: 
    lw   a1, 0(s2)               # a1 = numsSize
    li   s5, 0
    jal  ra, dupNumXOR           # call dupNumXOR
    li   a7, 1                   # show output
    ecall                       #
    addi s2, s2, 4              # the next dataset
    addi s1, s1, -1             # < len keep doing
    bne  s1, zero, main
    li   a7, 10
    ecall

dupNumXOR:
    li   s6,  0
    li   s11, 0
loop_i:
    bge  s11, a1, loop_i_exit
    addi t1, s11, 1
    addi s4,  s3, 4             # for nums[j]
loop_j:
    bge  t1, a1, loop_j_exit
    lw   s9, 0(s3)
    lw   t2, 0(s4)
    bne  s9, t2, skipxor
    xor  s6, s6, s9
skipxor:
    addi t1, t1, 1              # j++
    addi s4, s4, 4
    j    loop_j
loop_j_exit:
    addi s11, s11, 1            # i++
    addi s3,   s3, 4
    j    loop_i
loop_i_exit:
    mv   a0,  s6
    ret
Execution info Value
Cycles 3047
Instrs. retired 1816
CPI 1.68
IPC 0.596

This problem can easily be solved with Naive Approach. While it’s easy to understand, it is far from being the most time-efficient solution. After studying Problem C in Quiz 1, I applied a similar concept by using bitwise shift and AND operations to efficiently maintain information.

Solution - Optimized Bitwise Approach

This problem has the constraint that nums[i] < 50. Instead of using a naive O(n²) solution, a 64-bit variable (uint64_t) can be used to record whether each number has appeared. Since 64 bits exceed the range of 50, using this method is efficient and sufficient for this problem.

If a number appears for the first time, set the corresponding bit in the uint64_t using an OR operation (i.e., set |= (1ULL << nums[i]) at line 8). To check if the number has appeared a second time, use an AND operation(i.e., set & (1ULL << nums[i]) at line 6). If it has, then perform the XOR operation.

C

#include <stdint.h> int duplicateNumbersXOR(int *nums, int numsSize){ uint64_t set = 0ULL; int xor = 0; for (int i = 0; i < numsSize; ++i){ if (set & (1ULL << nums[i])) xor ^= nums[i]; set |= (1ULL << nums[i]); } return xor; }
#include <stdio.h>

#define test1_Size 4
#define test1_ans 1
int test1_data[test1_Size] = {1, 2, 1, 3};

#define test2_Size 14
#define test2_ans 12
int test2_data[test2_Size] = {1, 2, 3, 4, 6, 7, 8, 2, 9, 9, 3, 5, 1, 5};

#define test3_Size 16
#define test3_ans 43
int test3_data[test3_Size] = { 3, 49, 49,  3,  8, 12, 19, 19, 
                              28, 27, 28, 50, 31, 36, 50, 36};

int main()
{
    int ans1 = duplicateNumbersXOR(test1_data, test1_Size);
    printf("%2d, %2d, %2d\n", ans1, test1_ans, ans1 == test1_ans);

    int ans2 = duplicateNumbersXOR(test2_data, test2_Size);
    printf("%2d, %2d, %2d\n", ans2, test2_ans, ans2 == test2_ans);

    int ans3 = duplicateNumbersXOR(test3_data, test3_Size);
    printf("%2d, %2d, %2d\n", ans3, test3_ans, ans3 == test3_ans);
}

Console

 1,  1,  1
12, 12,  1
43, 43,  1

In the C program, the XOR result is stored in a uint64_t variable. However, in RISC-V, the registers are only 32 bits wide. If we use only one 32-bit register (s7 register in Wrong version of dupNumXOR function)to store the result, it could lead to incorrect answers.

Wrong version of dupNumXOR function

dupNumXOR:                  
    li s6 0                 # xor result
    li s7 0                 # set
    li s11 0                # i
    li t5 1                 # const 1
loop:
    lw s9, 0(s3)            # s9 = nums[i]
    sll  s10, t5, s9        # set &= (1 << nums[i])
    and  s10, s7, s10       # 
    beqz s10, skipxor       
    xor   s6, s6 ,s9        # xor ^= nums[i];
skipxor:
    sll  s10, t5, s9        # set |= (1ULL << nums[i])
    or   s7,  s7, s10       # 
    addi s3, s3, 4          # iterate from nums[i] to nums[i+1]
    addi s11, s11 1         # i<numsSize; i++;                             
    bge  s11, a1, quit      # 
    j    loop
quit:
    mv a0 s6                # s6 is the xor result
    ret

This scenario is similar to the C code, where changing uint64_t set = 0ULL; to uint32_t set = 0ULL; in line 3 would also lead to an incorrect result.

In this case, if nums[i] > 31, the information from (1ULL << nums[i]) cannot be stored in a single 32-bit register.

To correctly store a 64-bit unsigned integer in RISC-V, we need to utilize two 32-bit registers(in RISC-V Corrected version, register s7 and s8).

RISC-V Corrected version

.data
len:    .word 3             # number of dataset
size:   .word 4, 14, 16     # length of every dataset
dataset:  
.word 1, 2, 1, 3 
.word 1, 2, 3, 4, 6, 7, 8, 2, 9, 9, 3, 5, 1, 5
.word 3, 49, 49, 3, 8, 12, 19, 19, 28, 27, 28, 50, 31, 36, 50, 36
expect: .word 1, 12, 43
endl:       .string "\n"

.text
    lw  s1, len    
    la  s2, size    
    la  s3, dataset
main: 
    lw a1, 0(s2)            # numsSize
    li s5, 0
    jal ra dupNumXOR        # (s5 < numsSize) print(nums[i])
    li  a7, 1
    ecall
goback:
    addi s2, s2,4
    jal ra newline
    addi s1, s1,-1          # < len keep doing
    bne  s1, zero, main
    li   a7, 10
    ecall

dupNumXOR:                  # duplicateNumbersXOR function
    li s6 0                 # xor result
    li s7 0                 # set_A (the  left half bits)
    li s8 0                 # set_B (the right half bits)
    li s11 0
    li t5 1
loop:
    lw s9, 0(s3)            # s9 = nums[i]
    li  t4 32
    bge s9 t4 left_half_bits
right_half_bits:        # s8: set_B (the right half bits)
    sll  s10, t5, s9
    and  s10, s8, s10
    beqz s10, right_skipxor       # if not(set & (1ULL << nums[i]))
    xor   s6, s6 ,s9
right_skipxor:
    sll  s10, t5, s9
    or   s8,  s8, s10
    j loop_goback
left_half_bits:         # s7: set_A (the  left half bits)
    addi t4 s9 -31      # (nums[i] - 31)
    sll  s10, t5, t4
    and  s10, s7, s10
    beqz s10, left_skipxor       # if not(set & (1ULL << nums[i]))
    xor   s6, s6 ,s9
left_skipxor:
    sll  s10, t5, t4
    or   s7,  s7, s10
loop_goback:
    addi s3, s3, 4          # iter nums[i] -> nums[i+1]
    addi s11 s11 1          # i<numsSize; i++;                        
    bge  s11, a1, quit  #
    j    loop
quit:
    mv a0 s6
    ret

newline:                    # to print "\n"
    mv  t6, a0
    la  a0, endl
    li  a7, 4
    ecall                           
    mv  a0, t6
    ret
Execution info Value
Cycles 754
Instrs. retired 528
CPI 1.43
IPC 0.7

RISC-V Full Revised Version

Eventually, after slightly rearranging the registers in the checks for upper and lower bits and improving the branch conditions, I revised the code, reducing the number of instructions and cycles.

register name content
t4 immediate 32 for comparision, (nums[i] - 31)
t5 const 1
t6 hold the value in a0 during a syscall temporarily
a1 array length (numsSize)
s1 number of dataset
s2 length of every dataset base address
s3 dataset base address
s4 expect base address
s5 currently doing x-th in a dataset
s6 xor result
s7 set_A (the left half bits)
s8 set_B (the right half bits)
s9 nums[i]
s10 (set & (1ULL << nums[i]))
s11 i
.data
len:    .word 3             # number of dataset
size:   .word 4, 14, 16     # length of every dataset
dataset:  
.word 1, 2, 1, 3 
.word 1, 2, 3, 4, 6, 7, 8, 2, 9, 9, 3, 5, 1, 5
.word 3, 49, 49, 3, 8, 12, 19, 19, 28, 27, 28, 50, 31, 36, 50, 36
expect: .word 1, 12, 43
endl:       .string "\n"

.text
    lw  s1, len    
    la  s2, size    
    la  s3, dataset
main: 
    lw a1, 0(s2)            # numsSize
    li s5, 0
    jal ra dupNumXOR        # (s5 < numsSize) print(nums[i])
    li  a7, 1
    ecall
    jal ra newline          # print "\n"
    addi s2, s2,4           # the next dataset
    addi s1, s1,-1          # < len keep doing
    bne  s1, zero, main     # goback
    li   a7, 10             # quit
    ecall


dupNumXOR:                      # duplicateNumbersXOR function
    li s6 0                     # xor result
    li s7 0                     # set_A (the  left half bits)
    li s8 0                     # set_B (the right half bits)
    li s11 0
    li t5 1
loop:
    lw s9, 0(s3)                # s9 = nums[i]
    li  t4 32
    bge s9 t4 left_half_bits    # If nums[i] > 31, it belongs to the upper half of the bits
right_half_bits:                # s8: set_B (the right half bits)
    sll  t4,  t5, s9
    and  s10, s8, t4
    beqz s10, right_skipxor     # if not(set & (1ULL << nums[i]))
    xor   s6, s6 ,s9    
right_skipxor:
    or   s8,  s8, t4
    j loop_goback
left_half_bits:                 # s7: set_A (the  left half bits)
    addi t4,  s9, -31           # (nums[i] - 31)
    sll  t4,  t5, t4
    and  s10, s7, t4
    beqz s10, left_skipxor      # if not(set & (1ULL << nums[i]))
    xor   s6, s6, s9    
left_skipxor:
    or   s7,  s7, t4
loop_goback:
    addi s3,   s3, 4            # iter nums[i] -> nums[i+1]
    addi s11, s11, 1            # i<numsSize; i++;                        
    blt  s11,  a1, loop      
    mv a0 s6                    # quit
    ret

newline:                        # to print "\n"
    mv  t6, a0
    la  a0, endl
    li  a7, 4
    ecall                           
    mv  a0, t6
    ret
Execution info Value
Cycles 683
Instrs. retired 463
CPI 1.48
IPC 0.678

Testcase

  • Case1:

Input: nums = [1, 2, 1, 3]
Output: 1

  • Case2:

Input: nums = [1, 2, 3, 4, 6, 7, 8, 2, 9, 9, 3, 5, 1, 5]
Output: 12

  • Case3:

Input: nums = [3, 49, 49, 3, 8, 12, 19, 19, 28, 27, 28, 50, 31, 36, 50, 36]
Output: 43

Console

1
12
43

Program exited with code: 0

Analysis

In the Ripes simulator, the CPU is organized as a classic five-stage pipeline, with pipeline registers between each stage. An instruction will sequentially go through these stages, allowing for different instructions to overlap in execution. For instance, consider the instruction lw a1, 0(s2) , which retrieves the starting address of the size array and stores it in the a1 register. This example illustrates how the pipeline operates in the following five stages:

2024_10_13_205708_FiveStagePipeline

1. Instruction fetch (IF):

In RISC-V, the register a1 is represented as x11, and the register s2 is represented as x18. Therefore, the instruction becomes lw x11, 0 x18, as shown below, which means loading a word from the address specified by s2 (in x18) into a1 (in x11). The lw instruction format is as follows:

lw rd, offset(rs1)

31 ~ 20 19 ~ 15 14 ~ 12 11 ~ 7 6 ~ 0
imm [11:0] rs1 funct3 rd opcode
0 x18 x11
000000000000 10010 010 01011 0000011

According to the Program Counter (PC), the instruction address in instruction memory is 0x00000018, which retrieves the instruction 00092583. The PC will then load the address of the next instruction, which is the current instruction address plus 4 bytes (i.e., 0x0000001C).

2024_10_13_233817_

2. Instruction decode (ID)

The instruction is decoded from binary according to the above table, and control signals are generated to determine what the instruction is supposed to do.

Additionally, this stage reads the values from the source registers in the register file. In this case, rs1, which corresponds to s2, contains the starting address of the size array in the data section. The starting address is exactly 0x10000004, and the offset immediate is 0.

2024_10_14_000400_

In memory, the orange area indicates the address and the words of the size array that are stored. The starting address is 0x10000004, with space for 3 words. The corresponding data is 4, 14, 16, as previously defined.

2024_10_14_001830_

3. Execute (EX):

Arithmetic or logical operations are performed in this stage. The address calculation uses the base address from the source register s2 and the offset (which is 0). The multiplexer selects the register value from the previous step (i.e., the ID stage). The ALU adds these two values together, resulting in an effective address that is the same as the value stored in s2.

2024_10_14_002541_

4. Memory access (MEM)

For memory-related instructions, this stage handles reading from or writing to memory. In this stage, it retrieves the word stored at the address specified by the offset in data memory. As a result, the first word in the size array, which is 4, will be retrieved.

2024_10_14_003100_

5. Writeback (WB)

The word from data memory is written back to the register file. The result 4, is stored in the destination register a1 (0b).

2024_10_14_004903_

Reference

* RISC-V 指令集架構介紹 - Integer Calling convention https://tclin914.github.io/77838749/ * Classic RISC pipeline https://en.wikipedia.org/wiki/Classic_RISC_pipeline

Always refer to primary sources, such as official RISC-V documentation.