Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [`Kurumi2650`](https://github.com/Kurumi2650) > :::danger Choose one problem (A, B, or C) from [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol), translate it from C code to a complete RISC-V assembly program, and include the relevant test data. ::: ## Problem C * The assembly code in this section includes a self-verification feature. After performing the population count or other computations, the assembly code compares the result against a predefined correct answer. This comparison happens at the end of the operation. If the computed result does not match the expected value, the assembly code reports the error. This built-in self-check ensures that the calculations are accurate and allows for immediate detection of any issues during execution. ### C code of fabsf ```c static inline float fabsf(float x) { uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value x = *(float *)&i; // Write the modified bits back into the float return x; } ``` The main purpose of this code is to first use a uint32_t to store the bits of a float. The reason for doing this is to directly apply bitwise operations to manipulate the floating-point number. In the IEEE single precision format, the most significant bit is the sign bit, where 0 represents a positive number and 1 represents a negative number. Therefore, to obtain the absolute value, we simply need to perform an AND operation with 0x7FFFFFFF (0b01111111111111111111111111111111) to get the absolute value of the number. Finally, the result is returned as a float. ### Assembly code of fabsf ```c .data floats: .word 0xBF800000 # -1.0 .word 0x40000000 # 2.0 .word 0xC2480000 # -50.0 expected_values: .word 0x3F800000 # 1.0 .word 0x40000000 # 2.0 .word 0x42480000 # 50.0 correct_msg: .asciz "correct\n" wrong_msg_prefix: .asciz "wrong at index " newline: .asciz "\n" .text .globl main main: # Initialize registers la t0, floats # t0 points to the starting address of the floats array la t1, expected_values # t1 points to the starting address of the expected_values array li t2, 3 # t2 is the loop counter (number of floats) li t3, 0x7FFFFFFF # t3 is the mask to clear the sign bit li t4, 0 # t4 is the error flag, initially set to 0 (no error) li s0, 0 # s0 is the index counter, initially set to 0 loop: beqz t2, check_flag # If t2 is 0, all floats processed, jump to check_flag lw t5, 0(t0) # Load the current float from the floats array into t5 and t5, t5, t3 # Clear the sign bit to calculate the absolute value (fabsf) sw t5, 0(t0) # Store the result back to the floats array lw t6, 0(t1) # Load the expected value into t6 from the expected_values array bne t5, t6, report_error # If the result does not match the expected value, jump to report_error # Update pointers and counters addi t0, t0, 4 # Move to the next element in the floats array addi t1, t1, 4 # Move to the next element in the expected_values array addi t2, t2, -1 # Decrement the counter addi s0, s0, 1 # Increment the index counter j loop # Jump back to the start of the loop report_error: # Set the error flag li t4, 1 # Set t4 to 1, indicating an error # Print "wrong at index " la a0, wrong_msg_prefix li a7, 4 # Syscall number 4: print string ecall # Print the index (s0) mv a0, s0 # Move the current index to a0 li a7, 1 # Syscall number 1: print integer ecall # Print newline la a0, newline li a7, 4 # Syscall number 4: print string ecall # Update pointers and counters addi t0, t0, 4 # Move to the next element in the floats array addi t1, t1, 4 # Move to the next element in the expected_values array addi t2, t2, -1 # Decrement the counter addi s0, s0, 1 # Increment the index counter j loop # Jump back to the start of the loop check_flag: beqz t4, print_correct # If t4 is 0, indicating no error, print "correct" # If there is an error, exit the program j exit_program print_correct: la a0, correct_msg # Load the "correct" message li a7, 4 # Syscall number 4: print string ecall exit_program: li a7, 10 # Syscall number 10: exit program ecall ``` ### C code of my_clz ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` This code defines a function my_clz that counts the number of leading zeros in a 32-bit unsigned integer (uint32_t x). The function works by iterating through each bit of x starting from the most significant bit (bit 31) down to the least significant bit (bit 0). For each bit, it checks if the bit is set (i.e., if it equals 1). If it finds a set bit, the loop breaks. Otherwise, it increments a counter (count) to track how many leading zeros have been encountered. ### Assembly code of my_clz ```assembly= .data test_data: .word 0x00000000 # Test data 0: x = 0 .word 0x80000000 # Test data 1: x = -2147483648 (most significant bit is 1, indicating a negative number) .word 0x00F00000 # Test data 2: x = 15728640 expected_results: .word 32 # Expected result 0: 32 (because x == 0) .word 0 # Expected result 1: 0 (because x < 0) .word 8 # Expected result 2: 8 (15728640 has 8 leading zeros in binary) correct_msg: .asciz "correct\n" wrong_msg_prefix: .asciz "wrong at index " newline: .asciz "\n" .text .globl main main: # Initialize pointers and flags la s2, test_data # s2 points to the start of the test_data array la s3, expected_results # s3 points to the start of the expected_results array li s1, 3 # s1 is the loop counter (number of test data items) li t4, 0 # t4 is the error flag, initially 0 (no error) li s0, 0 # s0 is the index counter, initially 0 loop: beqz s1, check_flag # If s1 is 0, all test data is processed, jump to the check phase lw a0, 0(s2) # Load current test data into a0 jal ra, my_clz # Call my_clz function lw t5, 0(s3) # Load expected result into t5 bne a0, t5, report_error # If the result doesn't match the expected value, jump to report_error # Update pointers and counters addi s2, s2, 4 # Move to the next test data addi s3, s3, 4 # Move to the next expected result addi s1, s1, -1 # Decrease the loop counter addi s0, s0, 1 # Increase the index counter j loop # Return to the beginning of the loop report_error: # Set the error flag li t4, 1 # t4 = 1, indicating an error # Print "wrong at index " la a0, wrong_msg_prefix li a7, 4 # System call number 4: print string ecall # Print the index (s0) mv a0, s0 # Move the current index to a0 li a7, 1 # System call number 1: print integer ecall # Print newline la a0, newline li a7, 4 # System call number 4: print string ecall # Update pointers and counters addi s2, s2, 4 # Move to the next test data addi s3, s3, 4 # Move to the next expected result addi s1, s1, -1 # Decrease the loop counter addi s0, s0, 1 # Increase the index counter j loop # Return to the beginning of the loop check_flag: beqz t4, print_correct # If t4 is 0, indicating no errors, jump to print "correct" # If there is an error, exit the program j exit_program print_correct: la a0, correct_msg # Load "correct" message jal ra, print_string # Call print_string exit_program: li a7, 10 # System call number 10: exit program ecall # Subroutine to print string print_string: # a0 contains the address of the string li a7, 4 # System call number 4: print string ecall ret # my_clz function my_clz: addi sp, sp, -8 # Allocate stack space sw ra, 4(sp) # Save return address sw s0, 0(sp) # Save s0 (used within the function) mv s0, a0 # s0 = x # If x == 0, return 32 beqz s0, my_clz_return_32 # If x < 0 (i.e., the most significant bit of x is 1), return 0 bltz s0, my_clz_return_0 # count = 0 li t0, 0 # t0 = count li t1, 31 # t1 = i my_clz_loop: blt t1, zero, my_clz_end_loop # If i < 0, end loop li t2, 1 sll t2, t2, t1 # t2 = 1U << i and t3, s0, t2 # t3 = x & (1U << i) bnez t3, my_clz_end_loop addi t0, t0, 1 # count++ addi t1, t1, -1 # i-- j my_clz_loop my_clz_end_loop: mv a0, t0 # Return count j my_clz_exit my_clz_return_32: li a0, 32 j my_clz_exit my_clz_return_0: li a0, 0 my_clz_exit: # Restore registers and return lw s0, 0(sp) lw ra, 4(sp) addi sp, sp, 8 ret ``` :::danger Use fewer instructions. ::: ### C code of clz (population count) ```c #include <stdint.h> uint16_t count_leading_zeros(uint64_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); /* count ones (population count) */ x -= ((x >> 1) & 0x55555555); // A01 for 32-bit: alternating bits mask (binary: 010101...) x = ((x >> 2) & 0x33333333) + (x & 0x33333333); // A02 for 32-bit: mask for 2-bit groups (binary: 00110011...) x = ((x >> 4) + x) & 0x0f0f0f0f; // mask for 4-bit groups x += (x >> 8); x += (x >> 16); return (32 - (x & 0x3f)); // Adjust for 32 bits (0x3f = 6 bits mask) } ``` This 32-bit population count program works by efficiently counting the number of 1 bits in a 32-bit integer using a series of bitwise operations. It begins with a process called bit expansion, where the number is shifted and OR-ed with itself. This ensures that all bits below the highest set 1-bit are also set to 1, effectively preparing the number for counting. Next, the program performs the population count by grouping bits together. It first pairs adjacent bits and uses a subtraction operation to count how many 1s exist in each 2-bit group. Then, it aggregates these results by progressively summing larger groups of bits, such as 4-bit and 8-bit blocks. The program ultimately produces the total count of set 1 bits in the integer. ### Assembly code of clz (population count) ```assembly= .data test_data: .word 0x00000000 # Test data 0: x = 0 .word 0x80000000 # Test data 1: x = -2147483648 (MSB is 1) .word 0x00F00000 # Test data 2: x = 15728640 expected_results: .word 32 # Expected result 0: 32 (x == 0) .word 0 # Expected result 1: 0 (x < 0) .word 8 # Expected result 2: 8 (15728640 has 8 leading zeros) correct_msg: .asciz "correct\n" wrong_msg_prefix: .asciz "wrong at index " newline: .asciz "\n" # Define masks in the data section mask_55555555: .word 0x55555555 mask_33333333: .word 0x33333333 mask_0f0f0f0f: .word 0x0f0f0f0f .text .globl main main: # Initialize pointers and flags la s2, test_data # s2 points to test_data array la s3, expected_results # s3 points to expected_results array li s1, 3 # s1 is the loop counter (number of test cases) li t4, 0 # t4 is the error flag, initially 0 (no error) li s0, 0 # s0 is the index counter, initially 0 loop: beqz s1, check_flag # If s1 is 0, all tests are done lw a0, 0(s2) # Load current test data into a0 jal ra, my_clz # Call my_clz function lw t5, 0(s3) # Load expected result into t5 bne a0, t5, report_error # If result != expected, report error # Update pointers and counters addi s2, s2, 4 # Move to next test data addi s3, s3, 4 # Move to next expected result addi s1, s1, -1 # Decrease loop counter addi s0, s0, 1 # Increase index counter j loop # Repeat loop report_error: # Set error flag li t4, 1 # t4 = 1, indicating an error # Print "wrong at index " la a0, wrong_msg_prefix li a7, 4 # syscall number for print string ecall # Print the index (s0) mv a0, s0 # Move index to a0 li a7, 1 # syscall number for print integer ecall # Print newline la a0, newline li a7, 4 # syscall number for print string ecall # Update pointers and counters addi s2, s2, 4 # Move to next test data addi s3, s3, 4 # Move to next expected result addi s1, s1, -1 # Decrease loop counter addi s0, s0, 1 # Increase index counter j loop # Repeat loop check_flag: beqz t4, print_correct # If no errors, print "correct" j exit_program # If errors, exit print_correct: la a0, correct_msg # Load "correct" message jal ra, print_string # Call print_string j exit_program exit_program: li a7, 10 # syscall number for exit ecall # Subroutine to print a string print_string: li a7, 4 # syscall number for print string ecall ret # Function: my_clz my_clz: addi sp, sp, -8 # Allocate stack space sw ra, 4(sp) # Save return address sw s0, 0(sp) # Save s0 (used in function) mv s0, a0 # s0 = x # If x == 0, return 32 beqz s0, return_32 # If x < 0 (MSB is 1), return 0 bltz s0, return_0 # Begin computation as per the C code # x |= (x >> 1) srli t0, s0, 1 # t0 = x >> 1 or s0, s0, t0 # x |= t0 # x |= (x >> 2) srli t0, s0, 2 # t0 = x >> 2 or s0, s0, t0 # x |= t0 # x |= (x >> 4) srli t0, s0, 4 # t0 = x >> 4 or s0, s0, t0 # x |= t0 # x |= (x >> 8) srli t0, s0, 8 # t0 = x >> 8 or s0, s0, t0 # x |= t0 # x |= (x >> 16) srli t0, s0, 16 # t0 = x >> 16 or s0, s0, t0 # x |= t0 # x -= ((x >> 1) & 0x55555555) srli t0, s0, 1 # t0 = x >> 1 la t1, mask_55555555 # Load address of mask_55555555 lw t1, 0(t1) # t1 = 0x55555555 and t0, t0, t1 # t0 = (x >> 1) & 0x55555555 sub s0, s0, t0 # x -= t0 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333) srli t0, s0, 2 # t0 = x >> 2 la t1, mask_33333333 # Load address of mask_33333333 lw t1, 0(t1) # t1 = 0x33333333 and t0, t0, t1 # t0 = (x >> 2) & 0x33333333 and t2, s0, t1 # t2 = x & 0x33333333 add s0, t0, t2 # x = t0 + t2 # x = ((x >> 4) + x) & 0x0f0f0f0f srli t0, s0, 4 # t0 = x >> 4 add s0, s0, t0 # x += t0 la t1, mask_0f0f0f0f # Load address of mask_0f0f0f0f lw t1, 0(t1) # t1 = 0x0f0f0f0f and s0, s0, t1 # x &= 0x0f0f0f0f # x += (x >> 8) srli t0, s0, 8 # t0 = x >> 8 add s0, s0, t0 # x += t0 # x += (x >> 16) srli t0, s0, 16 # t0 = x >> 16 add s0, s0, t0 # x += t0 # Return (32 - (x & 0x3f)) andi t0, s0, 0x3f # t0 = x & 0x3f li t1, 32 # t1 = 32 sub a0, t1, t0 # a0 = 32 - t0 # Restore and return lw s0, 0(sp) lw ra, 4(sp) addi sp, sp, 8 ret return_32: li a0, 32 # Return 32 lw s0, 0(sp) lw ra, 4(sp) addi sp, sp, 8 ret return_0: li a0, 0 # Return 0 lw s0, 0(sp) lw ra, 4(sp) addi sp, sp, 8 ret ``` ### C code of fp16_to_fp32 ```c static inline uint32_t fp16_to_fp32(uint16_t h) { /* * Extends the 16-bit half-precision floating-point number to 32 bits * by shifting it to the upper half of a 32-bit word: * +---+-----+------------+-------------------+ * | S |EEEEE|MM MMMM MMMM|0000 0000 0000 0000| * +---+-----+------------+-------------------+ * Bits 31 26-30 16-25 0-15 * * S - sign bit, E - exponent bits, M - mantissa bits, 0 - zero bits. */ const uint32_t w = (uint32_t) h << 16; /* * Isolates the sign bit from the input number, placing it in the most * significant bit of a 32-bit word: * * +---+----------------------------------+ * | S |0000000 00000000 00000000 00000000| * +---+----------------------------------+ * Bits 31 0-31 */ const uint32_t sign = w & UINT32_C(0x80000000); /* * Extracts the mantissa and exponent from the input number, placing * them in bits 0-30 of the 32-bit word: * * +---+-----+------------+-------------------+ * | 0 |EEEEE|MM MMMM MMMM|0000 0000 0000 0000| * +---+-----+------------+-------------------+ * Bits 30 27-31 17-26 0-16 */ const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); /* * The renorm_shift variable indicates how many bits the mantissa * needs to be shifted to normalize the half-precision number. * For normalized numbers, renorm_shift will be 0. For denormalized * numbers, renorm_shift will be greater than 0. Shifting a * denormalized number will move the mantissa into the exponent, * normalizing it. */ uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; /* * If the half-precision number has an exponent of 15, adding a * specific value will cause overflow into bit 31, which converts * the upper 9 bits into ones. Thus: * inf_nan_mask == * 0x7F800000 if the half-precision number is * NaN or infinity (exponent of 15) * 0x00000000 otherwise */ const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); /* * If nonsign equals 0, subtracting 1 will cause overflow, setting * bit 31 to 1. Otherwise, bit 31 will be 0. Shifting this result * propagates bit 31 across all bits in zero_mask. Thus: * zero_mask == * 0xFFFFFFFF if the half-precision number is * zero (+0.0h or -0.0h) * 0x00000000 otherwise */ const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; /* * 1. Shifts nonsign left by renorm_shift to normalize it (for denormal * inputs). * 2. Shifts nonsign right by 3, adjusting the exponent to fit in the * 8-bit exponent field and moving the mantissa into the correct * position within the 23-bit mantissa field of the single-precision * format. * 3. Adds 0x70 to the exponent to account for the difference in bias * between half-precision and single-precision. * 4. Subtracts renorm_shift from the exponent to account for any * renormalization that occurred. * 5. ORs with inf_nan_mask to set the exponent to 0xFF if the input * was NaN or infinity. * 6. ANDs with the inverted zero_mask to set the mantissa and exponent * to zero if the input was zero. * 7. Combines everything with the sign bit of the input number. */ return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } ``` This function, fp16_to_fp32, converts a 16-bit half-precision floating-point number into a 32-bit single-precision floating-point number. ### Assembly code of fp16_to_fp32 ```assembly= .data # Test cases: half-precision floats (16-bit values) floats: .word 0x00003C00 # +1.0 in half-precision (0x3C00) .word 0x0000C000 # -2.0 in half-precision (0xC000) .word 0x00007BFF # Largest positive normal number in half-precision expected_values: .word 0x3F800000 # +1.0 in single-precision (0x3F800000) .word 0xC0000000 # -2.0 in single-precision (0xC0000000) .word 0x477FE000 # Equivalent single-precision value correct_msg: .asciz "correct\n" wrong_msg_prefix: .asciz "wrong at index " newline: .asciz "\n" .text .globl main main: # Initialize registers la s0, floats # s0 points to the starting address of the floats array la s1, expected_values # s1 points to the starting address of the expected_values array li s2, 3 # s2 is the loop counter (number of floats) li s3, 0 # s3 is the error flag, initially set to 0 (no error) li s4, 0 # s4 is the index counter, initially set to 0 loop: beqz s2, check_flag # If s3 is 0, all floats processed, jump to check_flag lw t4, 0(s0) # Load the current word from the floats array into t4 # Call fp16_to_fp32(t5) mv a0, t4 # Move h into a0 addi sp, sp ,-4 sw t4, 0(sp) # save t4 jal fp16_to_fp32 # Call fp16_to_fp32 mv t6, a0 # t6 = result lw t4, 0(sp) addi sp, sp ,4 # restore t4 lw t5, 0(s1) # Load the expected value into t5 from the expected_values array bne t6, t5, report_error # If the result does not match the expected value, jump to report_error # Update pointers and counters addi s0, s0, 4 # Move to the next element in the floats array addi s1, s1, 4 # Move to the next element in the expected_values array addi s2, s2, -1 # Decrement the counter addi s4, s4, 1 # Increment the index counter j loop # Jump back to the start of the loop report_error: # Set the error flag li s3, 1 # Set t3 to 1, indicating an error # Print "wrong at index " la a0, wrong_msg_prefix li a7, 4 # Syscall number 4: print string ecall # Print the index (s0) mv a0, s4 # Move the current index to a0 li a7, 1 # Syscall number 1: print integer ecall # Print newline la a0, newline li a7, 4 # Syscall number 4: print string ecall # Update pointers and counters addi s0, s0, 4 # Move to the next element in the floats array addi s1, s1, 4 # Move to the next element in the expected_values array addi s2, s2, -1 # Decrement the counter addi s4, s4, 1 # Increment the index counter j loop # Jump back to the start of the loop check_flag: bne s3, zero, exit_program # If s3 is not zero (error found), exit # If t3 is zero, indicating no error, print "correct" la a0, correct_msg # Load the "correct" message li a7, 4 # Syscall number 4: print string ecall j exit_program # Ensure the program exits after printing "correct" exit_program: li a7, 10 # Syscall number 10: exit program ecall j exit_program # Infinite loop to prevent execution beyond this point # fp16_to_fp32 function fp16_to_fp32: # Input: a0 = h (uint16_t) # Output: a0 = fp32 value addi sp, sp, -4 sw ra, 0(sp) # Shift h left by 16 bits to get w slli t0, a0, 16 # t0 = w = h << 16 # sign = w & 0x80000000 li t1, 0x80000000 and t1, t0, t1 # t1 = sign # nonsign = w & 0x7FFFFFFF li t2, 0x7FFFFFFF and t2, t0, t2 # t2 = nonsign addi sp, sp, -8 sw t1, 0(sp) # save t1 in stack sw t2, 4(sp) # save t2 in stack # renorm_shift = my_clz(nonsign) mv a0, t2 # Prepare argument for my_clz jal my_clz # Call my_clz(nonsign) mv t3, a0 # t3 = renorm_shift lw t1, 0(sp) # restore t1 lw t2, 4(sp) # restore t2 addi sp, sp, 8 # renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0 li t4, 5 blt t3, t4, renorm_shift_zero sub t3, t3, t4 # t3 = renorm_shift - 5 j renorm_shift_continue renorm_shift_zero: li t3, 0 renorm_shift_continue: # Compute inf_nan_mask = ((nonsign + 0x04000000) >> 8) & 0x7F800000 li t4, 0x04000000 add t5, t2, t4 # t5 = nonsign + 0x04000000 srai t5, t5, 8 # t5 = (nonsign + 0x04000000) >> 8 li t6, 0x7F800000 and t5, t5, t6 # t5 = inf_nan_mask # Compute zero_mask = (int32_t)(nonsign - 1) >> 31 addi t6, t2, -1 # t6 = nonsign - 1 srai t6, t6, 31 # t6 = zero_mask # Compute (nonsign << renorm_shift) >> 3 sll t4, t2, t3 # t4 = nonsign << renorm_shift srai t4, t4, 3 # t4 = t4 >> 3 # Compute ((0x70 - renorm_shift) << 23) li t2, 0x70 sub t2, t2, t3 # t2 = 0x70 - renorm_shift slli t2, t2, 23 # t2 = (0x70 - renorm_shift) << 23 # Sum add t4, t4, t2 # t4 = t4 + t2 # Combine with inf_nan_mask or t4, t4, t5 # s1 = s1 | inf_nan_mask # Apply zero_mask not t6, t6 # t6 = ~zero_mask and t4, t4, t6 # t4 = t4 & ~zero_mask # Combine with sign or a0, t1, t4 # a0 = sign | s1 lw ra, 0(sp) addi sp, sp, 4 ret # my_clz function my_clz: addi sp, sp, -8 # Allocate stack space sw ra, 4(sp) # Save return address sw s0, 0(sp) # Save s0 (used within the function) mv s0, a0 # s0 = x # If x == 0, return 32 beqz s0, my_clz_return_32 # count = 0 li t0, 0 # t0 = count li t1, 31 # t1 = i my_clz_loop: blt t1, zero, my_clz_end_loop # If i < 0, end loop li t2, 1 sll t2, t2, t1 # t2 = 1U << i and t3, s0, t2 # t3 = x & (1U << i) bnez t3, my_clz_end_loop addi t0, t0, 1 # count++ addi t1, t1, -1 # i-- j my_clz_loop my_clz_end_loop: mv a0, t0 # Return count j my_clz_exit my_clz_return_32: li a0, 32 my_clz_exit: # Restore registers and return lw s0, 0(sp) lw ra, 4(sp) addi sp, sp, 8 ret ``` This program will similarly define the test data and expected results in advance. After running the program, if any test case is incorrect, it will print out which one is wrong. ## Single number II > [LeetCode 137](https://leetcode.com/problems/single-number-ii/description/) Description: Given an integer array nums where every element appears three times except for one, which appears exactly once. Find the single element and return it. You must implement a solution with a linear runtime complexity and use only constant extra space. Constraints: * 1 <= nums.length <= 3 * 104 * -2^31 <= nums[i] <= 2^31 - 1 * Each element in nums appears exactly three times except for one element which appears once. ## Solution ### Idea for problem solving By counting the occurrences of 1s in each bit of every number and checking whether the number of 1s is divisible by 3, if it is not divisible by 3, it means that the bit was contributed by the number that appears only once. Therefore, this bit is stored in the result. ### C program #### First version (baseline) ```c int singleNumber(int* nums, int numsSize) { int result = 0; for (int i = 0; i < 32; ++i) { int count = 0; for (int j = 0; j < numsSize; ++j) { if (nums[j] & (1U << i)) { count++; } } if (count % 3 != 0) { result |= (1U << i); } } return result; } ``` By shifting an unsigned integer 1 and performing an AND operation with each number, if the result of the AND operation is non-zero, it means that the i-th bit of the number is set to 1. If the count of 1s in the i-th bit is not divisible by 3, it indicates that the Single Number contributed a 1 in that bit. We store this in a result variable. #### Second version ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } int singleNumber(int* nums, int numsSize) { int result = 0; int min_leading_zeros = 32; for (int i = 0; i < numsSize; i++) { if (nums[i] != 0) { int leading_zeros = my_clz((uint32_t)nums[i]); if (leading_zeros < min_leading_zeros) { min_leading_zeros = leading_zeros; } } } for (int i = 0; i < 32 - min_leading_zeros; i++) { int count = 0; for (int j = 0; j < numsSize; j++) { if (nums[j] & (1U << i)) { count++; } } if (count % 3 != 0) { result |= (1U << i); } } return result; } ``` Using the my_clz function, we can calculate the number of leading zeros for each number, record the minimum number of leading zeros, and use this value to determine the highest bit with a 1 across all numbers. This allows us to reduce the number of comparisons from the default 32 times for a 32-bit integer to fewer comparisons. #### Third version ```c static inline int my_clz(uint32_t x) { if (x == 0) return 32; int count = 0; if ((int32_t)x < 0) return 0; // x sign bit is 1， number of leading zero is 0. for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } int singleNumber(int* nums, int numsSize) { int result = 0; int min_leading_zeros = 32; for (int i = 0; i < numsSize; i++) { if (nums[i] != 0) { int leading_zeros = my_clz((uint32_t)nums[i]); if (leading_zeros < min_leading_zeros) { min_leading_zeros = leading_zeros; } } } for (int i = 0; i < 32 - min_leading_zeros; i++) { int count = 0; for (int j = 0; j < numsSize; j++) { if (nums[j] & (1U << i)) { count++; } } if (count % 3 != 0) { result |= (1U << i); } } return result; } ``` This version first checks if x is less than or equal to 0 and handles it accordingly, without entering the loop computation. ### Assembly #### First Version (Corresponding to the first version of the C code) ```assembly= .data # Test Case 1: nums = [2, 2, 3, 2] nums1: .word 2, 2, 3, 2 nums1_size: .word 4 # Test Case 2: nums = [0, 1, 0, 1, 0, 1, 99] nums2: .word 0, 1, 0, 1, 0, 1, 99 nums2_size: .word 7 # Test Case 3: nums = [-2, -2, -2, -5] nums3: .word -2, -2, -2, -5 nums3_size: .word 4 result_str: .asciz "Result: " newline_str: .asciz "\n" int_buffer: .zero 12 # Allocate and zero-initialize 12 bytes .text .globl main main: # Test Case 1 la a0, nums1 # Load address of nums1 array lw a1, nums1_size # Load size of nums1 array jal ra, singleNumber # Call singleNumber(nums1, size) mv a2, a0 # Move result to a2 for printing la a0, result_str # Load address of result string jal ra, print_result # Print result # Test Case 2 la a0, nums2 lw a1, nums2_size jal ra, singleNumber mv a2, a0 la a0, result_str jal ra, print_result # Test Case 3 la a0, nums3 lw a1, nums3_size jal ra, singleNumber mv a2, a0 la a0, result_str jal ra, print_result # Exit program li a0, 0 # Exit code 0 li a7, 93 # ECALL for exit ecall # int singleNumber(int* nums, int numsSize) singleNumber: addi sp, sp, -24 # Allocate stack space sw ra, 20(sp) # Save return address sw s0, 16(sp) # Save s0 sw s1, 12(sp) # Save s1 sw s2, 8(sp) # Save s2 sw s3, 4(sp) # Save s3 sw s4, 0(sp) # Save s4 mv s0, a0 # s0 = nums (int* nums) mv s1, a1 # s1 = numsSize (int numsSize) li a0, 0 # a0 = result = 0 li s2, 0 # s2 = i = 0 (bit index) singleNumber_outer_loop: li t0, 32 # t0 = 32 (number of bits) bge s2, t0, singleNumber_end_outer_loop # if i >= 32, exit loop # Initialize count = 0 li s3, 0 # s3 = count = 0 li s4, 0 # s4 = j = 0 singleNumber_inner_loop: bge s4, s1, singleNumber_end_inner_loop # if j >= numsSize, exit inner loop # Load nums[j] slli t1, s4, 2 # t1 = j * 4 (word size) add t1, s0, t1 # t1 = &nums[j] lw t2, 0(t1) # t2 = nums[j] # Check if nums[j] & (1U << i) li t3, 1 sll t3, t3, s2 # t3 = 1U << i and t4, t2, t3 # t4 = nums[j] & (1U << i) beqz t4, singleNumber_skip_increment # if zero, skip increment # count = (count + 1) % 3 addi s3, s3, 1 # count++ li t5, 3 blt s3, t5, singleNumber_skip_reset # if count < 3, skip reset li s3, 0 # count = 0 singleNumber_skip_reset: # Continue to next iteration singleNumber_skip_increment: addi s4, s4, 1 # j++ jal zero, singleNumber_inner_loop singleNumber_end_inner_loop: # if count != 0 beqz s3, singleNumber_skip_bit_set # if count == 0, skip setting bit # result |= (1U << i) li t6, 1 sll t6, t6, s2 # t6 = 1U << i or a0, a0, t6 # result |= t6 singleNumber_skip_bit_set: addi s2, s2, 1 # i++ jal zero, singleNumber_outer_loop singleNumber_end_outer_loop: # Result is in a0 # Restore registers and stack lw s4, 0(sp) lw s3, 4(sp) lw s2, 8(sp) lw s1, 12(sp) lw s0, 16(sp) lw ra, 20(sp) addi sp, sp, 24 ret # Simplified print_result function for Ripes simulator print_result: addi sp, sp, -8 # Allocate stack space sw ra, 4(sp) # Save return address sw a0, 0(sp) # Save a0 (address of result_str) # Print "Result: " mv a0, a0 # a0 already contains address of result_str li a7, 4 # ECALL for print string ecall # Print integer in a2 mv a0, a2 # Move result to a0 jal ra, print_int # Call print_int # Print newline la a0, newline_str li a7, 4 ecall # Restore registers and stack lw a0, 0(sp) # Restore a0 lw ra, 4(sp) # Restore ra addi sp, sp, 8 ret # Corrected print_int function print_int: # Convert integer to string and print # Handle negative numbers addi sp, sp, -36 # Adjust stack size for additional registers sw ra, 32(sp) sw t0, 28(sp) sw t1, 24(sp) sw t2, 20(sp) sw t3, 16(sp) sw t4, 12(sp) sw t5, 8(sp) sw a4, 4(sp) # Save a4 sw a5, 0(sp) # Save a5 mv t0, a0 # t0 = number la t1, int_buffer # t1 = address of int_buffer # Clear the buffer li t2, 12 # t2 = buffer size li t3, 0 # t3 = zero mv a5, t1 # a5 = buffer pointer clear_buffer_loop: beqz t2, clear_buffer_done sb t3, 0(a5) addi a5, a5, 1 addi t2, t2, -1 jal zero, clear_buffer_loop clear_buffer_done: # Reset t1 to end of buffer la t1, int_buffer # t1 = address of int_buffer addi t1, t1, 11 # t1 points to int_buffer + 11 (one before end) li t2, 48 # t2 = ASCII code for '0' li t3, 0 # t3 = sign flag li a4, 10 # a4 = 10 # Check if number is zero beq t0, zero, print_int_zero # Check if number is negative blt t0, zero, print_int_negative print_int_loop: # t4 = t0 / 10 (quotient) # t5 = t0 % 10 (remainder) # Initialize t4 = 0 (quotient), t5 = t0 (remainder) mv t5, t0 # t5 = t0 (remainder) li t4, 0 # t4 = 0 (quotient) print_int_divide_loop: blt t5, a4, print_int_divide_end sub t5, t5, a4 # t5 -= 10 addi t4, t4, 1 # t4 += 1 jal zero, print_int_divide_loop print_int_divide_end: # Now t4 is quotient, t5 is remainder # Convert remainder to character add t6, t5, t2 # t6 = t5 + '0' sb t6, 0(t1) # Store character at t1 addi t1, t1, -1 # t1 -= 1 # Update t0 to quotient mv t0, t4 # t0 = t4 # If t0 != 0, continue loop bnez t0, print_int_loop # After loop, check for sign beqz t3, print_int_print # If not negative, proceed to print # Add '-' sign li t6, 45 # '-' sb t6, 0(t1) addi t1, t1, -1 print_int_print: addi t1, t1, 1 # Adjust t1 to point to the start of string # Print the number string mv a0, t1 # a0 = pointer to start of string li a7, 4 ecall # Restore registers and stack lw a5, 0(sp) # Restore a5 lw a4, 4(sp) # Restore a4 lw t5, 8(sp) lw t4, 12(sp) lw t3, 16(sp) lw t2, 20(sp) lw t1, 24(sp) lw t0, 28(sp) lw ra, 32(sp) addi sp, sp, 36 ret print_int_negative: li t3, 1 # Set sign flag neg t0, t0 # t0 = -t0 jal zero, print_int_loop print_int_zero: # Handle zero li t6, 48 # '0' sb t6, 0(t1) addi t1, t1, -1 jal zero, print_int_print ``` #### Second Version (Corresponding to the third version of the C code) ```assembly= .data # Test Case 1: nums = [2, 2, 3, 2] nums1: .word 2, 2, 3, 2 nums1_size: .word 4 # Test Case 2: nums = [0, 1, 0, 1, 0, 1, 99] nums2: .word 0, 1, 0, 1, 0, 1, 99 nums2_size: .word 7 # Test Case 3: nums = [-2, -2, -2, -5] nums3: .word -2, -2, -2, -5 nums3_size: .word 4 result_str: .asciz "Result: " newline_str: .asciz "\n" int_buffer: .zero 12 # Allocate and zero-initialize 12 bytes .text .globl main main: # Test Case 1 la a0, nums1 # Load address of nums1 array lw a1, nums1_size # Load size of nums1 array jal ra, singleNumber # Call singleNumber(nums1, size) mv a2, a0 # Move result to a2 for printing la a0, result_str # Load address of result string jal ra, print_result # Print result # Test Case 2 la a0, nums2 lw a1, nums2_size jal ra, singleNumber mv a2, a0 la a0, result_str jal ra, print_result # Test Case 3 la a0, nums3 lw a1, nums3_size jal ra, singleNumber mv a2, a0 la a0, result_str jal ra, print_result # Exit program li a0, 0 # Exit code 0 li a7, 93 # ECALL for exit ecall # int singleNumber(int* nums, int numsSize) singleNumber: addi sp, sp, -36 # Allocate stack space sw ra, 32(sp) # Save return address sw s0, 28(sp) # Save s0 sw s1, 24(sp) # Save s1 sw s2, 20(sp) # Save s2 sw s3, 16(sp) # Save s3 sw s4, 12(sp) # Save s4 sw s5, 8(sp) # Save s5 sw s6, 4(sp) # Save s6 sw s7, 0(sp) # Save s7 mv s0, a0 # s0 = nums (int* nums) mv s1, a1 # s1 = numsSize (int numsSize) li s7, 0 # s7 = result = 0 # Initialize min_leading_zeros = 32 li s5, 32 # s5 = min_leading_zeros = 32 li s2, 0 # s2 = i = 0 find_min_leading_zeros_loop: bge s2, s1, end_min_leading_zeros_loop # Load nums[i] slli t0, s2, 2 # t0 = i * 4 add t0, s0, t0 # t0 = &nums[i] lw t1, 0(t0) # t1 = nums[i] bnez t1, process_non_zero addi s2, s2, 1 # i++ jal zero, find_min_leading_zeros_loop process_non_zero: # Call my_clz((uint32_t)nums[i]) mv a0, t1 # a0 = nums[i] jal ra, my_clz # leading_zeros in a0 mv t2, a0 # t2 = leading_zeros blt t2, s5, update_min_leading_zeros jal zero, skip_update_min_leading_zeros update_min_leading_zeros: mv s5, t2 # min_leading_zeros = leading_zeros skip_update_min_leading_zeros: addi s2, s2, 1 # i++ jal zero, find_min_leading_zeros_loop end_min_leading_zeros_loop: # Compute max_bit_index = 32 - min_leading_zeros li t0, 32 sub s6, t0, s5 # s6 = 32 - min_leading_zeros # Initialize s2 = 0 (i) li s2, 0 # s2 = i = 0 singleNumber_outer_loop: bge s2, s6, singleNumber_end_outer_loop # if i >= (32 - min_leading_zeros), exit loop # Initialize count = 0 li s3, 0 # s3 = count = 0 li s4, 0 # s4 = j = 0 singleNumber_inner_loop: bge s4, s1, singleNumber_end_inner_loop # if j >= numsSize, exit inner loop # Load nums[j] slli t1, s4, 2 # t1 = j * 4 add t1, s0, t1 # t1 = &nums[j] lw t2, 0(t1) # t2 = nums[j] # Check if nums[j] & (1U << i) li t3, 1 sll t3, t3, s2 # t3 = 1U << i and t4, t2, t3 # t4 = nums[j] & (1U << i) beqz t4, singleNumber_skip_increment # if zero, skip increment # count = (count + 1) % 3 addi s3, s3, 1 # count++ li t5, 3 blt s3, t5, singleNumber_skip_reset # if count < 3, skip reset li s3, 0 # count = 0 singleNumber_skip_reset: # Continue to next iteration singleNumber_skip_increment: addi s4, s4, 1 # j++ jal zero, singleNumber_inner_loop singleNumber_end_inner_loop: # if count != 0 beqz s3, singleNumber_skip_bit_set # if count == 0, skip setting bit # result |= (1U << i) li t6, 1 sll t6, t6, s2 # t6 = 1U << i or s7, s7, t6 # s7 |= t6 singleNumber_skip_bit_set: addi s2, s2, 1 # i++ jal zero, singleNumber_outer_loop singleNumber_end_outer_loop: mv a0, s7 # Move result to a0 # Restore registers and stack lw s7, 0(sp) lw s6, 4(sp) lw s5, 8(sp) lw s4, 12(sp) lw s3, 16(sp) lw s2, 20(sp) lw s1, 24(sp) lw s0, 28(sp) lw ra, 32(sp) addi sp, sp, 36 ret # int my_clz(uint32_t x) my_clz: addi sp, sp, -8 # Allocate stack space sw ra, 4(sp) # Save return address sw s0, 0(sp) # Save s0 mv s0, a0 # s0 = x # if x == 0, return 32 beqz s0, my_clz_return_32 # if x < 0 (i.e., (int32_t)x < 0), return 0 bltz s0, my_clz_return_0 # count = 0 li t0, 0 # t0 = count li t1, 31 # t1 = i my_clz_loop: blt t1, zero, my_clz_end_loop # if i < 0, end loop li t2, 1 sll t2, t2, t1 # t2 = 1U << i and t3, s0, t2 # t3 = x & (1U << i) bnez t3, my_clz_end_loop addi t0, t0, 1 # count++ addi t1, t1, -1 # i-- jal zero, my_clz_loop my_clz_end_loop: mv a0, t0 # Return count jal zero, my_clz_exit my_clz_return_32: li a0, 32 jal zero, my_clz_exit my_clz_return_0: li a0, 0 my_clz_exit: # Restore registers and return lw s0, 0(sp) lw ra, 4(sp) addi sp, sp, 8 ret # Simplified print_result function for Ripes simulator print_result: addi sp, sp, -8 # Allocate stack space sw ra, 4(sp) # Save return address sw a0, 0(sp) # Save a0 (address of result_str) # Print "Result: " mv a0, a0 # a0 already contains address of result_str li a7, 4 # ECALL for print string ecall # Print integer in a2 mv a0, a2 # Move result to a0 jal ra, print_int # Call print_int # Print newline la a0, newline_str li a7, 4 ecall # Restore registers and stack lw a0, 0(sp) # Restore a0 lw ra, 4(sp) # Restore ra addi sp, sp, 8 ret # Corrected print_int function print_int: # Convert integer to string and print # Handle negative numbers addi sp, sp, -36 # Adjust stack size for additional registers sw ra, 32(sp) sw t0, 28(sp) sw t1, 24(sp) sw t2, 20(sp) sw t3, 16(sp) sw t4, 12(sp) sw t5, 8(sp) sw a4, 4(sp) # Save a4 sw a5, 0(sp) # Save a5 mv t0, a0 # t0 = number la t1, int_buffer # t1 = address of int_buffer # Clear the buffer li t2, 12 # t2 = buffer size li t3, 0 # t3 = zero mv a5, t1 # a5 = buffer pointer clear_buffer_loop: beqz t2, clear_buffer_done sb t3, 0(a5) addi a5, a5, 1 addi t2, t2, -1 jal zero, clear_buffer_loop clear_buffer_done: # Reset t1 to end of buffer la t1, int_buffer # t1 = address of int_buffer addi t1, t1, 11 # t1 points to int_buffer + 11 (one before end) li t2, 48 # t2 = ASCII code for '0' li t3, 0 # t3 = sign flag li a4, 10 # a4 = 10 # Check if number is zero beq t0, zero, print_int_zero # Check if number is negative blt t0, zero, print_int_negative print_int_loop: # t4 = t0 / 10 (quotient) # t5 = t0 % 10 (remainder) # Initialize t4 = 0 (quotient), t5 = t0 (remainder) mv t5, t0 # t5 = t0 (remainder) li t4, 0 # t4 = 0 (quotient) print_int_divide_loop: blt t5, a4, print_int_divide_end sub t5, t5, a4 # t5 -= 10 addi t4, t4, 1 # t4 += 1 jal zero, print_int_divide_loop print_int_divide_end: # Now t4 is quotient, t5 is remainder # Convert remainder to character add t6, t5, t2 # t6 = t5 + '0' sb t6, 0(t1) # Store character at t1 addi t1, t1, -1 # t1 -= 1 # Update t0 to quotient mv t0, t4 # t0 = t4 # If t0 != 0, continue loop bnez t0, print_int_loop # After loop, check for sign beqz t3, print_int_print # If not negative, proceed to print # Add '-' sign li t6, 45 # '-' sb t6, 0(t1) addi t1, t1, -1 print_int_print: addi t1, t1, 1 # Adjust t1 to point to the start of string # Print the number string mv a0, t1 # a0 = pointer to start of string li a7, 4 ecall # Restore registers and stack lw a5, 0(sp) # Restore a5 lw a4, 4(sp) # Restore a4 lw t5, 8(sp) lw t4, 12(sp) lw t3, 16(sp) lw t2, 20(sp) lw t1, 24(sp) lw t0, 28(sp) lw ra, 32(sp) addi sp, sp, 36 ret print_int_negative: li t3, 1 # Set sign flag neg t0, t0 # t0 = -t0 jal zero, print_int_loop print_int_zero: # Handle zero li t6, 48 # '0' sb t6, 0(t1) addi t1, t1, -1 jal zero, print_int_print ``` :::danger Use fewer instructions. ::: #### Third Version (Improved from the second version) ```assemby= .data # Test Case 1: nums = [2, 2, 3, 2], expected result: 3 nums1: .word 2, 2, 3, 2 nums1_size: .word 4 nums1_expected: .word 3 # Test Case 2: nums = [0, 1, 0, 1, 0, 1, 99], expected result: 99 nums2: .word 0, 1, 0, 1, 0, 1, 99 nums2_size: .word 7 nums2_expected: .word 99 # Test Case 3: nums = [-2, -2, -2, -5], expected result: -5 nums3: .word -2, -2, -2, -5 nums3_size: .word 4 nums3_expected: .word -5 # Arrays of Test Cases test_cases: .word nums1 .word nums2 .word nums3 test_cases_size: .word 4 .word 7 .word 4 expected_results: .word 3 .word 99 .word -5 num_test_cases: .word 3 correct_msg: .asciz "correct\n" wrong_msg_prefix: .asciz "wrong at test case " newline: .asciz "\n" int_buffer: .zero 12 # Allocate and zero-initialize 12 bytes .text .globl main main: # Initialize registers la s0, test_cases # s0 = base address of test_cases array la s1, test_cases_size # s1 = base address of test_cases_size array la s2, expected_results # s2 = base address of expected_results array lw s3, num_test_cases # s3 = number of test cases li s4, 0 # s4 = test index loop_test_cases: beq s4, s3, end_program # If test index equals number of test cases, end program slli t0, s4, 2 # t0 = s4(index) * 4 (word offset) add t1, s0, t0 # t1 = &test_cases[s4(index)] lw a0, 0(t1) # a0 = test_cases[s4] add t2, s1, t0 # t6 = &test_cases_size[s4] lw a1, 0(t2) # a1 = test_cases_size[s4] addi sp, sp , -4 sw t0, 0(sp) # Save t0 (word offset) jal ra, singleNumber # singleNumber(nums, size) mv t1, a0 # t1 = result lw t0, 0(sp) # Restore t0 addi sp, sp , 4 add t2, s2, t0 # t2 = &expected_results[s0] lw t2, 0(t2) # t2 = expected_results[s0] bne t1, t2, report_error # If result != expected result, report_error # result == expected result, print "correct" la a0, correct_msg li a7, 4 # ECALL 4: print string ecall j next_test_case report_error: # Print "wrong at test case " la a0, wrong_msg_prefix li a7, 4 # ECALL 4: print string ecall #(0-based) mv a0, s4 # a0 = s4 li a7, 1 # ECALL 1: print integer ecall # Print newline la a0, newline li a7, 4 # ECALL 4: newline ecall next_test_case: addi s4, s4, 1 # s4++ j loop_test_cases # Loop to next test case end_program: li a7, 10 # ECALL 10: exit program ecall # int singleNumber(int* nums, int numsSize) singleNumber: addi sp, sp, -40 # Allocate stack space sw ra, 36(sp) # Save return address sw s0, 32(sp) # Save s0 sw s1, 28(sp) # Save s1 sw s2, 24(sp) # Save s2 sw s3, 20(sp) # Save s3 sw s4, 16(sp) # Save s4 sw s5, 12(sp) sw s6, 8(sp) sw s7, 4(sp) sw s8, 0(sp) mv s0, a0 # s0 = nums (int* nums) mv s1, a1 # s1 = numsSize (int numsSize) li s8, 0 # s8 = result = 0 # Initialize min_leading_zeros = 32 li s3, 32 # s3 = min_leading_zeros = 32 li s4, 0 # s4 = i = 0 find_min_leading_zeros_loop: bge s4, s1, end_min_leading_zeros_loop # Load nums[i] slli t0, s4, 2 # t0 = i * 4 add t0, s0, t0 # t0 = &nums[i] lw t1, 0(t0) # t1 = nums[i] bnez t1, process_non_zero addi s4, s4, 1 # i++ j find_min_leading_zeros_loop process_non_zero: # Call my_clz((uint32_t)nums[i]) mv a0, t1 # a0 = nums[i] jal ra, my_clz # Call my_clz mv t2, a0 # t2 = leading_zeros blt t2, s3, update_min_leading_zeros j skip_update_min_leading_zeros update_min_leading_zeros: mv s3, t2 # min_leading_zeros = leading_zeros skip_update_min_leading_zeros: addi s4, s4, 1 # i++ j find_min_leading_zeros_loop end_min_leading_zeros_loop: # Compute max_bit_index = 32 - min_leading_zeros li t0, 32 sub s5, t0, s3 # s5 = 32 - min_leading_zeros # Initialize s4 = 0 (i) li s4, 0 # s4 = i = 0 singleNumber_outer_loop: bge s4, s5, singleNumber_end_outer_loop # If i >= (32 - min_leading_zeros), exit loop # Initialize count = 0 and j = 0 li s6, 0 # s6 = count = 0 li s7, 0 # s7 = j = 0 singleNumber_inner_loop: bge s7, s1, singleNumber_end_inner_loop # If j >= numsSize, exit inner loop # Load nums[j] slli t1, s7, 2 # t1 = j * 4 add t1, s0, t1 # t1 = &nums[j] lw t2, 0(t1) # t2 = nums[j] # Check nums[j] & (1U << i) li t3, 1 sll t3, t3, s4 # t3 = 1U << i and t4, t2, t3 # t4 = nums[j] & (1U << i) beqz t4, singleNumber_skip_increment # If zero, skip increment # count = (count + 1) % 3 addi s6, s6, 1 # count++ li t5, 3 blt s6, t5, singleNumber_skip_reset # If count < 3, skip reset li s6, 0 # count = 0 singleNumber_skip_reset: # Continue to next iteration singleNumber_skip_increment: addi s7, s7, 1 # j++ j singleNumber_inner_loop singleNumber_end_inner_loop: # If count != 0, set the corresponding bit beqz s6, singleNumber_skip_bit_set # If count == 0, skip setting bit # result |= (1U << i) li t6, 1 sll t6, t6, s4 # t6 = 1U << i or s8, s8, t6 # s8 |= t6 singleNumber_skip_bit_set: addi s4, s4, 1 # i++ j singleNumber_outer_loop singleNumber_end_outer_loop: mv a0, s8 # Move result to a0 # Restore registers and stack lw s8, 0(sp) lw s7, 4(sp) lw s6, 8(sp) lw s5, 12(sp) lw s4, 16(sp) lw s3, 20(sp) lw s2, 24(sp) lw s1, 28(sp) lw s0, 32(sp) lw ra, 36(sp) addi sp, sp, 40 ret # int my_clz(uint32_t x) my_clz: addi sp, sp, -8 # Allocate stack space sw ra, 4(sp) # Save return address sw s0, 0(sp) # Save s0 mv s0, a0 # s0 = x # If x == 0, return 32 beqz s0, my_clz_return_32 # If x < 0, return 0 bltz s0, my_clz_return_0 # Initialize count = 0 and i = 31 li t0, 0 # t0 = count li t1, 31 # t1 = i my_clz_loop: blt t1, zero, my_clz_end_loop # If i < 0, end loop li t2, 1 sll t2, t2, t1 # t2 = 1U << i and t3, s0, t2 # t3 = x & (1U << i) bnez t3, my_clz_end_loop addi t0, t0, 1 # count++ addi t1, t1, -1 # i-- j my_clz_loop my_clz_end_loop: mv a0, t0 # Return count j my_clz_exit my_clz_return_32: li a0, 32 j my_clz_exit my_clz_return_0: li a0, 0 my_clz_exit: # Restore registers and stack lw s0, 0(sp) lw ra, 4(sp) addi sp, sp, 8 ret ``` This version uses internal verification to replace the manual visual inspection process and reduces the code for printing the result. #### fourth version (Improved from the fourth version) ```assembly= .data # Test Case 1: nums = [2, 2, 3, 2], expected result: 3 nums1: .word 2, 2, 3, 2 nums1_size: .word 4 nums1_expected: .word 3 # Test Case 2: nums = [0, 1, 0, 1, 0, 1, 99], expected result: 99 nums2: .word 0, 1, 0, 1, 0, 1, 99 nums2_size: .word 7 nums2_expected: .word 99 # Test Case 3: nums = [-2, -2, -2, -5], expected result: -5 nums3: .word -2, -2, -2, -5 nums3_size: .word 4 nums3_expected: .word -5 # Arrays of Test Cases test_cases: .word nums1 .word nums2 .word nums3 test_cases_size: .word 4 .word 7 .word 4 expected_results: .word 3 .word 99 .word -5 num_test_cases: .word 3 correct_msg: .asciz "correct\n" wrong_msg_prefix: .asciz "wrong at test case " newline: .asciz "\n" int_buffer: .zero 12 # Allocate and zero-initialize 12 bytes .text .globl main main: # Initialize registers la s0, test_cases # s0 = base address of test_cases array la s1, test_cases_size # s1 = base address of test_cases_size array la s2, expected_results # s2 = base address of expected_results array lw s3, num_test_cases # s3 = number of test cases li s4, 0 # s4 = test index loop_test_cases: beq s4, s3, end_program # If test index equals number of test cases, end program slli t0, s4, 2 # t0 = s4(index) * 4 (word offset) add t1, s0, t0 # t1 = &test_cases[s4(index)] lw a0, 0(t1) # a0 = test_cases[s4] add t2, s1, t0 # t2 = &test_cases_size[s4] lw a1, 0(t2) # a1 = test_cases_size[s4] addi sp, sp , -4 sw t0, 0(sp) # Save t0 (word offset) jal ra, singleNumber # singleNumber(nums, size) mv t1, a0 # t1 = result lw t0, 0(sp) # Restore t0 addi sp, sp , 4 add t2, s2, t0 # t2 = &expected_results[s4] lw t2, 0(t2) # t2 = expected_results[s4] bne t1, t2, report_error # If result != expected result, report_error # result == expected result, print "correct" la a0, correct_msg li a7, 4 # ECALL 4: print string ecall j next_test_case report_error: # Print "wrong at test case " la a0, wrong_msg_prefix li a7, 4 # ECALL 4: print string ecall #(0-based) mv a0, s4 # a0 = s4 li a7, 1 # ECALL 1: print integer ecall # Print newline la a0, newline li a7, 4 # ECALL 4: newline ecall next_test_case: addi s4, s4, 1 # s4++ j loop_test_cases # Loop to next test case end_program: li a7, 10 # ECALL 10: exit program ecall # int singleNumber(int* nums, int numsSize) singleNumber: addi sp, sp, -40 # Allocate stack space sw ra, 36(sp) # Save return address sw s0, 32(sp) # Save s0 sw s1, 28(sp) # Save s1 sw s2, 24(sp) # Save s2 sw s3, 20(sp) # Save s3 sw s4, 16(sp) # Save s4 sw s5, 12(sp) # Save s5 sw s6, 8(sp) # Save s6 sw s7, 4(sp) # Save s7 sw s8, 0(sp) # Save s8 mv s0, a0 # s0 = nums (int* nums) mv s1, a1 # s1 = numsSize (int numsSize) li s8, 0 # s8 = result = 0 # Initialize min_leading_zeros = 32 li s3, 32 # s3 = min_leading_zeros = 32 li s4, 0 # s4 = i = 0 find_min_leading_zeros_loop: bge s4, s1, end_min_leading_zeros_loop # Load nums[i] slli t0, s4, 2 # t0 = i * 4 add t0, s0, t0 # t0 = &nums[i] lw t1, 0(t0) # t1 = nums[i] bnez t1, process_non_zero addi s4, s4, 1 # i++ j find_min_leading_zeros_loop process_non_zero: # Call my_clz((uint32_t)nums[i]) mv a0, t1 # a0 = nums[i] jal ra, my_clz # Call my_clz mv t2, a0 # t2 = leading_zeros blt t2, s3, update_min_leading_zeros j skip_update_min_leading_zeros update_min_leading_zeros: mv s3, t2 # min_leading_zeros = leading_zeros skip_update_min_leading_zeros: addi s4, s4, 1 # i++ j find_min_leading_zeros_loop end_min_leading_zeros_loop: # Compute max_bit_index = 32 - min_leading_zeros li t0, 32 sub s5, t0, s3 # s5 = 32 - min_leading_zeros # Initialize s4 = 0 (i) li s4, 0 # s4 = i = 0 singleNumber_outer_loop: bge s4, s5, singleNumber_end_outer_loop # If i >= (32 - min_leading_zeros), exit loop # Initialize count = 0 and j = 0 li s6, 0 # s6 = count = 0 li s7, 0 # s7 = j = 0 singleNumber_inner_loop: bge s7, s1, singleNumber_end_inner_loop # If j >= numsSize, exit inner loop # Load nums[j] slli t1, s7, 2 # t1 = j * 4 add t1, s0, t1 # t1 = &nums[j] lw t2, 0(t1) # t2 = nums[j] # Check nums[j] & (1U << i) li t3, 1 sll t3, t3, s4 # t3 = 1U << i and t4, t2, t3 # t4 = nums[j] & (1U << i) beqz t4, singleNumber_skip_increment # If zero, skip increment # count = (count + 1) % 3 addi s6, s6, 1 # count++ li t5, 3 blt s6, t5, singleNumber_skip_reset # If count < 3, skip reset li s6, 0 # count = 0 singleNumber_skip_reset: # Continue to next iteration singleNumber_skip_increment: addi s7, s7, 1 # j++ j singleNumber_inner_loop singleNumber_end_inner_loop: # If count != 0, set the corresponding bit beqz s6, singleNumber_skip_bit_set # If count == 0, skip setting bit # result |= (1U << i) li t6, 1 sll t6, t6, s4 # t6 = 1U << i or s8, s8, t6 # s8 |= t6 singleNumber_skip_bit_set: addi s4, s4, 1 # i++ j singleNumber_outer_loop singleNumber_end_outer_loop: mv a0, s8 # Move result to a0 # Restore registers and stack lw s8, 0(sp) lw s7, 4(sp) lw s6, 8(sp) lw s5, 12(sp) lw s4, 16(sp) lw s3, 20(sp) lw s2, 24(sp) lw s1, 28(sp) lw s0, 32(sp) lw ra, 36(sp) addi sp, sp, 40 ret # int my_clz(uint32_t x) my_clz: addi sp, sp, -8 # Allocate stack space sw ra, 4(sp) # Save return address sw s0, 0(sp) # Save s0 mv s0, a0 # s0 = x # If x == 0, return 32 beqz s0, return_32 # If x < 0, return 0 bltz s0, return_0 # Begin computation as per the C code # x |= (x >> 1) srli t0, s0, 1 # t0 = x >> 1 or s0, s0, t0 # x |= t0 # x |= (x >> 2) srli t0, s0, 2 # t0 = x >> 2 or s0, s0, t0 # x |= t0 # x |= (x >> 4) srli t0, s0, 4 # t0 = x >> 4 or s0, s0, t0 # x |= t0 # x |= (x >> 8) srli t0, s0, 8 # t0 = x >> 8 or s0, s0, t0 # x |= t0 # x |= (x >> 16) srli t0, s0, 16 # t0 = x >> 16 or s0, s0, t0 # x |= t0 # x -= ((x >> 1) & 0x55555555) srli t0, s0, 1 # t0 = x >> 1 li t1, 0x55555555 # Load mask 0x55555555 and t0, t0, t1 # t0 = (x >> 1) & 0x55555555 sub s0, s0, t0 # x -= t0 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333) srli t0, s0, 2 # t0 = x >> 2 li t1, 0x33333333 # Load mask 0x33333333 and t0, t0, t1 # t0 = (x >> 2) & 0x33333333 and t2, s0, t1 # t2 = x & 0x33333333 add s0, t0, t2 # x = t0 + t2 # x = ((x >> 4) + x) & 0x0f0f0f0f srli t0, s0, 4 # t0 = x >> 4 add s0, s0, t0 # x += t0 li t1, 0x0f0f0f0f # Load mask 0x0f0f0f0f and s0, s0, t1 # x &= 0x0f0f0f0f # x += (x >> 8) srli t0, s0, 8 # t0 = x >> 8 add s0, s0, t0 # x += t0 # x += (x >> 16) srli t0, s0, 16 # t0 = x >> 16 add s0, s0, t0 # x += t0 # Return (32 - (x & 0x3f)) andi t0, s0, 0x3f # t0 = x & 0x3f li t1, 32 # t1 = 32 sub a0, t1, t0 # a0 = 32 - t0 # Restore and return lw s0, 0(sp) lw ra, 4(sp) addi sp, sp, 8 ret return_32: li a0, 32 # Return 32 lw s0, 0(sp) lw ra, 4(sp) addi sp, sp, 8 ret return_0: li a0, 0 # Return 0 lw s0, 0(sp) lw ra, 4(sp) addi sp, sp, 8 ret ``` This version uses population count to calculate leading zeros, reducing the loop cycles of the previous my_clz version. ## Analyze ### In five stage processor - | Discription | Cycles | | -------- | -------- | | baseline | 9174 | | with `count leading zero` | 7194 | | with `count leading zero and deal with special case` | 6693 | | with `population count` | 4473 | In the baseline, no specific method was used to determine the single number, meaning every bit (in a 32-bit integer) of each number had to be checked once, which resulted in a significant waste of cycles. After using count leading zero, the minimum leading zero value for each number is calculated first. Once this value is obtained, we can determine the position of the highest set bit across all numbers, reducing the number of bit comparisons required and significantly lowering the cycle count. A special case arises when handling negative numbers and zero. For negative numbers, the highest bit is always 1, and for zero, it means all bits are zero, so the leading zero count can directly be returned as 0 or 32. Finally, implementing leading zero counting with a loop is too cycle-expensive, so population count is used instead. This way, calculating the leading zeros for a number only requires. ### Program Functionality Overview My assembly program is designed to test the singleNumber function, which identifies the unique number in an array where every other number appears three times. The program sets up multiple test cases, executes the singleNumber function for each case, and verifies the result against expected outcomes. If the result matches, it prints "correct"; otherwise, it reports an error indicating which test case failed. ### Pipeline Stages and Instruction Operation #### Example 1 (Initialization in main) ```assembly= main: la s0, test_cases # Load address of test_cases array into s0 la s1, test_cases_size # Load address of test_cases_size array into s1 la s2, expected_results # Load address of expected_results array into s2 lw s3, num_test_cases # Load number of test cases into s3 li s4, 0 # Initialize test index s4 to 0 ``` Example instruction: la s0, test_cases(auipc x8, 0x10000; addi x8, x8, 84) * The instruction la s0, test_cases in Ripes will be split into an auipc to load the upper 20 bits and an addi to load the lower 12 bits. Therefore, we will focus only on the execution of the auipc instruction within the five-stage pipeline. - | Stage | Description | | -------- | -------- | | IF | The instruction is fetched from memory. | | ID | Decoded to understand it's a auipc instruction targeting register s0(x8). | | EX | The address of test_cases is calculated. | | MEM | Not applicable for auipc as it doesn't access memory. | | WB | The computed address is written to register s0. | ![image](https://hackmd.io/_uploads/HyoKkyS1Jg.png) ![image](https://hackmd.io/_uploads/Hy0j1ySk1e.png) ![image](https://hackmd.io/_uploads/rJitg1B1Je.png) ![image](https://hackmd.io/_uploads/Sy0oxJSyke.png) ![image](https://hackmd.io/_uploads/SJwTxyr1kg.png) #### Example 2 (Looping through Test Cases) ```assembly= loop_test_cases: beq s4, s3, end_program # If test index equals number of test cases, end program slli t0, s4, 2 # t0 = s4(index) * 4 (word offset) add t1, s0, t0 # t1 = &test_cases[s4(index)] lw a0, 0(t1) # a0 = test_cases[s4] ... ``` Example Instruction: slli t0, s4, 2 - | Stage | Description | | -------- | -------- | | IF | Fetches the slli (shift left logical immediate) instruction| | ID | Decodes the operation to shift s4 left by 2 bits | | EX | Performs the shift operation (s4 << 2) to calculate the word offset | | MEM | Not applicable as no memory access is needed | | WB | Writes the shifted value into register t0 | ![image](https://hackmd.io/_uploads/Hkaqgerykl.png) ![image](https://hackmd.io/_uploads/Byqoxxrkke.png) ![image](https://hackmd.io/_uploads/Sy52glrkye.png) ![image](https://hackmd.io/_uploads/H13yWeHyJg.png) ![image](https://hackmd.io/_uploads/H1CW-xBkJx.png) #### Example 3 (Calling the singleNumber Function) ```assembly= addi sp, sp , -4 sw t0, 0(sp) # Save t0 (word offset) jal ra, singleNumber # singleNumber(nums, size) mv t1, a0 # t1 = result lw t0, 0(sp) # Restore t0 addi sp, sp , 4 ``` Example Instruction: jal ra, singleNumber - | Stage | Description | | -------- | -------- | | IF | Fetches the jal (jump and link) instruction| | ID | Decodes the jump to the singleNumber label and saves the return address in ra | | EX | Calculates the jump target address | | MEM | Not applicable for jal | | WB | Writes the return address to ra | ![image](https://hackmd.io/_uploads/BkvQMeHk1x.png) ![image](https://hackmd.io/_uploads/SyNVGxB1Jx.png) ![image](https://hackmd.io/_uploads/ByJBGlHkyg.png) ![image](https://hackmd.io/_uploads/BkJUGerk1g.png) ![image](https://hackmd.io/_uploads/ByrvflrJkx.png) #### Example 4 (Memory Access) ```assembly= lw t2, 0(t2) # t2 = expected_results[s4] ``` Example Instruction: lw t2, 0(t2) - | Stage | Description | | -------- | -------- | | IF | Fetches the lw (load word) instruction| | ID | Decodes the load operation to retrieve a word from memory address in t2 | | EX | Calculates the effective memory address by adding the offset | | MEM |Accesses memory to load the word from the calculated address | | WB | Writes the loaded value into register t2 | ![image](https://hackmd.io/_uploads/B1_j4eSykl.png) ![image](https://hackmd.io/_uploads/BkHh4eH1ye.png) ![image](https://hackmd.io/_uploads/HJfTVerkJl.png) ![image](https://hackmd.io/_uploads/SyCpNxHJke.png) ![image](https://hackmd.io/_uploads/SkQJrgrkkx.png) ## Reference - [Quiz1 of Computer Architecture (2024 Fall)](https://hackmd.io/@sysprog/arch2024-quiz1-sol) - [Quiz1 of Computer Architecture (2023 Fall)](https://hackmd.io/@sysprog/arch2023-quiz1-sol) - [LeetCode 137. Single Number II](https://leetcode.com/problems/single-number-ii/description/)