# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < [`Kurumi2650`](https://github.com/Kurumi2650) >
:::danger
Choose one problem (A, B, or C) from [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol), translate it from C code to a complete RISC-V assembly program, and include the relevant test data.
:::
## Problem C
* The assembly code in this section includes a self-verification feature. After performing the population count or other computations, the assembly code compares the result against a predefined correct answer. This comparison happens at the end of the operation. If the computed result does not match the expected value, the assembly code reports the error. This built-in self-check ensures that the calculations are accurate and allows for immediate detection of any issues during execution.
### C code of fabsf
```c
static inline float fabsf(float x) {
uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer
i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value
x = *(float *)&i; // Write the modified bits back into the float
return x;
}
```
The main purpose of this code is to first use a uint32_t to store the bits of a float. The reason for doing this is to directly apply bitwise operations to manipulate the floating-point number. In the IEEE single precision format, the most significant bit is the sign bit, where 0 represents a positive number and 1 represents a negative number. Therefore, to obtain the absolute value, we simply need to perform an AND operation with 0x7FFFFFFF (0b01111111111111111111111111111111) to get the absolute value of the number. Finally, the result is returned as a float.
### Assembly code of fabsf
```c
.data
floats:
.word 0xBF800000 # -1.0
.word 0x40000000 # 2.0
.word 0xC2480000 # -50.0
expected_values:
.word 0x3F800000 # 1.0
.word 0x40000000 # 2.0
.word 0x42480000 # 50.0
correct_msg:
.asciz "correct\n"
wrong_msg_prefix:
.asciz "wrong at index "
newline:
.asciz "\n"
.text
.globl main
main:
# Initialize registers
la t0, floats # t0 points to the starting address of the floats array
la t1, expected_values # t1 points to the starting address of the expected_values array
li t2, 3 # t2 is the loop counter (number of floats)
li t3, 0x7FFFFFFF # t3 is the mask to clear the sign bit
li t4, 0 # t4 is the error flag, initially set to 0 (no error)
li s0, 0 # s0 is the index counter, initially set to 0
loop:
beqz t2, check_flag # If t2 is 0, all floats processed, jump to check_flag
lw t5, 0(t0) # Load the current float from the floats array into t5
and t5, t5, t3 # Clear the sign bit to calculate the absolute value (fabsf)
sw t5, 0(t0) # Store the result back to the floats array
lw t6, 0(t1) # Load the expected value into t6 from the expected_values array
bne t5, t6, report_error # If the result does not match the expected value, jump to report_error
# Update pointers and counters
addi t0, t0, 4 # Move to the next element in the floats array
addi t1, t1, 4 # Move to the next element in the expected_values array
addi t2, t2, -1 # Decrement the counter
addi s0, s0, 1 # Increment the index counter
j loop # Jump back to the start of the loop
report_error:
# Set the error flag
li t4, 1 # Set t4 to 1, indicating an error
# Print "wrong at index "
la a0, wrong_msg_prefix
li a7, 4 # Syscall number 4: print string
ecall
# Print the index (s0)
mv a0, s0 # Move the current index to a0
li a7, 1 # Syscall number 1: print integer
ecall
# Print newline
la a0, newline
li a7, 4 # Syscall number 4: print string
ecall
# Update pointers and counters
addi t0, t0, 4 # Move to the next element in the floats array
addi t1, t1, 4 # Move to the next element in the expected_values array
addi t2, t2, -1 # Decrement the counter
addi s0, s0, 1 # Increment the index counter
j loop # Jump back to the start of the loop
check_flag:
beqz t4, print_correct # If t4 is 0, indicating no error, print "correct"
# If there is an error, exit the program
j exit_program
print_correct:
la a0, correct_msg # Load the "correct" message
li a7, 4 # Syscall number 4: print string
ecall
exit_program:
li a7, 10 # Syscall number 10: exit program
ecall
```
### C code of my_clz
```c
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
```
This code defines a function my_clz that counts the number of leading zeros in a 32-bit unsigned integer (uint32_t x). The function works by iterating through each bit of x starting from the most significant bit (bit 31) down to the least significant bit (bit 0). For each bit, it checks if the bit is set (i.e., if it equals 1). If it finds a set bit, the loop breaks. Otherwise, it increments a counter (count) to track how many leading zeros have been encountered.
### Assembly code of my_clz
```assembly=
.data
test_data:
.word 0x00000000 # Test data 0: x = 0
.word 0x80000000 # Test data 1: x = -2147483648 (most significant bit is 1, indicating a negative number)
.word 0x00F00000 # Test data 2: x = 15728640
expected_results:
.word 32 # Expected result 0: 32 (because x == 0)
.word 0 # Expected result 1: 0 (because x < 0)
.word 8 # Expected result 2: 8 (15728640 has 8 leading zeros in binary)
correct_msg:
.asciz "correct\n"
wrong_msg_prefix:
.asciz "wrong at index "
newline:
.asciz "\n"
.text
.globl main
main:
# Initialize pointers and flags
la s2, test_data # s2 points to the start of the test_data array
la s3, expected_results # s3 points to the start of the expected_results array
li s1, 3 # s1 is the loop counter (number of test data items)
li t4, 0 # t4 is the error flag, initially 0 (no error)
li s0, 0 # s0 is the index counter, initially 0
loop:
beqz s1, check_flag # If s1 is 0, all test data is processed, jump to the check phase
lw a0, 0(s2) # Load current test data into a0
jal ra, my_clz # Call my_clz function
lw t5, 0(s3) # Load expected result into t5
bne a0, t5, report_error # If the result doesn't match the expected value, jump to report_error
# Update pointers and counters
addi s2, s2, 4 # Move to the next test data
addi s3, s3, 4 # Move to the next expected result
addi s1, s1, -1 # Decrease the loop counter
addi s0, s0, 1 # Increase the index counter
j loop # Return to the beginning of the loop
report_error:
# Set the error flag
li t4, 1 # t4 = 1, indicating an error
# Print "wrong at index "
la a0, wrong_msg_prefix
li a7, 4 # System call number 4: print string
ecall
# Print the index (s0)
mv a0, s0 # Move the current index to a0
li a7, 1 # System call number 1: print integer
ecall
# Print newline
la a0, newline
li a7, 4 # System call number 4: print string
ecall
# Update pointers and counters
addi s2, s2, 4 # Move to the next test data
addi s3, s3, 4 # Move to the next expected result
addi s1, s1, -1 # Decrease the loop counter
addi s0, s0, 1 # Increase the index counter
j loop # Return to the beginning of the loop
check_flag:
beqz t4, print_correct # If t4 is 0, indicating no errors, jump to print "correct"
# If there is an error, exit the program
j exit_program
print_correct:
la a0, correct_msg # Load "correct" message
jal ra, print_string # Call print_string
exit_program:
li a7, 10 # System call number 10: exit program
ecall
# Subroutine to print string
print_string:
# a0 contains the address of the string
li a7, 4 # System call number 4: print string
ecall
ret
# my_clz function
my_clz:
addi sp, sp, -8 # Allocate stack space
sw ra, 4(sp) # Save return address
sw s0, 0(sp) # Save s0 (used within the function)
mv s0, a0 # s0 = x
# If x == 0, return 32
beqz s0, my_clz_return_32
# If x < 0 (i.e., the most significant bit of x is 1), return 0
bltz s0, my_clz_return_0
# count = 0
li t0, 0 # t0 = count
li t1, 31 # t1 = i
my_clz_loop:
blt t1, zero, my_clz_end_loop # If i < 0, end loop
li t2, 1
sll t2, t2, t1 # t2 = 1U << i
and t3, s0, t2 # t3 = x & (1U << i)
bnez t3, my_clz_end_loop
addi t0, t0, 1 # count++
addi t1, t1, -1 # i--
j my_clz_loop
my_clz_end_loop:
mv a0, t0 # Return count
j my_clz_exit
my_clz_return_32:
li a0, 32
j my_clz_exit
my_clz_return_0:
li a0, 0
my_clz_exit:
# Restore registers and return
lw s0, 0(sp)
lw ra, 4(sp)
addi sp, sp, 8
ret
```
:::danger
Use fewer instructions.
:::
### C code of clz (population count)
```c
#include <stdint.h>
uint16_t count_leading_zeros(uint64_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
/* count ones (population count) */
x -= ((x >> 1) & 0x55555555); // A01 for 32-bit: alternating bits mask (binary: 010101...)
x = ((x >> 2) & 0x33333333) + (x & 0x33333333); // A02 for 32-bit: mask for 2-bit groups (binary: 00110011...)
x = ((x >> 4) + x) & 0x0f0f0f0f; // mask for 4-bit groups
x += (x >> 8);
x += (x >> 16);
return (32 - (x & 0x3f)); // Adjust for 32 bits (0x3f = 6 bits mask)
}
```
This 32-bit population count program works by efficiently counting the number of 1 bits in a 32-bit integer using a series of bitwise operations. It begins with a process called bit expansion, where the number is shifted and OR-ed with itself. This ensures that all bits below the highest set 1-bit are also set to 1, effectively preparing the number for counting.
Next, the program performs the population count by grouping bits together. It first pairs adjacent bits and uses a subtraction operation to count how many 1s exist in each 2-bit group. Then, it aggregates these results by progressively summing larger groups of bits, such as 4-bit and 8-bit blocks. The program ultimately produces the total count of set 1 bits in the integer.
### Assembly code of clz (population count)
```assembly=
.data
test_data:
.word 0x00000000 # Test data 0: x = 0
.word 0x80000000 # Test data 1: x = -2147483648 (MSB is 1)
.word 0x00F00000 # Test data 2: x = 15728640
expected_results:
.word 32 # Expected result 0: 32 (x == 0)
.word 0 # Expected result 1: 0 (x < 0)
.word 8 # Expected result 2: 8 (15728640 has 8 leading zeros)
correct_msg:
.asciz "correct\n"
wrong_msg_prefix:
.asciz "wrong at index "
newline:
.asciz "\n"
# Define masks in the data section
mask_55555555:
.word 0x55555555
mask_33333333:
.word 0x33333333
mask_0f0f0f0f:
.word 0x0f0f0f0f
.text
.globl main
main:
# Initialize pointers and flags
la s2, test_data # s2 points to test_data array
la s3, expected_results # s3 points to expected_results array
li s1, 3 # s1 is the loop counter (number of test cases)
li t4, 0 # t4 is the error flag, initially 0 (no error)
li s0, 0 # s0 is the index counter, initially 0
loop:
beqz s1, check_flag # If s1 is 0, all tests are done
lw a0, 0(s2) # Load current test data into a0
jal ra, my_clz # Call my_clz function
lw t5, 0(s3) # Load expected result into t5
bne a0, t5, report_error # If result != expected, report error
# Update pointers and counters
addi s2, s2, 4 # Move to next test data
addi s3, s3, 4 # Move to next expected result
addi s1, s1, -1 # Decrease loop counter
addi s0, s0, 1 # Increase index counter
j loop # Repeat loop
report_error:
# Set error flag
li t4, 1 # t4 = 1, indicating an error
# Print "wrong at index "
la a0, wrong_msg_prefix
li a7, 4 # syscall number for print string
ecall
# Print the index (s0)
mv a0, s0 # Move index to a0
li a7, 1 # syscall number for print integer
ecall
# Print newline
la a0, newline
li a7, 4 # syscall number for print string
ecall
# Update pointers and counters
addi s2, s2, 4 # Move to next test data
addi s3, s3, 4 # Move to next expected result
addi s1, s1, -1 # Decrease loop counter
addi s0, s0, 1 # Increase index counter
j loop # Repeat loop
check_flag:
beqz t4, print_correct # If no errors, print "correct"
j exit_program # If errors, exit
print_correct:
la a0, correct_msg # Load "correct" message
jal ra, print_string # Call print_string
j exit_program
exit_program:
li a7, 10 # syscall number for exit
ecall
# Subroutine to print a string
print_string:
li a7, 4 # syscall number for print string
ecall
ret
# Function: my_clz
my_clz:
addi sp, sp, -8 # Allocate stack space
sw ra, 4(sp) # Save return address
sw s0, 0(sp) # Save s0 (used in function)
mv s0, a0 # s0 = x
# If x == 0, return 32
beqz s0, return_32
# If x < 0 (MSB is 1), return 0
bltz s0, return_0
# Begin computation as per the C code
# x |= (x >> 1)
srli t0, s0, 1 # t0 = x >> 1
or s0, s0, t0 # x |= t0
# x |= (x >> 2)
srli t0, s0, 2 # t0 = x >> 2
or s0, s0, t0 # x |= t0
# x |= (x >> 4)
srli t0, s0, 4 # t0 = x >> 4
or s0, s0, t0 # x |= t0
# x |= (x >> 8)
srli t0, s0, 8 # t0 = x >> 8
or s0, s0, t0 # x |= t0
# x |= (x >> 16)
srli t0, s0, 16 # t0 = x >> 16
or s0, s0, t0 # x |= t0
# x -= ((x >> 1) & 0x55555555)
srli t0, s0, 1 # t0 = x >> 1
la t1, mask_55555555 # Load address of mask_55555555
lw t1, 0(t1) # t1 = 0x55555555
and t0, t0, t1 # t0 = (x >> 1) & 0x55555555
sub s0, s0, t0 # x -= t0
# x = ((x >> 2) & 0x33333333) + (x & 0x33333333)
srli t0, s0, 2 # t0 = x >> 2
la t1, mask_33333333 # Load address of mask_33333333
lw t1, 0(t1) # t1 = 0x33333333
and t0, t0, t1 # t0 = (x >> 2) & 0x33333333
and t2, s0, t1 # t2 = x & 0x33333333
add s0, t0, t2 # x = t0 + t2
# x = ((x >> 4) + x) & 0x0f0f0f0f
srli t0, s0, 4 # t0 = x >> 4
add s0, s0, t0 # x += t0
la t1, mask_0f0f0f0f # Load address of mask_0f0f0f0f
lw t1, 0(t1) # t1 = 0x0f0f0f0f
and s0, s0, t1 # x &= 0x0f0f0f0f
# x += (x >> 8)
srli t0, s0, 8 # t0 = x >> 8
add s0, s0, t0 # x += t0
# x += (x >> 16)
srli t0, s0, 16 # t0 = x >> 16
add s0, s0, t0 # x += t0
# Return (32 - (x & 0x3f))
andi t0, s0, 0x3f # t0 = x & 0x3f
li t1, 32 # t1 = 32
sub a0, t1, t0 # a0 = 32 - t0
# Restore and return
lw s0, 0(sp)
lw ra, 4(sp)
addi sp, sp, 8
ret
return_32:
li a0, 32 # Return 32
lw s0, 0(sp)
lw ra, 4(sp)
addi sp, sp, 8
ret
return_0:
li a0, 0 # Return 0
lw s0, 0(sp)
lw ra, 4(sp)
addi sp, sp, 8
ret
```
### C code of fp16_to_fp32
```c
static inline uint32_t fp16_to_fp32(uint16_t h) {
/*
* Extends the 16-bit half-precision floating-point number to 32 bits
* by shifting it to the upper half of a 32-bit word:
* +---+-----+------------+-------------------+
* | S |EEEEE|MM MMMM MMMM|0000 0000 0000 0000|
* +---+-----+------------+-------------------+
* Bits 31 26-30 16-25 0-15
*
* S - sign bit, E - exponent bits, M - mantissa bits, 0 - zero bits.
*/
const uint32_t w = (uint32_t) h << 16;
/*
* Isolates the sign bit from the input number, placing it in the most
* significant bit of a 32-bit word:
*
* +---+----------------------------------+
* | S |0000000 00000000 00000000 00000000|
* +---+----------------------------------+
* Bits 31 0-31
*/
const uint32_t sign = w & UINT32_C(0x80000000);
/*
* Extracts the mantissa and exponent from the input number, placing
* them in bits 0-30 of the 32-bit word:
*
* +---+-----+------------+-------------------+
* | 0 |EEEEE|MM MMMM MMMM|0000 0000 0000 0000|
* +---+-----+------------+-------------------+
* Bits 30 27-31 17-26 0-16
*/
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
/*
* The renorm_shift variable indicates how many bits the mantissa
* needs to be shifted to normalize the half-precision number.
* For normalized numbers, renorm_shift will be 0. For denormalized
* numbers, renorm_shift will be greater than 0. Shifting a
* denormalized number will move the mantissa into the exponent,
* normalizing it.
*/
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
/*
* If the half-precision number has an exponent of 15, adding a
* specific value will cause overflow into bit 31, which converts
* the upper 9 bits into ones. Thus:
* inf_nan_mask ==
* 0x7F800000 if the half-precision number is
* NaN or infinity (exponent of 15)
* 0x00000000 otherwise
*/
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) &
INT32_C(0x7F800000);
/*
* If nonsign equals 0, subtracting 1 will cause overflow, setting
* bit 31 to 1. Otherwise, bit 31 will be 0. Shifting this result
* propagates bit 31 across all bits in zero_mask. Thus:
* zero_mask ==
* 0xFFFFFFFF if the half-precision number is
* zero (+0.0h or -0.0h)
* 0x00000000 otherwise
*/
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
/*
* 1. Shifts nonsign left by renorm_shift to normalize it (for denormal
* inputs).
* 2. Shifts nonsign right by 3, adjusting the exponent to fit in the
* 8-bit exponent field and moving the mantissa into the correct
* position within the 23-bit mantissa field of the single-precision
* format.
* 3. Adds 0x70 to the exponent to account for the difference in bias
* between half-precision and single-precision.
* 4. Subtracts renorm_shift from the exponent to account for any
* renormalization that occurred.
* 5. ORs with inf_nan_mask to set the exponent to 0xFF if the input
* was NaN or infinity.
* 6. ANDs with the inverted zero_mask to set the mantissa and exponent
* to zero if the input was zero.
* 7. Combines everything with the sign bit of the input number.
*/
return sign | ((((nonsign << renorm_shift >> 3) +
((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
}
```
This function, fp16_to_fp32, converts a 16-bit half-precision floating-point number into a 32-bit single-precision floating-point number.
### Assembly code of fp16_to_fp32
```assembly=
.data
# Test cases: half-precision floats (16-bit values)
floats:
.word 0x00003C00 # +1.0 in half-precision (0x3C00)
.word 0x0000C000 # -2.0 in half-precision (0xC000)
.word 0x00007BFF # Largest positive normal number in half-precision
expected_values:
.word 0x3F800000 # +1.0 in single-precision (0x3F800000)
.word 0xC0000000 # -2.0 in single-precision (0xC0000000)
.word 0x477FE000 # Equivalent single-precision value
correct_msg:
.asciz "correct\n"
wrong_msg_prefix:
.asciz "wrong at index "
newline:
.asciz "\n"
.text
.globl main
main:
# Initialize registers
la s0, floats # s0 points to the starting address of the floats array
la s1, expected_values # s1 points to the starting address of the expected_values array
li s2, 3 # s2 is the loop counter (number of floats)
li s3, 0 # s3 is the error flag, initially set to 0 (no error)
li s4, 0 # s4 is the index counter, initially set to 0
loop:
beqz s2, check_flag # If s3 is 0, all floats processed, jump to check_flag
lw t4, 0(s0) # Load the current word from the floats array into t4
# Call fp16_to_fp32(t5)
mv a0, t4 # Move h into a0
addi sp, sp ,-4
sw t4, 0(sp) # save t4
jal fp16_to_fp32 # Call fp16_to_fp32
mv t6, a0 # t6 = result
lw t4, 0(sp)
addi sp, sp ,4 # restore t4
lw t5, 0(s1) # Load the expected value into t5 from the expected_values array
bne t6, t5, report_error # If the result does not match the expected value, jump to report_error
# Update pointers and counters
addi s0, s0, 4 # Move to the next element in the floats array
addi s1, s1, 4 # Move to the next element in the expected_values array
addi s2, s2, -1 # Decrement the counter
addi s4, s4, 1 # Increment the index counter
j loop # Jump back to the start of the loop
report_error:
# Set the error flag
li s3, 1 # Set t3 to 1, indicating an error
# Print "wrong at index "
la a0, wrong_msg_prefix
li a7, 4 # Syscall number 4: print string
ecall
# Print the index (s0)
mv a0, s4 # Move the current index to a0
li a7, 1 # Syscall number 1: print integer
ecall
# Print newline
la a0, newline
li a7, 4 # Syscall number 4: print string
ecall
# Update pointers and counters
addi s0, s0, 4 # Move to the next element in the floats array
addi s1, s1, 4 # Move to the next element in the expected_values array
addi s2, s2, -1 # Decrement the counter
addi s4, s4, 1 # Increment the index counter
j loop # Jump back to the start of the loop
check_flag:
bne s3, zero, exit_program # If s3 is not zero (error found), exit
# If t3 is zero, indicating no error, print "correct"
la a0, correct_msg # Load the "correct" message
li a7, 4 # Syscall number 4: print string
ecall
j exit_program # Ensure the program exits after printing "correct"
exit_program:
li a7, 10 # Syscall number 10: exit program
ecall
j exit_program # Infinite loop to prevent execution beyond this point
# fp16_to_fp32 function
fp16_to_fp32:
# Input: a0 = h (uint16_t)
# Output: a0 = fp32 value
addi sp, sp, -4
sw ra, 0(sp)
# Shift h left by 16 bits to get w
slli t0, a0, 16 # t0 = w = h << 16
# sign = w & 0x80000000
li t1, 0x80000000
and t1, t0, t1 # t1 = sign
# nonsign = w & 0x7FFFFFFF
li t2, 0x7FFFFFFF
and t2, t0, t2 # t2 = nonsign
addi sp, sp, -8
sw t1, 0(sp) # save t1 in stack
sw t2, 4(sp) # save t2 in stack
# renorm_shift = my_clz(nonsign)
mv a0, t2 # Prepare argument for my_clz
jal my_clz # Call my_clz(nonsign)
mv t3, a0 # t3 = renorm_shift
lw t1, 0(sp) # restore t1
lw t2, 4(sp) # restore t2
addi sp, sp, 8
# renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0
li t4, 5
blt t3, t4, renorm_shift_zero
sub t3, t3, t4 # t3 = renorm_shift - 5
j renorm_shift_continue
renorm_shift_zero:
li t3, 0
renorm_shift_continue:
# Compute inf_nan_mask = ((nonsign + 0x04000000) >> 8) & 0x7F800000
li t4, 0x04000000
add t5, t2, t4 # t5 = nonsign + 0x04000000
srai t5, t5, 8 # t5 = (nonsign + 0x04000000) >> 8
li t6, 0x7F800000
and t5, t5, t6 # t5 = inf_nan_mask
# Compute zero_mask = (int32_t)(nonsign - 1) >> 31
addi t6, t2, -1 # t6 = nonsign - 1
srai t6, t6, 31 # t6 = zero_mask
# Compute (nonsign << renorm_shift) >> 3
sll t4, t2, t3 # t4 = nonsign << renorm_shift
srai t4, t4, 3 # t4 = t4 >> 3
# Compute ((0x70 - renorm_shift) << 23)
li t2, 0x70
sub t2, t2, t3 # t2 = 0x70 - renorm_shift
slli t2, t2, 23 # t2 = (0x70 - renorm_shift) << 23
# Sum
add t4, t4, t2 # t4 = t4 + t2
# Combine with inf_nan_mask
or t4, t4, t5 # s1 = s1 | inf_nan_mask
# Apply zero_mask
not t6, t6 # t6 = ~zero_mask
and t4, t4, t6 # t4 = t4 & ~zero_mask
# Combine with sign
or a0, t1, t4 # a0 = sign | s1
lw ra, 0(sp)
addi sp, sp, 4
ret
# my_clz function
my_clz:
addi sp, sp, -8 # Allocate stack space
sw ra, 4(sp) # Save return address
sw s0, 0(sp) # Save s0 (used within the function)
mv s0, a0 # s0 = x
# If x == 0, return 32
beqz s0, my_clz_return_32
# count = 0
li t0, 0 # t0 = count
li t1, 31 # t1 = i
my_clz_loop:
blt t1, zero, my_clz_end_loop # If i < 0, end loop
li t2, 1
sll t2, t2, t1 # t2 = 1U << i
and t3, s0, t2 # t3 = x & (1U << i)
bnez t3, my_clz_end_loop
addi t0, t0, 1 # count++
addi t1, t1, -1 # i--
j my_clz_loop
my_clz_end_loop:
mv a0, t0 # Return count
j my_clz_exit
my_clz_return_32:
li a0, 32
my_clz_exit:
# Restore registers and return
lw s0, 0(sp)
lw ra, 4(sp)
addi sp, sp, 8
ret
```
This program will similarly define the test data and expected results in advance. After running the program, if any test case is incorrect, it will print out which one is wrong.
## Single number II
> [LeetCode 137](https://leetcode.com/problems/single-number-ii/description/)
Description: Given an integer array nums where every element appears three times except for one, which appears exactly once. Find the single element and return it.
You must implement a solution with a linear runtime complexity and use only constant extra space.
Constraints:
* 1 <= nums.length <= 3 * 104
* -2^31 <= nums[i] <= 2^31 - 1
* Each element in nums appears exactly three times except for one element which appears once.
## Solution
### Idea for problem solving
By counting the occurrences of 1s in each bit of every number and checking whether the number of 1s is divisible by 3, if it is not divisible by 3, it means that the bit was contributed by the number that appears only once. Therefore, this bit is stored in the result.
### C program
#### First version (baseline)
```c
int singleNumber(int* nums, int numsSize) {
int result = 0;
for (int i = 0; i < 32; ++i) {
int count = 0;
for (int j = 0; j < numsSize; ++j) {
if (nums[j] & (1U << i)) {
count++;
}
}
if (count % 3 != 0) {
result |= (1U << i);
}
}
return result;
}
```
By shifting an unsigned integer 1 and performing an AND operation with each number, if the result of the AND operation is non-zero, it means that the i-th bit of the number is set to 1. If the count of 1s in the i-th bit is not divisible by 3, it indicates that the Single Number contributed a 1 in that bit. We store this in a result variable.
#### Second version
```c
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
int singleNumber(int* nums, int numsSize) {
int result = 0;
int min_leading_zeros = 32;
for (int i = 0; i < numsSize; i++) {
if (nums[i] != 0) {
int leading_zeros = my_clz((uint32_t)nums[i]);
if (leading_zeros < min_leading_zeros) {
min_leading_zeros = leading_zeros;
}
}
}
for (int i = 0; i < 32 - min_leading_zeros; i++) {
int count = 0;
for (int j = 0; j < numsSize; j++) {
if (nums[j] & (1U << i)) {
count++;
}
}
if (count % 3 != 0) {
result |= (1U << i);
}
}
return result;
}
```
Using the my_clz function, we can calculate the number of leading zeros for each number, record the minimum number of leading zeros, and use this value to determine the highest bit with a 1 across all numbers. This allows us to reduce the number of comparisons from the default 32 times for a 32-bit integer to fewer comparisons.
#### Third version
```c
static inline int my_clz(uint32_t x) {
if (x == 0) return 32;
int count = 0;
if ((int32_t)x < 0) return 0; // x sign bit is 1, number of leading zero is 0.
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
int singleNumber(int* nums, int numsSize) {
int result = 0;
int min_leading_zeros = 32;
for (int i = 0; i < numsSize; i++) {
if (nums[i] != 0) {
int leading_zeros = my_clz((uint32_t)nums[i]);
if (leading_zeros < min_leading_zeros) {
min_leading_zeros = leading_zeros;
}
}
}
for (int i = 0; i < 32 - min_leading_zeros; i++) {
int count = 0;
for (int j = 0; j < numsSize; j++) {
if (nums[j] & (1U << i)) {
count++;
}
}
if (count % 3 != 0) {
result |= (1U << i);
}
}
return result;
}
```
This version first checks if x is less than or equal to 0 and handles it accordingly, without entering the loop computation.
### Assembly
#### First Version (Corresponding to the first version of the C code)
```assembly=
.data
# Test Case 1: nums = [2, 2, 3, 2]
nums1:
.word 2, 2, 3, 2
nums1_size:
.word 4
# Test Case 2: nums = [0, 1, 0, 1, 0, 1, 99]
nums2:
.word 0, 1, 0, 1, 0, 1, 99
nums2_size:
.word 7
# Test Case 3: nums = [-2, -2, -2, -5]
nums3:
.word -2, -2, -2, -5
nums3_size:
.word 4
result_str:
.asciz "Result: "
newline_str:
.asciz "\n"
int_buffer:
.zero 12 # Allocate and zero-initialize 12 bytes
.text
.globl main
main:
# Test Case 1
la a0, nums1 # Load address of nums1 array
lw a1, nums1_size # Load size of nums1 array
jal ra, singleNumber # Call singleNumber(nums1, size)
mv a2, a0 # Move result to a2 for printing
la a0, result_str # Load address of result string
jal ra, print_result # Print result
# Test Case 2
la a0, nums2
lw a1, nums2_size
jal ra, singleNumber
mv a2, a0
la a0, result_str
jal ra, print_result
# Test Case 3
la a0, nums3
lw a1, nums3_size
jal ra, singleNumber
mv a2, a0
la a0, result_str
jal ra, print_result
# Exit program
li a0, 0 # Exit code 0
li a7, 93 # ECALL for exit
ecall
# int singleNumber(int* nums, int numsSize)
singleNumber:
addi sp, sp, -24 # Allocate stack space
sw ra, 20(sp) # Save return address
sw s0, 16(sp) # Save s0
sw s1, 12(sp) # Save s1
sw s2, 8(sp) # Save s2
sw s3, 4(sp) # Save s3
sw s4, 0(sp) # Save s4
mv s0, a0 # s0 = nums (int* nums)
mv s1, a1 # s1 = numsSize (int numsSize)
li a0, 0 # a0 = result = 0
li s2, 0 # s2 = i = 0 (bit index)
singleNumber_outer_loop:
li t0, 32 # t0 = 32 (number of bits)
bge s2, t0, singleNumber_end_outer_loop # if i >= 32, exit loop
# Initialize count = 0
li s3, 0 # s3 = count = 0
li s4, 0 # s4 = j = 0
singleNumber_inner_loop:
bge s4, s1, singleNumber_end_inner_loop # if j >= numsSize, exit inner loop
# Load nums[j]
slli t1, s4, 2 # t1 = j * 4 (word size)
add t1, s0, t1 # t1 = &nums[j]
lw t2, 0(t1) # t2 = nums[j]
# Check if nums[j] & (1U << i)
li t3, 1
sll t3, t3, s2 # t3 = 1U << i
and t4, t2, t3 # t4 = nums[j] & (1U << i)
beqz t4, singleNumber_skip_increment # if zero, skip increment
# count = (count + 1) % 3
addi s3, s3, 1 # count++
li t5, 3
blt s3, t5, singleNumber_skip_reset # if count < 3, skip reset
li s3, 0 # count = 0
singleNumber_skip_reset:
# Continue to next iteration
singleNumber_skip_increment:
addi s4, s4, 1 # j++
jal zero, singleNumber_inner_loop
singleNumber_end_inner_loop:
# if count != 0
beqz s3, singleNumber_skip_bit_set # if count == 0, skip setting bit
# result |= (1U << i)
li t6, 1
sll t6, t6, s2 # t6 = 1U << i
or a0, a0, t6 # result |= t6
singleNumber_skip_bit_set:
addi s2, s2, 1 # i++
jal zero, singleNumber_outer_loop
singleNumber_end_outer_loop:
# Result is in a0
# Restore registers and stack
lw s4, 0(sp)
lw s3, 4(sp)
lw s2, 8(sp)
lw s1, 12(sp)
lw s0, 16(sp)
lw ra, 20(sp)
addi sp, sp, 24
ret
# Simplified print_result function for Ripes simulator
print_result:
addi sp, sp, -8 # Allocate stack space
sw ra, 4(sp) # Save return address
sw a0, 0(sp) # Save a0 (address of result_str)
# Print "Result: "
mv a0, a0 # a0 already contains address of result_str
li a7, 4 # ECALL for print string
ecall
# Print integer in a2
mv a0, a2 # Move result to a0
jal ra, print_int # Call print_int
# Print newline
la a0, newline_str
li a7, 4
ecall
# Restore registers and stack
lw a0, 0(sp) # Restore a0
lw ra, 4(sp) # Restore ra
addi sp, sp, 8
ret
# Corrected print_int function
print_int:
# Convert integer to string and print
# Handle negative numbers
addi sp, sp, -36 # Adjust stack size for additional registers
sw ra, 32(sp)
sw t0, 28(sp)
sw t1, 24(sp)
sw t2, 20(sp)
sw t3, 16(sp)
sw t4, 12(sp)
sw t5, 8(sp)
sw a4, 4(sp) # Save a4
sw a5, 0(sp) # Save a5
mv t0, a0 # t0 = number
la t1, int_buffer # t1 = address of int_buffer
# Clear the buffer
li t2, 12 # t2 = buffer size
li t3, 0 # t3 = zero
mv a5, t1 # a5 = buffer pointer
clear_buffer_loop:
beqz t2, clear_buffer_done
sb t3, 0(a5)
addi a5, a5, 1
addi t2, t2, -1
jal zero, clear_buffer_loop
clear_buffer_done:
# Reset t1 to end of buffer
la t1, int_buffer # t1 = address of int_buffer
addi t1, t1, 11 # t1 points to int_buffer + 11 (one before end)
li t2, 48 # t2 = ASCII code for '0'
li t3, 0 # t3 = sign flag
li a4, 10 # a4 = 10
# Check if number is zero
beq t0, zero, print_int_zero
# Check if number is negative
blt t0, zero, print_int_negative
print_int_loop:
# t4 = t0 / 10 (quotient)
# t5 = t0 % 10 (remainder)
# Initialize t4 = 0 (quotient), t5 = t0 (remainder)
mv t5, t0 # t5 = t0 (remainder)
li t4, 0 # t4 = 0 (quotient)
print_int_divide_loop:
blt t5, a4, print_int_divide_end
sub t5, t5, a4 # t5 -= 10
addi t4, t4, 1 # t4 += 1
jal zero, print_int_divide_loop
print_int_divide_end:
# Now t4 is quotient, t5 is remainder
# Convert remainder to character
add t6, t5, t2 # t6 = t5 + '0'
sb t6, 0(t1) # Store character at t1
addi t1, t1, -1 # t1 -= 1
# Update t0 to quotient
mv t0, t4 # t0 = t4
# If t0 != 0, continue loop
bnez t0, print_int_loop
# After loop, check for sign
beqz t3, print_int_print # If not negative, proceed to print
# Add '-' sign
li t6, 45 # '-'
sb t6, 0(t1)
addi t1, t1, -1
print_int_print:
addi t1, t1, 1 # Adjust t1 to point to the start of string
# Print the number string
mv a0, t1 # a0 = pointer to start of string
li a7, 4
ecall
# Restore registers and stack
lw a5, 0(sp) # Restore a5
lw a4, 4(sp) # Restore a4
lw t5, 8(sp)
lw t4, 12(sp)
lw t3, 16(sp)
lw t2, 20(sp)
lw t1, 24(sp)
lw t0, 28(sp)
lw ra, 32(sp)
addi sp, sp, 36
ret
print_int_negative:
li t3, 1 # Set sign flag
neg t0, t0 # t0 = -t0
jal zero, print_int_loop
print_int_zero:
# Handle zero
li t6, 48 # '0'
sb t6, 0(t1)
addi t1, t1, -1
jal zero, print_int_print
```
#### Second Version (Corresponding to the third version of the C code)
```assembly=
.data
# Test Case 1: nums = [2, 2, 3, 2]
nums1:
.word 2, 2, 3, 2
nums1_size:
.word 4
# Test Case 2: nums = [0, 1, 0, 1, 0, 1, 99]
nums2:
.word 0, 1, 0, 1, 0, 1, 99
nums2_size:
.word 7
# Test Case 3: nums = [-2, -2, -2, -5]
nums3:
.word -2, -2, -2, -5
nums3_size:
.word 4
result_str:
.asciz "Result: "
newline_str:
.asciz "\n"
int_buffer:
.zero 12 # Allocate and zero-initialize 12 bytes
.text
.globl main
main:
# Test Case 1
la a0, nums1 # Load address of nums1 array
lw a1, nums1_size # Load size of nums1 array
jal ra, singleNumber # Call singleNumber(nums1, size)
mv a2, a0 # Move result to a2 for printing
la a0, result_str # Load address of result string
jal ra, print_result # Print result
# Test Case 2
la a0, nums2
lw a1, nums2_size
jal ra, singleNumber
mv a2, a0
la a0, result_str
jal ra, print_result
# Test Case 3
la a0, nums3
lw a1, nums3_size
jal ra, singleNumber
mv a2, a0
la a0, result_str
jal ra, print_result
# Exit program
li a0, 0 # Exit code 0
li a7, 93 # ECALL for exit
ecall
# int singleNumber(int* nums, int numsSize)
singleNumber:
addi sp, sp, -36 # Allocate stack space
sw ra, 32(sp) # Save return address
sw s0, 28(sp) # Save s0
sw s1, 24(sp) # Save s1
sw s2, 20(sp) # Save s2
sw s3, 16(sp) # Save s3
sw s4, 12(sp) # Save s4
sw s5, 8(sp) # Save s5
sw s6, 4(sp) # Save s6
sw s7, 0(sp) # Save s7
mv s0, a0 # s0 = nums (int* nums)
mv s1, a1 # s1 = numsSize (int numsSize)
li s7, 0 # s7 = result = 0
# Initialize min_leading_zeros = 32
li s5, 32 # s5 = min_leading_zeros = 32
li s2, 0 # s2 = i = 0
find_min_leading_zeros_loop:
bge s2, s1, end_min_leading_zeros_loop
# Load nums[i]
slli t0, s2, 2 # t0 = i * 4
add t0, s0, t0 # t0 = &nums[i]
lw t1, 0(t0) # t1 = nums[i]
bnez t1, process_non_zero
addi s2, s2, 1 # i++
jal zero, find_min_leading_zeros_loop
process_non_zero:
# Call my_clz((uint32_t)nums[i])
mv a0, t1 # a0 = nums[i]
jal ra, my_clz # leading_zeros in a0
mv t2, a0 # t2 = leading_zeros
blt t2, s5, update_min_leading_zeros
jal zero, skip_update_min_leading_zeros
update_min_leading_zeros:
mv s5, t2 # min_leading_zeros = leading_zeros
skip_update_min_leading_zeros:
addi s2, s2, 1 # i++
jal zero, find_min_leading_zeros_loop
end_min_leading_zeros_loop:
# Compute max_bit_index = 32 - min_leading_zeros
li t0, 32
sub s6, t0, s5 # s6 = 32 - min_leading_zeros
# Initialize s2 = 0 (i)
li s2, 0 # s2 = i = 0
singleNumber_outer_loop:
bge s2, s6, singleNumber_end_outer_loop # if i >= (32 - min_leading_zeros), exit loop
# Initialize count = 0
li s3, 0 # s3 = count = 0
li s4, 0 # s4 = j = 0
singleNumber_inner_loop:
bge s4, s1, singleNumber_end_inner_loop # if j >= numsSize, exit inner loop
# Load nums[j]
slli t1, s4, 2 # t1 = j * 4
add t1, s0, t1 # t1 = &nums[j]
lw t2, 0(t1) # t2 = nums[j]
# Check if nums[j] & (1U << i)
li t3, 1
sll t3, t3, s2 # t3 = 1U << i
and t4, t2, t3 # t4 = nums[j] & (1U << i)
beqz t4, singleNumber_skip_increment # if zero, skip increment
# count = (count + 1) % 3
addi s3, s3, 1 # count++
li t5, 3
blt s3, t5, singleNumber_skip_reset # if count < 3, skip reset
li s3, 0 # count = 0
singleNumber_skip_reset:
# Continue to next iteration
singleNumber_skip_increment:
addi s4, s4, 1 # j++
jal zero, singleNumber_inner_loop
singleNumber_end_inner_loop:
# if count != 0
beqz s3, singleNumber_skip_bit_set # if count == 0, skip setting bit
# result |= (1U << i)
li t6, 1
sll t6, t6, s2 # t6 = 1U << i
or s7, s7, t6 # s7 |= t6
singleNumber_skip_bit_set:
addi s2, s2, 1 # i++
jal zero, singleNumber_outer_loop
singleNumber_end_outer_loop:
mv a0, s7 # Move result to a0
# Restore registers and stack
lw s7, 0(sp)
lw s6, 4(sp)
lw s5, 8(sp)
lw s4, 12(sp)
lw s3, 16(sp)
lw s2, 20(sp)
lw s1, 24(sp)
lw s0, 28(sp)
lw ra, 32(sp)
addi sp, sp, 36
ret
# int my_clz(uint32_t x)
my_clz:
addi sp, sp, -8 # Allocate stack space
sw ra, 4(sp) # Save return address
sw s0, 0(sp) # Save s0
mv s0, a0 # s0 = x
# if x == 0, return 32
beqz s0, my_clz_return_32
# if x < 0 (i.e., (int32_t)x < 0), return 0
bltz s0, my_clz_return_0
# count = 0
li t0, 0 # t0 = count
li t1, 31 # t1 = i
my_clz_loop:
blt t1, zero, my_clz_end_loop # if i < 0, end loop
li t2, 1
sll t2, t2, t1 # t2 = 1U << i
and t3, s0, t2 # t3 = x & (1U << i)
bnez t3, my_clz_end_loop
addi t0, t0, 1 # count++
addi t1, t1, -1 # i--
jal zero, my_clz_loop
my_clz_end_loop:
mv a0, t0 # Return count
jal zero, my_clz_exit
my_clz_return_32:
li a0, 32
jal zero, my_clz_exit
my_clz_return_0:
li a0, 0
my_clz_exit:
# Restore registers and return
lw s0, 0(sp)
lw ra, 4(sp)
addi sp, sp, 8
ret
# Simplified print_result function for Ripes simulator
print_result:
addi sp, sp, -8 # Allocate stack space
sw ra, 4(sp) # Save return address
sw a0, 0(sp) # Save a0 (address of result_str)
# Print "Result: "
mv a0, a0 # a0 already contains address of result_str
li a7, 4 # ECALL for print string
ecall
# Print integer in a2
mv a0, a2 # Move result to a0
jal ra, print_int # Call print_int
# Print newline
la a0, newline_str
li a7, 4
ecall
# Restore registers and stack
lw a0, 0(sp) # Restore a0
lw ra, 4(sp) # Restore ra
addi sp, sp, 8
ret
# Corrected print_int function
print_int:
# Convert integer to string and print
# Handle negative numbers
addi sp, sp, -36 # Adjust stack size for additional registers
sw ra, 32(sp)
sw t0, 28(sp)
sw t1, 24(sp)
sw t2, 20(sp)
sw t3, 16(sp)
sw t4, 12(sp)
sw t5, 8(sp)
sw a4, 4(sp) # Save a4
sw a5, 0(sp) # Save a5
mv t0, a0 # t0 = number
la t1, int_buffer # t1 = address of int_buffer
# Clear the buffer
li t2, 12 # t2 = buffer size
li t3, 0 # t3 = zero
mv a5, t1 # a5 = buffer pointer
clear_buffer_loop:
beqz t2, clear_buffer_done
sb t3, 0(a5)
addi a5, a5, 1
addi t2, t2, -1
jal zero, clear_buffer_loop
clear_buffer_done:
# Reset t1 to end of buffer
la t1, int_buffer # t1 = address of int_buffer
addi t1, t1, 11 # t1 points to int_buffer + 11 (one before end)
li t2, 48 # t2 = ASCII code for '0'
li t3, 0 # t3 = sign flag
li a4, 10 # a4 = 10
# Check if number is zero
beq t0, zero, print_int_zero
# Check if number is negative
blt t0, zero, print_int_negative
print_int_loop:
# t4 = t0 / 10 (quotient)
# t5 = t0 % 10 (remainder)
# Initialize t4 = 0 (quotient), t5 = t0 (remainder)
mv t5, t0 # t5 = t0 (remainder)
li t4, 0 # t4 = 0 (quotient)
print_int_divide_loop:
blt t5, a4, print_int_divide_end
sub t5, t5, a4 # t5 -= 10
addi t4, t4, 1 # t4 += 1
jal zero, print_int_divide_loop
print_int_divide_end:
# Now t4 is quotient, t5 is remainder
# Convert remainder to character
add t6, t5, t2 # t6 = t5 + '0'
sb t6, 0(t1) # Store character at t1
addi t1, t1, -1 # t1 -= 1
# Update t0 to quotient
mv t0, t4 # t0 = t4
# If t0 != 0, continue loop
bnez t0, print_int_loop
# After loop, check for sign
beqz t3, print_int_print # If not negative, proceed to print
# Add '-' sign
li t6, 45 # '-'
sb t6, 0(t1)
addi t1, t1, -1
print_int_print:
addi t1, t1, 1 # Adjust t1 to point to the start of string
# Print the number string
mv a0, t1 # a0 = pointer to start of string
li a7, 4
ecall
# Restore registers and stack
lw a5, 0(sp) # Restore a5
lw a4, 4(sp) # Restore a4
lw t5, 8(sp)
lw t4, 12(sp)
lw t3, 16(sp)
lw t2, 20(sp)
lw t1, 24(sp)
lw t0, 28(sp)
lw ra, 32(sp)
addi sp, sp, 36
ret
print_int_negative:
li t3, 1 # Set sign flag
neg t0, t0 # t0 = -t0
jal zero, print_int_loop
print_int_zero:
# Handle zero
li t6, 48 # '0'
sb t6, 0(t1)
addi t1, t1, -1
jal zero, print_int_print
```
:::danger
Use fewer instructions.
:::
#### Third Version (Improved from the second version)
```assemby=
.data
# Test Case 1: nums = [2, 2, 3, 2], expected result: 3
nums1:
.word 2, 2, 3, 2
nums1_size:
.word 4
nums1_expected:
.word 3
# Test Case 2: nums = [0, 1, 0, 1, 0, 1, 99], expected result: 99
nums2:
.word 0, 1, 0, 1, 0, 1, 99
nums2_size:
.word 7
nums2_expected:
.word 99
# Test Case 3: nums = [-2, -2, -2, -5], expected result: -5
nums3:
.word -2, -2, -2, -5
nums3_size:
.word 4
nums3_expected:
.word -5
# Arrays of Test Cases
test_cases:
.word nums1
.word nums2
.word nums3
test_cases_size:
.word 4
.word 7
.word 4
expected_results:
.word 3
.word 99
.word -5
num_test_cases:
.word 3
correct_msg:
.asciz "correct\n"
wrong_msg_prefix:
.asciz "wrong at test case "
newline:
.asciz "\n"
int_buffer:
.zero 12 # Allocate and zero-initialize 12 bytes
.text
.globl main
main:
# Initialize registers
la s0, test_cases # s0 = base address of test_cases array
la s1, test_cases_size # s1 = base address of test_cases_size array
la s2, expected_results # s2 = base address of expected_results array
lw s3, num_test_cases # s3 = number of test cases
li s4, 0 # s4 = test index
loop_test_cases:
beq s4, s3, end_program # If test index equals number of test cases, end program
slli t0, s4, 2 # t0 = s4(index) * 4 (word offset)
add t1, s0, t0 # t1 = &test_cases[s4(index)]
lw a0, 0(t1) # a0 = test_cases[s4]
add t2, s1, t0 # t6 = &test_cases_size[s4]
lw a1, 0(t2) # a1 = test_cases_size[s4]
addi sp, sp , -4
sw t0, 0(sp) # Save t0 (word offset)
jal ra, singleNumber # singleNumber(nums, size)
mv t1, a0 # t1 = result
lw t0, 0(sp) # Restore t0
addi sp, sp , 4
add t2, s2, t0 # t2 = &expected_results[s0]
lw t2, 0(t2) # t2 = expected_results[s0]
bne t1, t2, report_error # If result != expected result, report_error
# result == expected result, print "correct"
la a0, correct_msg
li a7, 4 # ECALL 4: print string
ecall
j next_test_case
report_error:
# Print "wrong at test case "
la a0, wrong_msg_prefix
li a7, 4 # ECALL 4: print string
ecall
#(0-based)
mv a0, s4 # a0 = s4
li a7, 1 # ECALL 1: print integer
ecall
# Print newline
la a0, newline
li a7, 4 # ECALL 4: newline
ecall
next_test_case:
addi s4, s4, 1 # s4++
j loop_test_cases # Loop to next test case
end_program:
li a7, 10 # ECALL 10: exit program
ecall
# int singleNumber(int* nums, int numsSize)
singleNumber:
addi sp, sp, -40 # Allocate stack space
sw ra, 36(sp) # Save return address
sw s0, 32(sp) # Save s0
sw s1, 28(sp) # Save s1
sw s2, 24(sp) # Save s2
sw s3, 20(sp) # Save s3
sw s4, 16(sp) # Save s4
sw s5, 12(sp)
sw s6, 8(sp)
sw s7, 4(sp)
sw s8, 0(sp)
mv s0, a0 # s0 = nums (int* nums)
mv s1, a1 # s1 = numsSize (int numsSize)
li s8, 0 # s8 = result = 0
# Initialize min_leading_zeros = 32
li s3, 32 # s3 = min_leading_zeros = 32
li s4, 0 # s4 = i = 0
find_min_leading_zeros_loop:
bge s4, s1, end_min_leading_zeros_loop
# Load nums[i]
slli t0, s4, 2 # t0 = i * 4
add t0, s0, t0 # t0 = &nums[i]
lw t1, 0(t0) # t1 = nums[i]
bnez t1, process_non_zero
addi s4, s4, 1 # i++
j find_min_leading_zeros_loop
process_non_zero:
# Call my_clz((uint32_t)nums[i])
mv a0, t1 # a0 = nums[i]
jal ra, my_clz # Call my_clz
mv t2, a0 # t2 = leading_zeros
blt t2, s3, update_min_leading_zeros
j skip_update_min_leading_zeros
update_min_leading_zeros:
mv s3, t2 # min_leading_zeros = leading_zeros
skip_update_min_leading_zeros:
addi s4, s4, 1 # i++
j find_min_leading_zeros_loop
end_min_leading_zeros_loop:
# Compute max_bit_index = 32 - min_leading_zeros
li t0, 32
sub s5, t0, s3 # s5 = 32 - min_leading_zeros
# Initialize s4 = 0 (i)
li s4, 0 # s4 = i = 0
singleNumber_outer_loop:
bge s4, s5, singleNumber_end_outer_loop # If i >= (32 - min_leading_zeros), exit loop
# Initialize count = 0 and j = 0
li s6, 0 # s6 = count = 0
li s7, 0 # s7 = j = 0
singleNumber_inner_loop:
bge s7, s1, singleNumber_end_inner_loop # If j >= numsSize, exit inner loop
# Load nums[j]
slli t1, s7, 2 # t1 = j * 4
add t1, s0, t1 # t1 = &nums[j]
lw t2, 0(t1) # t2 = nums[j]
# Check nums[j] & (1U << i)
li t3, 1
sll t3, t3, s4 # t3 = 1U << i
and t4, t2, t3 # t4 = nums[j] & (1U << i)
beqz t4, singleNumber_skip_increment # If zero, skip increment
# count = (count + 1) % 3
addi s6, s6, 1 # count++
li t5, 3
blt s6, t5, singleNumber_skip_reset # If count < 3, skip reset
li s6, 0 # count = 0
singleNumber_skip_reset:
# Continue to next iteration
singleNumber_skip_increment:
addi s7, s7, 1 # j++
j singleNumber_inner_loop
singleNumber_end_inner_loop:
# If count != 0, set the corresponding bit
beqz s6, singleNumber_skip_bit_set # If count == 0, skip setting bit
# result |= (1U << i)
li t6, 1
sll t6, t6, s4 # t6 = 1U << i
or s8, s8, t6 # s8 |= t6
singleNumber_skip_bit_set:
addi s4, s4, 1 # i++
j singleNumber_outer_loop
singleNumber_end_outer_loop:
mv a0, s8 # Move result to a0
# Restore registers and stack
lw s8, 0(sp)
lw s7, 4(sp)
lw s6, 8(sp)
lw s5, 12(sp)
lw s4, 16(sp)
lw s3, 20(sp)
lw s2, 24(sp)
lw s1, 28(sp)
lw s0, 32(sp)
lw ra, 36(sp)
addi sp, sp, 40
ret
# int my_clz(uint32_t x)
my_clz:
addi sp, sp, -8 # Allocate stack space
sw ra, 4(sp) # Save return address
sw s0, 0(sp) # Save s0
mv s0, a0 # s0 = x
# If x == 0, return 32
beqz s0, my_clz_return_32
# If x < 0, return 0
bltz s0, my_clz_return_0
# Initialize count = 0 and i = 31
li t0, 0 # t0 = count
li t1, 31 # t1 = i
my_clz_loop:
blt t1, zero, my_clz_end_loop # If i < 0, end loop
li t2, 1
sll t2, t2, t1 # t2 = 1U << i
and t3, s0, t2 # t3 = x & (1U << i)
bnez t3, my_clz_end_loop
addi t0, t0, 1 # count++
addi t1, t1, -1 # i--
j my_clz_loop
my_clz_end_loop:
mv a0, t0 # Return count
j my_clz_exit
my_clz_return_32:
li a0, 32
j my_clz_exit
my_clz_return_0:
li a0, 0
my_clz_exit:
# Restore registers and stack
lw s0, 0(sp)
lw ra, 4(sp)
addi sp, sp, 8
ret
```
This version uses internal verification to replace the manual visual inspection process and reduces the code for printing the result.
#### fourth version (Improved from the fourth version)
```assembly=
.data
# Test Case 1: nums = [2, 2, 3, 2], expected result: 3
nums1:
.word 2, 2, 3, 2
nums1_size:
.word 4
nums1_expected:
.word 3
# Test Case 2: nums = [0, 1, 0, 1, 0, 1, 99], expected result: 99
nums2:
.word 0, 1, 0, 1, 0, 1, 99
nums2_size:
.word 7
nums2_expected:
.word 99
# Test Case 3: nums = [-2, -2, -2, -5], expected result: -5
nums3:
.word -2, -2, -2, -5
nums3_size:
.word 4
nums3_expected:
.word -5
# Arrays of Test Cases
test_cases:
.word nums1
.word nums2
.word nums3
test_cases_size:
.word 4
.word 7
.word 4
expected_results:
.word 3
.word 99
.word -5
num_test_cases:
.word 3
correct_msg:
.asciz "correct\n"
wrong_msg_prefix:
.asciz "wrong at test case "
newline:
.asciz "\n"
int_buffer:
.zero 12 # Allocate and zero-initialize 12 bytes
.text
.globl main
main:
# Initialize registers
la s0, test_cases # s0 = base address of test_cases array
la s1, test_cases_size # s1 = base address of test_cases_size array
la s2, expected_results # s2 = base address of expected_results array
lw s3, num_test_cases # s3 = number of test cases
li s4, 0 # s4 = test index
loop_test_cases:
beq s4, s3, end_program # If test index equals number of test cases, end program
slli t0, s4, 2 # t0 = s4(index) * 4 (word offset)
add t1, s0, t0 # t1 = &test_cases[s4(index)]
lw a0, 0(t1) # a0 = test_cases[s4]
add t2, s1, t0 # t2 = &test_cases_size[s4]
lw a1, 0(t2) # a1 = test_cases_size[s4]
addi sp, sp , -4
sw t0, 0(sp) # Save t0 (word offset)
jal ra, singleNumber # singleNumber(nums, size)
mv t1, a0 # t1 = result
lw t0, 0(sp) # Restore t0
addi sp, sp , 4
add t2, s2, t0 # t2 = &expected_results[s4]
lw t2, 0(t2) # t2 = expected_results[s4]
bne t1, t2, report_error # If result != expected result, report_error
# result == expected result, print "correct"
la a0, correct_msg
li a7, 4 # ECALL 4: print string
ecall
j next_test_case
report_error:
# Print "wrong at test case "
la a0, wrong_msg_prefix
li a7, 4 # ECALL 4: print string
ecall
#(0-based)
mv a0, s4 # a0 = s4
li a7, 1 # ECALL 1: print integer
ecall
# Print newline
la a0, newline
li a7, 4 # ECALL 4: newline
ecall
next_test_case:
addi s4, s4, 1 # s4++
j loop_test_cases # Loop to next test case
end_program:
li a7, 10 # ECALL 10: exit program
ecall
# int singleNumber(int* nums, int numsSize)
singleNumber:
addi sp, sp, -40 # Allocate stack space
sw ra, 36(sp) # Save return address
sw s0, 32(sp) # Save s0
sw s1, 28(sp) # Save s1
sw s2, 24(sp) # Save s2
sw s3, 20(sp) # Save s3
sw s4, 16(sp) # Save s4
sw s5, 12(sp) # Save s5
sw s6, 8(sp) # Save s6
sw s7, 4(sp) # Save s7
sw s8, 0(sp) # Save s8
mv s0, a0 # s0 = nums (int* nums)
mv s1, a1 # s1 = numsSize (int numsSize)
li s8, 0 # s8 = result = 0
# Initialize min_leading_zeros = 32
li s3, 32 # s3 = min_leading_zeros = 32
li s4, 0 # s4 = i = 0
find_min_leading_zeros_loop:
bge s4, s1, end_min_leading_zeros_loop
# Load nums[i]
slli t0, s4, 2 # t0 = i * 4
add t0, s0, t0 # t0 = &nums[i]
lw t1, 0(t0) # t1 = nums[i]
bnez t1, process_non_zero
addi s4, s4, 1 # i++
j find_min_leading_zeros_loop
process_non_zero:
# Call my_clz((uint32_t)nums[i])
mv a0, t1 # a0 = nums[i]
jal ra, my_clz # Call my_clz
mv t2, a0 # t2 = leading_zeros
blt t2, s3, update_min_leading_zeros
j skip_update_min_leading_zeros
update_min_leading_zeros:
mv s3, t2 # min_leading_zeros = leading_zeros
skip_update_min_leading_zeros:
addi s4, s4, 1 # i++
j find_min_leading_zeros_loop
end_min_leading_zeros_loop:
# Compute max_bit_index = 32 - min_leading_zeros
li t0, 32
sub s5, t0, s3 # s5 = 32 - min_leading_zeros
# Initialize s4 = 0 (i)
li s4, 0 # s4 = i = 0
singleNumber_outer_loop:
bge s4, s5, singleNumber_end_outer_loop # If i >= (32 - min_leading_zeros), exit loop
# Initialize count = 0 and j = 0
li s6, 0 # s6 = count = 0
li s7, 0 # s7 = j = 0
singleNumber_inner_loop:
bge s7, s1, singleNumber_end_inner_loop # If j >= numsSize, exit inner loop
# Load nums[j]
slli t1, s7, 2 # t1 = j * 4
add t1, s0, t1 # t1 = &nums[j]
lw t2, 0(t1) # t2 = nums[j]
# Check nums[j] & (1U << i)
li t3, 1
sll t3, t3, s4 # t3 = 1U << i
and t4, t2, t3 # t4 = nums[j] & (1U << i)
beqz t4, singleNumber_skip_increment # If zero, skip increment
# count = (count + 1) % 3
addi s6, s6, 1 # count++
li t5, 3
blt s6, t5, singleNumber_skip_reset # If count < 3, skip reset
li s6, 0 # count = 0
singleNumber_skip_reset:
# Continue to next iteration
singleNumber_skip_increment:
addi s7, s7, 1 # j++
j singleNumber_inner_loop
singleNumber_end_inner_loop:
# If count != 0, set the corresponding bit
beqz s6, singleNumber_skip_bit_set # If count == 0, skip setting bit
# result |= (1U << i)
li t6, 1
sll t6, t6, s4 # t6 = 1U << i
or s8, s8, t6 # s8 |= t6
singleNumber_skip_bit_set:
addi s4, s4, 1 # i++
j singleNumber_outer_loop
singleNumber_end_outer_loop:
mv a0, s8 # Move result to a0
# Restore registers and stack
lw s8, 0(sp)
lw s7, 4(sp)
lw s6, 8(sp)
lw s5, 12(sp)
lw s4, 16(sp)
lw s3, 20(sp)
lw s2, 24(sp)
lw s1, 28(sp)
lw s0, 32(sp)
lw ra, 36(sp)
addi sp, sp, 40
ret
# int my_clz(uint32_t x)
my_clz:
addi sp, sp, -8 # Allocate stack space
sw ra, 4(sp) # Save return address
sw s0, 0(sp) # Save s0
mv s0, a0 # s0 = x
# If x == 0, return 32
beqz s0, return_32
# If x < 0, return 0
bltz s0, return_0
# Begin computation as per the C code
# x |= (x >> 1)
srli t0, s0, 1 # t0 = x >> 1
or s0, s0, t0 # x |= t0
# x |= (x >> 2)
srli t0, s0, 2 # t0 = x >> 2
or s0, s0, t0 # x |= t0
# x |= (x >> 4)
srli t0, s0, 4 # t0 = x >> 4
or s0, s0, t0 # x |= t0
# x |= (x >> 8)
srli t0, s0, 8 # t0 = x >> 8
or s0, s0, t0 # x |= t0
# x |= (x >> 16)
srli t0, s0, 16 # t0 = x >> 16
or s0, s0, t0 # x |= t0
# x -= ((x >> 1) & 0x55555555)
srli t0, s0, 1 # t0 = x >> 1
li t1, 0x55555555 # Load mask 0x55555555
and t0, t0, t1 # t0 = (x >> 1) & 0x55555555
sub s0, s0, t0 # x -= t0
# x = ((x >> 2) & 0x33333333) + (x & 0x33333333)
srli t0, s0, 2 # t0 = x >> 2
li t1, 0x33333333 # Load mask 0x33333333
and t0, t0, t1 # t0 = (x >> 2) & 0x33333333
and t2, s0, t1 # t2 = x & 0x33333333
add s0, t0, t2 # x = t0 + t2
# x = ((x >> 4) + x) & 0x0f0f0f0f
srli t0, s0, 4 # t0 = x >> 4
add s0, s0, t0 # x += t0
li t1, 0x0f0f0f0f # Load mask 0x0f0f0f0f
and s0, s0, t1 # x &= 0x0f0f0f0f
# x += (x >> 8)
srli t0, s0, 8 # t0 = x >> 8
add s0, s0, t0 # x += t0
# x += (x >> 16)
srli t0, s0, 16 # t0 = x >> 16
add s0, s0, t0 # x += t0
# Return (32 - (x & 0x3f))
andi t0, s0, 0x3f # t0 = x & 0x3f
li t1, 32 # t1 = 32
sub a0, t1, t0 # a0 = 32 - t0
# Restore and return
lw s0, 0(sp)
lw ra, 4(sp)
addi sp, sp, 8
ret
return_32:
li a0, 32 # Return 32
lw s0, 0(sp)
lw ra, 4(sp)
addi sp, sp, 8
ret
return_0:
li a0, 0 # Return 0
lw s0, 0(sp)
lw ra, 4(sp)
addi sp, sp, 8
ret
```
This version uses population count to calculate leading zeros, reducing the loop cycles of the previous my_clz version.
## Analyze
### In five stage processor
-
| Discription | Cycles |
| -------- | -------- |
| baseline | 9174 |
| with `count leading zero` | 7194 |
| with `count leading zero and deal with special case` | 6693 |
| with `population count` | 4473 |
In the baseline, no specific method was used to determine the single number, meaning every bit (in a 32-bit integer) of each number had to be checked once, which resulted in a significant waste of cycles.
After using count leading zero, the minimum leading zero value for each number is calculated first. Once this value is obtained, we can determine the position of the highest set bit across all numbers, reducing the number of bit comparisons required and significantly lowering the cycle count.
A special case arises when handling negative numbers and zero. For negative numbers, the highest bit is always 1, and for zero, it means all bits are zero, so the leading zero count can directly be returned as 0 or 32.
Finally, implementing leading zero counting with a loop is too cycle-expensive, so population count is used instead. This way, calculating the leading zeros for a number only requires.
### Program Functionality Overview
My assembly program is designed to test the singleNumber function, which identifies the unique number in an array where every other number appears three times. The program sets up multiple test cases, executes the singleNumber function for each case, and verifies the result against expected outcomes. If the result matches, it prints "correct"; otherwise, it reports an error indicating which test case failed.
### Pipeline Stages and Instruction Operation
#### Example 1 (Initialization in main)
```assembly=
main:
la s0, test_cases # Load address of test_cases array into s0
la s1, test_cases_size # Load address of test_cases_size array into s1
la s2, expected_results # Load address of expected_results array into s2
lw s3, num_test_cases # Load number of test cases into s3
li s4, 0 # Initialize test index s4 to 0
```
Example instruction: la s0, test_cases(auipc x8, 0x10000; addi x8, x8, 84)
* The instruction la s0, test_cases in Ripes will be split into an auipc to load the upper 20 bits and an addi to load the lower 12 bits. Therefore, we will focus only on the execution of the auipc instruction within the five-stage pipeline.
- | Stage | Description |
| -------- | -------- |
| IF | The instruction is fetched from memory. |
| ID | Decoded to understand it's a auipc instruction targeting register s0(x8). |
| EX | The address of test_cases is calculated. |
| MEM | Not applicable for auipc as it doesn't access memory. |
| WB | The computed address is written to register s0. |





#### Example 2 (Looping through Test Cases)
```assembly=
loop_test_cases:
beq s4, s3, end_program # If test index equals number of test cases, end program
slli t0, s4, 2 # t0 = s4(index) * 4 (word offset)
add t1, s0, t0 # t1 = &test_cases[s4(index)]
lw a0, 0(t1) # a0 = test_cases[s4]
...
```
Example Instruction: slli t0, s4, 2
- | Stage | Description |
| -------- | -------- |
| IF | Fetches the slli (shift left logical immediate) instruction|
| ID | Decodes the operation to shift s4 left by 2 bits |
| EX | Performs the shift operation (s4 << 2) to calculate the word offset |
| MEM | Not applicable as no memory access is needed |
| WB | Writes the shifted value into register t0 |





#### Example 3 (Calling the singleNumber Function)
```assembly=
addi sp, sp , -4
sw t0, 0(sp) # Save t0 (word offset)
jal ra, singleNumber # singleNumber(nums, size)
mv t1, a0 # t1 = result
lw t0, 0(sp) # Restore t0
addi sp, sp , 4
```
Example Instruction: jal ra, singleNumber
- | Stage | Description |
| -------- | -------- |
| IF | Fetches the jal (jump and link) instruction|
| ID | Decodes the jump to the singleNumber label and saves the return address in ra |
| EX | Calculates the jump target address |
| MEM | Not applicable for jal |
| WB | Writes the return address to ra |





#### Example 4 (Memory Access)
```assembly=
lw t2, 0(t2) # t2 = expected_results[s4]
```
Example Instruction: lw t2, 0(t2)
- | Stage | Description |
| -------- | -------- |
| IF | Fetches the lw (load word) instruction|
| ID | Decodes the load operation to retrieve a word from memory address in t2 |
| EX | Calculates the effective memory address by adding the offset |
| MEM |Accesses memory to load the word from the calculated address |
| WB | Writes the loaded value into register t2 |





## Reference
- [Quiz1 of Computer Architecture (2024 Fall)](https://hackmd.io/@sysprog/arch2024-quiz1-sol)
- [Quiz1 of Computer Architecture (2023 Fall)](https://hackmd.io/@sysprog/arch2023-quiz1-sol)
- [LeetCode 137. Single Number II](https://leetcode.com/problems/single-number-ii/description/)