contributed by <徐崇智>
C
__builtin_clz
In the case where x = 0
, the result is undefined according to the GCC documentation, so we cannot allow 0 to be used as an input to the my_clz
function.
x = 1
1
is 0000 0000 0000 0000 0000 0000 0000 0001
31
my_clz(1)
returns 31
x = 255
255
is 0000 0000 0000 0000 0000 0000 1111 1111
24
my_clz(255)
returns 24
x = 536870911
536870911
is 0001 1111 1111 1111 1111 1111 1111 1111
my_clz(536870911)
returns 3
#include <stdint.h>
int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; i--) {
if (x & (1U << i)) {
break;
}
count++;
}
return count;
}
The function starts with a counter count
initialized to zero, which will keep track of the number of leading zeros.
A for loop iterates from 31 down to 0, checking each bit position in the integer x
. It uses a logical left shift operation (1U << i)
to create a mask where only position i is set to 1, while all other positions are 0. This effectively allows for the detection of whether the specific bit in x is 1, enabling the calculation of the number of leading zeros.
The condition if (x & (1U << i))
checks if the ith bit of x
is 1. If it finds a 1, the loop break, however if the ith bit is 0, it increments the count.
When the loop is complete ( because a 1 was found or all bits have been checked ), the function returns the count
, which represents the number of leading zeros in x
.
.data
argument1: .word 1
argument2: .word 255
argument3: .word 536870911
str1: .string "There are "
str2: .string " leading zeros in "
new_line: .string "\n"
.text
main:
lw a0, argument1 # Load the argument1 (1) into register a0
jal ra, my_clz # Jump-and-link to the 'my_clz' function to count leading zeros of argument1 (1)
# Prepare to print the result
lw a1, argument1 # Reload the original argument into a1 to print it
jal ra, printResult # Call the function to print the result
lw a0, argument2 # Load the argument2 (255) into register a0
jal ra, my_clz # Jump-and-link to the 'my_clz' function to count leading zeros of argument2 (255)
# Prepare to print the result
lw a1, argument2 # Reload the original argument into a1 to print it
jal ra, printResult # Call the function to print the result
lw a0, argument3 # Load the argument3 (536870911) into register a0
jal ra, my_clz # Jump-and-link to the 'my_clz' function to count leading zeros of argument3 (536870911)
# Prepare to print the result
lw a1, argument3 # Reload the original argument into a1 to print it
jal ra, printResult # Call the function to print the result
li a7, 10 # Exit system call
ecall
my_clz:
addi t0, zero, 0 # Initialize t0 (count) to zero
addi t1, zero, 31 # Set t1 = 31
loop:
blt t1, zero, exit # Check if t1 < 0 then exit loop
addi t2, zero, 1 # Set t2 = 1
sll t2, t2, t1 # Shift left logic on t2 by t1
and t3, a0, t2 # Do x & (1U << i)
bne t3, zero, exit # if x & (1U << i) != 0 then exit loop
addi t0, t0, 1 # count = count + 1
addi t1, t1, -1 # t1 = t1 - 1 (i--)
j loop # goto loop label
exit:
mv a0, t0 # Move result from t0 to a0
jr ra # Return to main
printResult:
mv t0, a0 # Save original input value (argument) in temporary register t0
mv t1, a1 # Save my_clz result in temporary register t1
# Print "There are "
la a0, str1 # Load the address of the second string (" There are ")
li a7, 4 # System call code for printing a string
ecall
# Print my_clz result (leading zero count)
mv a0, t1 # Move the `my_clz` result to a0 for printing
li a7, 1 # System call code for printing an integer
ecall
# Print " leading zeros in "
la a0, str2 # Load the address of the second string (" leading zeros in ")
li a7, 4 # System call code for printing a string
ecall
# Print the original argument
mv a0, t0 # Move the input value (argument) to a0 for printing
li a7, 1 # System call code for printing an integer
ecall
# Print new_line
la a0, new_line # Load the address of the third string ("\n")
li a7, 4 # System call code for printing a string
ecall
jr ra # Return to main
Console
There are 31 leading zeros in 1
There are 24 leading zeros in 255
There are 3 leading zeros in 536870911
#include <stdint.h>
int my_clz(uint32_t x) {
int count = 0;
if (x <= 0x0000FFFF) { count += 16; x <<= 16;} // check the first 16 bits
if (x <= 0x00FFFFFF) { count += 8; x <<= 8; } // check the first 8 bits
if (x <= 0x0FFFFFFF) { count += 4; x <<= 4; } // check the first 4 bits
if (x <= 0x3FFFFFFF) { count += 2; x <<= 2; } // check the first 2 bits
if (x <= 0x7FFFFFFF) { count += 1;} // check the first 1 bits
return count;
}
The function starts with a counter count
initialized to zero, which will keep track of the number of leading zeros.
Instead of using a loop, it checks the most significant bits in chunks (16 bits, 8 bits, 4 bits, etc.), reducing the problem size with each step.
Once it’s determined that the leading bits are zeros, the value is shifted left by the corresponding number of bits to continue searching within the remaining bits.
After the condition checks are complete, the function returns count
, which indicates the number of leading zeros in x
.
.data
argument1: .word 1
argument2: .word 255
argument3: .word 536870911
str1: .string "There are "
str2: .string " leading zeros in "
new_line: .string "\n"
.text
main:
lw a0, argument1 # Load the argument1 (1) into register a0
jal ra, my_clz # Jump-and-link to the 'my_clz' function to count leading zeros of argument1 (1)
# Prepare to print the result
lw a1, argument1 # Reload the original argument into a1 to print it
jal ra, printResult # Call the function to print the result
lw a0, argument2 # Load the argument2 (255) into register a0
jal ra, my_clz # Jump-and-link to the 'my_clz' function to count leading zeros of argument2 (255)
# Prepare to print the result
lw a1, argument2 # Reload the original argument into a1 to print it
jal ra, printResult # Call the function to print the result
lw a0, argument3 # Load the argument3 (536870911) into register a0
jal ra, my_clz # Jump-and-link to the 'my_clz' function to count leading zeros of argument3 (536870911)
# Prepare to print the result
lw a1, argument3 # Reload the original argument into a1 to print it
jal ra, printResult # Call the function to print the result
li a7, 10 # Exit system call
ecall
my_clz:
addi t0, zero, 0 # Initialize t0 (count) to zero
mv t1, a0 # Move a0 input (argument) into t1
# Condition Check
li t2, 0x0000FFFF # Load 0x0000FFFF into t2
bleu t1, t2, step_16 # if x <= 0x0000FFFF then jump to step_16
j check_8 # jump to check_8
step_16:
addi t0, t0, 16 # count += 16
slli t1, t1, 16 # x <<= 16
check_8:
li t2, 0x00FFFFFF # Load 0x00FFFFFF into t2
bleu t1, t2, step_8 # if x <= 0x00FFFFFF then jump to step_8
j check_4 # jump to check_4
step_8:
addi t0, t0, 8 # count += 8
slli t1, t1, 8 # x <<= 8
check_4:
li t2, 0x0FFFFFFF # Load 0x0FFFFFFF into t2
bleu t1, t2, step_4 # if x <= 0x0FFFFFFF then jump to step_4
j check_2 # jump to check_2
step_4:
addi t0, t0, 4 # count += 4
slli t1, t1, 4 # x <<= 4
check_2:
li t2, 0x3FFFFFFF # Load 0x3FFFFFFF into t2
bleu t1, t2, step_2 # if x <= 0x3FFFFFFF then jump to step_2
j check_1 # jump to check_1
step_2:
addi t0, t0, 2 # count += 2
slli t1, t1, 2 # x <<= 2
check_1:
li t2, 0x7FFFFFFF # Load 0x7FFFFFFF into t2
bleu t1, t2, step_1 # if x <= 0x7FFFFFFF then jump to step_1
j exit # jump to exit
step_1:
addi t0, t0, 1 # count += 1
exit:
mv a0, t0 # Move result from t0 to a0
jr ra # Return to main
printResult:
mv t0, a0 # Save original input value (argument) in temporary register t0
mv t1, a1 # Save my_clz result in temporary register t1
# Print "There are "
la a0, str1 # Load the address of the second string (" There are ")
li a7, 4 # System call code for printing a string
ecall
# Print my_clz result (leading zero count)
mv a0, t1 # Move the `my_clz` result to a0 for printing
li a7, 1 # System call code for printing an integer
ecall
# Print " leading zeros in "
la a0, str2 # Load the address of the second string (" leading zeros in ")
li a7, 4 # System call code for printing a string
ecall
# Print the original argument
mv a0, t0 # Move the input value (argument) to a0 for printing
li a7, 1 # System call code for printing an integer
ecall
# Print new_line
la a0, new_line # Load the address of the third string ("\n")
li a7, 4 # System call code for printing a string
ecall
jr ra # Return to main
mv rd, rs
It translates to addi rd, rs, 0
.
j offset
It translates to jal x0, offset
.
jr rs
It translates to jalr x0, 0(rs)
.
li rd, immediate
It is composed of two actual instructions: lui
and addi
.
For example, the pseudo-instruction li t2, 0x0000FFFF
is translated into the following two instructions:
lui t2, 0x1
-> t2: 0x00010000addi t2, t2, -1
-> t2: 0x0000FFFFbleu rs, rt, offset
It translates to bgeu
(branch if greater than or equal, unsigned) with the positions of rt and rs swapped.
For example, the pseudo-instruction bleu t1, t2, step_16
is translated into bgeu t2, t1, step_16
.
I think there is an error in Ripes. The instruction
lui x7, 0x10
should actually belui x7, 0x1
.
In the image above, you can see that the instruction li t2, 0x0000FFFF
is translated into lui x7, 0x1
and addi x7, x7, -1
. Similarly, the instruction bleu t1, t2, step_16
is translated into bgeu x7, x6, 8<step_16>
.
Reference: RISC-V pseudoinstructions page 139
We used Ripes to simulate a 5-Stage RISC-V Processor
At the 4th clock cycle (C4), a control hazard (also known as a branch hazard) occurs, since the branch outcome is unknown, we must flush two instructions in the Instruction Fetch (IF) and Instruction Decode (ID) stages to prevent incorrect execution. This process is repeated for each subsequent branch instruction.
Instruction | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 |
---|---|---|---|---|---|---|---|---|---|
lw x10, 0(x10) | IF | ID | EXE | MEM | WB | ||||
jal x1, 84 <my_clz> | IF | ID | EXE | MEM | WB | ||||
addi x5, x0 0 | stall | stall | IF | ID | EXE | MEM | WB |
As shown in the picture above, there are two nop
instructions present to resolve the control hazard.
In the following list, the RAW (Read After Write) data hazard is highlighted in red and blue. To resolve this issue, a hardware component known as the forwarding unit is required. This unit connects across pipeline stages and controls the multiplexer in the execution stage, which selecting the correct value to forward to the ALU. By forwarding the result from either the memory or write-back stage. The forwarding unit ensures that the result of the last executed instruction is immediately available.
Instruction | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 |
---|---|---|---|---|---|---|---|---|---|
addi x5, x0 0 | IF | ID | EXE | MEM | WB | ||||
addi x6, x10, 0 | IF | ID | EXE | MEM | WB | ||||
lui x7, 0x01 | IF | ID | EXE | MEM | WB | ||||
addi, x7, x7, -1 | IF | ID | EXE | MEM | WB | ||||
bgeu x7 x6 8 <step_16> | IF | ID | EXE | MEM | WB |
The left picture below illustrates the RISC-V architecture with a forwarding circuit, but branch detection has not been moved earlier to the execution stage. Typically, moving branch detection to an earlier stage, such as the execution stage, would require flushing or stalling only two instructions (or clock cycles) in the IF and ID stages, respectively, upon a branch misprediction. This optimization can significantly reduce the number of wasted cycles, improving overall execution efficiency by minimizing the performance penalties associated with control hazards.
In the right picture, the value of x7 becomes immediately available in the memory stage and is forwarded to the execution stage as the source value for the next instruction.
Original Version:
Bit mask Version:
You can clearly see that the cycle count for the bit mask version is lower than that of the original version.
Here’s a problem that involves using the clz
(Count Leading Zeros) function: Smallest power of 2 greater than or equal to n
Description: Given a number
n
.Find the nearest number which is a power of 2 and greater than or equal ton
.Constraints: 1 <=
n
<= 109
x = 1
nearestPowerOfTwo(1)
returns 1
x = 5
nearestPowerOfTwo(5)
returns 8
x = 255
nearestPowerOfTwo(255)
returns 256
clz
to determine the number of leading zeros in the binary representation of n
.n
.Since the maximum value of n
is limited by the constraints of what an int
can store, I chose to use int
as the return type for the function.
#include <stdint.h>
int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; i--) {
if (x & (1U << i)) {
break;
}
count++;
}
return count;
}
int nearestPowerOfTwo(uint32_t n) {
// If n is already a power of two, return n directly
if ((n & (n - 1)) == 0) return n;
// Use my_clz to count the leading zeros of n
int leading_zeros = my_clz(n);
// Calculate the smallest power of 2 greater than or equal to n
int result = 1U << (32 - leading_zeros);
return result;
}
This problem in GeeksforGeeks is designed for a 64-bit system, but we are required to implement it on RV32I architecture. As a result, the C program must be adapted for a 32-bit system. If you attempt to use the following C program directly in GeeksforGeeks, it may result in errors due to this difference.
.data
argument1: .word 1
argument2: .word 5
argument3: .word 255
str1: .string "The nearest power of two to "
str2: .string " is "
new_line: .string "\n"
.text
main:
lw a0, argument1 # Load the argument1 (1) into register a0
jal ra, nearestPowerOfTwo # Jump-and-link to the 'my_clz' function to count leading zeros of argument1 (1)
# Prepare to print the result
lw a1, argument1 # Reload the original argument into a1 to print it
jal ra, printResult # Call the function to print the result
lw a0, argument2 # Load the argument2 (255) into register a0
jal ra, nearestPowerOfTwo # Jump-and-link to the 'my_clz' function to count leading zeros of argument2 (255)
# Prepare to print the result
lw a1, argument2 # Reload the original argument into a1 to print it
jal ra, printResult # Call the function to print the result
lw a0, argument3 # Load the argument3 (536870911) into register a0
jal ra, nearestPowerOfTwo # Jump-and-link to the 'my_clz' function to count leading zeros of argument3 (536870911)
# Prepare to print the result
lw a1, argument3 # Reload the original argument into a1 to print it
jal ra, printResult # Call the function to print the result
li a7, 10 # Exit system call
ecall
my_clz:
addi t0, zero, 0 # Initialize t0 (count) to zero
mv t1, a0 # Move a0 input (argument) into t1
# Condition Check
li t2, 0x0000FFFF # Load 0x0000FFFF into t2
bleu t1, t2, step_16 # if x <= 0x0000FFFF then jump to step_16
j check_8 # jump to check_8
step_16:
addi t0, t0, 16 # count += 16
slli t1, t1, 16 # x <<= 16
check_8:
li t2, 0x00FFFFFF # Load 0x00FFFFFF into t2
bleu t1, t2, step_8 # if x <= 0x00FFFFFF then jump to step_8
j check_4 # jump to check_4
step_8:
addi t0, t0, 8 # count += 8
slli t1, t1, 8 # x <<= 8
check_4:
li t2, 0x0FFFFFFF # Load 0x0FFFFFFF into t2
bleu t1, t2, step_4 # if x <= 0x0FFFFFFF then jump to step_4
j check_2 # jump to check_2
step_4:
addi t0, t0, 4 # count += 4
slli t1, t1, 4 # x <<= 4
check_2:
li t2, 0x3FFFFFFF # Load 0x3FFFFFFF into t2
bleu t1, t2, step_2 # if x <= 0x3FFFFFFF then jump to step_2
j check_1 # jump to check_1
step_2:
addi t0, t0, 2 # count += 2
slli t1, t1, 2 # x <<= 2
check_1:
li t2, 0x7FFFFFFF # Load 0x7FFFFFFF into t2
bleu t1, t2, step_1 # if x <= 0x7FFFFFFF then jump to step_1
j exit_my_clz # jump to exit_my_clz
step_1:
addi t0, t0, 1 # count += 1
exit_my_clz:
mv a0, t0 # Move result from t0 to a0
jr ra # Return to main
nearestPowerOfTwo:
addi sp, sp, -8 # Allocate stack space for local variables (ra)
sw ra, 0(sp) # Save return address (ra) on the stack
mv t0, a0 # Move argument into t0
addi t1, t0, -1 # n - 1
and t2, t0, t1 # n & (n - 1)
beq t2, zero, exit # if t2 == 0 then return n
jal ra, my_clz # Store the result of my_clz(n) in a0
mv t0, a0 # Move the result of my_clz(n) into t0
sub t0, zero, t0 # Store the negation of my_clz(n) in t0
addi t1, t0, 32 # 32 - my_clz(n)
addi t2, zero, 1 # Set t2 as 1U
sll t0, t2, t1 # 1U << (32 - my_clz(n))
exit:
mv a0, t0 # Move the result of nearestPowerOfTwo into a0
lw ra, 0(sp) # Restore return address of main (ra)
jr ra # Return to main
printResult:
mv t0, a0 # Save original input value (argument) in temporary register t0
mv t1, a1 # Save my_clz result in temporary register t1
# Print "There are "
la a0, str1 # Load the address of the second string (" There are ")
li a7, 4 # System call code for printing a string
ecall
# Print my_clz result (leading zero count)
mv a0, t1 # Move the `my_clz` result to a0 for printing
li a7, 1 # System call code for printing an integer
ecall
# Print " leading zeros in "
la a0, str2 # Load the address of the second string (" leading zeros in ")
li a7, 4 # System call code for printing a string
ecall
# Print the original argument
mv a0, t0 # Move the input value (argument) to a0 for printing
li a7, 1 # System call code for printing an integer
ecall
# Print new_line
la a0, new_line # Load the address of the third string ("\n")
li a7, 4 # System call code for printing a string
ecall
jr ra # Return to main
Since the nearestPowerOfTwo
function calls another function my_clz
, we must allocate stack space to store the return address (ra) of the main
function. This ensures that after the my_clz
function finishes execution, the program can correctly return to the calling function nearestPowerOfTwo
, and eventually back to main
. Therefore, stack management is important for preserving the correct control flow in the program.
Console:
The nearest power of two to 1 is 1
The nearest power of two to 5 is 8
The nearest power of two to 255 is 256