contributed by < paulpeng-popo >
Reverse bits of a given 32 bits unsigned integer.
Example:
Input: 0000 0010 1001 0100 0001 1110 1001 1100
Output: 0011 1001 0111 1000 0010 1001 0100 0000
For a vanilla reverseBits() function, we usually use a loop to iterate through each bit of the input integer, and then use a variable to store the reversed bits. However, if the integer has a lot of leading zeros, the loop will iterate through all the leading zeros, which is a waste of time.
First, we count the number of leading zeros of the input integer. Then, we use a loop to iterate through remaining bits of the input integer, and use a variable to store the reversed bits. Finally, we shift the reversed bits to the left by the number of leading zeros.
uint16_t count_leading_zeros(uint32_t x)
{
/* make sure all bits to the right of the MSB are 1 */
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
/* count ones (population count) */
x -= ((x >> 1) & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (32 - (x & 0x3f));
}
uint32_t reverseBits(uint32_t x)
{
uint16_t n = count_leading_zeros(x);
uint32_t y = x & 1;
while (x >>= 1)
{
y = ((y << 1) | (x & 1));
}
return (y << n);
}
Store the result of counting leading zeros (clz) in the variable n
. We can disregard the leftmost n bits since they are zeros. During each iteration, append the rightmost bit of x
to y
until all set bits in x
have been shifted. Finally, correct the result by shifting y
back by n positions to the right.
clz:
# s0 = x
# make sure all bits to the right of the MSB are 1
srli t0, s0, 1
or s0, s0, t0
srli t0, s0, 2
or s0, s0, t0
srli t0, s0, 4
or s0, s0, t0
srli t0, s0, 8
or s0, s0, t0
srli t0, s0, 16
or s0, s0, t0
# count ones (population count)
srli t0, s0, 1
li t1, 0x55555555
and s1, t0, t1
sub s0, s0, s1
srli t0, s0, 2
li t1, 0x33333333
and s1, t0, t1
and s2, s0, t1
add s0, s1, s2
srli t0, s0, 4
add s0, s0, t0
li t1, 0x0f0f0f0f
and s0, s0, t1
srli t0, s0, 8
add s0, s0, t0
srli t0, s0, 16
add s0, s0, t0
# mask 0x3f for popcount
andi s0, s0, 0x3f
# 32 - popcnt
li t0, 0x20
sub s0, t0, s0
bitrev:
# s0 = x
# s1 = y
# s2 = n
jal ra, clz
mv s2, a0
andi s1, s0, 1
bitrev_loop:
srli s0, s0, 1
beq s0, x0, bitrev_end
slli s1, s1, 1
andi t0, s0, 1
or s1, s1, t0
j bitrev_loop
bitrev_end:
sll s1, s1, s2
I divided my code into two sub-functions. Initially, I load the argument into a0
. In the first function call to bitrev()
, I use the value stored in a0
as a parameter. Following that, I proceed to make a second function call to retrieve the count of leading zeros (clz) for the value in a0
. The resulting clz value is then returned to the calling function, bitrev()
. Subsequently, bitrev()
continues its bit reversal process using the clz value.
Code tests on Ripes simulator
In function clz
, I noticed that instruction li
sometimes would be extended by lui
and addi
My code
srli t0, s0, 1
li t1, 0x55555555
and s1, t0, t1
sub s0, s0, s1
Executable code
00145293 srli x5 x8 1
55555337 lui x6 0x55555
55530313 addi x6 x6 1365
0062f4b3 and x9 x5 x6
40940433 sub x8 x8 x9
The li
instruction is designed to load a 32-bit integer into a register. However, it's important to note that RV32I architecture cannot represent all the necessary information within a single 32-bit instruction format.
To address this limitation, we can make use of the lui
and addi
instructions. But how do we split the immediate value into two instructions?
For the lui
instruction, which stands for 'load upper immediate' and follows the U-type format, its structure is as follows:
[31:12] | [11:7] | [6:0] |
---|---|---|
immediate[31:12] | rd | opcode |
Here, we have 20 bits to store the upper part of the immediate value.
As for the addi
instruction, which adds a sign-extended immediate value to a register and follows the I-type format, its structure is as follows:
[31:20] | [19:15] | [14:12] | [11:7] | [6:0] |
---|---|---|---|---|
immediate[11:0] | rs1 | 000 | rd | opcode |
In this format, the lower 12 bits of the immediate value are stored.
Now, let's consider the value 0x55555555 as an example. We can break it down into two parts: 0x55555 and 0x555, which is equivalent to 1365 in decimal, as reflected in the executable code snippet above.
A crucial observation to make is that the addi
instruction takes into account signed integers. This means that if the 11th bit of the immediate value is set, it won't perform a standard extension, as shown in the code snippet below:
lui s0, immediate >> 12
addi s0, s0, (immediate & 0xfff)
Instead, it performs a specialized extension as follows:
lui s0, ((immediate >> 12) + 1)
addi s0, s0, ((immediate & 0xfff) - 2^12)
In this specialized extension, the overflow is moved to the upper part of the immediate value, and the redundant values are subtracted. This behavior results in code like the following:
My code
srli t0, s0, 4
add s0, s0, t0
li t1, 0x0f0f0f0f
and s0, s0, t1
Executable code
00445293 srli x5 x8 4
00540433 add x8 x8 x5
0f0f1337 lui x6 0xf0f1
f0f30313 addi x6 x6 -241
00647433 and x8 x8 x6
This specialized behavior ensures that the signed integer handling in the addi
instruction aligns with the desired outcome.
Loop unrolling can expose more opportunities for parallelism within the loop body. When multiple iterations of the loop are executed in parallel, the processor can potentially execute instructions from different iterations concurrently, leading to better utilization of execution units in the CPU.
before
bitrev_loop:
srli s0, s0, 1
beq s0, x0, bitrev_end
slli s1, s1, 1
andi t0, s0, 1
or s1, s1, t0
j bitrev_loop
after
bitrev_loop:
srli s0, s0, 1
beq s0, x0, bitrev_end
slli s1, s1, 1
andi t0, s0, 1
or s1, s1, t0
srli s0, s0, 1
beq s0, x0, bitrev_end
slli s1, s1, 1
andi t0, s0, 1
or s1, s1, t0
srli s0, s0, 1
beq s0, x0, bitrev_end
slli s1, s1, 1
andi t0, s0, 1
or s1, s1, t0
srli s0, s0, 1
beq s0, x0, bitrev_end
slli s1, s1, 1
andi t0, s0, 1
or s1, s1, t0
j bitrev_loop