contributed by < paulpeng-popo >
Reverse bits of a given 32 bits unsigned integer.
Example:
For a vanilla reverseBits() function, we usually use a loop to iterate through each bit of the input integer, and then use a variable to store the reversed bits. However, if the integer has a lot of leading zeros, the loop will iterate through all the leading zeros, which is a waste of time.
First, we count the number of leading zeros of the input integer. Then, we use a loop to iterate through remaining bits of the input integer, and use a variable to store the reversed bits. Finally, we shift the reversed bits to the left by the number of leading zeros.
Store the result of counting leading zeros (clz) in the variable n
. We can disregard the leftmost n bits since they are zeros. During each iteration, append the rightmost bit of x
to y
until all set bits in x
have been shifted. Finally, correct the result by shifting y
back by n positions to the right.
I divided my code into two sub-functions. Initially, I load the argument into a0
. In the first function call to bitrev()
, I use the value stored in a0
as a parameter. Following that, I proceed to make a second function call to retrieve the count of leading zeros (clz) for the value in a0
. The resulting clz value is then returned to the calling function, bitrev()
. Subsequently, bitrev()
continues its bit reversal process using the clz value.
Code tests on Ripes simulator
In function clz
, I noticed that instruction li
sometimes would be extended by lui
and addi
My code
Executable code
The li
instruction is designed to load a 32-bit integer into a register. However, it's important to note that RV32I architecture cannot represent all the necessary information within a single 32-bit instruction format.
To address this limitation, we can make use of the lui
and addi
instructions. But how do we split the immediate value into two instructions?
For the lui
instruction, which stands for 'load upper immediate' and follows the U-type format, its structure is as follows:
[31:12] | [11:7] | [6:0] |
---|---|---|
immediate[31:12] | rd | opcode |
Here, we have 20 bits to store the upper part of the immediate value.
As for the addi
instruction, which adds a sign-extended immediate value to a register and follows the I-type format, its structure is as follows:
[31:20] | [19:15] | [14:12] | [11:7] | [6:0] |
---|---|---|---|---|
immediate[11:0] | rs1 | 000 | rd | opcode |
In this format, the lower 12 bits of the immediate value are stored.
Now, let's consider the value 0x55555555 as an example. We can break it down into two parts: 0x55555 and 0x555, which is equivalent to 1365 in decimal, as reflected in the executable code snippet above.
A crucial observation to make is that the addi
instruction takes into account signed integers. This means that if the 11th bit of the immediate value is set, it won't perform a standard extension, as shown in the code snippet below:
Instead, it performs a specialized extension as follows:
In this specialized extension, the overflow is moved to the upper part of the immediate value, and the redundant values are subtracted. This behavior results in code like the following:
My code
Executable code
This specialized behavior ensures that the signed integer handling in the addi
instruction aligns with the desired outcome.
Loop unrolling can expose more opportunities for parallelism within the loop body. When multiple iterations of the loop are executed in parallel, the processor can potentially execute instructions from different iterations concurrently, leading to better utilization of execution units in the CPU.
before
after