Bits reversal using clz

Functionality

Reverse bits of a given 32 bits unsigned integer.

Example:

Input:  0000 0010 1001 0100 0001 1110 1001 1100
Output: 0011 1001 0111 1000 0010 1001 0100 0000

Motivation

For a vanilla reverseBits() function, we usually use a loop to iterate through each bit of the input integer, and then use a variable to store the reversed bits. However, if the integer has a lot of leading zeros, the loop will iterate through all the leading zeros, which is a waste of time.

Idea

First, we count the number of leading zeros of the input integer. Then, we use a loop to iterate through remaining bits of the input integer, and use a variable to store the reversed bits. Finally, we shift the reversed bits to the left by the number of leading zeros.

Implementations

C code

bitr.c

uint16_t count_leading_zeros(uint32_t x)
{
    /* make sure all bits to the right of the MSB are 1 */
    x |= (x >> 1);
    x |= (x >> 2);
    x |= (x >> 4);
    x |= (x >> 8);
    x |= (x >> 16);

    /* count ones (population count) */
    x -= ((x >> 1) & 0x55555555);
    x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
    x = ((x >> 4) + x) & 0x0f0f0f0f;
    x += (x >> 8);
    x += (x >> 16);

    return (32 - (x & 0x3f));
}

uint32_t reverseBits(uint32_t x)
{
    uint16_t n = count_leading_zeros(x);
    uint32_t y = x & 1;

    while (x >>= 1)
    {
        y = ((y << 1) | (x & 1));
    }

    return (y << n);
}

Store the result of counting leading zeros (clz) in the variable n. We can disregard the leftmost n bits since they are zeros. During each iteration, append the rightmost bit of x to y until all set bits in x have been shifted. Finally, correct the result by shifting y back by n positions to the right.

Assembly code

bitr.s

clz:    
    # s0 = x
    # make sure all bits to the right of the MSB are 1
    srli t0, s0, 1
    or   s0, s0, t0
    srli t0, s0, 2
    or   s0, s0, t0
    srli t0, s0, 4
    or   s0, s0, t0
    srli t0, s0, 8
    or   s0, s0, t0
    srli t0, s0, 16
    or   s0, s0, t0

    # count ones (population count)
    srli t0, s0, 1
    li   t1, 0x55555555
    and  s1, t0, t1
    sub  s0, s0, s1

    srli t0, s0, 2
    li   t1, 0x33333333
    and  s1, t0, t1
    and  s2, s0, t1
    add  s0, s1, s2

    srli t0, s0, 4
    add  s0, s0, t0
    li   t1, 0x0f0f0f0f
    and  s0, s0, t1
    
    srli t0, s0, 8
    add  s0, s0, t0

    srli t0, s0, 16
    add  s0, s0, t0

    # mask 0x3f for popcount
    andi s0, s0, 0x3f
    
    # 32 - popcnt
    li  t0, 0x20
    sub s0, t0, s0
    
    
bitrev:
    # s0 = x
    # s1 = y
    # s2 = n
    jal  ra, clz
    mv   s2, a0
    andi s1, s0, 1

bitrev_loop:
    srli s0, s0, 1
    beq  s0, x0, bitrev_end
    slli s1, s1, 1
    andi t0, s0, 1
    or   s1, s1, t0
    j bitrev_loop

bitrev_end:
    sll s1, s1, s2

I divided my code into two sub-functions. Initially, I load the argument into a0. In the first function call to bitrev(), I use the value stored in a0 as a parameter. Following that, I proceed to make a second function call to retrieve the count of leading zeros (clz) for the value in a0. The resulting clz value is then returned to the calling function, bitrev(). Subsequently, bitrev() continues its bit reversal process using the clz value.

Analysis

Code tests on Ripes simulator

In function clz, I noticed that instruction li sometimes would be extended by lui and addi

My code

srli t0, s0, 1
li   t1, 0x55555555
and  s1, t0, t1
sub  s0, s0, s1

Executable code

00145293        srli x5 x8 1
55555337        lui  x6 0x55555
55530313        addi x6 x6 1365
0062f4b3        and  x9 x5 x6
40940433        sub  x8 x8 x9

The li instruction is designed to load a 32-bit integer into a register. However, it's important to note that RV32I architecture cannot represent all the necessary information within a single 32-bit instruction format.

To address this limitation, we can make use of the lui and addi instructions. But how do we split the immediate value into two instructions?

For the lui instruction, which stands for 'load upper immediate' and follows the U-type format, its structure is as follows:

[31:12]	[11:7]	[6:0]
immediate[31:12]	rd	opcode

Here, we have 20 bits to store the upper part of the immediate value.

As for the addi instruction, which adds a sign-extended immediate value to a register and follows the I-type format, its structure is as follows:

[31:20]	[19:15]	[14:12]	[11:7]	[6:0]
immediate[11:0]	rs1	000	rd	opcode

In this format, the lower 12 bits of the immediate value are stored.

Now, let's consider the value 0x55555555 as an example. We can break it down into two parts: 0x55555 and 0x555, which is equivalent to 1365 in decimal, as reflected in the executable code snippet above.

A crucial observation to make is that the addi instruction takes into account signed integers. This means that if the 11th bit of the immediate value is set, it won't perform a standard extension, as shown in the code snippet below:

lui  s0, immediate >> 12
addi s0, s0, (immediate & 0xfff)

Instead, it performs a specialized extension as follows:

lui  s0, ((immediate >> 12) + 1)
addi s0, s0, ((immediate & 0xfff) - 2^12)

In this specialized extension, the overflow is moved to the upper part of the immediate value, and the redundant values are subtracted. This behavior results in code like the following:

My code

srli t0, s0, 4
add  s0, s0, t0
li   t1, 0x0f0f0f0f
and  s0, s0, t1

Executable code

00445293        srli x5 x8 4
00540433        add  x8 x8 x5
0f0f1337        lui  x6 0xf0f1
f0f30313        addi x6 x6 -241
00647433        and  x8 x8 x6

This specialized behavior ensures that the signed integer handling in the addi instruction aligns with the desired outcome.

Optimizations

Loops unroll

Loop unrolling can expose more opportunities for parallelism within the loop body. When multiple iterations of the loop are executed in parallel, the processor can potentially execute instructions from different iterations concurrently, leading to better utilization of execution units in the CPU.

before

bitrev_loop:
    srli s0, s0, 1
    beq s0, x0, bitrev_end
    slli s1, s1, 1
    andi t0, s0, 1
    or s1, s1, t0
    j bitrev_loop

after

bitrev_loop:
    srli s0, s0, 1
    beq s0, x0, bitrev_end
    
    slli s1, s1, 1
    andi t0, s0, 1
    or s1, s1, t0
    
    srli s0, s0, 1
    beq s0, x0, bitrev_end
    
    slli s1, s1, 1
    andi t0, s0, 1
    or s1, s1, t0
    
    srli s0, s0, 1
    beq s0, x0, bitrev_end
    
    slli s1, s1, 1
    andi t0, s0, 1
    or s1, s1, t0
    
    srli s0, s0, 1
    beq s0, x0, bitrev_end
    
    slli s1, s1, 1
    andi t0, s0, 1
    or s1, s1, t0
    
    j bitrev_loop