owned this note
owned this note
Published
Linked with GitHub
# Implement next power of two using clz
contributed by < [terry23304](https://github.com/terry23304/next-power-of-2-with-clz) >
Determines the smallest power of two that is greater than a given integer. In this operation, the input integer is rounded up to the nearest power of two.
## C Code
```c
uint16_t count_leading_zeros(uint32_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
/* count ones (population count) */
x -= ((x >> 1) & 0x5555555555555555);
x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (32 - (x & 0x7f));
}
uint32_t next_pow2(uint32_t x)
{
if (!x)
return 1;
int n = 32 - count_leading_zeros(x);
return 1 << n;
}
int main()
{
uint32_t test_data[] = {0, 1, 9};
for (int i = 0; i < 3; i++)
printf("%d\n", next_pow2(test_data[i]));
return 0;
}
```
## Assembly Code
Please check the assembly code on [Github](https://github.com/terry23304/next-power-of-2-with-clz/blob/master/next_pow2.s)
# Analysis
Using [Ripes](https://github.com/mortbopet/Ripes), a 5-stage execution instruction pipeline simulator.
## 5-stage pipelined processor
Takes insturction `auipc x5 0x10000` for example:
The `AUIPC`(Add Upper Immediate to PC) instruction in the RISC-V assembly language is used to add an immediate value to the value in the Program Counter. The result is then placed into the specified destination register.
### IF Stage

PC of instruction `auipc x5 0x10000` is `0x00000000`, the instruction memory will fetch the instruction at PC and translate it to machine code.
`auipc` is U-Format instruction:
| imm[31:12] | rd | opcode |
| -------- | -------- | -------- |
| 00010000000000000000 | 00101 | 0010111 |
Therefore, the output of the instruction memory would be `0x100000297`
### ID Stage

Since there are no source registers, the output of decode will be:
- `R1 idx = 0x00`
- `R2 idx = 0x00`
- `Reg 1 = 0x00000000`
- `Reg 2 = 0x00000000`
The immedate is also send to the next stage.
### EX Stage

There are four multiplexer to choose which input will be used.
- `Op 1`: `0x00000000`
- `Op 2`: `0x10000000`
`Op 1` is program counter, `Op 2` is the immdiate value, added together will be the corresponding result.
### MEM Stage

`Wr en` is red means that won't store data to memory.
The output of ALU will also be send to the next stage.
### WB stage

The result will be store to the destination register.

# Assembly Improvement
## Loop unrolling
Loop unrolling is a technique used to optimize loops by reducing loop control overhead and potentially increasing instruction-level parallelism.
Ripes will always loads next instructions. Therefore, when encountering `j Loop` instruction, which is in the execution stage, there will be two instructions fetched after the `j Loop` instruction that will be flushed. Hence it would be beneficial to unroll the loop and avoid the flush operation.
```c
li s0, 32
li s1, 1
add s2, zero, a0
Loop:
beq s0, s1, End
srl t0, s2, s1
or s2, s2, t0
slli s1, s1, 1
j Loop
End:
```
Unrolling 4 times:
```c
srli t0, s2, 1 # x >> 1
or s2, s2, t0 # x |= (x >> 1)
srli t0, s2, 2 # x >> 2
or s2, s2, t0 # x |= (x >> 1)
srli t0, s2, 4 # x >> 4
or s2, s2, t0 # x |= (x >> 1)
srli t0, s2, 6 # x >> 6
or s2, s2, t0 # x |= (x >> 1)
srli t0, s2, 8 # x >> 8
or s2, s2, t0 # x |= (x >> 1)
```
Without unrolling:

Unrolling:

## Using saved registers
The callee will save the saved registers if they will be modifed, so using saved register instead of temporary registers can eliminate unnecssary stack operations.
:::spoiler before
```c
main:
la t0, list
add t1, x0, x0
addi t2, x0, 12
test:
beq t1, t2, exit
add t3, t0, t1
lw a0, 0(t3)
addi sp, sp, -12
sw t0, 0(sp)
sw t1, 4(sp)
sw t2, 8(sp)
jal ra, next_pow2
jal ra, print
lw t2, 8(sp)
lw t1, 4(sp)
lw t0, 0(sp)
addi s1, s1, 4
j test
```
:::
after:
```c
main:
la s0, list
add s1, x0, x0
addi s2, x0, 12
test:
beq s1, s2, exit
add s3, s0, s1
lw a0, 0(s3)
jal ra, next_pow2
jal ra, print
addi s1, s1, 4
j test
```

## Loop unrolling again
I missed a part of the code that can be unrolled in the LOOP UNROLLING above, so I'm doing the unrolling again here.
```c
main:
la s0, list
add s1, x0, x0
addi s2, x0, 12
test:
beq s1, s2, exit
add s3, s0, s1
lw a0, 0(s3)
jal ra, next_pow2
jal ra, print
addi s1, s1, 4
j test
```
```c
main:
la s0, list
lw a0, 0(s0)
jal ra, next_pow2
jal ra, print
lw a0, 4(s0)
jal ra, next_pow2
jal ra, print
lw a0, 8(s0)
jal ra, next_pow2
jal ra, print
j exit
```
