# Assignment1: Optimizing division with powers of two using CLZ (RISC-V)
contributed by [PuYuanHsu](https://github.com/PuYuanHsu/Optimizing-division-with-powers-of-two-using-CLZ)
# Motivation
Division is a fundamental operation in mathematics, but in the computational realm, its computational cost can often be more than anticipated. Especially when handling vast amounts of data, even a simple division operation can become a bottleneck. This article aims to explore optimizing division by powers of 2 using CLZ (Count Leading Zeros) and binary operations.
# Dividing by powers of two
### Compute the division quotient using bit operations
Let's take a general binary number and break it down:

Where each b represents a binary bit (0 or 1) and the position represents its power of 2 (starting from the rightmost as the least significant bit).
This number can be represented as:

Now, when you right shift, each bit moves one position to the right. So the new representation becomes:

Comparing these two representations, you can observe that every term of the second representation is half of the corresponding term in the first representation (because the power of 2 has been reduced by 1). This reduction by half is equivalent to dividing by 2.
### Remainder is also calculated using bit operations
Subtracting 1 from divisor(a power of two) results in a binary number consisting of all ones, which, when applied with the bitwise AND operation ('&') to the original number, effectively masks the higher bits of the number, isolating the lower bits that represent the remainder.
* ex: divisor = 8 (0b01000), dividend = 19 (0b010011)
-> divisor -1 = 7 (0b0111)
* remainder = dividend & (divisor - 1) = 0b011 = 3
From the above examples, it can be seen that these operations can retrieve the number lost in the bitwise right-shift operation, which is the remainder in division.
# implement (first version)
### [C code](https://github.com/PuYuanHsu/Optimizing-division-with-powers-of-two-using-CLZ/blob/main/devision%20using%20CLZ%20-%20C.cpp)
``` c=
#include <stdint.h>
#include<iostream>
using namespace std;
uint16_t count_leading_zeros(uint32_t x) {
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
/* count ones (population count) */
x -= ((x >> 1) & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
int a= (32 - (x & 0x3f));
return (32 - (x & 0x3f));
}
int main(void)
{
int divisor = 8, dividend = 19;
int move_right = 31 - count_leading_zeros(divisor);
int result = dividend >> move_right;
printf("%d divided by %d = %d\n" , dividend, divisor, result);
int remainder = dividend & (divisor - 1);
printf("remainder = %d",remainder);
return 0;
}
```
### Assembly Code
``` Assembly=
.data
str1: .string "result = "
str2: .string "\nremainder = "
.text
divisor:
.word 8
dividend:
.word 19
.globl main
main:
# Load data
lw a0, divisor # Load divisor into a0
lw t0, dividend # Load dividend into t3
# Call count_leading_zeros function
call count_leading_zeros
# Calculate move_right
li t2, 31 # Adjusted to 31 for 32-bit
sub a0, t2, a0 # move_right = 31 - clz(divisor)
# Shift the number
srl a2, t0, a0 # a2 = result
#calculate remainder
lw a0, divisor
li t1 1
sub t1, a0, t1
and a3, t0, t1 # a3 = remainder
#print result
li a7, 4
la a0, str1
ecall
li a7, 1
mv a0, a2
ecall
li a7, 4
la a0, str2
ecall
li a7, 1
mv a0, a3
ecall
li a7, 10
ecall
count_leading_zeros:
# Load immediate values into registers
li t1, 0x55555555
li t2, 0x33333333
li t3, 0x0f0f0f0f
li t4, 0x3f
# x |= (x >> 1)
srli t5, a0, 1
or a0, a0, t5
# x |= (x >> 2)
srli t5, a0, 2
or a0, a0, t5
# x |= (x >> 4)
srli t5, a0, 4
or a0, a0, t5
# x |= (x >> 8)
srli t5, a0, 8
or a0, a0, t5
# x |= (x >> 16)
srli t5, a0, 16
or a0, a0, t5
# x -= ((x >> 1) & 0x55555555)
srli t5, a0, 1
and t5, t5, t1
sub a0, a0, t5
# x = ((x >> 2) & 0x33333333) + (x & 0x33333333)
srli t5, a0, 2
and t5, t5, t2
and t6, a0, t2
add a0, t5, t6
# x = ((x >> 4) + x) & 0x0f0f0f0f
srli t5, a0, 4
add a0, t5, a0
and a0, a0, t3
# x += (x >> 8)
srli t5, a0, 8
add a0, a0, t5
# x += (x >> 16)
srli t5, a0, 16
add a0, a0, t5
# 32 - (x & 0x3f)
and a0, a0, t4
li t5, 32
sub a0, t5, a0
# Return
ret
```
# Assembly code optimizer
### Register usage: Reduce the usage of registers.
* Original version:
```Assembly=
# Load immediate values into registers
li t1, 0x55555555
li t2, 0x33333333
li t3, 0x0f0f0f0f
# x -= ((x >> 1) & 0x55555555)
srli t5, a0, 1
and t5, t5, t1
sub a0, a0, t5
# x = ((x >> 2) & 0x33333333) + (x & 0x33333333)
srli t5, a0, 2
and t5, t5, t2
and t4, a0, t2
add a0, t5, t4
# x = ((x >> 4) + x) & 0x0f0f0f0f
srli t5, a0, 4
add a0, t5, a0
and a0, a0, t3
```
* improved version
```Assmebly=
# x -= ((x >> 1) & 0x55555555)
li t1, 0x55555555
srli t2, a0, 1
and t2, t2, t1
sub a0, a0, t2
# x = ((x >> 2) & 0x33333333) + (x & 0x33333333)
li t1, 0x33333333 #Reuse t1
srli t2, a0, 2 # Reuse t2 here, since its previous value is not needed anymore
and t2, t2, t1
and t3, a0, t1
add a0, t2, t3
# x = ((x >> 4) + x) & 0x0f0f0f0f
li t1, 0x0f0f0f0f #Reuse t1
srli t2, a0, 4 #Reuse t2 again
add a0, t2, a0
```
Originally, registers a0, t1, t2, t3, t4, t5 etc. were used. After optimization, only a0, t1, t2 and t3 are needed.
---
### Loop: Express repetitive operations using loops to reduce the code size.
* Original version:
```Assembly=
# x |= (x >> 1)
srli t1, a0, 1
or a0, a0, t1
# x |= (x >> 2)
srli t1, a0, 2
or a0, a0, t1
# x |= (x >> 4)
srli t1, a0, 4
or a0, a0, t1
# x |= (x >> 8)
srli t1, a0, 8
or a0, a0, t1
# x |= (x >> 16)
srli t1, a0, 16
or a0, a0, t1
```
* improved version
```Assembly=
# Load immediate values
li t1, 17
li t2, 1 # Start shift value
shift_or_loop:
srl t3, a0, t2 # Use t3 for intermediate shifts
or a0, a0, t3
slli t2, t2, 1 # Double the shift value
# If shift value is less than 17, continue loop
blt t2, t1, shift_or_loop
```
Using loops significantly reduced the code size.
# Result
### Improved complete [assembly code](https://github.com/PuYuanHsu/Optimizing-division-with-powers-of-two-using-CLZ/blob/main/devision%20using%20CLZ%20-%20Assembly%20.s)
```Assembly=
.data
str1: .string "result = "
str2: .string "\nremainder = "
.text
divisor:
.word 8
dividend:
.word 19
.globl main
main:
# Load data
lw a0, divisor # Load divisor into a0
lw t0, dividend # Load dividend into t3
# Call count_leading_zeros function
call count_leading_zeros
# Calculate move_right
li t2, 31 # Adjusted to 31 for 32-bit
sub a0, t2, a0 # move_right = 31 - clz(divisor)
# Shift the number
srl a2, t0, a0 # a2 = result
#calculate remainder
lw a0, divisor
li t1 1
sub t1, a0, t1
and a3, t0, t1 # a3 = remainder
#print result
li a7, 4
la a0, str1
ecall
li a7, 1
mv a0, a2
ecall
li a7, 4
la a0, str2
ecall
li a7, 1
mv a0, a3
ecall
li a7, 10
ecall
count_leading_zeros:
# Load immediate values
li t1, 17
li t2, 1 # Start shift value
shift_or_loop:
srl t3, a0, t2 # Use t3 for intermediate shifts
or a0, a0, t3
slli t2, t2, 1 # Double the shift value
blt t2, t1, shift_or_loop # If shift value is less than 17, continue loop
# x -= ((x >> 1) & 0x55555555)
li t1, 0x55555555
srli t2, a0, 1
and t2, t2, t1
sub a0, a0, t2
# x = ((x >> 2) & 0x33333333) + (x & 0x33333333)
li t1, 0x33333333
srli t2, a0, 2
and t2, t2, t1
and t3, a0, t1
add a0, t2, t3
# x = ((x >> 4) + x) & 0x0f0f0f0f
li t1, 0x0f0f0f0f
srli t2, a0, 4
add a0, t2, a0
and a0, a0, t1
# x += (x >> 8)
srli t2, a0, 8
add a0, a0, t2
# x += (x >> 16)
srli t2, a0, 16
add a0, a0, t2
# 32 - (x & 0x3f)
andi a0, a0, 0x3f
li t1, 32
sub a0, t1, a0
# Return
ret
```
### C code output
* test data 1: divisor = 4, number = 16

* test data 2: divisor = 8, number = 19

* test data 3: divisor = 16, number = 1546

### Assembly output
* test data 1: divisor = 4, number = 16

* test data 2: divisor = 8, number = 19

* test data 3: divisor = 16, number = 1546

# Analysis
### Execution info

### 5-stage RISC-V processor

The 5-stage RISC-V processor simulated in RIPES follows the classic **RISC pipeline** architecture, and these stages include:
1. Instruction Fetch (IF):
* In this stage, the processor retrieves the next instruction to be executed from memory. This involves using the program counter (PC) to locate the instruction, and then fetching the instruction from memory.
* The program counter is updated to point to the address of the next instruction.
2. Instruction Decode (ID) and Register Read:
* In this stage, the processor decodes (or interprets) the fetched instruction, determining the exact operation to be performed (e.g., addition, jump, etc.).
* It also reads the necessary operands, which are typically stored in a register file.
3. Execute (EX):
* This is the stage where arithmetic and logical operations are performed, such as addition, subtraction, multiplication, or logical operations (AND, OR, NOT, etc.).
* For some instructions, it might also change the value of the program counter (e.g., for jump or branch instructions).
4. Memory Access (MEM):
* In this stage, the processor may read data from memory (for load instructions) or write data to memory (for store instructions).
* For instructions that do not involve memory access, this stage would be idle.
5. Write Back (WB):
* In the final stage, the result (from the previous stage) is written back to the register file so it can be used by subsequent instructions.
In this five-stage pipeline, each stage operates within its own clock cycle, meaning multiple instructions can be processed at different stages simultaneously. This design allows for efficient instruction execution, thereby improving the overall performance of the processor.
### Experiment
**Take or x10 X10 X28 as an example**
**1. IF**

* The Program Counter (PC) is set to the address of the current instruction, which is 0x0000008c.
* The processor fetches the instruction "or x10, x10, x28" from address 0x0000008c, and its encoding is 0X01c56533.
**2. ID**

* The processor decodes the fetched instruction, determining it's an "OR" operation involving registers x10 and x28, and that the result should be stored in x10.
* x10: 0x0a
* x28: 0x1c
**3. EX**

* At this stage, the processor performs the "OR" operation: 0x00000001 OR 0x00000006, resulting in 0x00000007.
* in binary, 0b0110 OR 0b0001 equals 0b0111
**4. MEM**

* Since this particular "OR" operation doesn't involve any memory access, this stage does nothing and remains idle during the execution of the current instruction.
**5. WB**

* Finally, the result of the operation, 0x00000007, is written back to the register file, stored in register x10 (0x0a), so it can be used by subsequent instructions.
**6. register table**

after the operation, the register x10 = 0x00000007
---
### Disassembly code is [here](https://github.com/PuYuanHsu/Optimizing-division-with-powers-of-two-using-CLZ/blob/main/Disassembly%20code%20(devision%20using%20CLZ).s)
``` Assemvly=
00000000 <divisor>:
0: 0008 c.addi4spn x10 0
2: 0000 c.addi4spn x8 0
00000004 <dividend>:
4: 00000013 addi x0 x0 0
00000008 <main>:
8: 00000517 auipc x10 0x0 <divisor>
c: ff852503 lw x10 -8 x10
10: 00000297 auipc x5 0x0 <divisor>
14: ff42a283 lw x5 -12 x5
18: 00000097 auipc x1 0x0 <divisor>
1c: 068080e7 jalr x1 x1 104
20: 01f00393 addi x7 x0 31
24: 40a38533 sub x10 x7 x10
28: 00a2d633 srl x12 x5 x10
2c: 00000517 auipc x10 0x0 <divisor>
30: fd452503 lw x10 -44 x10
34: 00100313 addi x6 x0 1
38: 40650333 sub x6 x10 x6
3c: 0062f6b3 and x13 x5 x6
40: 00400893 addi x17 x0 4
44: 10000517 auipc x10 0x10000
48: fbc50513 addi x10 x10 -68
4c: 00000073 ecall
50: 00100893 addi x17 x0 1
54: 00060513 addi x10 x12 0
58: 00000073 ecall
5c: 00400893 addi x17 x0 4
60: 10000517 auipc x10 0x10000
64: faa50513 addi x10 x10 -86
68: 00000073 ecall
6c: 00100893 addi x17 x0 1
70: 00068513 addi x10 x13 0
74: 00000073 ecall
78: 00a00893 addi x17 x0 10
7c: 00000073 ecall
00000080 <count_leading_zeros>:
80: 01100313 addi x6 x0 17
84: 00100393 addi x7 x0 1
00000088 <shift_or_loop>:
88: 00755e33 srl x28 x10 x7
8c: 01c56533 or x10 x10 x28
90: 00139393 slli x7 x7 1
94: fe63cae3 blt x7 x6 -12 <shift_or_loop>
98: 55555337 lui x6 0x55555
9c: 55530313 addi x6 x6 1365
a0: 00155393 srli x7 x10 1
a4: 0063f3b3 and x7 x7 x6
a8: 40750533 sub x10 x10 x7
ac: 33333337 lui x6 0x33333
b0: 33330313 addi x6 x6 819
b4: 00255393 srli x7 x10 2
b8: 0063f3b3 and x7 x7 x6
bc: 00657e33 and x28 x10 x6
c0: 01c38533 add x10 x7 x28
c4: 0f0f1337 lui x6 0xf0f1
c8: f0f30313 addi x6 x6 -241
cc: 00455393 srli x7 x10 4
d0: 00a38533 add x10 x7 x10
d4: 00657533 and x10 x10 x6
d8: 00855393 srli x7 x10 8
dc: 00750533 add x10 x10 x7
e0: 01055393 srli x7 x10 16
e4: 00750533 add x10 x10 x7
e8: 03f57513 andi x10 x10 63
ec: 02000313 addi x6 x0 32
f0: 40a30533 sub x10 x6 x10
f4: 00008067 jalr x0 x1 0
```
# Conclusion
* Utilizing bit manipulation techniques, we can compute the quotient and remainder of a number divided by a power of two with high efficiency. This method relies on the properties of binary representation and the characteristics of bitwise operations, enabling the rapid acquisition of results without incurring the substantial computational costs associated with division operations. However, the correct functioning of this technique is confined exclusively to cases where the divisor is a power of two; it will not yield correct results when a divisor that is not a power of two is employed. Consequently, special caution is required when applying these techniques.