# Assignment1: Optimizing division with powers of two using CLZ (RISC-V) contributed by [PuYuanHsu](https://github.com/PuYuanHsu/Optimizing-division-with-powers-of-two-using-CLZ) # Motivation Division is a fundamental operation in mathematics, but in the computational realm, its computational cost can often be more than anticipated. Especially when handling vast amounts of data, even a simple division operation can become a bottleneck. This article aims to explore optimizing division by powers of 2 using CLZ (Count Leading Zeros) and binary operations. # Dividing by powers of two ### Compute the division quotient using bit operations Let's take a general binary number and break it down: ![](https://hackmd.io/_uploads/S1hFYalbT.png) Where each b represents a binary bit (0 or 1) and the position represents its power of 2 (starting from the rightmost as the least significant bit). This number can be represented as: ![](https://hackmd.io/_uploads/S1tpY6l-T.png) Now, when you right shift, each bit moves one position to the right. So the new representation becomes: ![](https://hackmd.io/_uploads/SJMJqalb6.png) Comparing these two representations, you can observe that every term of the second representation is half of the corresponding term in the first representation (because the power of 2 has been reduced by 1). This reduction by half is equivalent to dividing by 2. ### Remainder is also calculated using bit operations Subtracting 1 from divisor(a power of two) results in a binary number consisting of all ones, which, when applied with the bitwise AND operation ('&') to the original number, effectively masks the higher bits of the number, isolating the lower bits that represent the remainder. * ex: divisor = 8 (0b01000), dividend = 19 (0b010011) -> divisor -1 = 7 (0b0111) * remainder = dividend & (divisor - 1) = 0b011 = 3 From the above examples, it can be seen that these operations can retrieve the number lost in the bitwise right-shift operation, which is the remainder in division. # implement (first version) ### [C code](https://github.com/PuYuanHsu/Optimizing-division-with-powers-of-two-using-CLZ/blob/main/devision%20using%20CLZ%20-%20C.cpp) ``` c= #include <stdint.h> #include<iostream> using namespace std; uint16_t count_leading_zeros(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); /* count ones (population count) */ x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); int a= (32 - (x & 0x3f)); return (32 - (x & 0x3f)); } int main(void) { int divisor = 8, dividend = 19; int move_right = 31 - count_leading_zeros(divisor); int result = dividend >> move_right; printf("%d divided by %d = %d\n" , dividend, divisor, result); int remainder = dividend & (divisor - 1); printf("remainder = %d",remainder); return 0; } ``` ### Assembly Code ``` Assembly= .data str1: .string "result = " str2: .string "\nremainder = " .text divisor: .word 8 dividend: .word 19 .globl main main: # Load data lw a0, divisor # Load divisor into a0 lw t0, dividend # Load dividend into t3 # Call count_leading_zeros function call count_leading_zeros # Calculate move_right li t2, 31 # Adjusted to 31 for 32-bit sub a0, t2, a0 # move_right = 31 - clz(divisor) # Shift the number srl a2, t0, a0 # a2 = result #calculate remainder lw a0, divisor li t1 1 sub t1, a0, t1 and a3, t0, t1 # a3 = remainder #print result li a7, 4 la a0, str1 ecall li a7, 1 mv a0, a2 ecall li a7, 4 la a0, str2 ecall li a7, 1 mv a0, a3 ecall li a7, 10 ecall count_leading_zeros: # Load immediate values into registers li t1, 0x55555555 li t2, 0x33333333 li t3, 0x0f0f0f0f li t4, 0x3f # x |= (x >> 1) srli t5, a0, 1 or a0, a0, t5 # x |= (x >> 2) srli t5, a0, 2 or a0, a0, t5 # x |= (x >> 4) srli t5, a0, 4 or a0, a0, t5 # x |= (x >> 8) srli t5, a0, 8 or a0, a0, t5 # x |= (x >> 16) srli t5, a0, 16 or a0, a0, t5 # x -= ((x >> 1) & 0x55555555) srli t5, a0, 1 and t5, t5, t1 sub a0, a0, t5 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333) srli t5, a0, 2 and t5, t5, t2 and t6, a0, t2 add a0, t5, t6 # x = ((x >> 4) + x) & 0x0f0f0f0f srli t5, a0, 4 add a0, t5, a0 and a0, a0, t3 # x += (x >> 8) srli t5, a0, 8 add a0, a0, t5 # x += (x >> 16) srli t5, a0, 16 add a0, a0, t5 # 32 - (x & 0x3f) and a0, a0, t4 li t5, 32 sub a0, t5, a0 # Return ret ``` # Assembly code optimizer ### Register usage: Reduce the usage of registers. * Original version: ```Assembly= # Load immediate values into registers li t1, 0x55555555 li t2, 0x33333333 li t3, 0x0f0f0f0f # x -= ((x >> 1) & 0x55555555) srli t5, a0, 1 and t5, t5, t1 sub a0, a0, t5 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333) srli t5, a0, 2 and t5, t5, t2 and t4, a0, t2 add a0, t5, t4 # x = ((x >> 4) + x) & 0x0f0f0f0f srli t5, a0, 4 add a0, t5, a0 and a0, a0, t3 ``` * improved version ```Assmebly= # x -= ((x >> 1) & 0x55555555) li t1, 0x55555555 srli t2, a0, 1 and t2, t2, t1 sub a0, a0, t2 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333) li t1, 0x33333333 #Reuse t1 srli t2, a0, 2 # Reuse t2 here, since its previous value is not needed anymore and t2, t2, t1 and t3, a0, t1 add a0, t2, t3 # x = ((x >> 4) + x) & 0x0f0f0f0f li t1, 0x0f0f0f0f #Reuse t1 srli t2, a0, 4 #Reuse t2 again add a0, t2, a0 ``` Originally, registers a0, t1, t2, t3, t4, t5 etc. were used. After optimization, only a0, t1, t2 and t3 are needed. --- ### Loop: Express repetitive operations using loops to reduce the code size. * Original version: ```Assembly= # x |= (x >> 1) srli t1, a0, 1 or a0, a0, t1 # x |= (x >> 2) srli t1, a0, 2 or a0, a0, t1 # x |= (x >> 4) srli t1, a0, 4 or a0, a0, t1 # x |= (x >> 8) srli t1, a0, 8 or a0, a0, t1 # x |= (x >> 16) srli t1, a0, 16 or a0, a0, t1 ``` * improved version ```Assembly= # Load immediate values li t1, 17 li t2, 1 # Start shift value shift_or_loop: srl t3, a0, t2 # Use t3 for intermediate shifts or a0, a0, t3 slli t2, t2, 1 # Double the shift value # If shift value is less than 17, continue loop blt t2, t1, shift_or_loop ``` Using loops significantly reduced the code size. # Result ### Improved complete [assembly code](https://github.com/PuYuanHsu/Optimizing-division-with-powers-of-two-using-CLZ/blob/main/devision%20using%20CLZ%20-%20Assembly%20.s) ```Assembly= .data str1: .string "result = " str2: .string "\nremainder = " .text divisor: .word 8 dividend: .word 19 .globl main main: # Load data lw a0, divisor # Load divisor into a0 lw t0, dividend # Load dividend into t3 # Call count_leading_zeros function call count_leading_zeros # Calculate move_right li t2, 31 # Adjusted to 31 for 32-bit sub a0, t2, a0 # move_right = 31 - clz(divisor) # Shift the number srl a2, t0, a0 # a2 = result #calculate remainder lw a0, divisor li t1 1 sub t1, a0, t1 and a3, t0, t1 # a3 = remainder #print result li a7, 4 la a0, str1 ecall li a7, 1 mv a0, a2 ecall li a7, 4 la a0, str2 ecall li a7, 1 mv a0, a3 ecall li a7, 10 ecall count_leading_zeros: # Load immediate values li t1, 17 li t2, 1 # Start shift value shift_or_loop: srl t3, a0, t2 # Use t3 for intermediate shifts or a0, a0, t3 slli t2, t2, 1 # Double the shift value blt t2, t1, shift_or_loop # If shift value is less than 17, continue loop # x -= ((x >> 1) & 0x55555555) li t1, 0x55555555 srli t2, a0, 1 and t2, t2, t1 sub a0, a0, t2 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333) li t1, 0x33333333 srli t2, a0, 2 and t2, t2, t1 and t3, a0, t1 add a0, t2, t3 # x = ((x >> 4) + x) & 0x0f0f0f0f li t1, 0x0f0f0f0f srli t2, a0, 4 add a0, t2, a0 and a0, a0, t1 # x += (x >> 8) srli t2, a0, 8 add a0, a0, t2 # x += (x >> 16) srli t2, a0, 16 add a0, a0, t2 # 32 - (x & 0x3f) andi a0, a0, 0x3f li t1, 32 sub a0, t1, a0 # Return ret ``` ### C code output * test data 1: divisor = 4, number = 16 ![](https://hackmd.io/_uploads/Hy9UYzzbT.png) * test data 2: divisor = 8, number = 19 ![](https://hackmd.io/_uploads/rJHIM7fbp.png) * test data 3: divisor = 16, number = 1546 ![](https://hackmd.io/_uploads/rkNftGfZp.png) ### Assembly output * test data 1: divisor = 4, number = 16 ![](https://hackmd.io/_uploads/rkI_gSfZT.png) * test data 2: divisor = 8, number = 19 ![](https://hackmd.io/_uploads/rJhWxBfZa.png) * test data 3: divisor = 16, number = 1546 ![](https://hackmd.io/_uploads/H1tHxBf-T.png) # Analysis ### Execution info ![](https://hackmd.io/_uploads/ByvmmBM-a.png) ### 5-stage RISC-V processor ![](https://hackmd.io/_uploads/SkkpC3GWp.png) The 5-stage RISC-V processor simulated in RIPES follows the classic **RISC pipeline** architecture, and these stages include: 1. Instruction Fetch (IF): * In this stage, the processor retrieves the next instruction to be executed from memory. This involves using the program counter (PC) to locate the instruction, and then fetching the instruction from memory. * The program counter is updated to point to the address of the next instruction. 2. Instruction Decode (ID) and Register Read: * In this stage, the processor decodes (or interprets) the fetched instruction, determining the exact operation to be performed (e.g., addition, jump, etc.). * It also reads the necessary operands, which are typically stored in a register file. 3. Execute (EX): * This is the stage where arithmetic and logical operations are performed, such as addition, subtraction, multiplication, or logical operations (AND, OR, NOT, etc.). * For some instructions, it might also change the value of the program counter (e.g., for jump or branch instructions). 4. Memory Access (MEM): * In this stage, the processor may read data from memory (for load instructions) or write data to memory (for store instructions). * For instructions that do not involve memory access, this stage would be idle. 5. Write Back (WB): * In the final stage, the result (from the previous stage) is written back to the register file so it can be used by subsequent instructions. In this five-stage pipeline, each stage operates within its own clock cycle, meaning multiple instructions can be processed at different stages simultaneously. This design allows for efficient instruction execution, thereby improving the overall performance of the processor. ### Experiment **Take or x10 X10 X28 as an example** **1. IF** ![](https://hackmd.io/_uploads/r1WN90GZT.png) * The Program Counter (PC) is set to the address of the current instruction, which is 0x0000008c. * The processor fetches the instruction "or x10, x10, x28" from address 0x0000008c, and its encoding is 0X01c56533. **2. ID** ![](https://hackmd.io/_uploads/r1cK9Rz-p.png) * The processor decodes the fetched instruction, determining it's an "OR" operation involving registers x10 and x28, and that the result should be stored in x10. * x10: 0x0a * x28: 0x1c **3. EX** ![](https://hackmd.io/_uploads/ry6ADaMbp.png) * At this stage, the processor performs the "OR" operation: 0x00000001 OR 0x00000006, resulting in 0x00000007. * in binary, 0b0110 OR 0b0001 equals 0b0111 **4. MEM** ![](https://hackmd.io/_uploads/rJ4UO6M-a.png) * Since this particular "OR" operation doesn't involve any memory access, this stage does nothing and remains idle during the execution of the current instruction. **5. WB** ![](https://hackmd.io/_uploads/r1D_Y6MZp.png) * Finally, the result of the operation, 0x00000007, is written back to the register file, stored in register x10 (0x0a), so it can be used by subsequent instructions. **6. register table** ![](https://hackmd.io/_uploads/rJHuW1X-a.png) after the operation, the register x10 = 0x00000007 --- ### Disassembly code is [here](https://github.com/PuYuanHsu/Optimizing-division-with-powers-of-two-using-CLZ/blob/main/Disassembly%20code%20(devision%20using%20CLZ).s) ``` Assemvly= 00000000 <divisor>: 0: 0008 c.addi4spn x10 0 2: 0000 c.addi4spn x8 0 00000004 <dividend>: 4: 00000013 addi x0 x0 0 00000008 <main>: 8: 00000517 auipc x10 0x0 <divisor> c: ff852503 lw x10 -8 x10 10: 00000297 auipc x5 0x0 <divisor> 14: ff42a283 lw x5 -12 x5 18: 00000097 auipc x1 0x0 <divisor> 1c: 068080e7 jalr x1 x1 104 20: 01f00393 addi x7 x0 31 24: 40a38533 sub x10 x7 x10 28: 00a2d633 srl x12 x5 x10 2c: 00000517 auipc x10 0x0 <divisor> 30: fd452503 lw x10 -44 x10 34: 00100313 addi x6 x0 1 38: 40650333 sub x6 x10 x6 3c: 0062f6b3 and x13 x5 x6 40: 00400893 addi x17 x0 4 44: 10000517 auipc x10 0x10000 48: fbc50513 addi x10 x10 -68 4c: 00000073 ecall 50: 00100893 addi x17 x0 1 54: 00060513 addi x10 x12 0 58: 00000073 ecall 5c: 00400893 addi x17 x0 4 60: 10000517 auipc x10 0x10000 64: faa50513 addi x10 x10 -86 68: 00000073 ecall 6c: 00100893 addi x17 x0 1 70: 00068513 addi x10 x13 0 74: 00000073 ecall 78: 00a00893 addi x17 x0 10 7c: 00000073 ecall 00000080 <count_leading_zeros>: 80: 01100313 addi x6 x0 17 84: 00100393 addi x7 x0 1 00000088 <shift_or_loop>: 88: 00755e33 srl x28 x10 x7 8c: 01c56533 or x10 x10 x28 90: 00139393 slli x7 x7 1 94: fe63cae3 blt x7 x6 -12 <shift_or_loop> 98: 55555337 lui x6 0x55555 9c: 55530313 addi x6 x6 1365 a0: 00155393 srli x7 x10 1 a4: 0063f3b3 and x7 x7 x6 a8: 40750533 sub x10 x10 x7 ac: 33333337 lui x6 0x33333 b0: 33330313 addi x6 x6 819 b4: 00255393 srli x7 x10 2 b8: 0063f3b3 and x7 x7 x6 bc: 00657e33 and x28 x10 x6 c0: 01c38533 add x10 x7 x28 c4: 0f0f1337 lui x6 0xf0f1 c8: f0f30313 addi x6 x6 -241 cc: 00455393 srli x7 x10 4 d0: 00a38533 add x10 x7 x10 d4: 00657533 and x10 x10 x6 d8: 00855393 srli x7 x10 8 dc: 00750533 add x10 x10 x7 e0: 01055393 srli x7 x10 16 e4: 00750533 add x10 x10 x7 e8: 03f57513 andi x10 x10 63 ec: 02000313 addi x6 x0 32 f0: 40a30533 sub x10 x6 x10 f4: 00008067 jalr x0 x1 0 ``` # Conclusion * Utilizing bit manipulation techniques, we can compute the quotient and remainder of a number divided by a power of two with high efficiency. This method relies on the properties of binary representation and the characteristics of bitwise operations, enabling the rapid acquisition of results without incurring the substantial computational costs associated with division operations. However, the correct functioning of this technique is confined exclusively to cases where the divisor is a power of two; it will not yield correct results when a divisor that is not a power of two is employed. Consequently, special caution is required when applying these techniques.