Given a number $x$, $clz(x)$ returns the number of leading zeros in the binary representation of $x$. Below show some examples where the word size storing $x$ is 8: Table 1. Examples of $x$ and $clz(x)$ of word-size 8 | $x$ | $(x)_2$ | $clz(x)$ | |:--------:| :----: | :--------: | | 2 | 00000010 |6 | | 41 | 00101001 | 2 | | 135 | 10000111 | 0 | Suppose that the word size storing $x$ is set as WORDSIZE. We may count the leading zeros of $x$ (i.e., $clz(x)$) by examining bits of $(x)_2$ one by one from the most significant bit to the least one. We simply shift every bit to the position of the most significant bit to see whether it is 0 or not. For instance, x=001010 has two leading zeros, the firt of which can be counted by !(x&100000) and the second by !((x<<1)&100000). Furthermore, !((x<<2)&100000) stops the counting. ```cpp= // Algorithm 1 unsigned int mBit = 1 << WORDSIZE-1; while (!(x & mBit) && clz<WORDSIZE) { x = (x << 1); clz ++; } ``` Using for-loop, the aformentioned codes could be compacted as follows. ```cpp= for ( clz=0; !(x & mBit) && clz<WORDSIZE; x<<=1, clz++); ``` Observing the case that shifting 001010 right 4 bits obtains a zero, which leads $clz$ to be 2 (=6-4), we receive another algorithm: By examining whether $x$ is NOT a zero, if so, add 1 to $cls$; else shift $x$ one bit to the right and continue (to examine whether $x$ is NOT a zero), we acquire another implementation: ```cpp= // Algorithm 2 for ( clz = WORDSIZE; x; x>>=1, --clz); ``` It is easily seen that the time complexity of both algorithms would be $O(w)$ where $w$ equals to WORDSIZE. Concering computing $clz(n)$ for number $n$, the time complexity of algorithms 1 and 2 becomes $O(log_2n).$ A sophisticated skill to compute $clz(x)$ depends on the function (in GCC compiler [https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html]) `__builtin_popcount(x),` which returns the number of 1's in $(x)_2$. Consider the above example x=001010. If we could find x'=001111, which keeps the leanding zeros and sets the others bits 0's, then $clz(x)$ = 6-__builtin_popcount(x). x' and $clz(x)$ can be computed as below. ```cpp= // Algorithm 3 (WORDSIZE=32) x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >>16); clz = WORDSIZE-__builtin_popcount(x); ``` Furthermore, Algorithm 3 coul be implemented using for-loop as follows. ```cpp= // Algorithm 3-1 (WORDSIZE=32) for (offset=1; offset<WORDSIZE; x=x|(x>>offset), offset<<=1); clz = WORDSIZE-__builtin_popcount(x); ``` It is not hard to realize that the time complexity of Algorithm 3 is $O($log$_2w)$. In fact, the GCC-compiler function `__builtin_clz(x)` returns the number of leading zeros in $(x)_2$. Therefore ```cpp= // Algorithm 4 clz = __builtin_clz(x); ``` On condition that x is within the storage of computer word (e.g. 32-bit or 64-bit), Algorithm 4 can be accomplished in $O(1)$ time. On the test platform [CPU: i7-9700KF @ 3.60GHz/64GB/Windows10], the implementation results in C using Code::Blocks are shown in Table 2. Table 2. CPU time (sec.) for computing $clz(x)$'s for $x$ = 0~$10^k-1$ (as unsigned int) where $k$ = 6-9 | uint | 10^6 | 10^7 | 10^8 | 10^9 | |:----------- | ----- | ----- | ----- | ------:| | A1_builtin | 0.015 | 0.023 | 0.148 | 1.523 | | A2_D&C | 0 | 0.049 | 0.484 | 4.905 | | A2_loop | 0 | 0.082 | 0.801 | 7.995 | | A3_backward | 0.034 | 0.248 | 2.883 | 32.998 | | A4_forward | 0.017 | 0.127 | 0.808 | 4.098 | Note that each value in Table 2 is the totl time for computing all $clz(x)$'s for $x$ = 0~$10^k-1$. Figure 1 illustrates the CPU time informtion in Table 2 where the x-axis denotes the size of the test data sets. ![](https://hackmd.io/_uploads/HJ083ctqh.png) Figure 1. CPU time informtion in Table 2 The following discussion ignores the reults of the case of $10^6$ because the time information is too fast to be ignificant. It is easily seen from Table 2 and Figure 1 that 1. The efficiency of A1 is the bast among the test approaches. 2. The execution time grows as the data size increases for each of the approches. 3. The order of the performances for the cases of $10^7$ and $10^8$ is $\qquad$ A1, A2_D&C, A2_loop, A4_forward and A3_backward, while that for $10^9$ becomes $\qquad$ A1, A4_forward, A2_D&C, A2_loop and A3_backward.* Let us focus on the results of A3 and A4. Both of them are the time complexity of $O(w)$ (where $w$ is WORDSIZE). However, the tendency of the growth of the execution time of A3 seems to be higher than that of A4. In fact, when dealing with 10^9, A4's execution time is bit better than A2; whereas, for cases of 10^7 and 10^8, A4 is worse than A2. This phenomenon would be explored from Figure 2, which reveals the superiority of A1 and the comparability of A4 to A2. ![](https://hackmd.io/_uploads/Hy_j35Fc2.png) Figure 2. Performance informtion in Table 2 Table 3 demontrates such a phenomenon by using the relative ratios of the computational time. We address the results in column 4, which shows the ratios of the running times of $t(10^9)$ over those of $t(10^8)$ for all implementations, which are summrized in Figure 3. Table 3. Ratios of the computational times in Table 2 | uint | $t(10^8)/t(10^7)$ | $t(10^9)/t(10^7)$ | $t(10^9)/t(10^8)$ | |:----------- | -----: | -----: | ------:| | A1_builtin | 66.2173913 | 2879.017013 | 10.29054054 | | A2_D&C | 100.1020408 | 2042.898792 | 10.13429752| | A2_loop | 97.5 | 1189.02439 | 9.981273408| | A3_backward | 133.0564516 | 536.5179501 | 11.44571627| | A4_forward | 32.26771654 | 254.0765082 | 5.071782178| ![](https://hackmd.io/_uploads/ryS3v5Kq2.png) Figure 3. Performance informtion in Table 3 Note that the ratios of $t(10^9)/t(10^8)$ of A1-A3 are about 10. It is reasonable since the data size grows 10 times (from $10^8$ to $10^9$). For A4, the ratio is about 5; that is, the time for A4 dealing with the case of $10^9$ is about 5 times (instead of 10 times) for that of $10^8$. The performance of A4 grows as the data size increases. We speculate that the C++ compiler optimizes the codes to a certain degree when the loop constitutes ++, <<, etc. operations. Yet, such optimizations are not found in A3 (with --, etc.). To further realize the performances of unsigned int (uint) against unsigned long long (ull) for all algorithms, Table 4 lists the CPU times executed. Table 4. CPU time (sec.) for computing $clz(x)$'s for $x$ = 0~$10^k-1$ (as both uint an ull) where $k$ = 6-9 | | 10^6 | | 10^7 | | 10^8 | | 10^9 | | |:-- | -- | -- | -- | -- | -- | -- | -- | -- | | | uint | ull | uint | ull | uint | ull | uint | ull | | A1_builtin | 0.015 | 0.000 | 0.023 | 0.015 | 0.148 | 0.156 | 1.523 | 1.509 | | A2_D&C | 0.000 | 0.003 | 0.049 | 0.052 | 0.484 | 0.564 | 4.905 | 5.613 | | A2_loop | 0.000 | 0.000 | 0.082 | 0.098 | 0.801 | 0.927 | 7.995 | 9.285| | A3_backward | 0.034 | 0.035 | 0.248 | 0.248 | 2.883 | 2.848 | 32.998 | 32.163| | A4_forward | 0.017 | 0.066 | 0.127 | 0.630 | 0.808 | 5.856 | 4.098 | 50.723| We depict the time informtion of using unsigned long long (ull) in Figure 4. ![](https://hackmd.io/_uploads/S1QCxsYc3.png) Figure 4. Execution time informtion for ull in Table 4 As the findings from Table 2 and Figure 1 (for uint), we see from Table 3 and Figure 4 (for ull) that 1. The efficiency of A1 is the bast among the test approaches. 2. The execution time grows as the data size increases for each of the approches. 3. The order of the performances for all cases is $\qquad$A1, A2_D&C, A2_loop, A3_backward and A4_forward. Note that the performance of A4 for ull is not so appealing as that for uint (found from Table 3 and Figures 2-3). We conjuct that the compiler optimizes the loops/codes for 32-bit (uint) operations better than those for 64-bit (ull) operations. More results for testing ull are summarized in Figure 5 where the times of $clz(x)$ of $0$~$10^k-1$ ($x\in$ull) for $k$ = 10, 11 and 12 are reported. ![](https://hackmd.io/_uploads/HyVBfphc2.png) Figure 5. CPU times of $clz(x)$ for $0$~$10^k-1$ ($x\in$ull) where $k$ = 10, 11 and 12 It is seen from Figure 5 that the order of the performances becomes A1, A2_D&C, A2_loop and A3_forward/A4_backward. Table 5 explores the performance raios of $t(10^{10})$, $t(10^{11})$ and $t(10^{12})$ over $t(10^{10})$. Table 5. Performance raios of $t(10^{10})$, $t(10^{11})$ and $t(10^{12})$ over $t(10^{10})$ | ull | $t(10^{10})/t(10^{10})$ | $t(10^{11})/t(10^{10})$ | $t(10^{12})/t(10^{10})$ | | :--- | ----: | ----: | ----: | | A1_builtin | 1 | 9.960061444 | 99.40960395 | | A2_D&C | 1 | 9.905054961 | 100.3043293| | A2_loop | 1 | 9.965533252 | 100.7686992| | A3_backward | 1 | 10.27814302 | 129.0900602| | A4_forward | 1 | 8.694913497 | 75.93657812| We may conject from Table 5 that the compiler optimizes the codes of A4_forward better than that of A3_backward when deling the word size of ull. Still, comparing Tables 2 and 4, we may say that such an opimization on uint is much better than that on ull. // Below give some results during the debug-stage, which cn be ignored. ![](https://hackmd.io/_uploads/SkeFNVF5n.png) ![](https://hackmd.io/_uploads/SkiKoIE92.png) ![](https://hackmd.io/_uploads/r1WH0CE52.png) ![](https://hackmd.io/_uploads/H1PooLVc3.png) CPU: i7-10510U 1.8GHz/16GB/Windows ![](https://hackmd.io/_uploads/HkeRNQ_5h.png) ![](https://hackmd.io/_uploads/HJA8SQd93.png) ![](https://hackmd.io/_uploads/S11YF7Oqh.png)