2020q3 Homework3 (quiz3)

# 2020q3 Homework3 (quiz3) contributed by < `guaneec` > ###### tags: `sysprog2020` Quick links: * [Problem description](https://hackmd.io/@sysprog/2020-quiz3) ## Q1 - `asr_i` ### Q1.1 - Explanation ```cpp= int asr_i(signed int m, unsigned int n) { const int logical = (((int) -1) >> 1) > 0; unsigned int fixu = -(logical & m < 0); int fix = *(int *) &fixu; return (m >> n) | (fix ^ (fix >> n)); } ``` > “The result of E1 >> E2 is E1 right-shifted E2 bit positions. … If E1 has a signed type and a negative value, the resulting value is implementation-defined.” Technically, implementation defined means that the result of the shift depends entirely on the implementation. In practice, it's probably either the logical shift or arithmetic shift. `logical` indicates whether a negative number shifted right is positive, which only happens if the shift is logical. `fixu` evaluates to a bitmask 0xff..ff when a fix is needed, which is when the input is negative and the implementation of shift is logical. I'm not sure why line 5 is needed. Replacing `fix` with `fixu` in line 6 works on my machine. ### Q1.2 - Generic `asr` ```cpp #define ASR(t, suffix) \ t asr_##suffix(t m, unsigned t n) \ { \ const int logical = ((((int) -1) >> 1)) > 0; \ unsigned t fixu = -(logical && m < 0); \ return (m >> n) | (fixu ^ (fixu >> n)); \ } ASR(long long, ll) ASR(long, l) ASR(int, i) ASR(short, s) #define asr(m, n) \ _Generic((m), long long \ : asr_ll, long \ : asr_l, int \ : asr_i, short \ : asr_s)(m, n) ``` ## Q2 - isPowerOfFour ### Q2.1 - Explanation Powers of four has the form `0b1, 0b100, 0b10000, ...`. A number is a power of four iff it: * is positive * is a power of two * has an even number of trailing zeros ```cpp bool isPowerOfFour(int num) { return num > 0 && (num & (num - 1))==0 && !(__builtin_ctz(num) & 1); } ``` ### Q2.2 - Implementation with less branches [x86-64 assembly, gcc -O2](https://godbolt.org/z/q83Exb) ``` isPowerOfFour(int): xor eax, eax # <retval> test edi, edi # num jle .L1 #, lea edx, [rdi-1] # tmp90, test edx, edi # tmp90, num jne .L1 #, xor eax, eax # _4 rep bsf eax, edi # _4, num not eax # tmp94 and eax, 1 # <retval>, .L1: ret ``` The jumps come from the `&&` operations. The fix is simple -- just change them to `&`s. ```cpp bool isPowerOfFourNb(int num) { return (num > 0) & ((num & (num - 1)) == 0) & !(__builtin_ctz(num) & 1); } ``` ``` isPowerOfFourNb: lea eax, [rdi-1] # tmp94, xor edx, edx # _7 rep bsf edx, edi # _7, num test eax, edi # tmp94, num sete al #, tmp97 test edi, edi # num not edx # tmp101 setg cl #, tmp99 and eax, ecx # tmp100, tmp99 and eax, edx # tmp93, tmp101 ret ``` ### Q2.3 - Applications in leetcode #### Q2.3.1 - [1009. Complement of Base 10 Integer](https://leetcode.com/problems/complement-of-base-10-integer/) ```cpp int bitwiseComplement(int N) { return N ? ((unsigned) -1 >> __builtin_clz(N)) & ~N : 1; } ``` ![](https://i.imgur.com/TaUDzI5.png) #### Q2.3.2 - [41. First Missing Positive](https://leetcode.com/problems/first-missing-positive/) ```cpp int cmp(const void *x, const void *y) { return (*(int *) x > *(int *) y) - (*(int *) x < *(int *) y); } int firstMissingPositive(int *nums, int numsSize) { qsort(nums, numsSize, sizeof(int), cmp); int i = 0; while (i < numsSize && nums[i] <= 0) ++i; int ret = 1; for (; i < numsSize; ++i) { if (nums[i] > ret) return ret; if (nums[i] == ret) ++ret; } return ret; } ``` ![](https://i.imgur.com/JKwy3u1.png) Didn't use bitwise operators. ### Q2.4 #### Q2.4.1 - Implementations of `clz` [Generated assembly](https://godbolt.org/z/aK4rxq) The various implementations in [2017q3 Homework4 (改善 clz)](https://hackmd.io/@3xOSPTI6QMGdj6jgMMe08w/Bk-uxCYxz) generates assembly much longer than the builtin version. I haven't run benchmarks but I suppose they're also slower. The builtin version utilizes the `bsr` instruction. #### Q2.4.2 - Exponential Golomb coding ## Q3 - numberOfSteps ### Q3.1 - Explanation ``` int numberOfSteps (int num) { return num ? __builtin_popcount(num) + 31 - __builtin_clz(num) : 0; } ``` Number of steps for an individual bit: 0: `is_leading? 0 : 1` 1: `is_leading? 1 : 2` ### Q3.2 - Alternative implementation I searched for alternative formulas in [OEIS](https://oeis.org/A056792) but to no avail. Guess I'll write a naive implementation. ```cpp int numberOfSteps (int num) { int i = 0; while (num) { num = num & 1 ? num - 1 : num / 2; ++i; } return i; } ``` It's just as fast! ![](https://i.imgur.com/o2lS3Oj.png) ### Q3.3 - Branchless implementation I copied the implementations of [clz](https://hackmd.io/@3xOSPTI6QMGdj6jgMMe08w/Bk-uxCYxz#Harley%E2%80%99s-algorithm-5) and [numberOfSetBits](https://stackoverflow.com/a/109025). ```cpp #include <stdint.h> int numberOfSetBits(uint32_t i) { // Java: use int, and use >>> instead of >> // C or C++: use uint32_t i = i - ((i >> 1) & 0x55555555); i = (i & 0x33333333) + ((i >> 2) & 0x33333333); return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24; } uint8_t clz(uint32_t x) { static uint8_t const Table[] = { 0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF, 16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF, 0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF, 0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF, 14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF, 18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13, 26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25, 0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF, }; /* Propagate leftmost 1-bit to the right */ x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >> 16); /* x = x * 0x6EB14F9 */ x = (x << 3) - x; /* Multiply by 7. */ x = (x << 8) - x; /* Multiply by 255. */ x = (x << 8) - x; /* Again. */ x = (x << 8) - x; /* Again. */ return 31 - Table[x >> 26]; } int numberOfSteps(int num) { return (numberOfSetBits(num) + 31 - clz(num)) & (unsigned) ((num == 0) - 1); } ``` [Generated assembly](https://godbolt.org/z/ceTTjd) The assembly code for `numberOfSteps` is long because the compiler inlined `clz` and `numberOfSetBits`. ![](https://i.imgur.com/vs4SrmP.png) ## Q4 - `gcd64` ```cpp #include <stdint.h> uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; int shift; for (shift = 0; !((u | v) & 1); shift++) { u /= 2, v /= 2; } while (!(u & 1)) u /= 2; do { while (!(v & 1)) v /= 2; if (u < v) { v -= u; } else { uint64_t t = u - v; u = v; v = t; } } while (v); return u << shift; } ``` ### Q4.1 - Explanation Commented code found on Wikipedia: ```cpp unsigned int gcd(unsigned int u, unsigned int v) { unsigned int shift = 0; /* GCD(0,v) == v; GCD(u,0) == u, GCD(0,0) == 0 */ if (u == 0) return v; if (v == 0) return u; /* Let shift := lg K, where K is the greatest power of 2 dividing both u and v. */ while (((u | v) & 1) == 0) { shift++; u >>= 1; v >>= 1; } while ((u & 1) == 0) u >>= 1; /* From here on, u is always odd. */ do { /* remove all factors of 2 in v -- they are not common */ /* note: v is not zero, so while will terminate */ while ((v & 1) == 0) v >>= 1; /* Now u and v are both odd. Swap if necessary so u <= v, then set v = v - u (which is even). For bignums, the swapping is just pointer movement, and the subtraction can be done in-place. */ if (u > v) { unsigned int t = v; v = u; u = t; // Swap u and v. } v -= u; // Here v >= u. } while (v != 0); /* restore common factors of 2 */ return u << shift; } ``` :::info References: * [Binary GCD algorithm](https://en.wikipedia.org/wiki/Binary_GCD_algorithm) ::: ### Q4.2 - Implementation with `__builtin_ctz` The part where the trailing zeros are popped off can be replaced with a single right shift. ```cpp #include <stdint.h> uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; int shift = __builtin_ctz(u | v); u >>= shift; v >>= shift; u >>= __builtin_ctz(u); do { v >>= __builtin_ctz(v); if (u < v) { v -= u; } else { uint64_t t = u - v; u = v; v = t; } } while (v); return u << shift; } ``` To test for the performance difference, I decided to try out the [google benchmark library](https://github.com/google/benchmark) for the first time instead of throwing in some timing functions ad hoc. Inputs are drawn randomly from the full range of `uint64_t`; ![](https://i.imgur.com/YPNb9mC.png) The ctz version is slightly faster. ## Q5 - Extract indices of set bits ```cpp= #include <stddef.h> size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; uint64_t bitset; for (size_t k = 0; k < bitmapsize; ++k) { bitset = bitmap[k]; while (bitset != 0) { uint64_t t = bitset & -bitset; int r = __builtin_ctzll(bitset); out[pos++] = k * 64 + r; bitset ^= t; } } return pos; } ``` ### Q5.1 - Explanation In each loop, the rightmost bit `t` is computed. The index of the set bit is found with `__builtin_ctzll`. ### Q5.2 - Performance The number at the end of the name a benchmark indicates the bit density (5 means $d \approx 5/16$, bits drawn form a Bernoulli process). In each iteration, a bitmap with 10000 `uint64_t`s is scanned. ![](https://i.imgur.com/J5owYfH.png) Observations: * Time of `naive` depends largely on the entropy of the bitmap * Time of `improved` depends on the bit density * The worst case of `improved` is almost as fast as the best case of `naive` * `naive` is >10 times slower when d=1/16 ### Q5.3 - Improvements ```cpp= size_t improved2(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; uint64_t bitset; for (size_t k = 0; k < bitmapsize; ++k) { bitset = bitmap[k]; int n = __builtin_popcount(bitset); while (n--) { uint64_t t = bitset & -bitset; int r = __builtin_ctzl(bitset); out[pos++] = k * 64 + r; bitset ^= t; } } return pos; } ``` I removed the data dependency between iterations of the inner loop. Previously, the loop condition `while (bitset != 0)` depends on the last statement in the loop `bitset ^= t`. This means the processor can only flush the pipeline after `bitset ^= t` evaluates to 0. By pre-calculating the number of iterations with `__builtin_popcount`, the compiler and processor can reorder the instructions more freely and achieve better performance with CPU Magic(TM). This is not a universal improvement though. With low bit densities, the cost of popcount outweighs the performance gain. ![](https://i.imgur.com/ATehinG.png) ![](https://i.imgur.com/zDwyXmf.png)