# 2020q3 Homework3 (quiz3)
contributed by < `guaneec` >
###### tags: `sysprog2020`
Quick links:
* [Problem description](https://hackmd.io/@sysprog/2020-quiz3)
## Q1 - `asr_i`
### Q1.1 - Explanation
```cpp=
int asr_i(signed int m, unsigned int n)
{
const int logical = (((int) -1) >> 1) > 0;
unsigned int fixu = -(logical & m < 0);
int fix = *(int *) &fixu;
return (m >> n) | (fix ^ (fix >> n));
}
```
> “The result of E1 >> E2 is E1 right-shifted E2 bit positions. … If E1 has a signed type and a negative value, the resulting value is implementation-defined.”
Technically, implementation defined means that the result of the shift depends entirely on the implementation. In practice, it's probably either the logical shift or arithmetic shift.
`logical` indicates whether a negative number shifted right is positive, which only happens if the shift is logical.
`fixu` evaluates to a bitmask 0xff..ff when a fix is needed, which is when the input is negative and the implementation of shift is logical.
I'm not sure why line 5 is needed. Replacing `fix` with `fixu` in line 6 works on my machine.
### Q1.2 - Generic `asr`
```cpp
#define ASR(t, suffix) \
t asr_##suffix(t m, unsigned t n) \
{ \
const int logical = ((((int) -1) >> 1)) > 0; \
unsigned t fixu = -(logical && m < 0); \
return (m >> n) | (fixu ^ (fixu >> n)); \
}
ASR(long long, ll)
ASR(long, l)
ASR(int, i)
ASR(short, s)
#define asr(m, n) \
_Generic((m), long long \
: asr_ll, long \
: asr_l, int \
: asr_i, short \
: asr_s)(m, n)
```
## Q2 - isPowerOfFour
### Q2.1 - Explanation
Powers of four has the form `0b1, 0b100, 0b10000, ...`. A number is a power of four iff it:
* is positive
* is a power of two
* has an even number of trailing zeros
```cpp
bool isPowerOfFour(int num)
{
return num > 0 && (num & (num - 1))==0 &&
!(__builtin_ctz(num) & 1);
}
```
### Q2.2 - Implementation with less branches
[x86-64 assembly, gcc -O2](https://godbolt.org/z/q83Exb)
```
isPowerOfFour(int):
xor eax, eax # <retval>
test edi, edi # num
jle .L1 #,
lea edx, [rdi-1] # tmp90,
test edx, edi # tmp90, num
jne .L1 #,
xor eax, eax # _4
rep bsf eax, edi # _4, num
not eax # tmp94
and eax, 1 # <retval>,
.L1:
ret
```
The jumps come from the `&&` operations.
The fix is simple -- just change them to `&`s.
```cpp
bool isPowerOfFourNb(int num)
{
return (num > 0) & ((num & (num - 1)) == 0) & !(__builtin_ctz(num) & 1);
}
```
```
isPowerOfFourNb:
lea eax, [rdi-1] # tmp94,
xor edx, edx # _7
rep bsf edx, edi # _7, num
test eax, edi # tmp94, num
sete al #, tmp97
test edi, edi # num
not edx # tmp101
setg cl #, tmp99
and eax, ecx # tmp100, tmp99
and eax, edx # tmp93, tmp101
ret
```
### Q2.3 - Applications in leetcode
#### Q2.3.1 - [1009. Complement of Base 10 Integer](https://leetcode.com/problems/complement-of-base-10-integer/)
```cpp
int bitwiseComplement(int N) {
return N ? ((unsigned) -1 >> __builtin_clz(N)) & ~N : 1;
}
```
![](https://i.imgur.com/TaUDzI5.png)
#### Q2.3.2 - [41. First Missing Positive](https://leetcode.com/problems/first-missing-positive/)
```cpp
int cmp(const void *x, const void *y)
{
return (*(int *) x > *(int *) y) - (*(int *) x < *(int *) y);
}
int firstMissingPositive(int *nums, int numsSize)
{
qsort(nums, numsSize, sizeof(int), cmp);
int i = 0;
while (i < numsSize && nums[i] <= 0)
++i;
int ret = 1;
for (; i < numsSize; ++i) {
if (nums[i] > ret) return ret;
if (nums[i] == ret) ++ret;
}
return ret;
}
```
![](https://i.imgur.com/JKwy3u1.png)
Didn't use bitwise operators.
### Q2.4
#### Q2.4.1 - Implementations of `clz`
[Generated assembly](https://godbolt.org/z/aK4rxq)
The various implementations in [2017q3 Homework4 (改善 clz)](https://hackmd.io/@3xOSPTI6QMGdj6jgMMe08w/Bk-uxCYxz) generates assembly much longer than the builtin version. I haven't run benchmarks but I suppose they're also slower.
The builtin version utilizes the `bsr` instruction.
#### Q2.4.2 - Exponential Golomb coding
## Q3 - numberOfSteps
### Q3.1 - Explanation
```
int numberOfSteps (int num)
{
return num ? __builtin_popcount(num) + 31 - __builtin_clz(num) : 0;
}
```
Number of steps for an individual bit:
0: `is_leading? 0 : 1`
1: `is_leading? 1 : 2`
### Q3.2 - Alternative implementation
I searched for alternative formulas in [OEIS](https://oeis.org/A056792) but to no avail.
Guess I'll write a naive implementation.
```cpp
int numberOfSteps (int num)
{
int i = 0;
while (num) {
num = num & 1 ? num - 1 : num / 2;
++i;
}
return i;
}
```
It's just as fast!
![](https://i.imgur.com/o2lS3Oj.png)
### Q3.3 - Branchless implementation
I copied the implementations of [clz](https://hackmd.io/@3xOSPTI6QMGdj6jgMMe08w/Bk-uxCYxz#Harley%E2%80%99s-algorithm-5) and [numberOfSetBits](https://stackoverflow.com/a/109025).
```cpp
#include <stdint.h>
int numberOfSetBits(uint32_t i)
{
// Java: use int, and use >>> instead of >>
// C or C++: use uint32_t
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}
uint8_t clz(uint32_t x)
{
static uint8_t const Table[] = {
0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF, 16, 0xFF, 0xFF,
0xFF, 2, 21, 29, 0xFF, 0xFF, 0xFF, 19, 17, 10, 0xFF,
12, 0xFF, 0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF, 14,
0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF, 18, 9, 11, 0xFF,
5, 0xFF, 0xFF, 13, 26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF,
25, 0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF,
};
/* Propagate leftmost 1-bit to the right */
x = x | (x >> 1);
x = x | (x >> 2);
x = x | (x >> 4);
x = x | (x >> 8);
x = x | (x >> 16);
/* x = x * 0x6EB14F9 */
x = (x << 3) - x; /* Multiply by 7. */
x = (x << 8) - x; /* Multiply by 255. */
x = (x << 8) - x; /* Again. */
x = (x << 8) - x; /* Again. */
return 31 - Table[x >> 26];
}
int numberOfSteps(int num)
{
return (numberOfSetBits(num) + 31 - clz(num)) & (unsigned) ((num == 0) - 1);
}
```
[Generated assembly](https://godbolt.org/z/ceTTjd)
The assembly code for `numberOfSteps` is long because the compiler inlined `clz` and `numberOfSetBits`.
![](https://i.imgur.com/vs4SrmP.png)
## Q4 - `gcd64`
```cpp
#include <stdint.h>
uint64_t gcd64(uint64_t u, uint64_t v) {
if (!u || !v) return u | v;
int shift;
for (shift = 0; !((u | v) & 1); shift++) {
u /= 2, v /= 2;
}
while (!(u & 1))
u /= 2;
do {
while (!(v & 1))
v /= 2;
if (u < v) {
v -= u;
} else {
uint64_t t = u - v;
u = v;
v = t;
}
} while (v);
return u << shift;
}
```
### Q4.1 - Explanation
Commented code found on Wikipedia:
```cpp
unsigned int gcd(unsigned int u, unsigned int v)
{
unsigned int shift = 0;
/* GCD(0,v) == v; GCD(u,0) == u, GCD(0,0) == 0 */
if (u == 0) return v;
if (v == 0) return u;
/* Let shift := lg K, where K is the greatest power of 2
dividing both u and v. */
while (((u | v) & 1) == 0) {
shift++;
u >>= 1;
v >>= 1;
}
while ((u & 1) == 0)
u >>= 1;
/* From here on, u is always odd. */
do {
/* remove all factors of 2 in v -- they are not common */
/* note: v is not zero, so while will terminate */
while ((v & 1) == 0)
v >>= 1;
/* Now u and v are both odd. Swap if necessary so u <= v,
then set v = v - u (which is even). For bignums, the
swapping is just pointer movement, and the subtraction
can be done in-place. */
if (u > v) {
unsigned int t = v; v = u; u = t; // Swap u and v.
}
v -= u; // Here v >= u.
} while (v != 0);
/* restore common factors of 2 */
return u << shift;
}
```
:::info
References:
* [Binary GCD algorithm](https://en.wikipedia.org/wiki/Binary_GCD_algorithm)
:::
### Q4.2 - Implementation with `__builtin_ctz`
The part where the trailing zeros are popped off can be replaced with a single right shift.
```cpp
#include <stdint.h>
uint64_t gcd64(uint64_t u, uint64_t v)
{
if (!u || !v)
return u | v;
int shift = __builtin_ctz(u | v);
u >>= shift;
v >>= shift;
u >>= __builtin_ctz(u);
do {
v >>= __builtin_ctz(v);
if (u < v) {
v -= u;
} else {
uint64_t t = u - v;
u = v;
v = t;
}
} while (v);
return u << shift;
}
```
To test for the performance difference, I decided to try out the [google benchmark library](https://github.com/google/benchmark) for the first time instead of throwing in some timing functions ad hoc.
Inputs are drawn randomly from the full range of `uint64_t`;
![](https://i.imgur.com/YPNb9mC.png)
The ctz version is slightly faster.
## Q5 - Extract indices of set bits
```cpp=
#include <stddef.h>
size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out)
{
size_t pos = 0;
uint64_t bitset;
for (size_t k = 0; k < bitmapsize; ++k) {
bitset = bitmap[k];
while (bitset != 0) {
uint64_t t = bitset & -bitset;
int r = __builtin_ctzll(bitset);
out[pos++] = k * 64 + r;
bitset ^= t;
}
}
return pos;
}
```
### Q5.1 - Explanation
In each loop, the rightmost bit `t` is computed. The index of the set bit is found with `__builtin_ctzll`.
### Q5.2 - Performance
The number at the end of the name a benchmark indicates the bit density (5 means $d \approx 5/16$, bits drawn form a Bernoulli process). In each iteration, a bitmap with 10000 `uint64_t`s is scanned.
![](https://i.imgur.com/J5owYfH.png)
Observations:
* Time of `naive` depends largely on the entropy of the bitmap
* Time of `improved` depends on the bit density
* The worst case of `improved` is almost as fast as the best case of `naive`
* `naive` is >10 times slower when d=1/16
### Q5.3 - Improvements
```cpp=
size_t improved2(uint64_t *bitmap, size_t bitmapsize, uint32_t *out)
{
size_t pos = 0;
uint64_t bitset;
for (size_t k = 0; k < bitmapsize; ++k) {
bitset = bitmap[k];
int n = __builtin_popcount(bitset);
while (n--) {
uint64_t t = bitset & -bitset;
int r = __builtin_ctzl(bitset);
out[pos++] = k * 64 + r;
bitset ^= t;
}
}
return pos;
}
```
I removed the data dependency between iterations of the inner loop. Previously, the loop condition `while (bitset != 0)` depends on the last statement in the loop `bitset ^= t`. This means the processor can only flush the pipeline after `bitset ^= t` evaluates to 0. By pre-calculating the number of iterations with `__builtin_popcount`, the compiler and processor can reorder the instructions more freely and achieve better performance with CPU Magic(TM).
This is not a universal improvement though. With low bit densities, the cost of popcount outweighs the performance gain.
![](https://i.imgur.com/ATehinG.png)
![](https://i.imgur.com/zDwyXmf.png)