2020q3 Homework4 (quiz4)

# 2020q3 Homework4 (quiz4) contributed by < `guaneec` > ###### tags: `sysprog2020` Quick links: * [Problem description](https://hackmd.io/@sysprog/2020-quiz4) * [Submissions](https://hackmd.io/@sysprog/2020-homework4) ## Q1 - Hamming distance ### Q1.1 - Explanation ```cpp int hammingDistance(int x, int y) { return __builtin_popcount(x ^ y); } ``` The Hamming distance is equal to the number of unequal bits. ### Q1.2 - [LeetCode 477. Total Hamming Distance](https://leetcode.com/problems/total-hamming-distance/) ```cpp // sizeof(int) * 8 #define l 32 int totalHammingDistance(int* nums, int numsSize){ int counts[l] = {0}; int s = 0; for (int i = 0; i < numsSize; ++i) { for (int j = 0; j < l; ++j) { bool b = (nums[i] & (1u << j)) > 0; s += b ? i - counts[j] : counts[j]; counts[j] += b; } } return s; } ``` ![](https://i.imgur.com/Xp6do3F.png) Swapping the order of the loops eliminates the need of a separate counter for each bit: ```cpp // sizeof(int) * 8 #define l 32 int totalHammingDistance(int* nums, int numsSize){ int s = 0; for (int j = 0; j < l; ++j) { short count = 0; for (int i = 0; i < numsSize; ++i) { bool b = (nums[i] & (1u << j)) > 0; s += b ? i - count : count; count += b; } } return s; } ``` ![](https://i.imgur.com/rgd5www.png) ### Q1.3 - Readings * [錯誤更正碼簡介](https://w3.math.sinica.edu.tw/math_media/d184/18404.pdf) * [抽象代數的實務應用](https://www.math.sinica.edu.tw/www/file_upload/summer/crypt2017/data/2017/%E4%B8%8A%E8%AA%B2%E8%AC%9B%E7%BE%A9/[20170710][%E6%8A%BD%E8%B1%A1%E4%BB%A3%E6%95%B8%E7%9A%84%E5%AF%A6%E5%8B%99%E6%87%89%E7%94%A8].pdf) Todo ### Q1.4 - Reed-Solomon error correction Todo ## Q2 - [LeetCode 1483. Kth Ancestor of a Tree Node](https://leetcode.com/problems/kth-ancestor-of-a-tree-node/) ```cpp #include <stdlib.h> typedef struct { int **parent; int max_level; int n; } TreeAncestor; TreeAncestor *treeAncestorCreate(int n, int *parent, int parentSize) { TreeAncestor *obj = malloc(sizeof(TreeAncestor)); obj->parent = malloc(n * sizeof(int *)); obj->n = n; int max_level = 32 - __builtin_clz(n) + 1; for (int i = 0; i < n; i++) { obj->parent[i] = malloc(max_level * sizeof(int)); for (int j = 0; j < max_level; j++) obj->parent[i][j] = -1; } for (int i = 0; i < parentSize; i++) obj->parent[i][0] = parent[i]; for (int j = 1;; j++) { int quit = 1; for (int i = 0; i < parentSize; i++) { obj->parent[i][j] = obj->parent[i][j - 1] == -1 ? -1 : obj->parent[obj->parent[i][j - 1]][j - 1]; if (obj->parent[i][j] != -1) quit = 0; } if (quit) break; } obj->max_level = max_level - 1; return obj; } int treeAncestorGetKthAncestor(TreeAncestor *obj, int node, int k) { int max_level = obj->max_level; for (int i = 0; i < max_level && node != -1; ++i) if (k & (1 << i)) node = obj->parent[node][i]; return node; } void treeAncestorFree(TreeAncestor *obj) { for (int i = 0; i < obj->n; i++) free(obj->parent[i]); free(obj->parent); free(obj); } ``` ### Q2.1 - Explanation The parent-child relationship can be thought of as a function: $$ f(x) \triangleq \mathrm{parent\_of}(x) $$ To get the kth parent, apply the [iterated function](https://en.wikipedia.org/wiki/Iterated_function): $$ \mathrm{kth\_parent\_of}(x) = f^k(x) $$ A naive algorithm is simply apply the same function repeatedly: ```cpp #include <stdlib.h> typedef struct { int *parent; } TreeAncestor; TreeAncestor *treeAncestorCreate(int n, int *parent, int parentSize) { TreeAncestor *obj = malloc(sizeof(TreeAncestor)); obj->parent = parent; return obj; } int treeAncestorGetKthAncestor(TreeAncestor *obj, int node, int k) { while (true) { if (!k) return node; if (!node) return -1; --k; node = obj->parent[node]; } } void treeAncestorFree(TreeAncestor *obj) { free(obj); } ``` However, this gets the "Time Limit Exceeded" error when n is 50000. In the reference implementation, xth iterates where x is a power of two are pre-calculated: $$ f, f^2, f^4, f^8, f^{16}... $$ With the [property](https://en.wikipedia.org/wiki/Iterated_function#Abelian_property_and_iteration_sequences) $f^m \circ f^n = f^{m+n}$, $f^k$ can be calculated with $O(\log(k))$ function applications. Take $k=49999$ for example. With the naive method, this takes 49999 table lookups. With a pre-calculated table, $$ \begin{aligned} f^{49999} &= f^{0b1100001101001111}\\ &= f^{32768} \circ f^{16384} \circ f^{512} \circ f^{256} \circ f^{64} \circ f^{8} \circ f^{4} \circ f^{2} \circ f^{1} \end{aligned} $$ That's only 9 lookups. ### Q2.2 - Memory usage ### Q2.3 - Alternative implementation ```cpp #include <stdlib.h> typedef struct { int **parent; int max_level; int n; } TreeAncestor; TreeAncestor *treeAncestorCreate(int n, int *parent, int parentSize) { TreeAncestor *obj = malloc(sizeof(TreeAncestor)); obj->parent = malloc((n + 1) * sizeof(int *)); obj->n = n; int max_level = 32 - __builtin_clz(n) + 1; for (int i = 0; i < n + 1; i++) { obj->parent[i] = malloc(max_level * sizeof(int)); for (int j = 0; j < max_level; j++) obj->parent[i][j] = 0; } for (int i = 0; i < parentSize; i++) obj->parent[i + 1][0] = parent[i] + 1; for (int j = 1;; j++) { int quit = 1; for (int i = 0; i < parentSize; i++) { obj->parent[i + 1][j] = obj->parent[obj->parent[i + 1][j - 1]][j - 1]; if (obj->parent[i + 1][j]) quit = 0; } if (quit) break; } obj->max_level = max_level - 1; return obj; } int treeAncestorGetKthAncestor(TreeAncestor *obj, int node, int k) { int max_level = obj->max_level; node = node + 1; while (k) { int i = __builtin_ctzl(k); k ^= k & -k; node = obj->parent[node][i]; } return node - 1; } void treeAncestorFree(TreeAncestor *obj) { for (int i = 0; i < obj->n; i++) free(obj->parent[i]); free(obj->parent); free(obj); } ``` ![](https://i.imgur.com/veiKzOS.png) Update: It was reported that [Leetcode's timings are unstable](https://hackmd.io/@RinHizakura/BkoGM5V8v#%E5%89%8D%E6%83%85%E6%8F%90%E8%A6%81). I tried submitting the same code couple of times: ![](https://i.imgur.com/MLKu6xB.png) Well, isn't that kinda silly? So before I conclude my changes to the code actually works, I need to test it with a more robust method somehow. I could test with random data on my machine, but optimizations on random data do not necessary work on Leetcode's test cases. Without assumption on the inputs, optimizations cannot be done. I grabbed the test cases from leetcode by intentionally exceeding the time limit on the n-th test case. You can download all 10 test cases [here](https://raw.githubusercontent.com/guaneec/sysprog-2020q3-quiz4/407654b514f4e55394f8cb68d1d8fbea8f7f9a3f/kth_parent.txt) `n: 7, 5, 5, 5, 10, 10, 1000, 10000, 50000, 50000` #### Attempt 1 - create a node for `-1` [Source](https://github.com/guaneec/sysprog-2020q3-quiz4/blob/master/kp/offset.hpp) Create a node for the "nothing node" `-1`, and offset all other nodes' index by 1. This eliminates some conditionals for boundaries. ##### Result ![](https://i.imgur.com/aAwwQ6T.png) <15% slowdown for small n, <6% speedup for large n. Maybe an overall speedup? #### Attempt 2 - Count trailing zeros [Source](https://github.com/guaneec/sysprog-2020q3-quiz4/blob/master/kp/ctzl.hpp) Use `__builtin_ctzl` to skip over zeros in the loop of `treeAncestorGetKthAncestor` ##### Result ![](https://i.imgur.com/8OYHTAs.png) Minor optimization #### Attempt 3 - Flatten the 2D array [Source](https://github.com/guaneec/sysprog-2020q3-quiz4/blob/master/kp/flatten.hpp) Flatten the 2D table into 1D to reduce memory accesses ##### Result ![](https://i.imgur.com/tB7mPGp.png) Massive speedup #### Attempt 3.1 - Flatten, transposed [Source](https://github.com/guaneec/sysprog-2020q3-quiz4/blob/master/kp/ftrans.hpp) Transpose before flatten ##### Result ![](https://i.imgur.com/SV0xpiq.png) Surprising amount of improvement. Especially for large n. #### Attempt 3.1.1 - Use memcpy [Source](https://github.com/guaneec/sysprog-2020q3-quiz4/blob/master/kp/memcpy.hpp) Use `memcpy` to copy over the input array instead of iterating one by one. ##### Result ![](https://i.imgur.com/achJVdA.png) <10% speedup. The effect is the largest when n=10, for some reason. Unfortunately though, this is not compatible with `uint16_t`s. #### Attempt 4 - Use naive method for low n [Source](https://github.com/guaneec/sysprog-2020q3-quiz4/blob/master/kp/hybrid.hpp) Use the linear algorithm for `n <= 100`, use the logarithmic version for large n. ##### Result ![](https://i.imgur.com/qyoXT7z.png) Huge speedup for low n, minor overhead for large n. This probably worsens the overall runtime. #### Attempt 5 - Use `unsigned short` instead of `int` [Source](https://github.com/guaneec/sysprog-2020q3-quiz4/blob/master/kp/short.hpp) Since `n` does not exceed 50000, the parents array may as well use `uint16_t`s. Special care need to be taken to handle `-1`. ##### Result ![](https://i.imgur.com/wlDKuRP.png) Major optimization, especially for large n. #### Attempt 6 - Remove wasted row [Source](https://github.com/guaneec/sysprog-2020q3-quiz4/blob/master/kp/level.hpp) In the original implementation, one too many row of `**parent` is allocated. Remove this row slightly reduces memory usage and should improve cache performance. ##### Result ![](https://i.imgur.com/WoLkT77.png) <8% improvement. #### All together - Combining 2, 3.1, 5, 6 [Source](https://github.com/guaneec/sysprog-2020q3-quiz4/blob/master/kp/full.hpp) ##### Result ![](https://i.imgur.com/DP9Y4Jp.png) ![](https://i.imgur.com/yztsfoT.png) ![](https://i.imgur.com/81XBs7K.png) New record! ## Q3 - Fizz Buzz ### Q3.1 - Explanation and performance analysis ```cpp #include <stdbool.h> #include <stdint.h> #include <stdio.h> #include <string.h> #define MSG_LEN 8 static inline bool is_divisible(uint32_t n, uint64_t M) { return n * M <= M - 1; } static uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1; static uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1; int main(int argc, char *argv[]) { for (size_t i = 1; i <= 100; i++) { uint8_t div3 = is_divisible(i, M3); uint8_t div5 = is_divisible(i, M5); unsigned int length = (2 << div3) << div5; char fmt[MSG_LEN + 1]; strncpy(fmt, &"FizzBuzz%u"[(MSG_LEN >> div5) >> (div3 << 2)], length); fmt[length] = '\0'; printf(fmt, i); printf("\n"); } return 0; } ``` | div3 | div5 | string offset | shift offset | div5 + (div3 << 2)| | -------- | -------- | -------- |-|-| |0|0|8|0|0| |0|1|4|1|1| |1|0|0|>3|4| |1|1|0|>3|5| The original naive implementation does not give the correct output. Here's a corrected version: ```cpp #include <stdio.h> int main() { for (unsigned int i = 1; i < 100; i++) { if (i % 15 == 0) printf("FizzBuzz"); else if (i % 3 == 0) printf("Fizz"); else if (i % 5 == 0) printf("Buzz"); else // if ((i % 3) && (i % 5)) printf("%u", i); printf("\n"); } return 0; } ``` ```shell $ diff <(./fizz_naive) <(./fizz_bitwise) $ time for i in `seq 1000`; do ./fizz_naive; done > /dev/null real 0m7.270s user 0m4.635s sys 0m2.744s $ time for i in `seq 1000`; do ./fizz_bitwise; done > /dev/null real 0m7.271s user 0m4.815s sys 0m2.572s ``` No difference in performance is observed with the printf version. I rewrote them with `spintf`s and ran them through [Google's benchmark](https://github.com/google/benchmark) | name | iterations | real\_time | cpu\_time | time\_unit | |------------------|------------|------------|-----------|------------| | "BM\_naive" | 135787 | 5089\.09 | 5088\.94 | ns | | "BM\_bitwise" | 119672 | 5801\.36 | 5801\.13 | ns | | "BM\_naive\_s" | 229481 | 3048\.54 | 3048\.39 | ns | | "BM\_bitwise\_s" | 142929 | 4866\.44 | 4866\.35 | ns | Seems like I/O indeed is the bottleneck. But wait, isn't the bitwise version supposed to be faster? :::warning No, bitwise implementation is not supposed to be faster. Instead, it is meant to be *deterministic* for instruction counts and branches. You shall pick up randomized test cases instead of computing the sum of execution time for all available inputs. :notes: jserv ::: Let's dive into [the assembly code](https://godbolt.org/z/vxxMzT) (it's long so I'm not pasting it here). A first thing one would notice is that in `naive`, no integer divide instruction was generated. The magic constants are (presumably) generated from ``` 1431655765 = 0x55555555 = 0xFFFFFFFF / 3 858993459 = 0x33333333 = 0xFFFFFFFF / 5 286331153 = 0x11111111 = 0xFFFFFFFF / 15 ``` Another interesting thing: no `mul` instruction is found in `bitwise`! How? It's [strength reduction](https://en.wikipedia.org/wiki/Strength_reduction) in action. Multiplications are replaced with repeated additions to the `r13` and `r14` registers. Anyways, let's look at what `perf` have to tell: (assembly of `bitwise`) ![](https://i.imgur.com/H0qcfTW.png) ![](https://i.imgur.com/NmeTxA4.png) My impressions: modifying memory and loading 64bit constants are slow. `naive` does none of these so it's faster. #### Testing with random cases In each iteration, 1500 elements are fed in to the fizzbuzz function. They're generated: * Randomly: drawn from $\mathcal{U}(1, 1500)$ * Sequentially: 1..1500, ordered ![](https://i.imgur.com/FIJJRgz.png) The source can be found [here](https://github.com/guaneec/sysprog-2020q3-quiz4/blob/40904b00359a1407e1d9b01fe285e6755d3651b9/benchfizz.cpp). Naive still outperforms bitwise. The sequential versions are slightly faster than their random counterparts. This is understandable for naive but kinda weird for bitwise since the bitwise version is branchless. I suspected that the compiler might try to do clever things when loading inputs from an ordered vector, but nope, the assembly for loading inputs is identical for seq and random. `perf` to the rescue! The number of branch-misses can be found with `perf stat` **Number of branch-misses** ||naive|bitwise| |-|-|-| |**Random**|11,130,739 [[src]](https://github.com/guaneec/sysprog-2020q3-quiz4/blob/master/fizz_naive_random.cpp)|12,465,348 [[src]](https://github.com/guaneec/sysprog-2020q3-quiz4/blob/master/fizz_bitwise_random.cpp) |**Ordered**|8,562,249 [[src]](https://github.com/guaneec/sysprog-2020q3-quiz4/blob/master/fizz_naive_seq.cpp)|7,046,116 [[src]](https://github.com/guaneec/sysprog-2020q3-quiz4/blob/master/fizz_bitwise_seq.cpp) So the bitwise version isn't as branchless as one would expect. To find out where the branch-misses actually are, use `perf record -e branch-misses` **`perf report`, bitwise, random** ![](https://i.imgur.com/nxgYeyV.png) **`perf report`, bitwise, ordered** ![](https://i.imgur.com/0248pv4.png) Even though our bitwise fizzbuzz is branchless, it calls `sprintf`, which is not branchless. The library function calls are responsible for a large number of branch-misses. ### Q3.2 - Implementation with bitmasks ## Q4 - ffs ### Q4.1 - Explanation The cases we need to consider are the numbers 1-8 with trailing zeros chopped off, which are the odd numbers 1, 3, 5, 7. If the divident is 1, no operation is needed. Therefore the necessary cases are 3, 5, 7. ### Q4.2 - Application in [LeetCode 51. N-Queens](https://leetcode.com/problems/n-queens/) Baseline implementation: ```cpp int N; bool *l; bool *p; bool *m; bool *board; int *rcur; char ***cur; size_t sizes[] = {1, 1, 0, 0, 2, 10, 4, 40, 92, 352, 724, 2680, 14200, 73712, 365596, 2279184, 14772512, 95815104, 666090624, 4968057848, 39029188884, 314666222712, 2691008701644, 24233937684440, 227514171973736, 2207893435808352, 22317699616364044, 234907967154122528}; void output() { *cur = malloc(N * sizeof(char*)); for (int i = 0; i < N; ++i) { char *line = calloc(N + 1, sizeof(char)); for (int j = 0; j < N; ++j) { line[j] = board[i * N + j] ? 'Q' : '.'; } (*cur)[i] = line; } *rcur = N; ++rcur; ++cur; } void bt(int i) { for (int j = 0; j < N; ++j) { if (l[j] || p[i + j] || m[N - i + j]) continue; board[i * N + j] = l[j] = p[i + j] = m[N - i + j] = 1; if (i == N - 1) output(); else bt(i + 1); board[i * N + j] = l[j] = p[i + j] = m[N - i + j] = 0; } } char *** solveNQueens(int n, int* returnSize, int** returnColumnSizes){ N = n; l = calloc(sizeof(bool), N); p = calloc(sizeof(bool), (2 * N + 1)); m = calloc(sizeof(bool), (2 * N + 1)); board = calloc(sizeof(bool), N * N); *returnColumnSizes = malloc(sizes[N] * sizeof(int)); *returnSize = sizes[N]; rcur = &(*returnColumnSizes)[0]; char *** ret = malloc(sizes[N] * sizeof(char**)); cur = &ret[0]; bt(0); return ret; } ``` ![](https://i.imgur.com/jm5NtA4.png) By intentionally entering a infinite loop, I grabbed the Leetcode's test cases for this problem: ``` 4 1 2 3 5 6 7 8 9 ``` With this information, we know 16 bits is enough to store the state of a single row of the chessboard. I'm not sure how to apply `__builtin_ffs` though. Most of the memory usage is for storing the output `char***`, I think. Fiddling with the representation of the board and flags should not affect the memory usage by more than 1kB, which is definitely not a "significant decrease of memory usage". Maybe I should try a fundamentally different approach.