2020q3 Homework4 (quiz4)

# 2020q3 Homework4 (quiz4) contributed by <`Tim096`> > [指標＆數值系統＆ Bitwise 練習題](https://hackmd.io/@sysprog/2020-quiz4) ## Q1. LeetCode [461. Hamming Distance](https://leetcode.com/problems/hamming-distance/) * The [Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance) between two integers is the number of positions at which the corresponding bits are different. #### 使用 `__builtin_popcount` 實作。 ```cpp= int hammingDistance(int x, int y) { return __builtin_popcount(x OP y); // OP = ^ } ``` OP = `^` `__builtin_popcount` : 用於計算一個 32 位元無符號整數有多少個，位元是 1 | A XOR B | A = 0 | A = 1 | | ----- | ----- | ----- | | B = 0 | 0 | 1 | | B = 1 | 1 | 0 | * `A ^ B` 可以將 `A` `B`兩者對應位置不同的 bits 顯示出來，其 1 的個數即為 Hamming distance ## Q1. 進階題 #### 不使用 gcc extension 實作在此先解釋一下何謂 gcc extension ：編譯器提供的內建函式。 :::success 延伸資料 [gcc extension 其他的涵式庫](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) ::: 我上網查和看其他人的共筆甚至是自己一開始想到進階題一開始的作法的確都是這樣但是我自己一直在想難道沒有一種方法可以不要用到迴圈的方式嘛，單純使用運算元符號來達成目標目前想法是 loop unrolling ，但是這樣根本就不會比較快一直在想要一個突破的點就是程式碼例如：（1000..01）這種數字如果用大家的方法就是肯定要跑 32 次，難道中間那些 `0` 沒有一種運算元的方式來去一次前進多一點甚至是可以是可以自動跳到下一個 `1` 所出現的位置。實際方法目前想不到 ```cpp= int hammingDistance(int x, int y) { x ^= y; int i = 0; while(x){ if (x & 1) i++; x >>= 1; } return i; } ``` 我想到了一個很酷的方法，雖然沒有完全解決我上面提出的問題，但是部份解決了。 ```c= int hammingDistance(int x, int y) { int a = x ^ y; int b0 = (a>>0)&0x55555555; int b1 = (a>>1)&0x55555555; int c = b0 + b1; int d0 = (c>>0)&0x33333333; int d2 = (c>>2)&0x33333333; int e = d0 + d2; int f0 = (e>>0)&0x0F0F0F0F; int f4 = (e>>4)&0x0F0F0F0F; int g = f0 + f4; int h0 = (g>>0)&0x00FF00FF; int h8 = (g>>8)&0x00FF00FF; int i = h0 +h8; int j0 = (i>>0)&0x0000FFFF; int j16 = (i>>16)&0x0000FFFF; int k = j0 + j16; return k; } ``` ![](https://i.imgur.com/FTPhy1X.png) #### LeetCode [477. Total Hamming Distance](https://leetcode.com/problems/total-hamming-distance/) * find the total Hamming distance between all pairs of the given numbers. :::spoiler 範例 >Input: 4, 14, 2 > >Output: 6 > >the 4 is 0100, 14 is 1110, and 2 is 0010 >So the answer will be: HammingDistance(4, 14) + HammingDistance(4, 2) + HammingDistance(14, 2) = 2 + 2 + 2 = 6. ::: 最一開始很直覺的想到使用階層式的想法 + 前幾堂課所教的 `__builtin_popcount` ```c= int totalHammingDistance(int* nums, int numsSize){ int total = 0,i,j; for(i=0;i<numsSize;i++) for(j=i+1;j<numsSize;j++) total +=__builtin_popcount(nums[i] ^ nums[j]); return total; } ``` 但是很可惜的是 leetcode 說超時了，不能夠這樣。 ##### 不超時版 ```cpp= int totalHammingDistance(int* nums, int numsSize){ int total = 0; for(int i=0; i < 31 ; i++){ int count = 0; for (int j= 0 ; j < numsSize ; j++){ count += (nums[j]) & (1); nums[j] = nums[j] >> 1; } total += count * (numsSize - count); } return total; } ``` 假設在某一位元共有 Count 個數字為1，那麼就有 (numsSize - Count) 個數字為 0，因為陣列中的每個數字皆會與其他數字計算過一次 hamming distance，所以就會使 hamming distance 總和增加 Count * (numsSize - Count)。如此一來，遍歷所有位元的總和即為最後的答案。 :::spoiler 舉例 Input : (numsSize = 4) > 04 = ( ==0==100 ) > 14 = ( ==1==110 ) > 02 = ( ==0==010 ) > 06 = ( ==0==110 ) 舉例 : 拿第一位元來舉例 Count = 1 (共有 1 個 1) numsSize - Count = 4 - 1 = 3 (共有 3 個 0) 因此 total = count * (numsSize - count) = 1 * 3 =3 (在第一個位元時，會有3個相異的位元) ::: * 效能評估 ![](https://i.imgur.com/iZsRZ9w.png) ## Q2. LeetCode [1483. Kth Ancestor of a Tree Node](https://leetcode.com/problems/kth-ancestor-of-a-tree-node/) * You are given a tree with `n` nodes numbered from `0` to `n-1` in the form of a parent array where `parent[i]` is the parent of node `i`. The root of the tree is node `0`. * Implement the function `getKthAncestor(int node, int k)` to **return the `k`-th ancestor of the given node**. If there is no such ancestor, return `-1`. * The k-th ancestor of a tree node is the `k`-th node in the path from that node to the root. #### [Binary Lifting](https://cp-algorithms.com/graph/lca_binary_lifting.html#toc-tgt-1) 的實作： ```cpp= #include <stdlib.h> typedef struct { AAA; // AAA = int **parent int max_level; } TreeAncestor; TreeAncestor *treeAncestorCreate(int n, int *parent, int parentSize) { TreeAncestor *obj = malloc(sizeof(TreeAncestor)); obj->parent = malloc(n * sizeof(int *)); int max_level = 32 - __builtin_clz(n) + 1; for (int i = 0; i < n; i++) { obj->parent[i] = malloc(max_level * sizeof(int)); for (int j = 0; j < max_level; j++) obj->parent[i][j] = -1; } for (int i = 0; i < parentSize; i++) obj->parent[i][0] = parent[i]; for (int j = 1;; j++) { int quit = 1; for (int i = 0; i < parentSize; i++) { obj->parent[i][j] = obj->parent[i][j + BBB] == -1 // BBB = (-1) ? -1 : obj->parent[obj->parent[i][j - 1]][j - 1]; if (obj->parent[i][j] != -1) quit = 0; } if (quit) break; } obj->max_level = max_level - 1; return obj; } int treeAncestorGetKthAncestor(TreeAncestor *obj, int node, int k) { int max_level = obj->max_level; for (int i = 0; i < max_level && node != -1; ++i) if (k & (CCC)) // CCC = 1 << i node = obj->parent[node][i]; return node; } void treeAncestorFree(TreeAncestor *obj) { for (int i = 0; i < obj->n; i++) free(obj->parent[i]); free(obj->parent); free(obj); } ``` AAA = `int **parent` BBB = `(-1)` CCC = `1 << i` ### 前導知識 : 1. 假設要找某 node 的第 k 個祖先，假設 k = 3 ( $2^0 + 2^1$) 為例，那麼等同於找此 node ==的==第 $2^0$ 個祖先==的==第 $2^1$ 個祖先 2. parent[i][j] 的意義就是第 i 個 node 的第 $2^j$ 個祖先是誰。 3. 每個 node 可以記錄 `max_level` 個 parents 4. 前提假設保證此為 Heap 結構。 5. line 11 和 line 15 把二維陣列建構出來 ### 程式碼解析： line 3~6 定義出來每一個 node 它所需的資訊需要 `int **parent` ， `int max_level` ```cpp=3 typedef struct { int **parent; int max_level; } TreeAncestor; ``` line 10~11 配置記憶體空間 ```cpp=10 TreeAncestor *obj = malloc(sizeof(TreeAncestor)); obj->parent = malloc(n * sizeof(int *)); ``` line 13 初始化 `max_level` ， **max_level =** $(\lfloor log_2{n} \rfloor + 1) + 1$ ```cpp=13 int max_level = 32 - __builtin_clz(n) + 1; ``` line 14~20 就在對此二維陣列初始化及設定每個 node 的第一個祖先。 ```cpp=14 for (int i = 0; i < n; i++) { obj->parent[i] = malloc(max_level * sizeof(int)); for (int j = 0; j < max_level; j++) obj->parent[i][j] = -1; } ``` line 22-31 在建立每個 node i 的第 $2^j$ 個祖先，會先看此 node 是否存在第 $2^{j-1}$ 個祖先， - 若不存在，則此 node 必不存在第 $2^j$ 個祖先; - 若存在，則此 node 的第 $2^j$ 個祖先 = 此 node 的第 $2^{j-1}$ 個祖先的第 $2^{j-1}$ 個祖先，因此 `BBB = -1`。 ```cpp=22 for (int j = 1;; j++) { int quit = 1; for (int i = 0; i < parentSize; i++) { obj->parent[i][j] = obj->parent[i][j-1] == -1 ? -1 : obj->parent[obj->parent[i][j - 1]][j - 1]; if (obj->parent[i][j] != -1) quit = 0; } if (quit) break; } ``` 最後呼叫 `treeAncestorGetKthAncestor` 時，會去拆解 k (即 k = $2^{m_0}$+$2^{m_1}$+...+$2^{m_k}$)，其中 $0<=m_0<m_1<...<m_k$，$m_i \in N$，因此 `CCC = 1 << i`。這邊有一點遞迴呼叫的概念 ## Q3. [FizzBuzz](https://en.wikipedia.org/wiki/Fizz_buzz) * 從 1 數到 n * 如果是 3的倍數，印出 “Fizz” * 如果是 5 的倍數，印出 “Buzz” * 如果是 15 的倍數，印出 “FizzBuzz” * 如果都不是，就印出數字本身 #### 直覺的實作: (`naive.c`) ```c= #include <stdio.h> int main() { for (unsigned int i = 1; i < 100; i++) { if (i % 3 == 0) printf("Fizz"); if (i % 5 == 0) printf("Buzz"); if ((i % 3) && (i % 5)) printf("%u", i); printf("\n"); } return 0; } ``` #### bitwise 觀察 ```cpp #define MSG_LEN 8 char fmt[MSG_LEN + 1]; strncpy(fmt, &"FizzBuzz%u"[start], length); fmt[length] = '\0'; printf(fmt, i); printf("\n"); ``` 我們若能精準從給定輸入 `i` 的規律去控制 `start` 及 `length`，即可符合 FizzBuzz 題意: ``` string literal: "fizzbuzz%u" offset: 0 4 8 ``` #### bitwise 的實作: (`bitwise.c`) ```c= #include <stdio.h> #include <stdbool.h> #include <stdint.h> ``` ```c= #define MSG_LEN 8 static inline bool is_divisible(uint32_t n, uint64_t M) { return n * M <= M - 1; } static uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1; static uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1; int main(int argc, char *argv[]) { for (size_t i = 1; i <= 100; i++) { uint8_t div3 = is_divisible(i, M3); uint8_t div5 = is_divisible(i, M5); unsigned int length = (2 << div3) << div5; char fmt[MSG_LEN + 1]; strncpy(fmt, &"FizzBuzz%u"[(MSG_LEN >> KK1) >> (KK2 << KK3)], length); // KK1 = div5 ; KK2 = div3 ; KK3 = 2 ; fmt[length] = '\0'; printf(fmt, i); printf("\n"); } return 0; } ``` KK1 = `div5` KK2 = `div3` KK3 = `2` * 程式碼解析： ```cpp=3 static inline bool is_divisible(uint32_t n, uint64_t M) { return n * M <= M - 1; } static uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1; static uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1; ``` line 3-8 可以參考之前 [quiz 2 - 測驗3](https://hackmd.io/5fY2oVqyRqadTcO6Y7YUEA?both#3-%E5%BF%AB%E9%80%9F%E9%99%A4%E6%B3%95) 如何實現==快速除法== ```cpp=12 uint8_t div3 = is_divisible(i, M3); uint8_t div5 = is_divisible(i, M5); ``` line 12-13 `div3`、`div5` 分別可知道 `i` 是否為 3 或 5 的倍數 * 若是，則為 1 * 若否，則為 0 ```c=14 unsigned int length = (2 << div3) << div5; ``` * 如果 `i` 為 3 或 5 的==其中之一==倍數則 `2` 往左位移 1 次 * length = 4 * 如果 `i` 為 15 的倍數則 `2` 往左位移 2 次 * length = 8 * 如果 `i` 以上皆不是則 `2` 往左位移 0 次 * length = 2 ```c=16 char fmt[MSG_LEN + 1]; strncpy(fmt, &"FizzBuzz%u"[(MSG_LEN >> div5) >> (div3 << 2)], length); ``` 首先創造出一個陣列最後可以印出需求。 >“%u” : 2 “fizz” : 4 “buzz” : 4 “fizzbuzz” : 8 * 若是 3 的倍數 * `strncpy(fmt, &"FizzBuzz%u"[0], 4);` * 若是 5 的倍數 * `strncpy(fmt, &"FizzBuzz%u"[4], 4);` * 若是 15 的倍數 * `strncpy(fmt, &"FizzBuzz%u"[0], 8);` * 若以上皆不是 * `strncpy(fmt, &"FizzBuzz%u"[8], 2);` :::danger Q : 根據 [C data types wiki](https://en.wikipedia.org/wiki/C_data_types) 理論上 `%u` 應該頂多到 65,535 若我的 Input 大於此值時，應該會是不能夠被 printf 出來的，但是我實際使用各種 compiler 做實驗去執行，是為甚麼 ? (作為字串) A : @jserv ::: ## Q4. offset / PAGE_QUEUES 考慮到整數 PAGE_QUEUES 可能有以下數值 : > 1. (1 or 2 or 3 or 4) > 2. (5 or 6 or 7 or 8) $×$ (${2^n}$), n from 1 to 14 給定 size_t offset，試著將原本運算 : ``` #include <stdint.h> size_t blockidx = offset / PAGE_QUEUES; ``` 因為 PAGE_QUEUES 的數值最大可能是 $8*2^n$ 所以若不進行化簡而直接做整數除法的話會運算的成本，因此我們可以使用 __builtin_ffs 將被除數與除數先行化簡由於 PAGE_QUEUES 的數值分佈已知，在整數除法時，可思考加速運算的機制。需要注意，某些硬體平台缺乏整數除法指令 (如 Arm Cortex-A8)，即使 Intel 公司出品的 Core 處裡器 Sandy Bridge 微架構中，針對 64 位元整數的除法運算，會佔用 40 到 103 個處理器週期，運算成本仍屬沈重。我們可預先進行計算，從而避免整數除法指令的使用: (假設在 x86_64 硬體執行 GNU/Linux 並透過 gcc/clang 編譯) ```c= #include <stdint.h> #define ffs32(x) ((__builtin_ffs(x)) - 1) size_t blockidx; size_t divident = PAGE_QUEUES; blockidx = offset >> ffs32(divident); divident >>= ffs32(divident); switch (divident) { CASES } ``` ```c= case 2: blockidx /= 2; break; ``` 這樣的組合，請用最少的 case-break 敘述做出同等 `size_t blockidx = offset / PAGE_QUEUES;` 效果的程式碼。 >— Built-in Function: >int __builtin_ffs (unsigned int x) >Returns one plus the index of the least significant 1-bit of x, or if x is zero, returns zero. ```c= for(int i=0;i<10;i++) printf("__builtin_ffs (%d) = %d \n ",i,__builtin_ffs (i)); ``` 先從 gcc 擴充函數了解 : ``` __builtin_ffs (0) = 0 __builtin_ffs (1) = 1 __builtin_ffs (2) = 2 __builtin_ffs (3) = 1 __builtin_ffs (4) = 3 __builtin_ffs (5) = 1 __builtin_ffs (6) = 2 __builtin_ffs (7) = 1 __builtin_ffs (8) = 4 __builtin_ffs (9) = 1 ``` 我們可以知道最後幾個 0 也就是可以得知該數是否含有 ${2^n}$ 的因數，計算時可以先提取容易觀察計算的 ${2^n}$ | divident | binary | after __builtin_ffs(divident) | |:------:|:------:|:---------------:| | 1 | 0001 | 0001 = 1 | | 2 | 0010 | 0001 = 1 | | 3 | 0011 | 0011 = 3 | | 4 | 0100 | 0001 = 1 | | 5 | 0101 | 0101 = 5 | | 6 | 0110 | 0011 = 3 | | 7 | 0111 | 0111 = 7 | | 8 | 1000 | 0001 = 1 | 列舉每種可能之後會發現其實需要額外判斷的 case 其實只有 3, 5, 7 三數而已 ```c= case 3: blockidx /= 3; break; case 5: blockidx /= 5; break; case 7: blockidx /= 7; break; ``` ### 延伸問題 1. 解釋程式運算原理，可搭配 Microsoft Research 的專案 [snmalloc](https://github.com/microsoft/snmalloc) 來解說 2. 對應的原始程式碼 src/mem/sizeclass.h 練習 LeetCode 51. [N-Queens](https://leetcode.com/problems/n-queens/)，應選擇 C 語言並善用上述的 __builtin_ffs，大幅降低記憶體開銷; #### LeetCode 51. [N-Queens](https://leetcode.com/problems/n-queens/) 皇后問題 > 善用 `int __builtin_ffs (unsigned int x)`，以此降低記憶體開銷 > * `int __builtin_ffs (unsigned int x)` 作用 Returns one plus the index of the least significant 1-bit of x, or if x is zero, returns zero.(返回右起第一個1的位置) [N 皇后經典問題](https://zh.wikipedia.org/wiki/%E5%85%AB%E7%9A%87%E5%90%8E%E9%97%AE%E9%A2%98) ![](https://i.imgur.com/aYyyL7S.png) 題目敘述 : 八皇后問題是一個以西洋棋為背景的問題：如何能夠在8×8的西洋棋棋盤上放置八個皇后，使得任何一個皇后都無法直接吃掉其他的皇后？為了達到此目的，任兩個皇后都不能處於同一條橫行、縱行或斜線上。八皇后問題可以推廣為更一般的 N 皇后擺放問題：這時棋盤的大小變為 N×N ，而皇后個數也變成 N 。若且唯若 N = 1或N ≥ 4時問題有解。 ```graphviz digraph A { node_A [shape=record label=" {||X|||X||}| {|||X||X||X}| {||||X|X|X|}| {X|X|X|X|X|Q|X|X}| {||||X|X|X|}| {|||X||X||X}| {||X|||X||}| {|X||||X||}"]; } ``` **分析** 1. 窮舉 * 最直觀想法為列舉全全部的可能放置位置，再逐一判斷該放置位置是否皇后彼此不會受到影響。缺點為非常慢，必須改善此點。 2. Backtracking Algorithm * 可以參考 [RusselCK](https://hackmd.io/@RusselCK/sysprog2020_quiz4) 同學的寫得非常詳細以下試著用 `__builtin_ffs` 以此來大幅降低記憶體開銷。 ```cpp= static int N; static int sol; static int upperlim; int sizes[] = {1, 1, 0, 0, 2, 10, 4, 40, 92, 352, 724, 2680, 14200, 73712, 365596, 2279184, 14772512, 95815104, 666090624}; static char ***ans; static char **ret; //vd: Vertical //ld: Left slash //rd: Right slash void solve(int level, int vd, int ld, int rd){ if (level == N){ for (int i = 0; i < N; i++) memcpy(ans[sol][i], ret[i], N + 1); sol++; return; } int limit = (vd | ld | rd) ^ upperlim; int pos; while(limit){ pos = limit & (-limit); //LSB limit = limit & (limit - 1); //cutLSB ret[level][N - __builtin_ffs(pos)] = 'Q'; solve(level + 1, vd | pos, ((ld | pos) >> 1) & upperlim, ((rd | pos) << 1) & upperlim); ret[level][N - __builtin_ffs(pos)] = '.'; } } char *** solveNQueens(int n, int *returnSize, int **returnColumnSizes){ *returnSize = sizes[n]; *returnColumnSizes = malloc(sizes[n] * sizeof(int)); for (int i = 0; i < sizes[n]; i++) (*returnColumnSizes)[i] = n; N = n; sol = 0; upperlim = 1; ans = malloc(sizes[n] * sizeof(char **)); for (int i = 0; i < sizes[n]; i++) ans[i] = malloc(n * sizeof(char *)); for (int i = 0; i < sizes[n]; i++){ for(int j = 0; j < n; j++) ans[i][j] = malloc((n + 1) * sizeof(char)); // 1 for '\0' } ret = malloc(n * sizeof(char *)); for(int i = 0; i < n; i++){ ret[i] = malloc((n + 1) * sizeof(char)); for(int j = 0; j < n; j++) ret[i][j] = '.'; ret[i][n] = '\0'; } upperlim = (upperlim << n) - 1; solve(0, 0, 0, 0); for (int i = 0; i < n; i++) free(ret[i]); free(ret); return ans; } ``` ###### tags: `進階電腦系統應用2020` `quiz4`