2020q3 Homework4 (quiz4)

--- tags: Linux2020 --- # 2020q3 Homework4 (quiz4) contributed by < `YLowy` > ## [測驗 1](https://hackmd.io/@sysprog/2020-quiz4#%E6%B8%AC%E9%A9%97-1) [Hamming Distance](https://leetcode.com/problems/hamming-distance/) 兩個整數間的 Hamming distance 為其二進位的每個位元的差 Example : 1 (0 0 0 1) 4 (0 1 0 0) hamming distance = 2 | A $\oplus$ B | B = 0 | B = 1 | | -------- | -------- | -------- | | A = 0 | 0 | 1 | | A = 1 | 1 | 0 | `A ^ B`　就為二進位中兩數字的每個位元的差，再透過　popcount() 就可以得到答案 ```cpp int hammingDistance(int x, int y) { return __builtin_popcount(x ^ y); } ``` ### 延伸問題 1. 解釋上述程式碼運作原理，並舉出不使用 gcc extension 的 C99 實作程式碼其實看到這題挺眼熟的，原來 3 個月前有碰到過 ![](https://i.imgur.com/1WwNKSo.png) 當時寫法也是相似不過當時還不知道 gcc extension popcount() ```c= int hammingDistance(int x, int y){ unsigned int Z; Z=x^y; int result= 0,i; for(i=0;i<32;i++){ result+= (Z>>i)&1; } return result; } ``` 透過計數每一個位元得到結果 3. 練習 [LeetCode 477. Total Hamming Distance](https://leetcode.com/problems/total-hamming-distance/)，你應當提交 (submit) 你的實作 (C 語言實作)，並確保執行結果正確且不超時一開始想法很簡單，兩兩互相取 Hamming Distance 再除以 2 ```c= int totalHammingDistance(int* nums, int numsSize){ int total = 0,i,j; for(i=0;i<numsSize;i++) for(j=0;j<numsSize;j++) total +=__builtin_popcount(nums[i] ^ nums[j]); return total>>1; } ``` 不過超出時間 ![](https://i.imgur.com/pUQakOM.png) 於是想說用減少一半計算量的方法修改 ```c= int totalHammingDistance(int* nums, int numsSize){ int total = 0,i,j; for(i=0;i<numsSize;i++) for(j=i;j<numsSize;j++) total +=__builtin_popcount(nums[i] ^ nums[j] ); return total; } ``` ![](https://i.imgur.com/XrXYtnK.png) 快，還需要再快似乎從 pop_count 角度出發，時間複雜度就被限制在 O($n^{2}$) 透過觀察可以看看能不能從 " bit " 角度去觀察 Total Hamming Distance ```graphviz digraph A{ b [shape=record fontsize=24 label="{ n 'th bit|input 7| input 5| input 10|input 17 }|{ 4|0| 0| 0|1 }|{ 3|0| 0| 1 |0}|{ 2|1| 1| 0 |0}|{ 1|1| 0| 0 |0}|{ 0|1| 1| 1 |1}"] } ``` 1. 第 0 個 bit 觀察 : 全部都是 1 -> Hamming Distance = 0 2. 第 1 個 bit 觀察 : 1 有一個，其餘皆為 0 可以直接用單個 bit 判斷 Hamming Distance (1)* (numsize - 1) 3. 第 2 個 bit 觀察 : 1 有兩個，其餘皆為 0 Hamming Distance (2)* (numsize - 2) 以 bit 找規則似乎就找到了 : > 所有 Number 中從第 0 個 bit 開始計算全部該 bit 的 Hamming Distance ，在全部累加起來就是 Total Hamming Distance 之後程式改寫 : ```c= int totalHammingDistance(int* nums, int numsSize) { int bit_1_counter =0; int total=0; for(int i=0;i<32;i++){ bit_1_counter=0; // count how many '1' in nums[bit 'i'] in total for(int j=0;j<numsSize;j++){ bit_1_counter+=nums[j] & 1; nums[j]>>=1; } total+=bit_1_counter*(numsSize-bit_1_counter); } return total; } ``` ![](https://i.imgur.com/CacC7Xv.png) 4. 研讀[錯誤更正碼簡介](https://w3.math.sinica.edu.tw/math_media/d184/18404.pdf)及[抽象代數的實務應用](https://www.math.sinica.edu.tw/www/file_upload/summer/crypt2017/data/2017/%E4%B8%8A%E8%AA%B2%E8%AC%9B%E7%BE%A9/[20170710][%E6%8A%BD%E8%B1%A1%E4%BB%A3%E6%95%B8%E7%9A%84%E5%AF%A6%E5%8B%99%E6%87%89%E7%94%A8].pdf)，摘錄其中 Hamming Distance 和 Hamming Weight 的描述並重新闡述，舉出實際的 C 程式碼來解說 >搭配閱讀 [Hamming codes, Part I](https://www.youtube.com/watch?v=X8jsijhllIA&feature=youtu.be&ab_channel=3Blue1Brown) 及 [Hamming codes, Part II](https://www.youtube.com/watch?v=b3NxrZOu_CE&feature=youtu.be&ab_channel=3Blue1Brown)，你可適度摘錄影片中的素材，但應當清楚標註來源 Hamming Distance : 兩個向量中有多少 bit 不同 Hamming Weight : 向量中有多少 bit 為 '1' ( 非 0 ) 對於以此可以定義 minimum distance 以及 minimum weight 考慮到所有 Domain 中，最小值的 Hamming Distance 、Hamming Weight 會是相同的在 (7,4) Hamming code 中 minimum distance , minimum weight 皆等於 3 並且 codeword 可有 7 個在任何 Hamming code 中，如果想改其中 t 個錯誤，則該它的 minimum distance 至少要2t+ 1 ![](https://i.imgur.com/hdh9do0.png) Hamming Distance 說明了 Hamming code 理論上改錯最大值 (15, 7) BCH code 想成 minimum distance = 5 的 Hamming code BCH code 實作上十分簡樸優雅，可以非常快速的找到更正出現的一個錯誤，並且可以偵測到如果出現兩個錯誤時的情況 ![](https://i.imgur.com/0CwzIQQ.png) 圖片來源 : [Hamming codes, h■w to ov■rco■e n■ise.](https://www.youtube.com/watch?v=X8jsijhllIA&feature=youtu.be&ab_channel=3Blue1Brown) 0 : 用來判斷錯兩個的情況 1 : {1,5,9,13} {3,7,11,15} 該兩行的錯誤更正碼，使其為偶數 2 : {2,6,10,14} {3,7,11,15} 該兩行的錯誤更正碼，使其為偶數 4 : {4,5,6,7} {12,13,14,15} 該兩行的錯誤更正碼，使其為偶數 8 : {8,9,10,11} {12,13,14,15} 該兩行的錯誤更正碼，使其為偶數 #### 編碼給定 11 bit 資料，封裝進此陣列中，在預設 0 , 1 , 2 , 4 , 8 皆為 0 下算出這些錯誤更正碼 uint16_t hammingcode = 0b000x0xxx0xxxxxxx (x 為資料 0 or 1) 以 1 為例 : bit 1 = Hamming Weight(hammingcode & 0b0101010101010101) mod 2 用 C 語言寫起來會長 : ```c= hammingcode |=(((__builtin_popcount(hammingcode & 0b0101010101010101)&1))<<14); ``` bit 2, 4, 8 也是類似方法，最後再由計算完的整體進行最後的錯誤更正碼並指派給 bit 0 #### 解碼異常簡單，用 4 個判斷即可，讓我對這個編碼感到驚艷的是找到錯誤位置就是該四個判斷的分支，設計的非常優雅現在收到一組資料 hammingcode 現在想判斷錯誤點的最低 bit : ```c= bool bit1 = ((__builtin_popcount(hammingcode & 0b0101010101010101)&1)?1:0; ``` 當然如果 4 個判斷都是 0 的情況下表示沒有錯誤基本原理如上述所說，往後延伸應用包含像是分散式取block避免連續區域毀損、( 63 , 57 ) Hamming code #### Example [ LoRa-SDR/LoRaCodes.hpp ](https://github.com/myriadrf/LoRa-SDR/blob/master/LoRaCodes.hpp) LoRa 在物流網系統中為一種通訊系統，低耗電以及訊號覆蓋範圍大可以運用在中長距離通訊 [錯誤更正碼簡介](https://w3.math.sinica.edu.tw/math_media/d184/18404.pdf) 文中提到的(7,4)Hamming 實作如下 : ```cpp= /*********************************************************************** * Encode a 4 bit word into a 7 bits with parity. * Non standard version used in sx1272. **********************************************************************/ static inline unsigned char encodeHamming74sx(const unsigned char x) { auto d0 = (x >> 0) & 0x1; auto d1 = (x >> 1) & 0x1; auto d2 = (x >> 2) & 0x1; auto d3 = (x >> 3) & 0x1; unsigned char b = x & 0xf; b |= (d0 ^ d1 ^ d2) << 4; b |= (d1 ^ d2 ^ d3) << 5; b |= (d0 ^ d1 ^ d3) << 6; return b; } ``` ```cpp= /*********************************************************************** * Decode 7 bits into a 4 bit word with single bit correction. * Non standard version used in sx1272. * Set error to true when a parity error was detected **********************************************************************/ static inline unsigned char decodeHamming74sx(const unsigned char b, bool &error) { auto b0 = (b >> 0) & 0x1; auto b1 = (b >> 1) & 0x1; auto b2 = (b >> 2) & 0x1; auto b3 = (b >> 3) & 0x1; auto b4 = (b >> 4) & 0x1; auto b5 = (b >> 5) & 0x1; auto b6 = (b >> 6) & 0x1; auto p0 = (b0 ^ b1 ^ b2 ^ b4); auto p1 = (b1 ^ b2 ^ b3 ^ b5); auto p2 = (b0 ^ b1 ^ b3 ^ b6); auto parity = (p0 << 0) | (p1 << 1) | (p2 << 2); if (parity != 0) error = true; switch (parity) { case 0x5: return (b ^ 1) & 0xf; case 0x7: return (b ^ 2) & 0xf; case 0x3: return (b ^ 4) & 0xf; case 0x6: return (b ^ 8) & 0xf; case 0x0: case 0x1: case 0x2: case 0x4: return b & 0xF; } return b & 0xf; } ``` ![](https://i.imgur.com/wbbzAM4.png) 可以非常簡單看出 Hamming 應用 5. 研讀 [Reed–Solomon error correction](https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction)，再閱讀 Linux 核心原始程式碼的 [lib/reed_solomon](https://github.com/torvalds/linux/tree/master/lib/reed_solomon) 目錄，抽離後者程式碼為單獨可執行的程式，作為 ECC 的示範。 [Reed-Solomon Library Programming Interface](https://www.kernel.org/doc/html/latest/core-api/librs.html) [里德所羅門碼的編 / 解碼方式](https://csie.ntut.edu.tw/labaspl/edu/Convolution_RS.pdf) page 23 ~ 38 Reed–Solomon encode: 1. 編碼 : 如果想傳送 km 個 bit 也就是所謂 k 個 block ( m 個 bit 且Domain ∈ GF(${2^m}$) ) , 後面附上的 parity (額外附加) 可以協助修正傳送資料。而能修改的資料(block 數量) 的最大值為 $\dfrac {n-k}{2}$ 傳出的資料計算 : $S(x)= p(x) * {x^t}$ - $S_r$$(x)$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | | -------- | -------- | -------- | -------- | -------- | -------- | -------- | | 3 | 2 | 1 | 382 | 191 | 487 | 474 | 2. 傳送過程中產生些錯誤 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | | -------- | -------- | -------- | -------- | -------- | -------- | -------- | | 3 | 2 | 123 | 456 | 191 | 487 | 474 | 3. 接收端收到錯誤資料要進行兩件事 > 1. 找出錯誤位置 > The syndromes are calculated by evaluating r at powers of α > 2. 在錯誤位置找到正確的值 [Forney algorithm](https://en.wikipedia.org/wiki/Forney_algorithm) 方法將正確值帶回原本資料參考 [Reed-Solomon Library Programming Interface](https://www.kernel.org/doc/html/latest/core-api/librs.html) 若是要實作該 de/encode 需要先 create 且 Inititalize ```cpp rs_decoder = init_rs (10, 0x409, 0, 1, 6); ``` 對照到 linux kernel [lib/reed_solomon/reed_solomon.c](https://github.com/torvalds/linux/blob/master/lib/reed_solomon/reed_solomon.c) init_rs 存在不同 initialize 行為 (init_rs_XXX )，但是均為帶入預設參數進 init_rs_internal 中 ***initrsinternal() 目的在於指派 rs control ，找尋對應的 codec 或者建立的新的，一個系統中未必只由一種 codec*** 先定義 rs_codec 結構 ```cpp struct rs_codec { int mm; //徵狀值 (Syndromes) 每個 Syndromes 的bit 數，也決定GF(X) int nn; //block 有多少 Syndromes (= (1<<mm)-1) uint16_t *alpha_to; // 用以計算log uint16_t *index_of; // 用以取log uint16_t *genpoly; //生成函數 int nroots; //多少位額外的 parity ，同時也代表生成函數的 root 數量 int fcr; //First consecutive root, index form int prim; //Primitive element, index form int iprim; //prim-th root of 1, index form int gfpoly; //原本的生成函數 g(x) 已就是 parity 決定之函數 int (*gffunc)(int); //Function to generate the field, if non-canonical representation int users; //使用者，這邊我認為應該是對於使用的 Thread ，實作中有使用到 mutex_lock struct list_head list; // Linux Kernel 雙向鏈結的封裝，定義在linux/list.h中 }; ``` >>[Linux鏈結串列struct list_head 研究](https://myao0730.blogspot.com/2016/12/linux.html) 還挺特別的 >>![](https://i.imgur.com/5ELCzji.png) ```c= #define LIST_HEAD_INIT(name) { &(name), &(name) } #define LIST_HEAD(name) \ struct list_head name = LIST_HEAD_INIT(name) ``` reed_solomon.c 有宣告此 List: ```c=53 static LIST_HEAD(codec_list); ``` ```c=231 static struct rs_control *init_rs_internal(int symsize, int gfpoly, int (*gffunc)(int), int fcr, int prim, int nroots, gfp_t gfp) { struct list_head *tmp; struct rs_control *rs; unsigned int bsize; /* Sanity checks */ if (symsize < 1) return NULL; if (fcr < 0 || fcr >= (1<<symsize)) return NULL; if (prim <= 0 || prim >= (1<<symsize)) return NULL; if (nroots < 0 || nroots >= (1<<symsize)) return NULL; ``` 上述檢查輸入合理性與否 ```c=248 /* * The decoder needs buffers in each control struct instance to * avoid variable size or large fixed size allocations on * stack. Size the buffers to arrays of [nroots + 1]. */ ``` [Reed-Solomon Library Programming Interface](https://www.kernel.org/doc/html/latest/core-api/librs.html) >The encoder calculates the Reed-Solomon code over the given data length and stores the result in the parity buffer. Note that the parity buffer must be initialized before calling the encoder. 故先在 create 與 initialize 中先設定好 ```c=255 bsize = sizeof(uint16_t) * RS_DECODE_NUM_BUFFERS * (nroots + 1); rs = kzalloc(sizeof(*rs) + bsize, gfp); if (!rs) return NULL; mutex_lock(&rslistlock); /* Walk through the list and look for a matching entry */ list_for_each(tmp, &codec_list) { struct rs_codec *cd = list_entry(tmp, struct rs_codec, list); ``` [kmalloc](https://www.kernel.org/doc/htmldocs/kernel-api/API-kmalloc.html) >bsize = sizeof(uint16_t) * RS_DECODE_NUM_BUFFERS * (nroots + 1) >sizeof(uint16_t) * RS_DECODE_NUM_BUFFERS = 64 ，剛好就是 64 位元架構的大小 ` RS_DECODE_NUM_BUFFERS //RS_DECODE_NUM_BUFFERS = 8 ` >void *kmalloc(size_t size, gfp_t flags) gfp -> get_free_pages rs = kzalloc(sizeof(*rs) + bsize, gfp) 可以將 buffer 與 rs 接在一起 kmalloc 是 kernel 中最常用的一種記憶體分配方式，它通過呼叫kmem_cache_alloc函數來實現。 de/encode 通常不會少量次數，在 cache 中肯定快些，但是要注意 cache page 大小 kzalloc 會在取得記憶體後預先填入 0 flags argument 運在在多種情況 EG: >>GFP_ATOMIC : 用於interrupt handler 或任何行程外環境的程式..絕不休眠 >>GFP_KERNEL : 一般核心記憶體配置..有可能休眠 >>... > mutex_lock(&rslistlock) 其他 Thread 若也想做 encode initialization 會被擋住，確保對於該初始化行為是正確不被干擾的 >list_for_each(pos, head) >`#define list_for_each(pos, head) for (pos = (head)->next; pos != (head); pos = pos->next)` LIST_HEAD 串接 codec 的 list ```cpp=265 if (symsize != cd->mm) continue; if (gfpoly != cd->gfpoly) continue; if (gffunc != cd->gffunc) continue; if (fcr != cd->fcr) continue; if (prim != cd->prim) continue; if (nroots != cd->nroots) continue; /* We have a matching one already */ cd->users++; rs->codec = cd; goto out; } ``` >找尋適訂的 codec 這邊他寫的條件判斷挺特別的，透過 continue 讓判別錯的回到 for判斷式找下一個 codec ```c=284 /* Create a new one */ rs->codec = codec_init(symsize, gfpoly, gffunc, fcr, prim, nroots, gfp); if (!rs->codec) { kfree(rs); rs = NULL; } out: mutex_unlock(&rslistlock); return rs; } ``` >如果一開始 for 迴圈就離開表示該 codec list 並未存在任何 codec ,就可以用原本尋找的條件自行創一個 codec >並且找到指定的 codec 之後 unlock mutex_lock >找到所需的 codec 再回傳 :::success 這裡的實作挺像物件導向設計模式的簡單工廠模式，有需要再做。 ::: 建立 rs_codec 以及初始化 ( 這裡幾乎看不懂了zzz ) ```c=70 static struct rs_codec *codec_init(int symsize, int gfpoly, int (*gffunc)(int), int fcr, int prim, int nroots, gfp_t gfp) { int i, j, sr, root, iprim; struct rs_codec *rs; rs = kzalloc(sizeof(*rs), gfp); if (!rs) return NULL; INIT_LIST_HEAD(&rs->list); ``` 對 cache 建立空間，並且初始化 LIST_HEAD ```c=81 rs->mm = symsize; rs->nn = (1 << symsize) - 1; rs->fcr = fcr; rs->prim = prim; rs->nroots = nroots; rs->gfpoly = gfpoly; rs->gffunc = gffunc; /* Allocate the arrays */ rs->alpha_to = kmalloc_array(rs->nn + 1, sizeof(uint16_t), gfp); if (rs->alpha_to == NULL) goto err; rs->index_of = kmalloc_array(rs->nn + 1, sizeof(uint16_t), gfp); if (rs->index_of == NULL) goto err; rs->genpoly = kmalloc_array(rs->nroots + 1, sizeof(uint16_t), gfp); if(rs->genpoly == NULL) goto err; ``` 對照一開始輸入的參數建立 codec 物件 ```c=101 /* Generate Galois field lookup tables */ rs->index_of[0] = rs->nn; /* log(zero) = -inf */ rs->alpha_to[rs->nn] = 0; /* alpha**-inf = 0 */ if (gfpoly) { sr = 1; for (i = 0; i < rs->nn; i++) { rs->index_of[sr] = i; rs->alpha_to[i] = sr; sr <<= 1; if (sr & (1 << symsize)) sr ^= gfpoly; sr &= rs->nn; } } else { sr = gffunc(0); for (i = 0; i < rs->nn; i++) { rs->index_of[sr] = i; rs->alpha_to[i] = sr; sr = gffunc(sr); } } ``` >[An Introduction to Galois Fields and Reed-Solomon Coding](https://people.cs.clemson.edu/~westall/851/rs-code.pdf) 先前提到的 GF(${2^m}$) >gfpoly (Generator polynomial) 代表其生成多項式 [參考來源](http://apwcs2014.nsysu.edu.tw/course/pdfdownload/95_2/%E9%8C%AF%E8%AA%A4%E6%9B%B4%E6%AD%A3%E7%A2%BC/CC-05-BCH.pdf) >若是一開始呼叫 init_rs_non_canonical 此 rs_control 則會使用到 gffunc >>gffunc : pointer to function to generate the next field element, or the multiplicative identity element if given 0. Used instead of gfpoly if gfpoly is 0 ```c=121 /* If it's not primitive, exit */ if(sr != rs->alpha_to[0]) goto err; /* Find prim-th root of 1, used in decoding */ for(iprim = 1; (iprim % prim) != 0; iprim += rs->nn); /* prim-th root of 1, index form */ rs->iprim = iprim / prim; ``` >不太了解卡一個 for 在這裡 >prim : Primitive element, index form >iprim : prim-th root of 1, index form ```c=129 /* Form RS code generator polynomial from its roots */ rs->genpoly[0] = 1; for (i = 0, root = fcr * prim; i < nroots; i++, root += prim) { rs->genpoly[i + 1] = 1; /* Multiply rs->genpoly[] by @**(root + x) */ for (j = i; j > 0; j--) { if (rs->genpoly[j] != 0) { rs->genpoly[j] = rs->genpoly[j -1] ^ rs->alpha_to[rs_modnn(rs, rs->index_of[rs->genpoly[j]] + root)]; } else rs->genpoly[j] = rs->genpoly[j - 1]; } /* rs->genpoly[0] can never be zero */ rs->genpoly[0] = rs->alpha_to[rs_modnn(rs, rs->index_of[rs->genpoly[0]] + root)]; } ``` 計算 generator polynomial 可是我看不懂為啥是這樣 ```c=148 /* convert rs->genpoly[] to index form for quicker encoding */ for (i = 0; i <= nroots; i++) rs->genpoly[i] = rs->index_of[rs->genpoly[i]]; rs->users = 1; list_add(&rs->list, &codec_list); return rs; err: kfree(rs->genpoly); kfree(rs->index_of); kfree(rs->alpha_to); kfree(rs); return NULL; } ``` :::success initial_codec 有點... ::: **encode** ```c=315 #ifdef CONFIG_REED_SOLOMON_ENC8 /** * encode_rs8 - Calculate the parity for data values (8bit data width) * @rsc: the rs control structure * @data: data field of a given type * @len: data length * @par: parity data, must be initialized by caller (usually all 0) * @invmsk: invert data mask (will be xored on data) * * The parity uses a uint16_t data type to enable * symbol size > 8. The calling code must take care of encoding of the * syndrome result for storage itself. */ int encode_rs8(struct rs_control *rsc, uint8_t *data, int len, uint16_t *par, uint16_t invmsk) { #include "encode_rs.c" } EXPORT_SYMBOL_GPL(encode_rs8); #endif ``` >原本只有看到 encode_rs.c 裡面一個 {......} 連函數名稱都沒有，原來可以直接這樣用 lib/reed_solomon/encode_rs.c ```c=12 { struct rs_codec *rs = rsc->codec; int i, j, pad; int nn = rs->nn; int nroots = rs->nroots; uint16_t *alpha_to = rs->alpha_to; uint16_t *index_of = rs->index_of; uint16_t *genpoly = rs->genpoly; uint16_t fb; uint16_t msk = (uint16_t) rs->nn; ``` >檢查輸入 >invmsk (invert data mask) XOR on data, not on parity >這裡用意為辨別輸入為 RS 8 或者 RS 16 利用此種 bitwise 方式操作 >par 為 parity data ```c=22 /* Check length parameter for validity */ pad = nn - nroots - len; if (pad < 0 || pad >= nn) return -ERANGE; for (i = 0; i < len; i++) { fb = index_of[((((uint16_t) data[i])^invmsk) & msk) ^ par[0]]; /* feedback term is non-zero */ if (fb != nn) { for (j = 1; j < nroots; j++) { par[j] ^= alpha_to[rs_modnn(rs, fb + genpoly[nroots - j])]; } } /* Shift */ memmove(&par[0], &par[1], sizeof(uint16_t) * (nroots - 1)); if (fb != nn) { par[nroots - 1] = alpha_to[rs_modnn(rs, fb + genpoly[0])]; } else { par[nroots - 1] = 0; } } return 0; } ``` ... ## 測驗 2 [Kth Ancestor of a Tree Node](https://leetcode.com/problems/kth-ancestor-of-a-tree-node/) >給你一棵樹，樹上有 n 個節點，編號自 0 到 `n−1`。樹以父節點陣列的形式提供，其中 `parent[i]` 是節點 i 的父節點。樹的根節點是編號為 0 的節點。 >請設計 `treeAncestorGetKthAncestor(TreeAncestor *obj, int node, int k)` 函式，回傳節點 node 的第 k 個祖先節點。若不存在這樣的祖先節點，回傳 -1。樹節點的第 k 個祖先節點是從該節點到根節點路徑上的第 k 個節點 ![](https://i.imgur.com/JI25RwH.png) input : ```c= ["TreeAncestor","getKthAncestor","getKthAncestor","getKthAncestor"] [[7,[-1,0,0,1,1,2,2]],[3,1],[5,2],[6,3]] ``` output ``` [null,1,0,-1] ``` C 程式實作建構一個題目所提供的程式可以有兩種方式 : 1. 建立 tree 的時候一起建立個 table 以記錄該點以及每點父節點，優點是存取速度快 O(1) ，而缺點是存在節點上限，且不能無限動態新增 2. structure 裡面新增 parent_node 指向父節點，但在找尋時候需要逐一向上找，時間複雜度 worst case 為 O(n) ，而平均情況為 O( log n ) 本實作使用第一種 : 建立記錄節點的 table 首先要考慮其大小，在已知 n 個節點，在 worst case 情況下會形成一條 list , 在最差情況下每個 node 對應的 table 需要 n-1 個紀錄空間大小，也就是說理論上我們需要建立 n * ( n-1 ) 個空間的 Table 但是考量到空間成本效益，除非是例如 Heap 這樣的資料結構，Table 空間利用率可能會很差，~~所以此處是利用 "只記錄第${2^x}$ 個祖先" 方式記錄在表中，剩下的透過計算算出的折衷方式。~~ ```c= #include <stdlib.h> typedef struct { int **parent; int max_level; } TreeAncestor; TreeAncestor *treeAncestorCreate(int n, int *parent, int parentSize) { TreeAncestor *obj = malloc(sizeof(TreeAncestor)); obj->parent = malloc(n * sizeof(int *)); //上述這行可以先確定 parent 為 a pointer to pointer to integer int max_level = 32 - __builtin_clz(n) + 1; for (int i = 0; i < n; i++) { obj->parent[i] = malloc(max_level * sizeof(int)); for (int j = 0; j < max_level; j++) obj->parent[i][j] = -1; } ``` 這邊可以想像我們創建一個 table 以記錄每個點的 parent ，而每個 node 可以記錄 max_level 個 parents 取得空間後先在每個欄位預設填入 -1 | node | parent 1 | parent 2 | parent 3 | | ---- | -------- | -------- | -------- | | 0 | XX | XX | XX | | 1 | XX | XX | XX | | 2 | XX | XX | XX | | ... | XX | XX | XX | | n | XX | XX | XX | --- ```c=20 for (int i = 0; i < parentSize; i++) obj->parent[i][0] = parent[i]; ``` 這邊將每個 node 所對應的 table 第一個位置填入第一個父節點 ```c=22 for (int j = 1;; j++) { int quit = 1; for (int i = 0; i < parentSize; i++) { obj->parent[i][j] = obj->parent[i][j + (-1)] == -1 ? -1 : obj->parent[obj->parent[i][j - 1]][j - 1]; if (obj->parent[i][j] != -1) quit = 0; } if (quit) break; } obj->max_level = max_level - 1; return obj; } ``` ~~22-31 for 迴圈中主要是填入每個節點的 ${2^j}$ 個parent node ~~ 值得注意 25 行 ```c=25 obj->parent[i][j] = obj->parent[i][j + (-1)] == -1 ? -1 : obj->parent[obj->parent[i][j - 1]][j - 1]; ``` ~~這邊是先檢查第 ${2^{j-1}}$ 個祖先存在與否~~ | 第${2^0}$個祖先 | 第${2^1}$個祖先 | 第${2^2}$個祖先 | 第${2^3}$個祖先 | 第${2^4}$個祖先 | | ---------- | ---------- | ---------- | ---------- | ---------- | --- | --- | --- | --- |:---:| --- | | XXX | XXX | XXX | XXX | XXX | ~~挺特別的設計為第 ${2^{x+1}}$ 個祖先就是第${2^x}$ 個祖先的第${2^x}$ 個祖先~~ 透過上述程式設計成為可以 O(1) 查詢的 Table ```c=35 int treeAncestorGetKthAncestor(TreeAncestor *obj, int node, int k) { int max_level = obj->max_level; for (int i = 0; i < max_level && node != -1; ++i) if (k & (1 << i)) node = obj->parent[node][i]; return node; } void treeAncestorFree(TreeAncestor *obj) { for (int i = 0; i < obj->n; i++) free(obj->parent[i]); free(obj->parent); free(obj); } ``` 注意到這裡反而覺得特別，treeAncestorGetKthAncestor() 直接取 Table 的值後來發現原本想法錯了，此處建立 table 時候會根據前面所建立的資料建立 table 所以並不需要親自探訪也就是說 treeAncestorCreate() 裡面我們可以想說有張表格 | 第 X 個節點 | parent[X][0] | parent[X][1] | | -------- | -------- | -------- | | 0 | -1| -1 | | 1 | 0 | -1 | | 2 | 0 | -1 | | 3 | 2 | -1 | treeAncestorCreate() 建立時就會知道要放在哪個父點 (int *parent) 下，並把父點紀錄在 parent[X][0] 中，當要建立下個資料時，其實會透過知道本身父點位置再去找尋本身父點，用自己本身的父點建立祖先 Table 的值，有點像可是又不是動態規劃的方式。透過這個方法可以讓 table 直接查對應值就可以達到找到的祖先點，而且確保為 Heap 結構情況時資料空間深度只需要 O(log n)，而且 create table 實作速度快 O(1) ### 延伸問題 1. 在 `treeAncestorCreate` 函式內部，若干記憶體被浪費，請撰寫完整測試程式碼，並透過工具分析首先在 leetcode 中 "可能" 建立 tree 時候就確定為 Heap 架構了，如果考慮 tree 的 worst case 情況應該需要 n*(n-1) 的空間大小，而如果資料結構只存在為 heap ，我們可以透過計算階層計算算出該節點對應表格位置，畢竟像是 `obj->parent[0][i] ` i=1,2,3,... 都只會是 -1 ，而且沒意義。原先 table | 第 X 個節點 | parent[X][0] | parent[X][1] | | -------- | -------- | -------- | | 0 | -1| -1 | | 1 | 0 | -1 | | 2 | 0 | -1 | | 3 | 1 | -1 | 我們可以想成 | Table[0] | Table[1] | Table[2] | Table[3] | Table[4] | Table[5] | Table[6] | Table[7] | | -------- | -------- | -------- | -------- | -------- | -------- | -------- |:--------:| | -1 | 0 | -1 | 0 | -1 | 1 | 0 | -1 | 以此類推該 x 節點資料位置為Table[${1+2*2+3*2*2+4*2*2*2...}$] 而讀取到 -1 表示無父點 2. 在 LeetCode 1483. Kth Ancestor of a Tree Node 提交你的實作，你應該要在 C 語言項目中，執行時間領先 75% 的提交者; 原先部分 ![](https://i.imgur.com/vFuXFWD.png) 其實也很明顯，最後的 "-1" 只是為了判斷停止點浪費了改動部分 : int max_level = 32 - __builtin_clz(n); ... for (int j = 1; j<max_level; j++) obj->max_level = max_level; 更改完程式碼: ```c= #include <stdlib.h> typedef struct { int **parent; int max_level; } TreeAncestor; TreeAncestor *treeAncestorCreate(int n, int *parent, int parentSize) { TreeAncestor *obj = malloc(sizeof(TreeAncestor)); obj->parent = malloc(n * sizeof(int *)); //上述這行可以先確定 parent 為 a pointer to pointer to integer int max_level = 32 - __builtin_clz(n); for (int i = 0; i < n; i++) { obj->parent[i] = malloc(max_level * sizeof(int)); for (int j = 0; j < max_level; j++) obj->parent[i][j] = -1; } for (int i = 0; i < parentSize; i++) obj->parent[i][0] = parent[i]; for (int j = 1; j<max_level; j++) { int quit = 1; for (int i = 0; i < parentSize; i++) { obj->parent[i][j] = obj->parent[i][j + (-1)] == -1 ? -1 : obj->parent[obj->parent[i][j - 1]][j - 1]; if (obj->parent[i][j] != -1) quit = 0; } if (quit) break; } obj->max_level = max_level; return obj; } int treeAncestorGetKthAncestor(TreeAncestor *obj, int node, int k) { int max_level = obj->max_level; for (int i = 0; i < max_level && node != -1; ++i) if (k & (1 << i)) node = obj->parent[node][i]; return node; } void treeAncestorFree(TreeAncestor *obj) { for (int i = 0; i < obj->max_level; i++) free(obj->parent[i]); free(obj->parent); free(obj); } ``` ![](https://i.imgur.com/PoGtdcl.png) 改善空間結果連同時間一起改善了，此生或許再無悲喜 ## 測驗 3 [Fizz buzz](https://en.wikipedia.org/wiki/Fizz_buzz) 從 1 數到 n，如果是 3的倍數，印出 “Fizz” 如果是 5 的倍數，印出 “Buzz” 如果是 15 的倍數，印出 “FizzBuzz” 如果都不是，就印出數字本身 ```c = #define MSG_LEN 8 char fmt[MSG_LEN + 1]; strncpy(fmt, &"FizzBuzz%u"[start], length); fmt[length] = '\0'; printf(fmt, i); printf("\n"); ``` 我們若能精準從給定輸入 i 的規律去控制 start 及 length，即可符合 FizzBuzz 題意: ``` string literal: "fizzbuzz%u" offset: 0 4 8 ``` bitwise.c ```c= #define MSG_LEN 8 static inline bool is_divisible(uint32_t n, uint64_t M) { return n * M <= M - 1; } static uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1; static uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1; int main(int argc, char *argv[]) { for (size_t i = 1; i <= 100; i++) { uint8_t div3 = is_divisible(i, M3); uint8_t div5 = is_divisible(i, M5); unsigned int length = (2 << div3) << div5; char fmt[MSG_LEN + 1]; strncpy(fmt, &"FizzBuzz%u"[(MSG_LEN >> div5) >> (div3 << 2)], length); fmt[length] = '\0'; printf(fmt, i); printf("\n"); } return 0; } ``` ( FizzBuzz optimization)[https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853] 分析 : i div3 div5 > length ``` 1 0 0 > 2 2 0 0 > 2 3 1 0 > 4 4 0 0 > 2 5 0 1 > 4 6 1 0 > 4 7 0 0 > 2 8 0 0 > 2 9 1 0 > 4 10 0 1 > 4 11 0 0 > 2 12 1 0 > 4 13 0 0 > 2 14 0 0 > 2 15 1 1 > 8 ``` length 決定該輸出字串長度 "%u" : 2 "fizz" : 4 "buzz" : 4 "fizzbuzz" : 8 故再來希望知道 " 從哪個位置開始 " `&"FizzBuzz%u" [0]` : 若 length = 4 => "Fizz" 若 length = 8 => "FizzBuzz" `&"FizzBuzz%u" [4]` : 若 length = 4 => "Buzz" `&"FizzBuzz%u" [8]` : 若 length = 2 => "%u" 考慮程式碼中，對於字串 "fizzbuzz%u" 的位移 ```c= &"FizzBuzz%u"[(MSG_LEN >> div5) >> (div3 << 2)] ``` 1. 非 3,5 倍數，offset = 8 (MSG_LEN >> 0) >> (0 << 2) = MSG_LEN >> 0 = 8 2. 為 5 倍數，但是非 3 的倍數，offset = 4 (MSG_LEN >> 1) >> (0 << 2) = MSG_LEN >> 1 = 4 3. 為 3 倍數，但是非 5 的倍數，offset = 4 (MSG_LEN >> 0) >> (1 << 2) = MSG_LEN >> 4 = 0 再透過 length 判斷是否為 15 倍數 ### 延伸問題 1. 評估 naive.c 和 bitwise.c 效能落差 fizzbuzz count 測試每次讓 fizzbuzz 判斷 1000 數 (0-999) 各測試 100 次 naive.c: ![](https://i.imgur.com/7sE3joo.png) bitwise.c: ![](https://i.imgur.com/6y3PuJW.png) 挺意外的其實都差不多一樣，就算算更多次的數字其實也都差不多。使用 bitwise 造成預期的效能提升並沒有出現。 2. 分析 [Faster remainders when the divisor is a constant: beating compilers and libdivide](https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/) 一文的想法 (可參照同一篇網誌下方的評論)，並設計另一種 bitmask，如「可被 3 整除則末位設為 1」「可被 5 整除則倒數第二位設定為 1」，然後再改寫 bitwise.c 程式碼，試圖運用更少的指令來實作出 branchless; 參照[fastmod: A header file for fast 32-bit division remainders on 64-bit hardware](https://github.com/lemire/fastmod) ## 測驗 4 offset / PAGE_QUEUES 考慮到整數 PAGE_QUEUES 可能有以下數值 : > 1. (1 or 2 or 3 or 4) > 2. (5 or 6 or 7 or 8) × (${2^n}$), n from 1 to 14 給定 size_t offset，試著將原本運算 : ``` #include <stdint.h> size_t blockidx = offset / PAGE_QUEUES; ``` 因為 PAGE_QUEUES 的數值最大可能是 8*${2^n}$ 所以若不進行化簡而直接做整數除法的話會運算的成本，因此我們可以使用 __builtin_ffs 將被除數與除數先行化簡由於 PAGE_QUEUES 的數值分佈已知，在整數除法時，可思考加速運算的機制。需要注意，某些硬體平台缺乏整數除法指令 (如 Arm Cortex-A8)，即使 Intel 公司出品的 Core 處裡器 Sandy Bridge 微架構中，針對 64 位元整數的除法運算，會佔用 40 到 103 個處理器週期，運算成本仍屬沈重。我們可預先進行計算，從而避免整數除法指令的使用: (假設在 x86_64 硬體執行 GNU/Linux 並透過 gcc/clang 編譯) ```c= #include <stdint.h> #define ffs32(x) ((__builtin_ffs(x)) - 1) size_t blockidx; size_t divident = PAGE_QUEUES; blockidx = offset >> ffs32(divident); divident >>= ffs32(divident); switch (divident) { CASES } ``` ```c= case 2: blockidx /= 2; break; ``` 這樣的組合，請用最少的 case-break 敘述做出同等 `size_t blockidx = offset / PAGE_QUEUES;` 效果的程式碼。 >— Built-in Function: >int __builtin_ffs (unsigned int x) >Returns one plus the index of the least significant 1-bit of x, or if x is zero, returns zero. ```c= for(int i=0;i<10;i++) printf("__builtin_ffs (%d) = %d \n ",i,__builtin_ffs (i)); ``` 先從 gcc 擴充函數了解 : ``` __builtin_ffs (0) = 0 __builtin_ffs (1) = 1 __builtin_ffs (2) = 2 __builtin_ffs (3) = 1 __builtin_ffs (4) = 3 __builtin_ffs (5) = 1 __builtin_ffs (6) = 2 __builtin_ffs (7) = 1 __builtin_ffs (8) = 4 __builtin_ffs (9) = 1 ``` 我們可以知道最後幾個 0 也就是可以得知該數含有 ${2^n}$ 的因數這題目與 [Quiz 3 Question 4](https://hackmd.io/0dQsfZYVQaCf9H0_W2weQA#%E6%B8%AC%E9%A9%97-4) 觀念很像，再計算時可以先提取容易觀察計算的 ${2^n}$ | divident | binary | after __builtin_ffs(divident) | |:--------:|:------:|:-----------------------------:| | 1 | 0001 | 0001 = 1 | | 2 | 0010 | 0001 = 1 | | 3 | 0011 | 0011 = 3 | | 4 | 0100 | 0001 = 1 | | 5 | 0101 | 0101 = 5 | | 6 | 0110 | 0011 = 3 | | 7 | 0111 | 0111 = 7 | | 8 | 1000 | 0001 = 1 | 列舉每種可能之後會發現其實需要額外判斷的 case 其實只有 3, 5, 7 三數而已 ```c= case 3: blockidx /= 3; break; case 5: blockidx /= 5; break; case 7: blockidx /= 7; break; ``` ### 延伸問題 1. 解釋程式運算原理，可搭配 Microsoft Research 的專案 [snmalloc](https://github.com/microsoft/snmalloc) 來解說對應的原始程式碼 src/mem/sizeclass.h 2. 練習 LeetCode 51. [N-Queens](https://leetcode.com/problems/n-queens/)，應選擇 C 語言並善用上述的 __builtin_ffs，大幅降低記憶體開銷 N-Queen Problem 算是挺經典的問題，不過過去一直沒有機會碰到關於此題全部整理在 [YNQueens](https://hackmd.io/@YLowy/r1wIe9-DD)