2018q3 Homework6

# 2018q3 Homework6 contributed by < `TerryShu` > 、 < `ofAlpaca` > ###### tags: `CSIE5006` `HW` ## [Huffman Coding](https://hackmd.io/s/S1azdiUiX#) 此程式碼可以分為兩個部分，第一是**建立 Min Heap** ，第二是**建立 Huffman Tree**。 ### 建立 Min Heap * **Min Heap** : 父節點的值會小於下方 subtree 的所有節點的值，且 root 為所有節點中最小值。 * 步驟大致上可分為 : 1. 計算文件中各字元出現的次數，此紀錄會存在 `dictionary` 中。其中需要注意的是 `dictionary` 結構並非單純的陣列而已，它是**以字元的 ASCII 編碼**來作為 index 存入對應的位置，換句話說，如果讀到 ABC 這三個字元，則會分別會增加 `dictionary[65]` 、 `dictionary[66]` 、 `dictionary[67]` 的 weight ，其餘位置的 weight 則為 0 。 **附註 : `calloc` 會對陣列做初始化，所以沒用到的 node_t 的 weight 會是 0 ，而非 garbage value 。** ![](https://i.imgur.com/yVozVJ1.png) ```clike= min_heap_t *scan_file(FILE *in) { node_t *dictionary = calloc(ASCII_SIZE, sizeof(node_t)); min_heap_t *heap = calloc(1, sizeof(min_heap_t)); int ch; for (;;) { if ((ch = fgetc(in)) == EOF) break; dictionary[ch].weight++; printf("%d\n", ch); } for (ch = 0; ch < ASCII_SIZE; ch++) { if (dictionary[ch].weight == 0) continue; dictionary[ch].data = ch; add_to_heap(&(dictionary[ch]), heap); } rewind(in); return heap; } ``` 2. 建立好 `dictionary` table 後，會逐一加入 weight 不為 0 的 ASCII 字元來建立 Min Heap 。最後生成的 Min Heap 的 root 為最小值，父節點的值小於子節點的。 * 使用一維陣列來表示 Binary Heap : ![](https://i.imgur.com/afStepN.png) * 下方迴圈負責**維護 Min Heap 的特性**，每次加入新節點時都去檢查其 weight 是否小於前面節點的，如果是則往前交換，換到有更小的 weight 出現為止，或是到達 root 。 ```clike= static void add_to_heap(node_t *to, min_heap_t *heap) { int pos = heap->size++; heap->array[pos] = to; if (heap->size > 2) { // XXE while ((heap->array[pos - 1])->weight > (heap->array[pos])->weight) { swap_node(&(heap->array[pos - 1]), &(heap->array[pos])); if (--pos == 0) break; } } } ``` :::info **問題**:question: 第四行有個 `XXE` 的填空題，答案為 `heap->size > 2` ，看到這裡會想，為什麼 size 為 2 的時候不用檢查，那如果今天只有兩個節點，先加入的 weight 較大，後加入的 weight 較小，如果後者不往前交換不就違背了 Min Heap 定義嗎 ? 如果前後兩個節點的 weight 相差極大的話，還會影響壓縮率。 ::: **實驗** : 比較 `heap->size > XXE` 中的 `XXE` 為 2 或 1 時的壓縮率。 * 測試文件 input 的內容 : ``` AAAAAAAAAAAAAAAAABBCC ``` * 如果 XXE = 2 : ``` $ ./huff input C : 0 A : 10 B : 11 Compressing... Achieved 71.428571 % compression. ``` * 圖解： ![](https://i.imgur.com/9DFxKLq.png) * 如果 XXE = 1 : ``` $ ./huff input B : 00 C : 01 A : 1 Compressing... Achieved 80.952381 % compression. ``` * 圖解： ![](https://i.imgur.com/z5EfTSy.png) * 經過測試後發現當 `XXE` 為 1 時壓縮率反而優於 2 ，因為當 `XXE` 為 2 時所求出的 Huffman tree 不正確，照理來說，出現頻率越高的字元應該有更短的 Huffman code 。 --- ### 建立 Huffman Tree * 可分為兩步驟，一是 `build_hufftree` ，二是 `build_pairings` : 1. `build_hufftree` : * 一開始的 Min Heap 。 ![](https://i.imgur.com/qdpLEBJ.png) * 先將 Min Heap 中最小的前兩個節點合而為一，並再重新加入 Heap 中。這裡的最小前兩個節點是指 `array[0]` 與 `array[1]` ，前者為 lighter node(left node) ，後者為 heavier node(right node) 。 ![](https://i.imgur.com/PopDotC.png) * 為維護 Min Heap ，故往上交換。 ![](https://i.imgur.com/H68PnFT.png) * 因為已經將前兩小的節點結合，所以需要將原本位置的他們移除，故做 `shift_up_2()` 也就是將所有節點往上移兩個單位。 ![](https://i.imgur.com/hnofHAA.png) * `shift_up_2()` 第八行有 `XXC` 的填空題，因為所有節點向上位移兩個單位，所以 size 要減少 2 。 ```clike= static void shift_up_2(min_heap_t *heap) { int i = 0; while (i < heap->size) { (heap->array)[i] = (heap->array)[i + 2]; (heap->array)[i + 1] = (heap->array)[i + 3]; i++; } heap->size -= 2; // XXC } ``` * 重複同樣步驟，最後的 `array[0]` 即是 Huffman tree 的 root 。 ![](https://i.imgur.com/KYo0pBD.png) * 第六行有 `XXD` 的填空題，根據上述可得答案為 `array[0]` 。 * Huffman tree 的建置到此告一段落。 ```clike= static node_t *build_hufftree(min_heap_t *heap) { while (heap->size > 1) { add_to_heap(combine_node_ts(heap->array[0], heap->array[1]), heap); shift_up_2(heap); } return heap->array[0]; // XXD } ``` 2. `build_pairings` : * 這部分負責產生編碼用的 table ，換句話說，就是要找出所有 ASCII 字元符號相對應的 Huffman code 。 * `build_pariings` 函式從 root 開始遞迴呼叫左子樹與右子樹，在進入遞迴前會先記錄進入子樹的方向，往左子樹則紀錄 0 ，往右子樹則紀錄 1 ，這些記錄會寫在陣列 `array` 上， `top` 為此陣列的高度，代表目前節點的 depth。 * 第四、八行有 `XXA` 與 `XXB` 的填空題，答案都為 `build_pairings(root->right, arr, top + 1, pairs)` ，只需要增加陣列的高度 `top` 即可， `arr` 是不用去增加的。 * 當遞迴至左子樹與右子樹都為 `null` 的節點時，代表此節點為葉節點，可以輸出某字元的 Huffman code 至 `pairs` 的陣列中。 * `pairs` 陣列與 `dictionary` 相似，都是以字元的 ASCII 編碼來作為 index 。 * 最後的 `pairs` 會是一個 ASCII 編碼對應 Huffman code 的 table ，接下來做 encode 時只需要將文件中的各字元轉換為 Huffman code 即可。 ```clike= void build_pairings(node_t *root, int arr[], int top, pair_t *pairs) { if (root->left) { arr[top] = 0; build_pairings(root->left, arr, top + 1, pairs); // XXA } if (root->right) { arr[top] = 1; build_pairings(root->right, arr, top + 1, pairs); // XXB } if (!(root->left) && !(root->right)) { uint8_t index = root->data; for (int i = 0; i < top; i++) pairs[index].arr[i] = arr[i]; pairs[index].length = top; pairs[index].data = root->data; } } ``` --- ### Encode & Decode * Encode * `encode` : * 一開始先讀入一個 `character` * 接著在建立好的 ASCII 配對表 `pairs` 中找到其對應的 `Huffman code` * 並用填一 bit ，左移一位的方式填入 `buffer` 中 * 最後當填完 8 bit (1 byte) 時，透過 `stream` 寫入檔案中 * 則最後一位若不足 8 bit 則左移直到填滿 8 bit 為止 ```clike= void encode(FILE *in, FILE *out, pair_t *pairs) { int curr_size = 0; unsigned buffer = 0; for (;;) { int ch = fgetc(in); if (ch == EOF) break; int i = 0; while (i++ < pairs[ch].length) { buffer <<= 1; buffer += (pairs[ch].arr)[i]; curr_size++; if (curr_size == 8) { printf("%d\n",buffer); fwrite(&buffer, 1, 1, out); curr_size = 0; buffer = 0; } } } while (curr_size < 8) { buffer <<= 1; curr_size++; } rewind(in); printf("%d\n",buffer); fwrite(&buffer, 1, 1, out); fclose(out); } ``` :::info **Bug** :question: 此處提供之 `source code` 應該有誤第 10 行的判斷條件 ```clike while (i++ < pairs[ch].length) ``` 此寫法會先用 `i` 去跟 `pairs[ch].length` 比較是否小於之後做 `++` 這個動作會導致第一次進入迴圈時 i 的值就是 `1` 則後續取 `Huffman code` 時會出現錯誤 ```clike= buffer += (pairs[ch].arr)[i]; // should start from 0 not 1 ``` 故將上面程式碼中的 10~19 行的 `while` 改成以下 ```clike= while (i < pairs[ch].length) { buffer <<= 1; buffer += (pairs[ch].arr)[i]; curr_size++; if (curr_size == 8) { printf("%d\n",buffer); fwrite(&buffer, 1, 1, out); curr_size = 0; buffer = 0; } i++; } ``` ::: * Decode * `decode` : 利用走訪 MinHeap 達到 Decode 要求 * 一開始從要解密檔案的 `stream` 內讀入一個 `byte` * 接著將 `byte` 轉換為 `bit` 遵循以下規則走訪 `MinHeap` 1. 當 `bit` 為 `1` 時往右 `0` 時往左 2. 若走到的節點為 `leaf node` 則為結果 * 當解碼出來的字是 `EOF` 時表示完成解碼 ```clike void decode(FILE *in, min_heap_t *heap) { unsigned char buffer; long filelen; fseek(in, 0, SEEK_END); // Jump to the end of the file filelen = ftell(in); // Get the current byte offset in the file rewind(in); // Jump back to the beginning of the file node_t *root = heap->array[0]; node_t *temp = root; bool stop = false; printf("Decoding...\nPlaintext:\n"); for (int i = 0; i < filelen; i++) { fread(&buffer, 1, 1, in); // Read One Byte for (int j = 0; j < 8; j++) { unsigned first_bit = buffer >> 7; buffer <<= 1; if (first_bit) temp = temp->right; else temp = temp->left; if (((int)temp->data)) { if (((int)temp->data) == ASCII_EOF) { stop = true; break; } printf("%c", temp->data); temp = root; } } if (stop) { printf("\n"); break; } } } ``` :::info **問題** :question: 一開始的 `Encode` 並無加入一些控制字元如 `EOF` 則有可能出現以下狀況 * 輸入內容 : ``` AAAABBCC ``` * 經過編碼後 ``` 編碼表 : A : 1 B : 00 C : 01 編碼 : 11110000 0101(0000) ↑ 補0 ``` * 解碼 ``` 11110000 01010000 AAAABBCCBB ``` 會發現最後面的補0 會導致錯誤所以解決方法是補上一個 `EOF` 符號一起去建立 `Huffman code` 則可解決問題修改後 * 輸入內容 ``` AAAABBCC(EOF) ``` * 經過編碼後 ``` 編碼表 : A : 1 B : 011 C : 00 EOF : 010 編碼 : 11110110 11000001 0(0000000) ↑ 補0 ``` * 解碼 ``` 11110110 11000001 00000000 AAAABBCC(EOF) 當 EOF 出現後後面就不用再進行解碼 ``` 經過加入一個 EOF 去處理結尾的問題後發現有可能會因為新增了一個新的 node 導致壓縮率變差但若不加則無法保證正確性 ::: --- ## 延伸題目 1. 回顧 dict 作業，提出資料壓縮的方案； * 深入理解 dict 中所建構的 ternary tree 後可以發現其建立的 tree 並非如 [Ternary Search Tree Visualization](https://www.cs.usfca.edu/~galles/visualization/TST.html) 裡頭所顯示的那樣，以綠色標記出一個單字的結尾， **[dict](https://hackmd.io/s/B1Bq5Jat7) 是以空白的節點來當作單字的結尾**，比較如下圖 : (輸入順序 `CODE` -> `CODES`) * 原本所認知的 Ternary tree : ![](https://i.imgur.com/hnf5JkR.png) * dict 中所產生的 Ternary tree : * `S` 節點為何會連向上個結尾節點的 right child ? 那是因為當 `CODES` 比較到 `S` 時是和空白節點比 key value ，因為 `S` 的 key value 大於空白，所以會往 `high child` 移動。 ![](https://i.imgur.com/EBms1YM.png) * 但是更好玩的還在後頭，如果改變插入順序是先 `CODES` 再 `CODE` ，就會變成下圖。當 `CODE` 單字在 tree 中找完後會停留在 `S` ，接著由空白字元與 `S` 節點的 key value 做比較，想當然空白字元會較小，所以接下來才會往 `left child` 移動。 ![](https://i.imgur.com/FT6ANeW.png) * 順帶一提，每個結尾節點的 `middle child` 還會指向當時 input 的那個單字，所以整體的資料結構應該如下。 ![](https://i.imgur.com/KWIkjTG.png) * 壓縮方法 : 由上圖可以知道總共花了 7 個節點來建立整棵樹，如果把 `C` 到 `E` 的部分合成一個節點，那就只需要 4 個節點就可以建立整棵樹，藉此達到壓縮的目的。 ![](https://i.imgur.com/lIfmi3J.png) * 實作方面需要先從資料結構做修改，將原本的 key value 改為 pointer to character ，好讓節點的 key value 可以儲存超過一個的 key 。 ```clike= typedef struct tst_node { char *key; unsigned refcnt; struct tst_node *lokid; /* ternary low child pointer */ struct tst_node *eqkid; /* ternary equal child pointer */ struct tst_node *hikid; /* ternary high child pointer */ } tst_node; ``` * 接下來開始對建構出來的 ternary tree 做壓縮，其步驟可分為下 : 1. 由 `root` 開始往下移動直到**有分岔或是結尾的節點**出現 : * 有分岔的節點意思是其 `left child` 與 `right child` 不為 `NULL` ，這代表這些單字的最長 prefix 就此結束，所以可壓縮節點到此為止。 * 結尾節點出現代表說這些單字的 prefix 也是一個單字，所以此單字可以壓縮成一個節點。 2. 將這些節點的 key value 取出並寫入至第一個節點的 key value ，如此一來在壓縮 tree 後才能保有原本的 key value 。 ![](https://i.imgur.com/ac4OnnQ.png) 3. 把除了第一個節點外的節點刪除，並將第一個節點的 `middle child` 指向分岔或結尾的節點。 ![](https://i.imgur.com/0sMmaOo.png) 4. 就此，已經壓縮完從 `root` 到分岔或結尾前的節點，接下來從分岔或結尾的節點的 child 開始做 DFS 遍歷，重複上述步驟，遍歷完整棵樹後即停止。如果遍歷途中遇到結尾節點則不走訪其 `middle child` ，因為結尾節點的 `middle child` 都是指向一個字串。 ![](https://i.imgur.com/G4c362a.png) * 較多節點的例子，從原本的 15 個節點壓縮成 11 個節點。 ![](https://i.imgur.com/Bh9jihj.png) ```c= void tst_compress(tst_node **p) { if (!(*p)) return; char buffer[BUFFER_SIZE]; memset(buffer, '\0', BUFFER_SIZE); int i = 0, free_cnt = 0; tst_node *root = *p; tst_node *curp = *p; // find the longest prefix and collect it while ((curp->hikid == NULL) && (curp->lokid == NULL) && *(curp->key)) { buffer[i++] = curp->key[0]; tst_node *free_ptr = curp; curp = curp->eqkid; if (root != free_ptr) { free(free_ptr->key); free(free_ptr); free_cnt++; } } // No released node,then no need to concat key value if (free_cnt > 0) { root->eqkid = curp; free(root->key); root->key = strndup(buffer, i); } tst_compress(&(curp->lokid)); if (curp->key[0]) tst_compress(&(curp->eqkid)); tst_compress(&(curp->hikid)); } ``` --- 2. 實作程式正確驗證和效能分析機制; * **完整性驗證** : 經過壓縮後的樹是否有遺失或增加單字 * 驗證方法 : 使用 DFS 尋訪原本的樹與壓縮後的樹，一旦找到了「單字」，就將其寫入各自的檔案，最後使用 md5 hash function 去驗證兩檔案是否擁有一樣的 hash value ，有則表示壓縮後的樹是完整的。 * 此函式會透過 DFS 尋訪各節點，找到「單字」的結尾節點後就寫入檔案。 ```c= void tst_traverse_seq(const tst_node *p, char arr[], int top, FILE *fp) { if (!p) return; tst_traverse_seq(p->lokid, arr, top, fp); if (p->key[0]) { int size = strlen(p->key); strncpy(arr + top, p->key, size); tst_traverse_seq(p->eqkid, arr, top + size, fp); } else { arr[top] = '\n'; fwrite(arr, top + 1, 1, fp); } tst_traverse_seq(p->hikid, arr, top, fp); } ``` * 執行 `test_tst` 後會產生 `raw_data.txt` 紀錄著原本樹的走訪順序，與 `compress_data.txt` 紀錄壓縮後樹的走訪順序，比較兩者經過 md5 的 hash value 即可得知是否有遺失或增加任何單字。 * 就結果而言，兩者的 hash value 皆為 `99cfdb4c44ea28808cdbc41eccc1d2ea` ，故壓縮後的樹應該是完整的。 ``` $ md5sum raw_data.txt 99cfdb4c44ea28808cdbc41eccc1d2ea raw_data.txt $ md5sum compress_data.txt 99cfdb4c44ea28808cdbc41eccc1d2ea compress_data.txt ``` * **壓縮率** : * 計算方法 : 建制原本的 Ternary tree 與壓縮後的樹，計算各自花費的節點與其內部 key value 所 malloc 的空間，以下為計算方式： $Compress\ ratio = 100 - ((\dfrac{compressed\ tree\ size} {original\ tree\ size})\cdot100)$ * 結果 : 計算結果為壓縮了 `51.763%` ，但尚未找到能正確算出耗費記憶體多寡的方法。 ``` $ ./test_tst ternary_tree, loaded 259112 words in 0.151419 sec Compressing ternary_tree, compressed tree in 0.026532 sec Compress ratio : 51.762904 % ``` * **CPY 與 REF 差異** * 前面的結果都是使用 `test_ref.c` 的方式來配置記憶體( memory pool )，以下測試為比較使用一般 `malloc` 與 memory pool 的差別。 * 使用 memory pool : ``` $ perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles ./test_tst -s Performance counter stats for './test_tst -s' (100 runs): 5,550,240 cache-misses # 58.325 % of all cache refs ( +- 0.12% ) 9,516,017 cache-references ( +- 0.06% ) 988,553,351 instructions # 0.86 insn per cycle ( +- 0.00% ) 1,142,911,602 cycles ( +- 0.08% ) 0.310643358 seconds time elapsed ( +- 0.09% ) ``` * 使用一般 `malloc` ``` $ perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles ./test_tst_cpy -s Performance counter stats for './test_tst_cpy -s' (100 runs): 5,935,431 cache-misses # 61.311 % of all cache refs ( +- 0.43% ) 9,680,913 cache-references ( +- 0.24% ) 1,016,060,042 instructions # 0.84 insn per cycle ( +- 0.01% ) 1,207,350,409 cycles ( +- 0.34% ) 0.332096963 seconds time elapsed ( +- 0.43% ) ``` * 由上圖可見，使用 memory pool 在各方面都比使用一般 `malloc` 的稍微好一點，原因應該如 [dict](https://hackmd.io/s/r1eSxvo5m) 中所述的和 data locality 有關。 --- ## 參考資料 [Huffman Encoding and Data Compression](https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1126/handouts/220%20Huffman%20Encoding.pdf) [ASCII WIKI](https://en.wikipedia.org/wiki/ASCII)