重做 quiz3 測驗二的延伸問題，實作資料壓縮器並提出改進 x-compressor 的方案

# 重做 quiz3 測驗二的延伸問題，實作資料壓縮器並提出改進 x-compressor 的方案 contributed by < `blueskyson` > ###### tags: `linux2020` ## [quiz3](https://hackmd.io/@sysprog/2020-quiz3) 測驗二的延伸問題在 [C 語言：數值系統篇](https://hackmd.io/@sysprog/c-numerics) 中，我們介紹過 power of 2 (漢字可寫作「二的冪」)。 [GCC extension](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) 其中兩個是 ctz 和 clz: >Built-in Function: `int __builtin_ctz (unsigned int x)` >- Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined. >Built-in Function: `int __builtin_clz (unsigned int x)` >- Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined. 我們可用 `__builtin_ctz` 來實作 [LeetCode 342. Power of Four](https://leetcode.com/problems/power-of-four/) ，假設 int 為 32 位元寬度。實作程式碼如下: ```cpp= bool isPowerOfFour(int num) { return num > 0 && (num & (num - 1))==0 && !(__builtin_ctz(num) OPQ); } ``` ### 1. 解釋上述程式運作原理首先第一個判斷條件是 `num > 0` 不用多作解釋。因為 2 的冪次有以下特性: - 2 = `00010`, (2 - 1) = `00001`, 2 & (2 - 1) = `0` - 4 = `00100`, (4 - 1) = `00011`, 4 & (4 - 1) = `0` - 8 = `01000`, (8 - 1) = `00111`, 8 & (8 - 1) = `0` - 16 = `10000`, (16 - 1) = `01111`, 16 & (16 - 1) = `0` 所以 `(num & (num - 1))==0` 可用來判定 2 的冪次最後，我們需要從所有 2 的冪次中挑出 4 的冪次。觀察上面 2 的冪次的規律，可以發現只要是 4 的冪次，其 trailing 0-bits 的個數都為偶數，如: 4 的 trailing 0-bits 有 2 個、 16 的 trailing 0-bits 有 4 個。如果 `num` 的 trailing 0-bits 為偶數，則 `!(__builtin_ctz(num) OPQ)` 回傳 `TRUE` ，很明顯的 ==OPQ== 為 `& 0x1` 。 ### 2. 改寫上述程式碼，提供等價功能的不同實作，儘量降低 branch 的數量首先使用 perf 分析本題解法的 branch 數量，我用以下程式碼做測試 ```c= #include <stdbool.h> #include <stdlib.h> bool isPowerOfFour(int num) { return num > 0 && (num & (num - 1))==0 && !(__builtin_ctz(num) & 0x1); } int main(int argc, char *argv[]) { int n = atoi(argv[1]); isPowerOfFour(n); return 0; } ``` 編譯，並將執行檔命名為 `pow4` 後使用以下腳本計算 branch 數量: ```non $ cat test.sh for ((i = 0; i < 32; i++)) do ((num = 1<<i)) perf stat -r 10 -e branches -o "perf/$num" ./pow4 $num grep branches "perf/$num" >> branch.txt done ``` 執行完這份 shell script ，會得到一份 branch.txt ，紀錄 $2^{0}$, $2^{1}$,..., $2^{31}$ 所用的 branch 數量，可以觀察出原版程式所產生的 branch 數量介於 135000 至 138000 之間: ``` $ tr -d ' ,' < branch.txt 137565branches(+-0.53%) 137251branches(+-0.41%) 135862branches(+-0.53%) 136473branches(+-0.76%) 136823branches(+-0.50%) 136616branches(+-0.55%) 136150branches(+-0.56%) 137170branches(+-0.53%) 136848branches(+-0.34%) 137641branches(+-0.50%) 136450branches(+-0.56%) 135561branches(+-0.38%) 135748branches(+-0.40%) 136889branches(+-0.36%) 137397branches(+-0.45%) 138100branches(+-0.64%) 137186branches(+-0.50%) 135619branches(+-0.51%) 136723branches(+-0.46%) 137738branches(+-0.37%) 135568branches(+-0.61%) 136818branches(+-0.37%) 136434branches(+-0.51%) 135632branches(+-0.45%) 136480branches(+-0.56%) 137798branches(+-0.48%) 136101branches(+-0.35%) 136059branches(+-0.43%) 136315branches(+-0.57%) 136510branches(+-0.64%) 135684branches(+-0.42%) ``` 接下來我寫了一個等價功能的實作: 首先檢查 `num` 中為 1 的 bit 的數量是否為 1 ，然後檢查 trailing zero 的數量是否為偶數，如果兩個條件都符合就回傳 `true`。 ```c= #include <stdbool.h> #include <stdlib.h> bool isPowerOfFour(int num) { return __builtin_popcount(num) == 1 && num & 0b01010101010101010101010101010101; } int main(int argc, char *argv[]) { int n = atoi(argv[1]); isPowerOfFour(n); return 0; } ``` 一樣編譯成 `pow4` 後透過 shell script 執行，印出檔案，平均 branch 數量介於 135700 至 138000 之間，沒有把 branch 數量壓得更低: ``` 136751branches(+-0.50%) 136739branches(+-0.52%) 137065branches(+-0.54%) 138113branches(+-0.45%) 136818branches(+-0.44%) 137006branches(+-0.38%) 136894branches(+-0.51%) 137206branches(+-0.51%) 135915branches(+-0.38%) 137785branches(+-0.53%) 137061branches(+-0.42%) 135992branches(+-0.49%) 136021branches(+-0.39%) 137311branches(+-0.53%) 136040branches(+-0.35%) 137076branches(+-0.47%) 137580branches(+-0.37%) 136060branches(+-0.39%) 136346branches(+-0.44%) 137884branches(+-0.53%) 137547branches(+-0.39%) 136204branches(+-0.40%) 136971branches(+-0.44%) 136752branches(+-0.55%) 137632branches(+-0.52%) 137220branches(+-0.51%) 137152branches(+-0.60%) 136830branches(+-0.47%) 136015branches(+-0.44%) 136820branches(+-0.27%) 135756branches(+-0.45%) ``` ## 練習 [LeetCode 1009. Complement of Base 10 Integer](https://leetcode.com/problems/complement-of-base-10-integer/) ### 題目描述 Every non-negative integer `N` has a binary representation. For example, `5` can be represented as `"101"` in binary, `11` as `"1011"` in binary, and so on. Note that except for `N = 0`, there are no leading zeroes in any binary representation. The complement of a binary representation is the number in binary you get when changing every `1` to a `0` and `0` to a `1`. For example, the complement of `"101"` in binary is `"010"` in binary. For a given number `N` in base-10, return the complement of it's binary representation as a base-10 integer. **Example 1:** **Input:** `5` **Output:** `2` **Explanation:** `5` is `"101"` in binary, with complement `"010"` in binary, which is `2` in base-10. **Example 2:** **Input:** `7` **Output:** `0` **Explanation:** `7` is `"111"` in binary, with complement `"000"` in binary, which is `0` in base-10. **Example 3:** **Input:** `10` **Output:** `5` **Explanation:** `10` is `"1010"` in binary, with complement `"0101"` in binary, which is `5` in base-10. ### 解法 ```c= int bitwiseComplement(int N){ if (N == 0) return 1; unsigned int mask = 0xffffffff << (32 - __builtin_clz(N)); return ~(N | mask); } ``` ### 思路以 `N = 5` 為例，其二進制表示為 `"101"` ，我想要求得 `5` 的補數，也就是 `"010"` 然而在 32 位元的 `int` 中， `5` 會表示成 `"00...00101"` ，直接對其取補數會得到 `"11...11010"` ，其二補數表示為 `-6` 不是我們想要的結果所以我要做的事為: 將 N 的 leading zero 全部變成 `1` ，再進行取補數，也就是將 `"00...00101"` 變換為 `"11...11101"` 然後反轉為 `"010"` 即為正確答案逐步解釋程式碼的意義: ``` Suppose N = 5 then: 1. s = 32 - __builtin_clz(N) = 3 2. mask = 0xffffffff << s = "11111111111111111111111111111000" 3. N | mask = "11111111111111111111111111111101" 4. ~(N | mask) = "00000000000000000000000000000010" #solution ``` ### 提交 ![](https://i.imgur.com/1TNf2Vq.png) ## 練習 [LeetCode 41. First Missing Positive](https://leetcode.com/problems/first-missing-positive/) ### 題目描述 Given an unsorted integer array `nums`, find the smallest missing positive integer. **Follow up:** Could you implement an algorithm that runs in `O(n)` time and uses constant extra space? **Example 1:** **Input:** `nums = [1,2,0]` **Output:** `3` **Example 2:** **Input:** `nums = [3,4,-1,1]` **Output:** `2` **Example 3:** **Input:** `nums = [7,8,9,11,12]` **Output:** `1` ### 解法 ```c= void sort(int arr[], int front, int end) { if(front<end) { int i, index=front; for(i=front; i<end; i++) { if(arr[i]<arr[end]) { int tmp=arr[i]; //swap arr[i]=arr[index]; // arr[index]=tmp; // index++; } } int tmp=arr[i]; //swap pivot arr[i]=arr[index]; // arr[index]=tmp; // sort(arr,front,index-1); sort(arr,index+1,end); } } int firstMissingPositive(int* nums, int numsSize){ sort(nums, 0, numsSize - 1); int i = 0, min = 1; while(i < numsSize) { if (nums[i] == min) min++; else if (nums[i] > min) break; i++; } return min; } ``` ### 思路先對 `nums` 做 quick sort ，然後將 `min` 設為 `1` ，從排序好的 `nums` 中依序檢查由 `1` 開始的連續正整數，每當檢查到一個連續正整數便將 `min` 加 1 ，否則跳出迴圈並回傳 `min` ### 提交 ![](https://i.imgur.com/zBuVfJP.png) ### 另一種解法使用 bit array ，還沒實作 ## 研讀 [2017 年修課學生報告](https://hackmd.io/@3xOSPTI6QMGdj6jgMMe08w/Bk-uxCYxz) ，理解 `clz` 的實作方式，並舉出 [Exponential Golomb coding](https://en.wikipedia.org/wiki/Exponential-Golomb_coding) 的案例說明 >[x-compressor](https://github.com/jserv/x-compressor) 是個以 [Golomb-Rice coding](https://en.wikipedia.org/wiki/Golomb_coding) 為基礎的資料壓縮器，實作中也用到 clz/ctz 指令，可參見 [Selecting the Golomb Parameter in Rice Coding](https://ipnpr.jpl.nasa.gov/progress_report/42-159/159E.pdf)。 ### `clz` 效能測試 [2017 年修課學生報告](https://hackmd.io/@3xOSPTI6QMGdj6jgMMe08w/Bk-uxCYxz) 列出了 8 種 `clz` 的實做，分別為: - binary search (iterative) - binary search (bit mask) - binary search (byte shfit) - binary search (recursive) - De Bruijn method - Harley’s algorithm - Harley’s algorithm inline assembly - binary search (recursive) inline assembly 因為不知道[2017 年修課學生報告](https://hackmd.io/@3xOSPTI6QMGdj6jgMMe08w/Bk-uxCYxz)效能分析的原始碼，而且也不確定我的 CPU 的行為會不會符合他們效能分析的結果，所以我先用自己的方法測試一次，測試環境: - CPU: Intel i7-10750H - OS: Ubuntu 20.04 LTS 我使用 `rdtsc` 取得 clz 運算前後的 time stamp ，再將其相減取得 cycle 數，即下方程式碼的 `end - begin` 。測試時，將參數 `argv[1]` 作為 2 的冪次傳入並將其轉成 `int` 並存入 `shift` ，我針對每個 $2^{shift}$ 作為 `clz` 的 input 測試。以 binary search (iterative) 為例，其測試程式碼如下: ```c= #include <stdio.h> #include <stdlib.h> #include <stdint.h> #pragma optimize( "", off ) int clz(uint32_t x) { // l := left, r := right uint32_t n = 32, c = 16, l, r = x; do { l = r >> c; if (l) { n -= c; r = l; } c >>= 1; } while (c); return (n - r); } int main(int argc, char* argv[]) { int shift = atoi(argv[1]); /* input */ uint32_t x = 1 << shift; uint64_t begin, end; uint32_t begin_lo, begin_hi, end_lo, end_hi; __asm__ __volatile__ ("rdtsc" : "=a" (begin_lo), "=d" (begin_hi)); /* clz implementation */ clz(x); __asm__ __volatile__ ("rdtsc" : "=a" (end_lo), "=d" (end_hi)); begin = ((uint64_t)begin_hi << 32) | begin_lo; end = ((uint64_t)end_hi << 32) | end_lo; printf("%d %lu\n", shift, end - begin); return 0; } ``` 將 binary search (iterative) 的測試程式碼命名為 `clz1.c` ，編譯為 `clz1` 並藉由以下 shell script 對 `clz1` 至 `clz8` 進行效能測試，測試方式為對 $2^0$ 至 $2^{31}$ 分別執行 100 次 `clz`: ```non rm clz*.txt for (( i = 0; i < 32; i++ )) do echo "testing for 1 << $i" for (( j = 0; j < 100; j++)) do ./clz1 "$i" >> clz1.txt ./clz2 "$i" >> clz2.txt ./clz3 "$i" >> clz3.txt ./clz4 "$i" >> clz4.txt ./clz5 "$i" >> clz5.txt ./clz6 "$i" >> clz6.txt ./clz7 "$i" >> clz7.txt ./clz8 "$i" >> clz8.txt done done ``` 再來用 gnuplot 畫出 2 的冪次對 cpu cycle 的 scatter ，以下是 gnuplot 指令: :::spoiler ```non set size 0.5, 0.5 set term png size 900, 900 set output "figure1.png" set multiplot layout 2, 2 rowsfirst title "Binary Search CLZ" font ",18" set xlabel "power of 2" font ",10" set ylabel "rdtsc cycles" font ",10" set xtics font ",10" set ytics font ",10" # unset key set title "iterative" font ",14" plot [-1:32][:500] 'clz1.txt' with points pt 7 ps 1 lt rgb "red" # unset key set title "bit mask" font ",14" plot [-1:32][:500] 'clz2.txt' with points pt 7 ps 1 lt rgb "orange" # unset key set title "byte shfit" font ",14" plot [-1:32][:500] 'clz3.txt' with points pt 7 ps 1 lt rgb "blue" # unset key set title "recursive" font ",14" plot [-1:32][:500] 'clz4.txt' with points pt 7 ps 1 lt rgb "violet" unset multiplot ``` ::: ![](https://i.imgur.com/GuowsEi.png) 上圖 Binary Search 表示的結果大致和 [2017 年修課學生報告](https://hackmd.io/@3xOSPTI6QMGdj6jgMMe08w/Bk-uxCYxz) 測試的結果差不多，效能由優至劣排序依序是： $$\text{bit mask} \approx \text{byte shift} > \text{iteration} > \text{recursive}$$ 接下來再看另外 4 種 `clz` 的時間分佈圖: :::spoiler ```non set size 0.5, 0.5 set term png size 900, 900 set output "figure2.png" set multiplot layout 2, 2 rowsfirst title "Other CLZ" font ",18" set xlabel "power of 2" font ",10" set ylabel "rdtsc cycles" font ",10" set xtics font ",10" set ytics font ",10" # unset key set title "De Bruijn method" font ",14" plot [-1:32][:500] 'clz5.txt' with points pt 7 ps 1 lt rgb "red" # unset key set title "Harley’s algorithm" font ",14" plot [-1:32][:500] 'clz6.txt' with points pt 7 ps 1 lt rgb "orange" # unset key set title "Harley’s algorithm asm" font ",14" plot [-1:32][:500] 'clz7.txt' with points pt 7 ps 1 lt rgb "blue" # unset key set title "bnary search (recursive) asm" font ",14" plot [-1:32][:500] 'clz8.txt' with points pt 7 ps 1 lt rgb "violet" unset multiplot ``` ::: ![](https://i.imgur.com/R46iIK4.png) 可以發現 De Bruijn method 是 8 種方法中速度最慢的，我推測其原因為函式內使用了一次乘法拖慢了速度，此外，將 Harley’s algorithm 和 binary search (recursive) 用組合語言實做的確有改善效能，不過具體改善多少需要比較他們的平均。以上實驗是以 2 的冪次作為 input ，會有盲點，未來可能需要加一些隨機性較高的輸入值來觀察這些 `clz` 的實做是否會表現其他特性。我順手也測試了 `__builtin_clz` 所耗的時間，發現速度比上述實做都要快得多 ![](https://i.imgur.com/KPSneGW.png) 我將 `__builtin_clz` 包在函式 `clz` 中，並且用 gcc -S 編譯，查看其 x86 的實做: ```c // c implementation int clz(int x) { return __builtin_clz(x); } ``` ```asm ; x86 assembly implementation clz: .LFB6: .cfi_startproc endbr64 pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 movl %edi, -4(%rbp) movl -4(%rbp), %eax ; load input value bsrl %eax, %eax ; get the position of most significant set bit xorl $31, %eax ; get complement of eax popq %rbp .cfi_def_cfa 7, 8 ret .cfi_endproc ``` 由上面 x86 實做可以看出和 `__builtin_clz` 有關的指令只有 3 行，所以會比 2017 報告中實做的快，至於 `bsrl` 是否有獨立的電路或是有被硬體優化則需要再探討。最後我畫出 `clz` 消耗的 cycle 的 10% trimed mean 比較圖: ![](https://i.imgur.com/tywCIqy.png) ### Exponential Golomb coding 摘錄自維基百科，[Exponential-Golomb coding](https://en.wikipedia.org/wiki/Exponential-Golomb_coding) (或稱 Exp-Golomb code) 是一種 [Universal code](https://en.wikipedia.org/wiki/Universal_code_(data_compression))，也就是 Exp-Golomb code 能夠映射到所有正整數。假設輸入為 `x` ，其編碼步驟為: 1. 以二進位寫下 `x + 1` 2. 數 `x + 1` 的二進位數字總共有多少位數，並在 `x + 1` 前面補上等同其位數減 1 的數量的 `0` **Example 1:** ``` x = 5 1. Write down 5 + 1 = 6 as binary "110". 2. "110" has 3 digits, we write 2 "0"s preceding "110". The Exp-Golomb code of 5 is "00110". ``` **Example 2:** ``` x = 24 1. Write down 24 + 1 = 25 as binary "11001". 2. "11001" has 5 digits, we write 4 "0"s preceding "11001". The Exp-Golomb code of 24 is "000011001". ``` 按照上述規則計算 0 ~ 8 的 Exp-Golomb code: ``` x step 1 step 2 0 ⇒ 1 ⇒ 1 1 ⇒ 10 ⇒ 010 2 ⇒ 11 ⇒ 011 3 ⇒ 100 ⇒ 00100 4 ⇒ 101 ⇒ 00101 5 ⇒ 110 ⇒ 00110 6 ⇒ 111 ⇒ 00111 7 ⇒ 1000 ⇒ 0001000 8 ⇒ 1001 ⇒ 0001001 ``` 如果要讓 Exp-Golomb code 編碼**有號**整數，只需要讓有號數一一對應無號數即能達到目的: ``` x map step 1 step 2 0 ⇒ 0 ⇒ 1 ⇒ 1 1 ⇒ 1 ⇒ 10 ⇒ 010 −1 ⇒ 2 ⇒ 11 ⇒ 011 2 ⇒ 3 ⇒ 100 ⇒ 00100 −2 ⇒ 4 ⇒ 101 ⇒ 00101 3 ⇒ 5 ⇒ 110 ⇒ 00110 −3 ⇒ 6 ⇒ 111 ⇒ 00111 4 ⇒ 7 ⇒ 1000 ⇒ 0001000 −4 ⇒ 8 ⇒ 1001 ⇒ 0001001 ``` 我們可以按照順序，每 $2^i$ 個 Exp-Golomb code 分為一組。對於每個編碼，前面有 $i$ 個 `0` 就代表該編碼屬於 $S_i$ ，每個編碼正中央紅色的 `1` 作為分界點，後綴的編碼以二進位表示為第 $S_i$ 組的第 $j$ 個元素，記為 $G_{i,j}$ : $\ S_0 \quad \quad \quad S_1 \quad \quad \quad \quad \quad \quad \quad \quad \ \ S_2 \\ \{\color{red}1\},\ \{0\color{red}10,\ 0\color{red}11\},\ \{00\color{red}100,\ 00\color{red}101,\ 00\color{red}110,\ 00\color{red}111\},\ ... \\ G_{0,0} \ \ \ G_{1,0} \ \ \ G_{1,1} \quad \quad G_{2,0} \quad \ \ G_{2,1} \quad \ \ G_{2,2} \quad \ G_{2,3}$ ### Order-k Exponential Golomb coding 接下來，對 Exp-Golomb code 的編碼方式一般化，前綴的 `0` 的數量一樣代表該編碼在第幾組，但是每一組的元素數量變為 $2^{i+k}$ 個，後綴的編碼也加長 $k$ 位，我們將這種編碼方式稱為 $k$ 階 Exp-Golomb code。以下為範例: **order-0** $\ S_0 \quad \quad \quad S_1 \quad \quad \quad \quad \quad \quad \quad \quad \ \ S_2 \\ \{\color{red}1\},\ \{0\color{red}10,\ 0\color{red}11\},\ \{00\color{red}100,\ 00\color{red}101,\ 00\color{red}110,\ 00\color{red}111\},\ ... \\ G_{0,0} \ \ \ G_{1,0} \ \ \ G_{1,1} \quad \quad G_{2,0} \quad \ \ G_{2,1} \quad \ \ G_{2,2} \quad \ G_{2,3}$ **order-1** $\ \quad S_0 \quad \quad \quad \quad \quad \quad \quad S_1 \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad S_2 \\ \{\color{red}10,\ \color{red}11\},\ \{0\color{red}100,\ 0\color{red}101, \ 0\color{red}110,\ 0\color{red}111\},\ \{00\color{red}1000,\ 00\color{red}1001,\ 00\color{red}1010,\ ... \\ G_{0,0} \ \ G_{0,1} \quad \ G_{1,0} \quad G_{1,1} \quad G_{1,2} \quad G_{1,3} \quad \quad G_{3,0} \quad \quad \ G_{3,1} \quad \quad G_{3,2}$ **order-2** $\ \quad S_0 \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad S_1 \\ \{\color{red}100,\ \color{red}101,\ \color{red}110,\ \color{red}111\},\ \{0\color{red}1000,\ 0\color{red}1001, \ 0\color{red}1010,\ 0\color{red}1011,\ 0\color{red}1100,\ 0\color{red}1101,\ 0\color{red}1110,\ 0\color{red}1111\},\ ...\\ G_{0,0} \ \ \ G_{0,1} \ \ G_{0,2} \ \ G_{0,3} \quad \ \ \ G_{1,0} \quad \ \ G_{1,1} \quad \ \ G_{1,2} \quad \ \ G_{1,3} \quad \ \ G_{1, 4} \quad \ \ G_{1, 5} \quad \ \ G_{1, 6} \quad \ G_{1, 7}$ 產生 $k$ 階 Exp-Golomb code 的步驟為: 1. 以二進位寫下 $x+2^k$ 2. 數 $x+2^k$ 的二進位數字總共有多少位數，並在 $x+2^k$ 前面補上等同其位數減 $(k+1)$ 的數量的 `0` 以下是 order-0 到 order-4 的前 13 個編碼: |x |order-0 |order-1 |order-2 |order-3 |order-4 | |--|-----------|-----------|-----------|-----------|-----------| |0 |**1** |**1**0 |**1**00 |**1**000 |**1**0000 | |1 |0**1**0 |**1**1 |**1**01 |**1**001 |**1**0001 | |2 |0**1**1 |0**1**00 |**1**10 |**1**010 |**1**0010 | |3 |00**1**00 |0**1**01 |**1**11 |**1**011 |**1**0011 | |4 |00**1**01 |0**1**10 |0**1**000 |**1**100 |**1**0100 | |5 |00**1**10 |0**1**11 |0**1**001 |**1**101 |**1**0101 | |6 |00**1**11 |00**1**000 |0**1**010 |**1**110 |**1**0110 | |7 |000**1**000|00**1**001 |0**1**011 |**1**111 |**1**0111 | |8 |000**1**001|00**1**010 |0**1**100 |0**1**0000 |**1**1000 | |9 |000**1**010|00**1**011 |0**1**101 |0**1**0001 |**1**1001 | |10|000**1**011|00**1**100 |0**1**110 |0**1**0010 |**1**1010 | |11|000**1**100|00**1**101 |0**1**111 |0**1**0011 |**1**1011 | |12|000**1**101|00**1**110 |00**1**0000|0**1**0100 |**1**1100 | ### 實做 Exponential Golomb coding **repo:** https://github.com/blueskyson/Exponential-Golomb-coding **如何使用:** 首先將 repo 中的 `exp-golomb.c` 編譯成 `encode` ，再將 `decode` 連結至 `encode` 即完成編譯: ``` $ gcc exp-golomb.c -o encode $ ln -s encode decode ``` `encode` 和 `decode` 的參數格式如下，以下將 `sample_text.txt` 轉成 order-4 exp-golomb code 的方法: ``` # Usage encode [input file] [output file] [order-k] decode [input file] [output file] [order-k] # Example ./encode sample_text.txt text.encode 4 ./decode text.encode text.decode 4 ``` **主程式:** 這個小程式能將資料以 `uint8_t` 為一個單位編碼成 exp-golomb code ，在程式第 7 行，先設定 `MAX_ORDER` ，也就是程式能容許的 order-k 的限度，此程式最大的限度為 order-31 ，因為在轉換的過程是以 1 到 2 個 `uint32_t` 作為中繼的容器，如果 MAX_ORDER 大於等於 32 會造成編碼後的資料跨越 3 個 `uint32_t` ，我的沒有設計相應機制處理。接下來設定 `BUFFER_SIZE` ，這個 macro 決定暫存陣列的長度。在 `encode` 函式中，暫存陣列為 `uint32_t buffer[BUFFER_SIZE]` ；在 `decode` 函式中，暫存陣列為 `uint8_t buffer[BUFFER_SIZE]`。當 buffer 的元素達到 `WRITE_SIZE` 時便會將 buffer 的內容寫入 output file 並的清空 buffer 。接下來我將 input file 以 file descriptor 的形式開啟，之後使用 `mmap` 讀檔、 output file 以 FILE* 的形式開啟，以 `fwrite` 寫入。 ```c= #include <stdio.h> #include <stdlib.h> #include <string.h> #include <stdint.h> #include <libgen.h> //basename #include <fcntl.h> //open #include <unistd.h> //close #include <sys/mman.h> //mmap #include <sys/stat.h> #define MAX_ORDER 7 // must be smaller than 32 #define BUFFER_SIZE 100 #define WRITE_SIZE (BUFFER_SIZE - 2) /* function status */ #define FAIL 0 #define SUCCESS 1 int encode(int, FILE*, int); int decode(int, FILE*, int); int main(int argc, char *argv[]) { if (argc < 3) { puts("Usage:\n" "encode [input file] [output file] [order-k]\n" "decode [input file] [output file] [order-k]"); return 0; } /* open argv[1] as input file */ int in_fd = open(argv[1], O_RDONLY); if (in_fd < 0) { puts("cannot open input file"); return 0; } /* open argv[2] as output file */ FILE *out_file = fopen(argv[2], "wb"); if (!out_file) { puts("cannot open output file"); return 0; } /* use argv[3] as order */ long int order = 0; if (argc >= 4) { char* endptr; order = strtol(argv[3], &endptr, 10); if (endptr == argv[3]) { puts("order is not a decimal number"); return 0; } if (order > MAX_ORDER) { puts("order larger than max order"); return 0; } } int status; if (strcmp(basename(argv[0]), "encode") == 0) { status = encode(in_fd, out_file, order); } else { status = decode(in_fd, out_file, order); } fclose(out_file); close(in_fd); if (status == FAIL) { remove(argv[2]); } return 0; } ``` **Encode** 以 `uint8_t*` 型態打開 input file 的 `mmap` ，轉換後的 exp-golomb code 會緊密排列在 uint32_t 的陣列中，如下圖示: ![](https://i.imgur.com/NJm4IPD.png) 這裡所使用的編碼方法跟維基百科提到的一樣: 1. 以二進位寫下 $x+2^k$ 2. 數 $x+2^k$ 的二進位數字總共有多少位數，並在 $x+2^k$ 前面補上等同其位數減 $(k+1)$ 的數量的 `0` 當轉換完所有資料後，在檔案尾加上一個 (uint32_t) 1 作為結束檔案的記號。 ```c=73 int encode(int in_fd, FILE* out_file, int order) { struct stat s; if (fstat(in_fd, &s) < 0) { puts("cannot get status of iniput file"); return FAIL; } uint8_t *map = (uint8_t*) mmap(0, s.st_size, PROT_READ, MAP_PRIVATE, in_fd, 0); if (map == MAP_FAILED) { puts("cannot open mmap"); return FAIL; } uint32_t buffer[BUFFER_SIZE]; memset(buffer, 0, sizeof(buffer)); int buffer_index = 0; int cursor = 32; uint32_t offset = 1 << order; /* start to encode */ for (long int i = 0; i < s.st_size; i++) { uint32_t current = (uint32_t) map[i]; current += offset; int clz = __builtin_clz(current); int unary_width = 31 - clz - order; int binary_width = 32 - clz; /* write unary */ cursor -= unary_width; if (cursor <= 0) { cursor += 32; buffer_index++; } /* write binary */ if (cursor < binary_width) { // truncate in uint32_t buffer[buffer_index++] |= current >> (binary_width - cursor); buffer[buffer_index] |= current << (32 - (binary_width - cursor)); cursor = cursor + 32 - binary_width; } else { buffer[buffer_index] |= current << (cursor - binary_width); cursor -= binary_width; } /* write buffer */ if (buffer_index >= WRITE_SIZE) { fwrite(buffer, 4, WRITE_SIZE, out_file); uint32_t tail = buffer[buffer_index]; memset(buffer, 0, sizeof(buffer)); buffer[0] = tail; buffer_index = 0; } } /* finalize */ buffer[buffer_index + 1] = (uint32_t) 1; //end signal fwrite(buffer, 4, buffer_index + 2, out_file); return SUCCESS; } ``` **Decode** 這裡只要將 encode 的步驟反著做就好了，以 `uint32_t*` 型態打開 input file 的 `mmap` ，轉換後的資料會轉型成 `uint8_t` 。唯一需要注意的是，讓 `uint32_t` 左移或右移 32 bit 時，會發生 [-Wshift-count-overflow] ，所以在 191 行寫了一個判斷去規避平移 32 bit 。 ```c=134 int decode(int in_fd, FILE* out_file, int order) { struct stat s; if (fstat(in_fd, &s) < 0) { puts("cannot get status of iniput file"); return FAIL; } uint32_t *map = (uint32_t*) mmap(0, s.st_size, PROT_READ, MAP_PRIVATE, in_fd, 0); if (map == MAP_FAILED) { puts("cannot open mmap"); return FAIL; } uint8_t buffer[BUFFER_SIZE]; memset(buffer, 0, sizeof(buffer)); int buffer_index = 0; int cursor = 32; uint32_t offset = 1 << order; int map_index = 0; uint32_t current = map[0]; while (1) { /* read unary */ int unary_width = 0; if (current == 0) { // truncate in unary field current = map[++map_index]; int clz = __builtin_clz(current); unary_width = cursor + clz; cursor = 32 - clz; } else { int clz = __builtin_clz(current); unary_width = clz - (32 - cursor); cursor = 32 - clz; } /* end of file is (uint32_t) 1 , * the leading zero of end of file is 31 */ if (unary_width >= 31) { break; } /* read binary */ int binary_width = unary_width + 1 + order; uint32_t tmp = 0; if (binary_width > cursor) { // truncate in binary field tmp = current << (binary_width - cursor); current = map[++map_index]; binary_width -= cursor; cursor = 32; } tmp |= current >> (cursor - binary_width); tmp -= offset; buffer[buffer_index++] = (uint8_t) tmp; /* be careful for left shift 32 bits */ cursor -= binary_width; if (cursor == 0) { current = 0; } else { int shift = 32 - cursor; current = (current << shift) >> shift; } if (buffer_index == BUFFER_SIZE) { fwrite(buffer, 1, BUFFER_SIZE, out_file); buffer_index = 0; } } /* finalize */ if (buffer_index != 0) { fwrite(buffer, 1, buffer_index, out_file); } return SUCCESS; } ``` ### 使用 valgrind 檢查程式因為整個程式完全沒使用 malloc ，所以很難發生 memory leak ，使用 `valgrind -q --leak-check=full` 編解碼都沒有被偵測到 memory leak。下面為 encode 和 decode 的記憶體使用狀況: ![](https://i.imgur.com/QhkG4Pi.png) ![](https://i.imgur.com/znfd4SS.png) ### 使用 perf 觀察 **Encode** ``` # perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles,context-switches,branches,branch-misses ./encode 74-0.txt 74-0.encode Performance counter stats for './encode 74-0.txt 74-0.encode' (5 runs): 5,4589 cache-misses # 32.056 % of all cache refs ( +- 20.69% ) (45.64%) 17,0295 cache-references ( +- 4.66% ) (97.83%) 2515,6815 instructions # 1.46 insn per cycle ( +- 0.56% ) 1718,9863 cycles ( +- 0.80% ) 1 context-switches ( +- 40.82% ) 267,6514 branches ( +- 3.97% ) (54.36%) <not counted> branch-misses ( +-100.00% ) (2.17%) 0.006105 +- 0.000246 seconds time elapsed ( +- 4.02% ) ``` **Decode** ``` # perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles,context-switches,branches,branch-misses ./decode 74-0.encode 74-0.decode Performance counter stats for './decode 74-0.encode 74-0.decode' (5 runs): 7,6270 cache-misses # 32.452 % of all cache refs ( +- 20.52% ) (34.36%) 23,5021 cache-references ( +- 12.40% ) (77.71%) 2955,5866 instructions # 1.32 insn per cycle ( +- 2.98% ) (96.42%) 2245,0086 cycles ( +- 5.05% ) 1 context-switches ( +- 25.00% ) 298,9814 branches ( +- 2.96% ) (69.22%) <not counted> branch-misses ( +- 28.11% ) (22.29%) 0.00845 +- 0.00136 seconds time elapsed ( +- 16.08% ) ``` 為了讓程式執行的夠久，讓 perf 能有效偵測效能，我使用 repo 中的 74-0.txt 進行測試，但是仍然無法測出 branch-mises 不知道是系統權限沒設定好或是要讓程式執行更久才能統計出來。我對 order-0 的 encode 和 decode 測試 5 次，得到上面兩份數據，很明顯發現我的 cache-misses 非常高，平均高達 32% ，代表我程式寫法對變數的 locality 處理非常差，我認為這是目前值得我去改進的部份。我初步猜測可能原因是類似 `dict` 作業 copy 和 reference 機制的問題，也就是我常常存取的陣列沒有配置在較近的 heap 中。 ### Exponential Golomb coding 效能分析下圖是 8-bit 數值經過 order 0 到 4 的編碼後所佔據的位元長度，可以看到除了 order-4 之外，其他 exp-golomb code 都在原數值大於 50 之前達到 8 bit 的長度。 ASCII code 中常用的數字和英文字母介於 48 的 `'0'` 到 122 的 `'z'` 之間，幾乎都是會讓 exp-golomb code 超過 8 bit 的區間，若沒有修改 exp-golomb 的機制，根本無法壓縮。 x-compressor 以不斷更新模型，盡量讓出現頻率最高的數字使用位元長度較少的 golomb-rice code 達成壓縮的效果 ![](https://i.imgur.com/SjRCT5l.png) ## 閱讀論文 [Selecting the Golomb Parameter in Rice Coding](https://ipnpr.jpl.nasa.gov/progress_report/42-159/159E.pdf) 這篇論文的目標是針對一個資源，找出 golomb-power-of-2 (GPO2) 最佳的參數 $k$ ，使得此資源的 GPO2 的 bit rate 期望值 (編碼的平均長度) 最小。找出參數 $k$ 的過程必須快、節省計算資源。 ### Analysis **A. Minimizing Code Rate** 首先，我們把**非負整數**設定為目標資源，用 GPO2 編碼非負整數的隨機變數 $δ$ 時，前綴的一進位值需要 $\left\lfloor\dfrac{j}{2^k}\right\rfloor + 1$ 個 bit ，後綴二進值則固定為 $k$ 個 bit ，此時的編碼率 (對所有非負整數編碼的長度) 為: $$ R_k = \sum_{j=0}^\infty(\left\lfloor\dfrac{j}{2^k}\right\rfloor+1+k) \quad \text{Prob}[δ=j]$$ 觀察上面的函數，在固定 $j$ 的情況下 $\left\lfloor\dfrac{j}{2^k}\right\rfloor+1+k$ 為一凸函數，又 $R_k$ 為這些凸函數的總和，所以 $R_k$ 也是凸函數，因此 $R_k$ 的區域最小值亦為全域最小值。任何非負整數 $k^*$ 滿足 $$R_{k^*-1} \leq R_{k^*} \leq R_{k^*+1} \quad \quad (1)$$ 即為 GPO2 編碼長度最小的解。為了方便計算，將 $R_{-1}$ 視為 $\infty$ 。接下來，我們考慮已知 $µ=E[δ]$ 的情況下，如何依據先前的編碼估算參數 $k$ 的值。給定 $µ$ ， GPO2 的最佳參數 $k$ 一定會在區間 $[k^*_{min}(µ), \ k^*_{max}(µ)]$ ，其中 $$\begin{align*} &k^*_{min}(µ) = \text{max}\left\lbrace 0,\left\lfloor \text{log}_2\left(\dfrac{2}{3}(µ+1)\right) \right\rfloor \right\rbrace \\ &k^*_{max}(µ) = \text{max}\lbrace 0, \lceil \text{log}_2 µ \rceil\rbrace \end{align*} \quad \quad (2)$$ 使得 $R_{k^*}$ 接近最低點 (編碼率最佳) ，下圖顯示了 $k^*_{min}(µ)$ 和 $k^*_{max}(µ)$ 的成長趨勢 ![](https://i.imgur.com/knA8lsX.png) 再來由 (2) 可以推得 $k^*_{max}(µ)-k^*_{min}(µ) \leq 2 \ \text{for all} \ µ$ ，其中的 $µ$ 是樣本的平均值。以 $µ$ 計算編碼率 $R_k$ 和 $R_{k-1}$ 的差值 $$\begin{align*} R_k - R_{k-1} &= 1 + \sum_j(\left\lfloor \dfrac{j}{2^k} \right\rfloor - \left\lfloor \dfrac{j}{2^{k-1}} \right\rfloor) \\ &= 1 + \sum_j \left\lfloor \dfrac{1}{2} - \dfrac{j+1}{2^k} \right\rfloor \quad \text{Prob}[δ=j] \end{align*}$$ 其中 $$-\dfrac{1}{2}-\dfrac{j}{2^k} ≤ \left\lfloor \dfrac{1}{2} - \dfrac{j+1}{2^k} \right\rfloor ≤ \dfrac{1}{2}-\dfrac{j+1}{2^k}$$ 所以我們得到 $$\dfrac{1}{2} - \dfrac{µ}{2^k} ≤ R_k - R_{k-1} ≤ \dfrac{3}{2} - \dfrac{µ+1}{2^k} \quad \quad (3)$$ 我們可以將編碼率表示為 $$\begin{align*} R_k &= k+1+E\left[ \left\lfloor\dfrac{δ}{2^k}\right\rfloor \right]\\ &= k+1+\dfrac{1}{2^k}\left(E[δ]-E\left[ δ-2^k \left\lfloor\dfrac{δ}{2^k} \right\rfloor \right]\right)\\ &=k+1+\dfrac{1}{2^k}(µ-\bar{r}_k) \quad \quad (4) \end{align*}$$ 其中 $\bar{r}_k$ 的值等於 $$\bar{r}_k=δ-2^k\left\lfloor \dfrac{δ}{2^k} \right\rfloor \quad \quad (5)$$ 其值剛好等於 $δ \ \text{mod} \ 2^k$ 。由 (4) 可以進一步推出 $R_k^{\text{up}}$ 和 $R_k^{\text{low}}$ 代表 $R_k$ 的上下界 $$\begin{align*} &R_k ≤ R_k^{\text{up}}(µ) \triangleq k+1+\dfrac{µ}{2^k}\\ &R_k ≥ R_k^{\text{low}}(µ) \triangleq k+\text{max} \left\lbrace 1, \dfrac{µ+1}{2^k}\right\rbrace \end{align*}$$ 我們由 (4) 推導出的算式進一步推論在滿足 (1) 的 $R_k ≤ R_{k-1}$ 時， $µ$ 的範圍為 (6) $$\begin{align*} &k+1+\dfrac{1}{2^k}(µ-\bar{r}_k) ≤ (k-1)+1+\dfrac{1}{2^{k-1}}(µ-\bar{r}_{k-1})\\ \Rightarrow \quad &µ≥2^k+2\bar{r}_{k-1}-\bar{r}_{k} \quad \quad (6) \end{align*}$$ 最後，針對 (6) 我們討論邊界問題: 因為 $\bar{r}_0 = 0$ ，當 $µ < 2^1+2\bar{r}_0-\bar{r}_1$ (即無法滿足 (6)) 時，就將參數 $k$ 設為 0 ；否則，選擇滿足 (6) 的最大的 $k$ 值。如此一來，便可以得到最佳的 $k$ 值，我們可以預先計算每個 $µ$ 在 (6) 中應該要對應哪個 $k$ 值，並做一個表格記錄下來。 **B. Dynamic Range Limit and Uncoded Data** 有兩個因素影響了 GPO2 在現實的應用: 1. 儘管 Golomb code 允許任意大小的輸入，但在實際應用中，樣本通常有最大值，縮小了需要考慮的編碼範圍。 2. 許多 entropy 編碼器可以選擇不要編碼樣本，參數選擇 ( $k$ ) 的過程中應該要包含不予編碼的狀況。再來，當樣本有 $N$ bit 時， $k ≥ N − 1$ 會使得編碼率大於等於原始樣本，這樣就沒有壓縮的效果，所以我們預先施加約束: $k ≤ N − 2$ ### Coding a Geometric Source **A. Optimum Code Selection** 將 $δ$ 以幾何分布建模如下，評估 $\bar{r}_k$ $$\text{Prob}[δ=j]=(1-α)α^j \quad \quad α ∈ (0,1)$$ ## 改進 x-compressor 的方案 ### 解析 x-compressor 原始碼 **compress** 首先 input file 藉由 `stdin` 進入程序，在 `x.c` 的 main function 中，宣告 `FILE* istream` 指向 `stdin` 的資料流；`FILE* ostream` 指向 `stdout` 的資料流。接下來，將資料分成兩個 `layer` ，將 `stdin` 的資料載入 heap 中，並用 `layer[0].data` 指向它，之後 `layer[1].data` 會被分配到一塊夠大的空間儲存轉換後的資料。透過 `x_init()` 初始化 `table` ，也就是字元分佈的模型，首先按照每個字元的大小順序排好。然後進入到 `x_compress` 函式，使用 `initiate` 初始化 `io` ， `io` 負責將轉換好的編碼緊密的排在 `layer[1].data` 中， `io.ptr` 指向 `layer[1].data` 中即將被寫入的位址，功能類似我前面的程式的 `buffer_index` 、 `io.c` 指向即將被寫入的 bit ，功能類似我前面的程式的 `cursor` **todo:**