2020q1 Homework3 (quiz4)

# 2020q1 Homework3 (quiz4) contributed by < `Yu-Wei-Chang` > > [2020q1 第 4 週測驗題](https://hackmd.io/@sysprog/ByJyk6HrL) ### 測驗 `1` #### 程式運作原理 * 輸入資料的位元位移處理 bitcpy() 函式中使用 uint8_t data 來處理輸入輸出的資料，所以一次處理 8 個位元。由這個條件切入去分析位元位移的處理: * 超過 8 位元的位移直接對 `_src buf` 做指標位移即可移動到我們要讀取輸入資料的位置。 ```cpp const uint8_t *source = _src + (_read / 8); ``` * 少於 8 位元的位移處理 : `read_lhs` 表示需要對資料左移多少位元 (個人感覺改寫成 _read % 8 比較直覺一點)，`read_rhs` 表示還需要往下讀多少位元 ```cpp size_t read_lhs = _read & 7; size_t read_rhs = 8 - read_lhs; ``` * 處理輸入資料位移的實作 : 從適當指標位移後的 source buf 中讀出 8 bits 資料，有必要的話做左位移，丟掉不要的 bit，然後看需要需要再從 buf 中補 bit。 while 迴圈中一次處理最多 8 bits 的資料，直到所有 bit 處理完成，每次處理完一輪理當要扣除已經處理過的 bit 數量，所以 `KK2` = _count -= bitsize;。 ```cpp while (_count > 0) { data = *source++; bitsize = (_count > 8) ? 8 : _count; if (read_lhs > 0) { data <<= read_lhs; if (bitsize > read_rhs) data |= (*source >> read_rhs); } ... _count -= bitsize; /* KK2 */ } ``` * 輸出資料的位元位移處理 * `_dest but`、`write_lhs`、`write_rhs` 的初始化原理和上面輸入的部分相同，`KK1` = `_write & 7` 表示資料寫入 buf 時需要位移多少位元。 ```cpp size_t write_lhs = _write & 7; /* KK1 */ size_t write_rhs = 8 - write_lhs; uint8_t *dest = _dest + (_write / 8); ``` * 處理輸出資料位移的實作 : 主要的原理還是將變數 `data` 存入 `_dest` buffer 之中，當中要考慮需要需要跨 byte ? 透過 bitwise mask 來處理要位移多少位元..等。 ```cpp while (_count > 0) { ... original = *dest; if (write_lhs > 0) { mask = read_mask[write_lhs]; if (bitsize > write_rhs) { *dest++ = (original & mask) | (data >> write_lhs); original = *dest & write_mask[bitsize - write_rhs]; *dest = original | (data << write_rhs); } else { if ((bitsize - write_lhs) > 0) mask = mask | write_mask[8 - (bitsize - write_lhs)]; *dest++ = (original & mask) | (data >> write_lhs); } } else { if (bitsize < 8) data = data | (original & write_mask[bitsize]); *dest++ = data; } _count -= bitsize; /* KK2 */ } ``` * 嘗試精簡 read mask 和 write mask 的程式碼數量 * 原始程式碼 `read_mask` 和 `write_mask` 宣告為 `static const uint8_t` 的陣列，而且已經初始化，所以這些資料會儲存在 VFS 的 `Data Segment` 之中。 * 仔細觀察，發現可以用 bitwise 操作來替換，如下。read mask 剛好是 write mask 的 reverse。 ```cpp #define WRITE_MASK(x) ((1 << (8 - ((x) % 9))) - 1) #define READ_MASK(x) (~(WRITE_MASK(x))) ``` * 結果 : * 在效能上沒有特別明顯的改善，推論 **即時運算 mask** 和 **從 `Data segment` 讀 mask** 相比，在指令上沒有明顯優勢。 * 雖然程式碼減少了一些，可是可讀性不見的比較好。 * 有發現儘管已經指定 CPU 來專門執行行程，並且排除干擾的可能性 ([參考 2020q1 Homework 2](https://hackmd.io/@sysprog/linux2020-fibdrv#-Linux-%E6%95%88%E8%83%BD%E5%88%86%E6%9E%90%E7%9A%84%E6%8F%90%E7%A4%BA))，==但是行程的執行耗時還是有很大範圍的浮動，每次跑數值都可能有很大的差異。== (WIP..) ![](https://i.imgur.com/0wdg4b8.png) * 改善 `bitcpy()` 函式的效能參考 [eecheng](https://hackmd.io/@eecheng/SyOd56BBI) 同學和 [oscarshiang](https://hackmd.io/@oscarshiang/linux_quiz4) 同學的作業，發現有 `branch-missed` 這樣的狀況。回頭看[你所不知道的 C 語言：編譯器和最佳化原理篇](https://hackmd.io/@sysprog/c-compiler-optimization?type=view#From-Source-to-Binary-How-A-Compiler-Works-GNU-Toolchain)，其中也確實有提到 `Branch Optimization`。再參考[資料](https://en.wikibooks.org/wiki/Optimizing_C%2B%2B/Code_optimization/Pipeline)，其中解釋到 `if-else` 條件判斷式會產生 jump 的機器語言指令 (a.k.a. branch)。==處理器會根據前一次 iteration 的結果來預測 branch 的走向，預測成功的話會增加效能，但是預測失敗的話則會產生更多的開銷==。所以應該避免在很難預測的情況下使用 `if-else` 條件判斷，又或是在程式中減少不必要的 `if-else` 條件判斷式。 ``` Modern processors handle branches efficiently only if they can predict them. In case of prediction error, the steps already done by the pipeline on the following instructions are useless and the processor must restart from the branch destination instruction. ... In bottlenecks, the hard-to-predict branches should be avoided. If a branch is predicted very badly, even replacing it with a rather slow sequence of instructions may result in a speed up. ``` 改善 `bitcpy()` 減少不必要的分支 : ```cpp while (_count > 0) { bitsize = (_count > 8) ? 8 : _count; /* Don't care if read_lhs > 0 and * bitsize > read_rhs are matched or not. * Just read 8 bits data. */ data = (*source++ << read_lhs) | (*source >> read_rhs); /* Don't care if bitsize < 8 or not. * data doesn't changed when it AND 11111111b */ data &= READ_MASK(bitsize); original = *dest; /* Handle left part of data: * Mask original value out (*dest), and OR data with it. */ mask = READ_MASK(write_lhs); *dest++ = (original & mask) | (data >> write_lhs); /* Handle right part of data: */ if ((bitsize - write_rhs) >= 0) { *dest &= WRITE_MASK(bitsize - write_rhs); *dest |= (data << write_rhs); } _count -= bitsize; /* KK2 */ } ``` 用 `perf` 去量測 branch-misses 和 branch-instruction，看起來沒有什麼差別。 ```shell $ sudo perf record -e branch-misses:u,branch-instructions:u -F 10000 ./bitcpy $ sudo pert report Original bitcpy Available samples 1K branch-misses:u 1K branch-instructions:u Branch lessed bitcpy Available samples 1K branch-misses:u 1K branch-instructions:u ``` 更改實驗方式，複製多點 bit 再量測 branch-misses 和 branch-instruction，就可以發現 branch-misses 的數量下降了，==但是 branch-instruction 的數值沒有變化。== (WIP..) ```cpp static uint8_t output[1000], input[1000]; memset(&input[0], 0xFF, sizeof(input)); for (int i = 1; i <= 7900; ++i) { memset(&output[0], 0x00, sizeof(output)); clock_gettime(CLOCK_REALTIME, &t1); bitcpy(&output[0], 1, &input[0], 2, i); clock_gettime(CLOCK_REALTIME, &t2); fprintf(fptr, "%d %ld", i, timespec_diff(&t1, &t2)); memset(&output[0], 0x00, sizeof(output)); clock_gettime(CLOCK_REALTIME, &t1); bitcpy_branch_less(&output[0], 1, &input[0], 2, i); clock_gettime(CLOCK_REALTIME, &t2); fprintf(fptr, " %ld\n", timespec_diff(&t1, &t2)); } ``` ```shell Original bitcpy Available samples 546 branch-misses:u 1K branch-instructions:u Branch lessed bitcpy Available samples 198 branch-misses:u 1K branch-instructions:u ``` 來看看執行效能: ==減少分支後的版本執行效率有沒比較好 (WIP..)== ![](https://i.imgur.com/00MqmGP.png) :::danger 1. 改善測量時間的方式，避免在迴圈內呼叫 `fprintf` 一類的函式; 2. 避免在同一個迴圈內執行兩種 bitcpy 實作，注意 locality 的影響; 3. 改進對用語的精準掌握; :notes: jserv ::: > 收到，感謝老師的建議。 > [name=Y.W.Chang] [time=Sat, May 30, 2020 4:14 PM] ### 測驗 `2` #### 程式運作原理 * `NEXT_POWER_OF_2(s)` 巨集由字面上可以猜出巨集是要找出大於 `s` 的最小 $2^n$。 `VV1` 條件成立時指定 vector 長度為 `s`，反之指定 vector 長度為 $2^{64-builtin_clzl(s)}$，所以可以合理推測 `VV1` 的條件為 ==輸入 `s` 是 2 的冪嗎 ?== 從選項中可以選出適當的答案為 `(s & (s - 1)) == 0`，分析如下 : * 2 的冪表示為二進位一定是長這樣 $00000001_b$、$00000010_b$、$00000100_b$、$00001000_b$... * 當它和其減 1 的數值做位元運算結果一定為 `False`，E.g. $00001000_b$ & $00000111_b$ == 0 從[你所不知道的 C 語言 - 記憶體管理、對齊及硬體特性篇](https://hackmd.io/@sysprog/c-memory?type=view#data-alignment)中有提到，因為處理器從記憶體抓取資料的特性，所以資料的對齊會影響其效能的發揮。編譯器在編譯時會對 struct 做資料對齊，以 struct 中所有成員中最大的 size 為 base，然後對 struct 做 padding，直到其大小為 $base size*n$ 。從這個例子來看，若今天指定 vector 配給 3 個 int ，但是因為資料對齊的關係，編譯器最後會配給 vector 的大小為 24 byte (size_t + 3 個 int + padding，最大成員是 size_t 8 byte，為了資料對齊，所以會 padding 其大小為 8、16、24 ... byte)，推測是因為不浪費 padding 的空間，所以巨集才這樣設計。 * vec_pop_back(v) 的實作目的是刪除 vector 中最尾端的資料，最尾端的資料為 v.buf[v.size] 或是 v.ptr[v.size]，想要刪除它直接遞減 v.size 即可，故 `VV2` 恰當的選項為 `v.size -= 1`。可是這樣有點危險，因為當 v.size 為 0 時還是會繼續遞減，會產生錯誤的 size，目前想到可以改寫成下面的巨集，但是 gcc 會報出警告，有看到一個說法說省略第二個參數的 `ternary operator` 是 GCC extension (待查)。 ```cpp #define vec_pop_back(v) (void)((v.size <= 0) ?: (v.size -= 1)) ``` ```shell vector.c:48:47: warning: the omitted middle operand in ?: will always be ‘true’, suggest explicit middle operand [-Wparentheses] #define vec_pop_back(v) (void)((v.size <= 0) ?: (v.size -= 1)) ^ ``` * GCC extension: [attribute cleanup](https://stackoverflow.com/questions/42025488/resource-acquisition-is-initialization-in-c-lang) 程式中的 `v` 巨集在定義上使用了 `__attribute__((cleanup(vec_free)))`，目的是達到自動回收資源。當透過此巨集宣告的 vector 的 life time 結束後 (超出其可用的範圍，也就 main() 函式的 `}` )，vec_free() 函式會自動被呼叫，將配給給 vector 的資源釋放掉。 ```cpp #define v(t, s, name, ...) \ union { \ STRUCT_BODY(t); \ struct { \ size_t filler; \ t buf[NEXT_POWER_OF_2(s)]; \ }; \ } name __attribute__((cleanup(vec_free))) = {.buf = {__VA_ARGS__}}; \ ``` ###### tags: `Linux核心課程筆記 - 隨堂測驗` ###### tags: `Linux核心課程筆記 - Homework`