2022q1 Homework2 (quiz2)

--- tags: linux2022 --- # 2022q1 Homework2 (quiz2) contributed by < `SmallHanley` > > [linux2022-quiz2](https://hackmd.io/@sysprog/linux2022-quiz2) ## 測驗 `1` :::success 解釋上述程式碼運作的原理 ::: 考慮以下對二個無號整數取平均值的程式碼： ```c #include <stdint.h> uint32_t average(uint32_t a, uint32_t b) { return (a + b) / 2; } ``` 這個直覺的解法會有 overflow 的問題，若我們已知 a, b 數值的大小，可用下方程式避免 overflow： ```c #include <stdint.h> uint32_t average(uint32_t low, uint32_t high) { return low + (high - low) / 2; } ``` 接著我們可改寫為以下等價的實作： ```c #include <stdint.h> uint32_t average(uint32_t a, uint32_t b) { return (a >> 1) + (b >> 1) + (a & b & 1); } ``` **原理：** 如果是我們平常使用的兩實數取平均，可以寫成： \begin{gather} \frac{a + b}{2} = \frac{a}{2} + \frac{b}{2} \end{gather} 而上述的程式碼考慮的是非負整數，因此我們可以使用右移運算子 `>>` 代替上式的除以 2。不過這還不是等價的程式碼，會忽略到兩數都是奇數時的進位問題，因此要加上 `a & b & 1`，使得當兩數都是奇數時會再加一。我們再次改寫為以下等價的實作： ```c #include <stdint.h> uint32_t average(uint32_t a, uint32_t b) { return (a & b) + ((a ^ b) >> 1); } ``` **原理：** 上述程式碼讓我們很容易聯想到加法器，我們先觀察一下半加器，以下是真值表 ({C, S} = A + B)： | A | B | C | S | | --- | --- | --- |:---:| | 0 | 0 | 0 | 0 | | 0 | 1 | 0 | 1 | | 1 | 0 | 0 | 1 | | 1 | 1 | 1 | 0 | 得到以下結果： \begin{gather} S = A \oplus B \\ C = A \times B \end{gather} 而全加器則是除了兩個一位元的輸入訊號，還需要低一位的進位訊號 (即 {C~out~, S} = A + B + C~in~)。當我們要計算多位元的非負整數加法，可以利用多個全加器串接 (或稱級聯，cascade)。若要使用到半加器的結果，當前位數的運算結果，可以看作是當前位數的兩輸入訊號相加再加上前一位的進位訊號 (即 S + C~in~，注意，這裡的 S 為半加器的 Sum，C~in~ 為前一位的 C~out~)，也就是當前位數兩輸入訊號的 bitwise XOR 加上前一位數兩輸入訊號的 bitwise AND。因此，兩多位數 (分別為 a 和 b) 相加的結果可以寫作： ```c ((a & b) << 1) + (a ^ b) ``` 兩數平均可以寫作： ```c (a & b) + ((a ^ b) >> 1) ``` :::success 比較上述實作在編譯器最佳化開啟的狀況，對應的組合語言輸出，並嘗試解讀 (可研讀 [CS:APP 第 3 章](https://hackmd.io/@sysprog/CSAPP-ch3)) ::: ==暫略==，我目前沒有觀察到什麼有趣的現象。 :::success 研讀 Linux 核心原始程式碼 [include/linux/average.h](https://github.com/torvalds/linux/blob/master/include/linux/average.h)，探討其 [Exponentially weighted moving average](https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average) (EWMA) 實作，以及在 Linux 核心的應用 > 移動平均（Moving average)，又稱滾動平均值、滑動平均，在統計學中是種藉由建立整個資料集合中不同子集的一系列平均數，來分析資料點的計算方法。移動平均通常與時間序列資料一起使用，以消除短期波動，突出長期趨勢或周期。短期和長期之間的[閾值](https://terms.naer.edu.tw/detail/1085244/)取決於應用，移動平均的參數將相應地設置。例如，它通常用於對財務數據進行技術分析，如股票價格、收益率或交易量。它也用於經濟學中研究國內生產總值、就業或其他宏觀經濟時間序列。 ::: 在 [include/linux/average.h](https://github.com/torvalds/linux/blob/master/include/linux/average.h) 中使用巨集 `DECLARE_EWMA(name, _precision, _weight_rcp)` 用來生成相關的物件和函式。其中，`_precision` 是用來表示小數部份的精準度，也就是我們使用多少位元表示。而 `_weight_rcp` 則是權重。而巨集內部包含以下結構體： ```c struct ewma_##name { \ unsigned long internal; \ }; ``` 以及以下函式： * **void ewma_##name##_init(struct ewma_##name *e)*** 將 `internal` 初始化為 `0`。 * **unsigned long ewma_##name##_read(struct ewma_##name *e)*** 將 `internal` 右移 `_precision` 的位元數回傳。 * **void ewma_##name##_add(struct ewma_##name *e, unsigned long val)*** 將 `val` 新增進來，新的 `internal` 會是舊的 `internal` 和 `val` 的線性組合 (`val` 佔 1 / _weight_rcp，舊的 `internal` 佔 1 - 1 / _weight_rcp)。舊的 `internal` 佔比的部份，原本應寫作： \begin{gather} \mbox{(internal << precision) - (internal << precision >> weight_rcp)} \end{gather}經化簡可寫成： \begin{gather} \mbox{(internal << weight_rcp) - internal} \end{gather} **應用：** 比方說 [drivers/net/wireless/realtek/rtw88/main.h](https://github.com/torvalds/linux/blob/8efd0d9c316af470377894a6a0f9ff63ce18c177/drivers/net/wireless/realtek/rtw88/main.h) 有使用到： ```c DECLARE_EWMA(tp, 10, 2); struct ewma_tp tx_ewma_tp; ``` 可以推測跟一些訊號處理有關係。 ## 測驗 `2` :::success 解釋上述程式碼運作的原理 ::: 改寫〈[解讀計算機編碼](https://hackmd.io/@sysprog/binary-representation)〉一文的「不需要分支的設計」一節提供的程式碼 `min`，我們得到以下實作 (`max`)： ```c #include <stdint.h> uint32_t max(uint32_t a, uint32_t b) { return a ^ ((a ^ b) & -(a < b)); } ``` **原理：** 我們知道 a 和 b 做 bitwise XOR，再和 a 做 bitwise XOR 會得到 b，因此我們可以運用這個原理讓 a 和 (a ^ b) 或是 0 做 bitwise XOR 而分別得到 b 和 a。至於要怎麼做到不使用分支，我們可以利用 a < b 的結果，加上負號，會產生兩種結果 (0x00...00 和 0xff...ff)，再加上 bitwise AND 可以巧妙的做到無分支判斷最大值。 :::success 針對 32 位元無號/有號整數，撰寫同樣 [branchless](https://en.wikipedia.org/wiki/Branch_(computer_science)) 的實作 ::: 當我們要判斷一個 32 位元無號整數是否為 2 的冪，我們可以用以下程式碼判斷： ```c #include <stdint.h> #include <stdbool.h> bool is_power_of_2(uint32_t n) { uint32_t t = 0x80000000; while (t) { if (n == t) { return true; } t >>= 1; } return false; } ``` 我們可以改寫成以下 branchless 的實作： ```c #include <stdint.h> #include <stdbool.h> bool is_power_of_2(uint32_t n) { return !(n & (n - 1)) && n; } ``` :::success Linux 核心也有若干 branchless / branch-free 的實作，例如 [lib/sort.c](https://github.com/torvalds/linux/blob/master/lib/sort.c): ```c /* * Logically, we're doing "if (i & lsbit) i -= size;", but since the * branch is unpredictable, it's done with a bit of clever branch-free * code instead. */ __attribute_const__ __always_inline static size_t parent(size_t i, unsigned int lsbit, size_t size) { i -= size; i -= size & -(i & lsbit); return i / 2; } ``` 請在 Linux 核心原始程式碼中，找出更多類似的實作手法。請善用 `git log` 檢索。 ::: 在 [Replace int2float() with an optimized version.](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=747f49ba67b8895a5831ab539de551b916f3738c) (commit 747f49ba) 中，將 32 位元整數的 bit pattern 轉成浮點數的 bit pattern，有使用到類似的實作手法 (不過目前 r600_blit 相關程式已背棄用)。以下是 [IEEE 754-2008](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4610935) 定義的 `binary32` 格式，其中藍、綠、紅分別代表 sign、exponent 和 fraction，分別佔用 1、8、23 位元： ```graphviz digraph G { n [ shape = none label = <<table border="0" cellspacing="0"><tr> <td border="1" bgcolor="blue">0</td> <td border="1" bgcolor="green">1</td> <td border="1" bgcolor="green">1</td> <td border="1" bgcolor="green">0</td> <td border="1" bgcolor="green">1</td> <td border="1" bgcolor="green">1</td> <td border="1" bgcolor="green">0</td> <td border="1" bgcolor="green">0</td> <td border="1" bgcolor="green">0</td> <td border="1" bgcolor="red">1</td> <td border="1" bgcolor="red">1</td> <td border="1" bgcolor="red">0</td> <td border="1" bgcolor="red">1</td> <td border="1" bgcolor="red">1</td> <td border="1" bgcolor="red">0</td> <td border="1" bgcolor="red">0</td> <td border="1" bgcolor="red">0</td> <td border="1" bgcolor="red">1</td> <td border="1" bgcolor="red">1</td> <td border="1" bgcolor="red">0</td> <td border="1" bgcolor="red">1</td> <td border="1" bgcolor="red">1</td> <td border="1" bgcolor="red">0</td> <td border="1" bgcolor="red">0</td> <td border="1" bgcolor="red">0</td> <td border="1" bgcolor="red">1</td> <td border="1" bgcolor="red">1</td> <td border="1" bgcolor="red">0</td> <td border="1" bgcolor="red">1</td> <td border="1" bgcolor="red">1</td> <td border="1" bgcolor="red">0</td> <td border="1" bgcolor="red">0</td> </tr></table>> ] } ``` 完整實作如下： ```c /* 23 bits of float fractional data */ #define I2F_FRAC_BITS 23 #define I2F_MASK ((1 << I2F_FRAC_BITS) - 1) /* * Converts unsigned integer into 32-bit IEEE floating point representation. * Will be exact from 0 to 2^24. Above that, we round towards zero * as the fractional bits will not fit in a float. (It would be better to * round towards even as the fpu does, but that is slower.) */ uint32_t int2float(uint32_t x) { uint32_t msb, exponent, fraction; /* Zero is special */ if (!x) return 0; /* Get location of the most significant bit */ msb = __fls(x); /* * Use a rotate instead of a shift because that works both leftwards * and rightwards due to the mod(32) behaviour. This means we don't * need to check to see if we are above 2^24 or not. */ fraction = ror32(x, (msb - I2F_FRAC_BITS) & 0x1f) & I2F_MASK; exponent = (127 + msb) << I2F_FRAC_BITS; return fraction + exponent; } ``` 其中用到一個 `ror32` 的函式，是被定義在 [include/linux/bitops.h](https://github.com/torvalds/linux/blob/a48b0872e69428d3d02994dcfad3519f01def7fa/include/linux/bitops.h#L116)，程式碼如下： ```c static inline __u32 ror32(__u32 word, unsigned int shift) { return (word >> (shift & 31)) | (word << ((-shift) & 31)); } ``` 有些指令集架構有支援 rotate 指令，比方說 [x86_64 指令集](https://en.wikipedia.org/wiki/X86_instruction_listings) 具備 `ror`、`rol` 指令，上述程式碼經過編譯器最佳化，確實會產生出 `ror` 指令。新版 `int2float` 的程式碼先使用 `__fls` 來找出該整數的指數部份。原本的實作 fraction 會用左移加上條件判斷來求值，新版使用 `(msb - I2F_FRAC_BITS) & 0x1f)`，搭配 `ror32` 來實作，減少檢查是否大於 $2^{24}$ 所需的分支，前者利用 unsigned integer 減法的溢位來決定輸入小於 $2^{24}$ 時的位移數目，搭配後者 rotate 指令，不管輸入大於或是小於 $2^{24}$，皆可以做到保留 MSB 之後的 23 個位元給 `fraction` 的效果。 ## 測驗 `3` :::success 解釋上述程式運作原理; ::: 考慮以下實作： ```c #include <stdint.h> uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; int shift; for (shift = 0; !((u | v) & 1); shift++) { u /= 2, v /= 2; } while (!(u & 1)) u /= 2; do { while (!(v & 1)) v /= 2; if (u < v) { v -= u; } else { uint64_t t = u - v; u = v; v = t; } } while (v); return u << shift; } ``` 性質： 1. If both x and y are 0, gcd is zero $gcd(0, 0) = 0$. 2. $gcd(x, 0) = x$ and $gcd(0, y) = y$ because everything divides $0$. 3. If x and y are both even, $gcd(x, y) = 2 * gcd(\frac{x}{2}, \frac{y}{2})$ because $2$ is a common divisor. 4. If x is even and y is odd, $gcd(x, y) = gcd(\frac{x}{2}, y)$. 5. On similar lines, if x is odd and y is even, then $gcd(x, y) = gcd(x, \frac{y}{2})$. It is because $2$ is not a common divisor. 一開始的 if 判斷式對應到性質 1、2，如果 x、y 皆為 0，回傳 0；如果 y 為 0，回傳 x；如果 x 為 0，回傳 y。接著，for 迴圈對應到性質 3，並用 `shift` 記錄乘了多少次 2。while 迴圈對應到性質 4。do while 迴圈則是用到性質 5 以及 [Euclid's_algorithm](https://en.wikipedia.org/wiki/Greatest_common_divisor#Euclid's_algorithm)。做到最後 `u` 和 `v` 會相等，會跳到 else 區塊，做完該區塊後 `u` 為一非 0 數值，`v` 為 0，跳出迴圈，再將 `u` 左移 `shift`。 ## 測驗 `4` :::success 解釋上述程式運作原理，並舉出這樣的程式碼用在哪些真實案例中; ::: 在影像處理中，[bit array](https://en.wikipedia.org/wiki/Bit_array) (也稱 `bitset`) 廣泛使用，考慮以下程式碼: ```c #include <stddef.h> size_t naive(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; for (size_t k = 0; k < bitmapsize; ++k) { uint64_t bitset = bitmap[k]; size_t p = k * 64; for (int i = 0; i < 64; i++) { if ((bitset >> i) & 0x1) out[pos++] = p + i; } } return pos; } ``` 考慮 GNU extension 的 [`__builtin_ctzll`](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) 的行為是回傳由低位往高位遇上連續多少個 `0` 才碰到 `1`。 > 範例: 當 `a = 16` `16` 這個十進位數值的二進位表示法為 00000000 00000000 00000000 00010000 從低位元 (即右側) 往高位元，我們可發現 $0 \rightarrow 0 \rightarrow 0 \rightarrow 0 \rightarrow 1$，於是 ctz 就為 4，表示最低位元往高位元有 4 個 `0` 用以改寫的程式碼如下: ```c #include <stddef.h> size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; uint64_t bitset; for (size_t k = 0; k < bitmapsize; ++k) { bitset = bitmap[k]; while (bitset != 0) { uint64_t t = bitset & -bitset; int r = __builtin_ctzll(bitset); out[pos++] = k * 64 + r; bitset ^= t; } } return pos; } ``` 在 `naive()` 這個函式中，需要將 bitmap 中的所有 `1` 的位置記錄到 `out` 中，並回傳 `pos` 代表 `out` 的 size。其中，我們可以使用 `__builtin_ctzll` 優化，如 `improved()` 函式所示。使用 `ctz` 找到目前 bitmap 中的 least significant `1`，並將該 `1` clear，進行下一次迭代。我們可以觀察到 `bitset ^= t` 這一行程式就是 clear bit 的操作。至於 `t = bitset & -bitset` 則是利用二補數的特性 (可觀察 [解讀計算機編碼](https://hackmd.io/@sysprog/binary-representation) 的圖)，取二補數相當於對原數作 bitwise NOT 後加 1。二補數再和原數作 bitwise AND 可以得到一個數，該數的第 r 個 bit 為 set，其餘為 clear (等價於 `1 << r`)。 :::success 設計實驗，檢驗 ctz/clz 改寫的程式碼相較原本的實作有多少改進？應考慮到不同的 [bitmap density](http://www.vki.com/2013/Support/docs/vgltools-3.html); ::: :::success 思考進一步的改進空間; ::: 可改寫成以下等價實作： ```c #include <stddef.h> size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; uint64_t bitset; for (size_t k = 0; k < bitmapsize; ++k) { bitset = bitmap[k]; while (bitset != 0) { int r = __builtin_ctzll(bitset); out[pos++] = k * 64 + r; bitset ^= 1UL << r; } } return pos; } ``` :::success 閱讀 [Data Structures in the Linux Kernel](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-3.html) 並舉出 Linux 核心使用 bitmap 的案例; ::: Linux 的 [kernel/sched/sched.h](https://elixir.bootlin.com/linux/latest/source/kernel/sched/sched.h#L254)，有關 RT scheduling 就有使用到 bitmap： ```c /* * This is the priority-queue data structure of the RT scheduling class: */ struct rt_prio_array { DECLARE_BITMAP(bitmap, MAX_RT_PRIO+1); /* include 1 bit for delimiter */ struct list_head queue[MAX_RT_PRIO]; }; ``` 搭配下述程式碼可以找出 bitmap 中的 first set： ```c /* * Every architecture must define this function. It's the fastest * way of searching a 100-bit bitmap. It's guaranteed that at least * one of the 100 bits is cleared. */ static inline int sched_find_first_bit(const unsigned long *b) { #if BITS_PER_LONG == 64 if (b[0]) return __ffs(b[0]); return __ffs(b[1]) + 64; #elif BITS_PER_LONG == 32 if (b[0]) return __ffs(b[0]); if (b[1]) return __ffs(b[1]) + 32; if (b[2]) return __ffs(b[2]) + 64; return __ffs(b[3]) + 96; #else #error BITS_PER_LONG not defined #endif } ```

Read more

2022q1 Homework1 (quiz1)

Lab1: RV32I Simulator

BMQ 排程器研究

2022q1 Homework4 (quiz4)