2022q1 Homework2 (quiz2)

# 2022q1 Homework2 (quiz2) contributed by < [`RoyWFHuang`](https://github.com/RoyWFHuang) > ###### tags: `linux_class` > [2022q1 第 2 週測驗題](https://hackmd.io/@sysprog/linux2022-quiz2?fbclid=IwAR3DOwHxkJil7Jbr7UG8liqY9yPVOqfNcQQQ2w9dGDobJXz0HbmisETHvEE#2022q1-%E7%AC%AC-2-%E9%80%B1%E6%B8%AC%E9%A9%97%E9%A1%8C1) 開發環境為: OS: ubuntu 20.04 kernel ver: 5.4.0-99-generic CPU arch: x86_64 ```shell roy@roy-ThinkPad-T460s:$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.4 LTS Release: 20.04 Codename: focal roy@roy-ThinkPad-T460s:~$ uname -a Linux roy-ThinkPad-T460s 5.4.0-100-generic ``` --- ## 測驗 1 ```c #include <stdint.h> uint32_t average(uint32_t a, uint32_t b) { return (a >> 1) + (b >> 1) + (EXP1); } uint32_t average(uint32_t a, uint32_t b) { return (EXP2) + ((EXP3) >> 1); } ``` EXP1 = a & b & 1 EXP2 = a & b EXP3 = a ^ b :::success 延伸問題: 1. 解釋上述程式碼運作的原理 2. 比較上述實作在編譯器最佳化開啟的狀況，對應的組合語言輸出，並嘗試解讀 (可研讀 CS:APP 第 3 章) 3. 研讀 Linux 核心原始程式碼 [include/linux/average.h](https://github.com/torvalds/linux/blob/master/include/linux/average.h)，探討其 [Exponentially weighted moving average (EWMA)](https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average) 實作，以及在 Linux 核心的應用 > 移動平均（Moving average)，又稱滾動平均值、滑動平均，在統計學中是種藉由建立整個資料集合中不同子集的一系列平均數，來分析資料點的計算方法。移動平均通常與時間序列資料一起使用，以消除短期波動，突出長期趨勢或周期。短期和長期之間的閾值取決於應用，移動平均的參數將相應地設置。例如，它通常用於對財務數據進行技術分析，如股票價格、收益率或交易量。它也用於經濟學中研究國內生產總值、就業或其他宏觀經濟時間序列。 ::: ### 1. 解釋上述程式碼運作的原理 1. EXP1 = a & b & 1 $\dfrac{a+b}{2}=\dfrac{a}{2}+\dfrac{b}{2}$ 而 $\dfrac{x}{2}$ 可以換成 (x >> 1) 所以運算式變成 (a >> 1) + (b >> 1), 但是 x >> 1 之後, 原本的 $\dfrac{x}{2}$.5 中的 0.5會因為位移而消失, 而當兩個 odd 相加的時候, 兩個的 0.5 就會造成結果進位, 所以最後要在多一個補償 (a & b & 1) 判別兩個數字是否都為奇數. 2. EXP2 = a & b, EXP3 = a ^ b 根據加法器的邏輯 a + b = (a & b) << 1 + a ^b 所以 (a + b)/2 = ((a & b) << 1 + a ^b) >> 1 展開後就變成上面的式子 ### 2. 對應的組合語言輸出，並嘗試解讀 O1 下 (a >> 1) + (b >> 1) + (EXP1) ```asm endbr64 movl %edi, %eax shrl %eax movl %esi, %edx shrl %edx addl %edx, %eax andl %esi, %edi andl $1, %edi addl %edi, %eax ret ``` (a & b) + ((a ^ b) >> 1) ```asm average2: endbr64 movl %edi, %eax xorl %esi, %eax shrl %eax andl %esi, %edi addl %edi, %eax ret ``` O3 / Os 下 (a >> 1) + (b >> 1) + (EXP1) ```asm movl %edi, %eax movl %esi, %edx andl %esi, %edi shrl %eax shrl %edx andl $1, %edi addl %edx, %eax addl %edi, %eax ret ``` (a & b) + ((a ^ b) >> 1) ```asm movl %edi, %eax andl %esi, %edi xorl %esi, %eax shrl %eax addl %edi, %eax ret ``` ## 測驗 2 ```c #include <stdint.h> uint32_t max(uint32_t a, uint32_t b) { return a ^ ((EXP4) & -(EXP5)); } ``` EXP4 = a ^ b EXP5 = a < b :::success 延伸問題: 解釋上述程式碼運作的原理針對 32 位元無號/有號整數，撰寫同樣 branchless 的實作 Linux 核心也有若干 branchless / branch-free 的實作，例如 lib/sort.c: > ```c > /* > * Logically, we're doing "if (i & lsbit) i -= size;", but since the > * branch is unpredictable, it's done with a bit of clever branch-free > * code instead. > */ > __attribute_const__ __always_inline > static size_t parent(size_t i, unsigned int lsbit, size_t size) > { > i -= size; > i -= size & -(i & lsbit); > return i / 2; > } > ``` 請在 Linux 核心原始程式碼中，找出更多類似的實作手法。請善用 git log 檢索。 ::: ## 測驗 3 ```c #include <stdint.h> uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; int shift; for (shift = 0; !((u | v) & 1); shift++) { u /= 2, v /= 2; } while (!(u & 1)) u /= 2; do { while (!(v & 1)) v /= 2; if (u < v) { v -= u; } else { uint64_t t = u - v; u = v; v = t; } } while (COND); return RET; } ``` COND = v RET = (u << shift) :::success 延伸問題: 解釋上述程式運作原理; 在 x86_64 上透過 __builtin_ctz 改寫 GCD，分析對效能的提升; Linux 核心中也內建 GCD (而且還不只一種實作)，例如 lib/math/gcd.c，請解釋其實作手法和探討在核心內的應用場景。 ::: ## 測驗 4 ```c #include <stddef.h> size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; uint64_t bitset; for (size_t k = 0; k < bitmapsize; ++k) { bitset = bitmap[k]; while (bitset != 0) { uint64_t t = EXP6; int r = __builtin_ctzll(bitset); out[pos++] = k * 64 + r; bitset ^= t; } } return pos; } ``` EXP6 = bitset & (-bitset) :::success 延伸問題: 解釋上述程式運作原理，並舉出這樣的程式碼用在哪些真實案例中; 設計實驗，檢驗 ctz/clz 改寫的程式碼相較原本的實作有多少改進？應考慮到不同的 bitmap density; 思考進一步的改進空間; 閱讀 Data Structures in the Linux Kernel 並舉出 Linux 核心使用 bitmap 的案例; ::: ## [測驗 5](https://hackmd.io/@sysprog/linux2022-quiz2?fbclid=IwAR3DOwHxkJil7Jbr7UG8liqY9yPVOqfNcQQQ2w9dGDobJXz0HbmisETHvEE#%E6%B8%AC%E9%A9%97-5) ```c #include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include "list.h" struct rem_node { int key; int index; struct list_head link; }; static int find(struct list_head *heads, int size, int key) { struct rem_node *node; int hash = key % size; list_for_each_entry (node, &heads[hash], link) { if (key == node->key) return node->index; } return -1; } char *fractionToDecimal(int numerator, int denominator) { int size = 1024; char *result = malloc(size); char *p = result; if (denominator == 0) { result[0] = '\0'; return result; } if (numerator == 0) { result[0] = '0'; result[1] = '\0'; return result; } /* using long long type make sure there has no integer overflow */ long long n = numerator; long long d = denominator; /* deal with negtive cases */ if (n < 0) n = -n; if (d < 0) d = -d; bool sign = (float) numerator / denominator >= 0; if (!sign) *p++ = '-'; long long remainder = n % d; long long division = n / d; sprintf(p, "%ld", division > 0 ? (long) division : (long) -division); if (remainder == 0) return result; p = result + strlen(result); *p++ = '.'; /* Using a map to record all of reminders and their position. * if the reminder appeared before, which means the repeated loop begin, */ char *decimal = malloc(size); memset(decimal, 0, size); char *q = decimal; size = 1333; struct list_head *heads = malloc(size * sizeof(*heads)); for (int i = 0; i < size; i++) INIT_LIST_HEAD(&heads[i]); for (int i = 0; remainder; i++) { int pos = find(heads, size, remainder); if (pos >= 0) { while (PPP > 0) *p++ = *decimal++; *p++ = '('; while (*decimal != '\0') *p++ = *decimal++; *p++ = ')'; *p = '\0'; return result; } struct rem_node *node = malloc(sizeof(*node)); node->key = remainder; node->index = i; MMM(&node->link, EEE); *q++ = (remainder * 10) / d + '0'; remainder = (remainder * 10) % d; } strcpy(p, decimal); return result; } ``` PPP = pos-- MMM = list_add EEE = &heads[remainder % size] :::success 延伸問題: 1. 解釋上述程式碼運作原理，指出其中不足，並予以改進 >例如，判斷負號只要寫作 bool isNegative = numerator < 0 ^ denominator < 0; 搭配研讀 The simple math behind decimal-binary conversion algorithms 2. 在 Linux 核心原始程式碼的 mm/ 目錄 (即記憶體管理) 中，找到特定整數比值 (即 fraction) 和 bitwise 操作的程式碼案例，予以探討和解說其應用場景 ::: ## [測驗 6](https://hackmd.io/@sysprog/linux2022-quiz2?fbclid=IwAR3DOwHxkJil7Jbr7UG8liqY9yPVOqfNcQQQ2w9dGDobJXz0HbmisETHvEE#%E6%B8%AC%E9%A9%97-6) ```c /* * ALIGNOF - get the alignment of a type * @t: the type to test * * This returns a safe alignment for the given type. */ #define ALIGNOF(t) \ ((char *)(&((struct { char c; t _h; } *)0)->M) - (char *)X) ``` M = _h X = 0 :::success 延伸問題: 解釋上述程式碼運作原理在 Linux 核心原始程式碼中找出 __alignof__ 的使用案例 2 則，並針對其場景進行解說在 Linux 核心源使程式碼找出 ALIGN, ALIGN_DOWN, ALIGN_UP 等巨集，探討其實作機制和用途，並舉例探討 (可和上述第二點的案例重複)。思索能否對 Linux 核心提交貢獻，儘量共用相同的巨集 ::: ## [測驗 7](https://hackmd.io/@sysprog/linux2022-quiz2?fbclid=IwAR3DOwHxkJil7Jbr7UG8liqY9yPVOqfNcQQQ2w9dGDobJXz0HbmisETHvEE#%E6%B8%AC%E9%A9%97-7) ```c static inline bool is_divisible(uint32_t n, uint64_t M) { return n * M <= M - 1; } static uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1; static uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1; int main(int argc, char **argv) { for (size_t i = 1; i <= 100; i++) { uint8_t div3 = is_divisible(i, M3); uint8_t div5 = is_divisible(i, M5); unsigned int length = (2 << KK1) << KK2; char fmt[9]; strncpy(fmt, &"FizzBuzz%u"[(9 >> div5) >> (KK3)], length); fmt[length] = '\0'; printf(fmt, i); printf("\n"); } return 0; } ``` KK1 = div3 KK2 = div5 KK3 = div3 << 2 :::success 1. 解釋上述程式運作原理並評估 naive.c 和 bitwise.c 效能落差 * 避免 stream I/O 帶來的影響，可將 printf 更換為 sprintf 2. 分析 Faster remainders when the divisor is a constant: beating compilers and libdivide 一文的想法 (可參照同一篇網誌下方的評論)，並設計另一種 bitmask，如「可被 3 整除則末位設為 1」「可被 5 整除則倒數第二位設定為 1」，然後再改寫 bitwise.c 程式碼，試圖運用更少的指令來實作出 branchless; * 參照 fastmod: A header file for fast 32-bit division remainders on 64-bit hardware 3. 研讀 The Fastest FizzBuzz Implementation 及其 Hacker News 討論，提出 throughput (吞吐量) 更高的 Fizzbuzz 實作 4. 解析 Linux 核心原始程式碼 kernel/time/timekeeping.c 裡頭涉及到除法運算的機制，探討其快速除法的實作 (注意: 你可能要對照研讀 kernel/time/ 目錄的標頭檔和程式碼) > 過程中，你可能會發現可貢獻到 Linux 核心的空間，請充分討論 ::: ### 1. 解釋上述程式運作原理並評估首先先來分析 ==UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1== 這段的作用, 參考([Faster remainders when the divisor is a constant: beating compilers and libdivide](https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/)) 中的範例 :::info > The intuition is as follows. To divide by four, you might choose to multiply by 0.25 instead. Take 5 * 0.25, you get 1.25. The integer part (1) gives you the quotient, and the decimal part (0.25) is indicative of the remainder: multiply 0.25 by 4 and you get 1 ::: ==(0xFFFFFFFFFFFFFFFF) / 3== 這個其實就可以視為是 0.25的意義, 用圖來解釋如下 (以下縮短至16bit解釋) ``` 0xFFFF 0xAAAA 0x5555 0x0000 |---------|---------|---------| ```` 在系統中超過 16bit (overflow) 會自動被切斷, 所以在此數值系統中, 若落在 0x0000 ~ 0x5555之間, 則會被視為可以被整除, 0x5556 ~ 0xAAAA 則是餘一, 0xAAAB ~ 0xFFFF 則是餘二. 而 ==+1== 的目的, 是為了調整讓他確保可以落於區間內. ``` e.g. 2/3 -> 可以視為 2 * (0x5555 + 1) = 0xAAAC, 所以就是餘2 4/3 -> 可以視為 4 * (0x5555 + 1) = 0x5558, 所以就是餘1 ``` PS: 這也就是為什麼需要 +1 做為調整, 必須讓數值完全落於區段內. 如果沒有 +1, 2 * (0x5555) = 0xAAAA 則會剛好落於邊界, 無法判別是 1 or 2. 基於這個原理, 於是我們可以改寫 ```c static inline bool is_divisible(uint32_t n, uint64_t M) { return n * M <= M - 1; } ``` 這段變成 ```c static inline uint32_t mod(uint32_t n, uint64_t M) { if(!(n * M <= M - 1)) { return n * M / M; } else { return 0; } } ``` 可以變成求取餘數. 但問題來了, 為什麼他只能用來計算 uint32_t 以內, 超過不行呢? 以下我們來做個實驗 n = 0x7FFFFFFFFFFFFFFF 跟 n = 0x8000000000000000 套入看看 ```c printf("M3 0x7FFFFFFFFFFFFFFF (%d)\n", mod(0x7FFFFFFFFFFFFFFF, M3)); printf("M3 0x8000000000000000 (%d)\n", mod(0x8000000000000000, M3)); ``` ```shell M3 0x7FFFFFFFFFFFFFFF (1) M3 0x8000000000000000 (0) ``` 很明顯的錯誤了, 至於未什麼會出現這錯誤, 原因就在於 ==+1==, 因為每次運算的數值都會被多1出來 e.g 同樣以 16bit為例 |num | hex after mul M3| over num(compare with 0x5555)| |-|-|-| |1 | 0x5556 | 1 | |2 | 0xAAAC | 2 | |3 | 0x0002 | 2 | |4 | 0x5558 | 3 | |5 | 0xAAAE | 4 | |6 | 0x0004 | 4 | 我們換個陳列方式 |N|hex|over||N|hex|over| |-|-|-|-|-|-|-| |1 | 0x5556 | 1 ||4 | 0x5558 | 3 | |2 | 0xAAAC | 2 ||5 | 0xAAAE | 4 | |3 | 0x0002 | 2 ||6 | 0x0004 | 4 | 很明顯以2為倍數去增加, 那什麼時候會 overflow? 當他超過到下個區間的時候, 也就是 0x5555 + 0x5555(over flow) 此時的數值就會是錯誤的了. 所以就可以推算出他的最大可運算的數字是多少了(0x5555/2 + 0x5555 = 0x7FFF) :::warning 但問題來了 ==/3== ==/5== 都可以正確算出數字是多少 ```shell M3 0x7FFFFFFFFFFFFFFF (1) M3 0x8000000000000000 (0) M5 0x3FFFFFFFFFFFFFFF (3) M5 0x4000000000000000 (0) ``` 但是 ==/7== 跟 ==/9== 卻不是此規則 ```shell M7 0x3333333333333335 (5) M7 0x3333333333333336 (0) ``` 猜測是 0xFFFF...FFFF/7 之後不是整除, 而是 FFFFFFFFFFFFFFFE / 7 才是整除造成, 但我找不出公式可以解決, 有人可以幫忙解決嗎? :::