# 2022q1 Homework2 (quiz2)
contributed by < [`chengingx`](https://github.com/chengingx) >
###### tags: `linux2022`
> [測驗題目](https://hackmd.io/@sysprog/linux2022-quiz2)
## 測驗 1
考慮以下對二個無號整數取平均值的程式碼:
```c
#include <stdint.h>
uint32_t average(uint32_t a, uint32_t b)
{
return (a + b) / 2;
}
```
這個直覺的解法會有 overflow 的問題,若我們已知 a, b 數值的大小,可用下方程式避免 overflow:
```c
uint32_t average(uint32_t low, uint32_t high)
{
return low + (high - low) / 2;
}
```
接著我們可改寫為以下等價的實作:
```c
uint32_t average(uint32_t a, uint32_t b)
{
return (a >> 1) + (b >> 1) + (a & b & 1);
}
```
我們知道往右 shift 一位元可以用於除以 2 的操作,但當我們在作曲平均運算是兩數皆是奇數時,則會少 1,因此我們能專注於兩數的最後一個位元是否皆為 1 即可。
我們再次改寫為以下等價的實作:
```c
uint32_t average(uint32_t a, uint32_t b)
{
return a & b + ((a ^ b) >> 1);
}
```
計算過程如下
```c
a + b = a ^ b + (a & b) << 1
(a + b) / 2 = a & b + ((a ^ b) >> 1)
```
:::info
- [X] 解釋上述程式碼運作的原理
- [X] 比較上述實作在編譯器最佳化開啟的狀況,對應的組合語言輸出,並嘗試解讀 (可研讀 CS:APP 第 3 章)
- [X] 研讀 Linux 核心原始程式碼 include/linux/average.h,探討其 Exponentially weighted moving average (EWMA) 實作,以及在 Linux 核心的應用
:::
### 編譯器最佳化開啟
首先研究 x86-64 的暫存器有哪些,根據 [CS356: A short guide to x86-64 assembly](https://hackmd.io/@paolieri/x86_64),We also need to use instructions (e.g., movX or addX) where the suffix X (one of q, l, w, b) matches the size of the register:


* %rax, %rbx, %rcx, %rdx (the “a, b, c, d”)
* %rsi, %rdi (source index, destination index)
* %rsp, %rbp (stack pointer, base pointer)
* %r8, %r9, …, %r15 (just numbers here)
接著可以觀察以下函式的 assembly code
##### (a + b) / 2
```c
uint32_t average(uint32_t a, uint32_t b)
{
return (a + b) / 2;
}
```
```shell
$ gcc -S -O2 -o test1.s test1.c
```
> The command-line option -O1 instructs the compiler to apply level-one optimizations. In general, increasing the level of optimization makes the final program run faster, but at a risk of increased compilation time and difficulties running debugging tools on the code.
> In practice, level-two optimization (specified with the option -O2) is considered a better choice in terms of the resulting program performance.
```
average:
leal (%rdi,%rsi), %eax
shrl %eax
ret
```
首先了解 `leal (%rdi,%rsi), %eax` 為何物?
> All of the lines beginning with ‘.’ are directives to guide the assembler and linker. We can generally ignore these. On the other hand, there are no explanatory
remarks about what the instructions do or how they relate to the source code.
[Introduction to the leal instruction](https://www.cs.swarthmore.edu/~newhall/cs31/resources/addrleal.php) 提到 leal looks like a mov instr, but does not access Memory. Load Effective Address only loads the address.
> leal (%eax),%ecx # R[%ecx]<--&(M[R[%eax]])
> \# this moves the value stored in %eax to %ecx
>
> Assume: %eax: x %edx: y
> leal 10(%eax, %edx, 5), %ecx # R[%ecx] <-- x + 5y + 10
>
> Load Effective Address: leaq 10(%rax,%rbx,4),%rcx saves 10 + %rax + %rbx * 4 into %rcx. This is quite useful to combine simple arithmetic operations into one, or to compute a memory address and store it in a register %rcx for later use. Only the leaq / leal variants exist in x86-64 and they are used quite frequently by the compiler:
`%rdi` 是存放 `a` `%rsi` 存放 `b`,`leal (%rdi,%rsi), %eax` 則是將 `a` 與 `b` 的值做相加放入 `%eax`
而 `shrl %eax` 則是將 `%eax` 右移 1 個 bit (Logical right shift by 1)
##### low + (high - low) / 2
```c
uint32_t average(uint32_t low, uint32_t high)
{
return low + (high - low) / 2;
}
```
```
average:
.LFB0:
movl %esi, %eax
subl %edi, %eax
shrl %eax
addl %edi, %eax
ret
```
`%esi` 將值放入 `%eax`,也就是 `high`,因為`subl Src, Dest # Dest = Dest - Src`,所以`%eax = %eax - %edi`,其中`%edi` 放的是 `low`,透過 right shift 後,加 `a`
##### (a >> 1) + (b >> 1) + (a & b & 1)
```c
uint32_t average(uint32_t a, uint32_t b)
{
return (a >> 1) + (b >> 1) + (a & b & 1);
}
```
```
average:
.LFB0:
movl %esi, %edx
shrl %esi # a >> 1
andl $1, %edx # a & 1
andl %edi, %edx # a & 1 & b
shrl %edi # b >> 1
addl %edx, %edi # b >> 1 + a & 1 & b
leal (%rdi,%rsi), %eax # a >> 1 + b >> 1 + a & 1 & b
ret
```
##### a & b + ((a ^ b) >> 1)
```c
uint32_t average(uint32_t a, uint32_t b)
{
return a & b + ((a ^ b) >> 1);
}
```
```
average:
.LFB0:
movl %edi, %eax
xorl %esi, %eax # a ^ b
shrl %eax # (a ^ b) >> 1
leal (%rax,%rsi), %eax # (a ^ b) >> 1 + b
andl %edi, %eax
ret
```
### 研讀 Linux 核心原始程式碼 [include/linux/average.h](https://github.com/torvalds/linux/blob/master/include/linux/average.h),探討其 Exponentially weighted moving average (EWMA) 實作,以及在 Linux 核心的應用
[EWMA](https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average) is a first-order infinite impulse response filter that applies weighting factors which decrease exponentially.
$S_t = \begin{cases}
Y_0, & t = 0 \\
\alpha Y_t + (1 - \alpha) \cdot S_{t-1}, & t > 0
\end{cases}$
* The coefficient $\alpha$ represents the degree of weighting decrease, a constant smoothing factor between 0 and 1. A higher $\alpha$ discounts older observations faster.
* $Y_t$ is the value at a time period $t$.
* $S_t$ is the value of the EMA at any time period $t$.
```c
#define DECLARE_EWMA(name, _precision, _weight_rcp)
```
* `_precision` : expresses how many bits are used for the fractional part of the fixed-precision values.
* `_weight_rcp` : new values will get weight 1/weight_rcp ($\alpha$) and old values 1-1/weight_rcp (1-$\alpha$). (reciprocal : 倒數)
* Note that this parameter must be ==a power of two== for efficiency.
首先建立一個含有 `internal` 的 `ewma_##name` 的 struct
```c
struct ewma_##name {
unsigned long internal;
};
```
`ewma_##name##_init` 函式有些檢查機制,其中`__builtin_constant_p` 為 gcc 提供的函式
```c
static inline void ewma_##name##_init(struct ewma_##name *e)
{
BUILD_BUG_ON(!__builtin_constant_p(_precision));
BUILD_BUG_ON(!__builtin_constant_p(_weight_rcp));
BUILD_BUG_ON((_precision) > 30);
BUILD_BUG_ON_NOT_POWER_OF_2(_weight_rcp);
e->internal = 0;
}
```
`BUILD_BUG_ON` 巨集主要是希望能在編譯時期就找發現問題,繼續追蹤下去會發現 `__compiletime_assert` 函示
[include/linux/build_bug.h](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/tools/include/linux/build_bug.h)
```c
#define BUILD_BUG_ON(condition) \
BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
#define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
```
[include/linux/build_bug.h](https://github.com/torvalds/linux/blob/6f38be8f2ccd9babf04b9b23539108542a59fcb8/include/linux/compiler_types.h)
```c
#ifdef __OPTIMIZE__
# define __compiletime_assert(condition, msg, prefix, suffix)
do {
__noreturn extern void prefix ## suffix(void)
__compiletime_error(msg);
if (!(condition))
prefix ## suffix();
} while (0)
#else
# define __compiletime_assert(condition, msg, prefix, suffix) do { } while (0)
#endif
```
> [你所不知道的 C 語言:前置處理器應用篇](https://hackmd.io/@sysprog/c-preprocessor?type=view) 提到 do { ... } while(0) 巨集用於避開 dangling else
在檢查機制後,`ewma_##name##_init` 主要做的事是 `e->internal = 0`
```c
static inline unsigned long
ewma_##name##_read(struct ewma_##name *e)
{
BUILD_BUG_ON(!__builtin_constant_p(_precision));
BUILD_BUG_ON(!__builtin_constant_p(_weight_rcp));
BUILD_BUG_ON((_precision) > 30);
BUILD_BUG_ON_NOT_POWER_OF_2(_weight_rcp);
return e->internal >> (_precision);
}
```
主要做 `e->internal >> (_precision)`
```c
static inline void ewma_##name##_add(struct ewma_##name *e,
unsigned long val)
{
unsigned long internal = READ_ONCE(e->internal);
unsigned long weight_rcp = ilog2(_weight_rcp);
unsigned long precision = _precision;
BUILD_BUG_ON(!__builtin_constant_p(_precision));
BUILD_BUG_ON(!__builtin_constant_p(_weight_rcp));
BUILD_BUG_ON((_precision) > 30);
BUILD_BUG_ON_NOT_POWER_OF_2(_weight_rcp);
WRITE_ONCE(e->internal, internal ?
(((internal << weight_rcp) - internal) +
(val << precision)) >> weight_rcp :
(val << precision));
}
```
`READ_ONCE` 相關資料整理與實驗我記錄在 [`READ_ONCE` 實驗](https://hackmd.io/@chenging/read_once)
`ilog2` : log base 2 of 32-bit or a 64-bit unsigned value
```c
#define ilog2(n)
(
__builtin_constant_p(n) ?
((n) < 2 ? 0 :
63 - __builtin_clzll(n)) :
(sizeof(n) <= 4) ?
__ilog2_u32(n) :
__ilog2_u64(n)
)
```
:::info
TODO : 了解 ilog2(n) 細部操作
:::
```c
WRITE_ONCE(e->internal, internal ?
(((internal << weight_rcp) - internal) +
(val << precision)) >> weight_rcp :
(val << precision));
```
`e->internal` 用於存取過去權重過後的數值
* 當沒有 `internal` 時,也就是當 $t=0$:
$e->internal=value$
* 如果有 `internal` 時:
$e->internal\\=((\dfrac{1}{\alpha} - 1)\cdot internal+value)\cdot{\alpha}\\={\alpha}\cdot value+(1-{\alpha}) \cdot internal$
#### 在 Linux 核心的應用
尋找 `DECLARE_EWMA` 關鍵字 `17 code results in torvalds/linux or view all results on GitHub
`,看到蠻多 `wireless` 與 `gpu` 的字眼,推測==常使用在計算機網路與 GPU==,應是用於測量執行時間,且讓曲線平滑
[linux/drivers/net/wireless/realtek/rtw88/main.h](https://github.com/torvalds/linux/blob/8efd0d9c316af470377894a6a0f9ff63ce18c177/drivers/net/wireless/realtek/rtw88/main.h)
```c
DECLARE_EWMA(tp, 10, 2);
// _precision = 10, _weight_rcp = 2
```
[linux/drivers/net/wireless/mediatek/mt76/mt76x02_mac.h](https://github.com/torvalds/linux/blob/dbe69e43372212527abf48609aba7fc39a6daa27/drivers/net/wireless/mediatek/mt76/mt76x02_mac.h)
```c
DECLARE_EWMA(pktlen, 8, 8);
// _precision = 8, _weight_rcp = 8
```
[linux/drivers/gpu/drm/i915/gt/intel_context_types.h](https://github.com/torvalds/linux/blob/8d0749b4f83bf4768ceae45ee6a79e6e7eddfc2a/drivers/gpu/drm/i915/gt/intel_context_types.h)
```c
DECLARE_EWMA(runtime, 3, 8);
// _precision = 3, _weight_rcp = 8
```
```c
/* Time on GPU as tracked by the hw. */
struct {
struct ewma_runtime avg;
u64 total;
u32 last;
I915_SELFTEST_DECLARE(u32 num_underflow);
I915_SELFTEST_DECLARE(u32 max_underflow);
} runtime;
```
:::info
TODO : _precision, _weight_rcp 怎麼定義的?
:::
## 測驗 2
改寫[〈解讀計算機編碼〉](https://hackmd.io/@sysprog/binary-representation)一文的「不需要分支的設計」一節提供的程式碼 min,我們得到以下實作 (max):
```c
#include <stdint.h>
uint32_t max(uint32_t a, uint32_t b)
{
return a ^ ((a ^ b) & -(a < b));
}
```
由 XOR 操作可以解決 [Leetcode 136. Single Number](https://leetcode.com/problems/single-number/) , 而今天可以用下面兩個特性解決此問題
```c
a ^ a ^ b = b
a ^ 0 = a
```
因此,在這裡我們可以分成兩種情況
* 當 $a\geq b$ 時
```c
a ^ ((a ^ b) & 0x00000000) = a
```
* 當 $a<b$ 時
```c
a ^ ((a ^ b) & 0xFFFFFFFF) = b
```
最後,將上述兩種情況合併
```c
a ^ ((a ^ b) & -(a < b))
```
:::info
延伸問題
- [x] 解釋上述程式碼運作的原理
- [x] 針對 32 位元無號/有號整數,撰寫同樣 branchless 的實作
- [ ] Linux 核心也有若干 branchless / branch-free 的實作,例如 lib/sort.c:
```c
/*
* Logically, we're doing "if (i & lsbit) i -= size;", but since the
* branch is unpredictable, it's done with a bit of clever branch-free
* code instead.
*/
__attribute_const__ __always_inline
static size_t parent(size_t i, unsigned int lsbit, size_t size)
{
i -= size;
i -= size & -(i & lsbit);
return i / 2;
}
```
請在 Linux 核心原始程式碼中,找出更多類似的實作手法。請善用 git log 檢索。
:::
### 針對 32 位元無號/有號整數,撰寫同樣 branchless 的實作
```c
int32_t max(int32_t a, int32_t b)
{
return a ^ ((a ^ b) & -(a < b));
}
```
與無號整數概念相同,當 `a < b` 時, `-(a < b)` 會等於 `-1` , 在有號整數觀點來看,根據二補數就可以得到 `0xFFFFFFFF`,也就此操作就是有號整數的運算
### Linux 核心中若干 branchless / branch-free 的實作
```shell
$ grep -rnw './' -e 'branch-free'
./lib/sort.c:169: * branch is unpredictable, it's done with a bit of clever branch-free
./arch/arc/lib/strchr-700.S:9: words branch-free. */
$ grep -rnw './' -e 'branchless'
./arch/x86/kvm/mmu.h:269: * It is important to keep this branchless.
./arch/arm64/lib/memcpy.S:48: from a single entry point. It uses unaligned accesses and branchless
./arch/arm64/lib/memcpy.S:99: /* Copy 0..3 bytes using a branchless sequence. */
./include/linux/if_vlan.h:482: * Copies VLAN information from @src to @dst (for branchless code)
```
## 測驗 3
考慮以下 64 位元 GCD (greatest common divisor, 最大公因數) 求值函式:
```c
#include <stdint.h>
uint64_t gcd64(uint64_t u, uint64_t v)
{
if (!u || !v) return u | v;
while (v) {
uint64_t t = v;
v = u % v;
u = t;
}
return u;
}
```
改寫為以下等價實作:
```c
#include <stdint.h>
uint64_t gcd64(uint64_t u, uint64_t v)
{
if (!u || !v) return u | v;
int shift;
// u 與 v 皆偶數就 shift
for (shift = 0; !((u | v) & 1); shift++) {
u /= 2, v /= 2;
}
// 如果 u 是偶數
while (!(u & 1))
u /= 2;
do {
// 如果 v 是偶數
while (!(v & 1))
v /= 2;
// 當 v > u 時, v = v - u
if (u < v) {
v -= u;
} else {
// 當 u >= v 時, v = u - v
uint64_t t = u - v;
u = v;
v = t;
}
} while (v);
return u << shift;
}
```
將求最大公因數想成疊積木問題即可解決,[Greatest Common Divisor: Binary GCD Algorithm](http://web.ntnu.edu.tw/~algo/Divisor.html)
* 如果 a b 有一個是零 ⇒ gcd(a, b) = a和b當中不是零的那個數
* 如果 a b 都是偶數 ⇒ gcd(a, b) = gcd(a/2, b/2) × 2
* 如果 a 是偶數 b 是奇數 ⇒ gcd(a, b) = gcd(a/2, b)
* 如果 a 是奇數 b 是偶數 ⇒ gcd(a, b) = gcd(a, b/2)
* 如果 a b 都是奇數 ⇒ gcd(a, b) = gcd(a和b比較小那個數, a和b的差)
:::info
延伸問題
- [X] 解釋上述程式運作原理;
- [X] 在 x86_64 上透過 __builtin_ctz 改寫 GCD,分析對效能的提升;
- [ ] Linux 核心中也內建 GCD (而且還不只一種實作),例如 lib/math/gcd.c,請解釋其實作手法和探討在核心內的應用場景。
:::
### __builtin_ctz 改寫 GCD
根據 [6.59 Other Built-in Functions Provided by GCC](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html)
Built-in Function: **`int __builtin_ctz (unsigned int x)`**
* Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined.
`u` 是 `uint64_t` 的,因此使用
Built-in Function: **`int __builtin_ctzll (unsigned long long)`**
* Similar to __builtin_ctz, except the argument type is unsigned long long.
```c
for (shift = 0; !((u | v) & 1); shift++) {
u /= 2, v /= 2;
}
```
改成
```c
shift = __builtin_ctzll(u | v);
u >> shift;
v >> shift;
```
```c
while (!(u & 1))
u /= 2;
```
改成
```c
u = u >> __builtin_ctzll(u);
```
最終程式碼如下
```c
#include <stdint.h>
uint64_t gcd64(uint64_t u, uint64_t v)
{
if (!u || !v)
return u | v;
int shift = __builtin_ctzll(u | v);
u >> shift;
v >> shift;
u = u >> __builtin_ctzll(u);
do {
v = v >> __builtin_ctzll(v);
if (u < v) {
v -= u;
} else {
uint64_t t = u - v;
u = v;
v = t;
}
} while (v);
return u << shift;
}
```
### Perf 效能分析
[perf 原理和實務](https://hackmd.io/@sysprog/gnu-linux-dev/https%3A%2F%2Fhackmd.io%2Fs%2FB11109rdg) 提到藉由 perf,應用程式可以利用 PMU (Performance Monitoring Unit), tracepoint 和核心內部的特殊計數器 (counter) 來進行統計,另外還能同時分析運行中的核心程式碼,從而更全面了解應用程式中的效能瓶頸
#### 前置作業
```shell
$ cat /proc/sys/kernel/perf_event_paranoid
```
得到 `4` : 不允許任何量測,利用以下命令設定成 `-1` : 權限全開
```shell
$ echo '-1' | sudo tee -a /proc/sys/kernel/perf_event_paranoid
```
[How to insert text into a root-owned file using sudo?](https://unix.stackexchange.com/questions/4335/how-to-insert-text-into-a-root-owned-file-using-sudo) 的方法
#### 效能分析
```c
gcd64(rand64(), rand64());
```
原始版本效能
```shell
$ perf stat --repeat 100 ./gcd
Performance counter stats for './gcd' (100 runs):
0.34 msec task-clock # 0.612 CPUs utilized ( +- 8.06% )
0 context-switches # 29.312 /sec ( +-100.00% )
0 cpu-migrations # 0.000 /sec
47 page-faults # 136.651 K/sec ( +- 0.27% )
78,5779 cycles # 2.303 GHz ( +- 0.31% )
77,5276 instructions # 0.99 insn per cycle ( +- 0.16% )
15,2036 branches # 445.641 M/sec ( +- 0.16% )
4281 branch-misses # 2.82% of all branches ( +- 1.30% )
0.0005577 +- 0.0000435 seconds time elapsed ( +- 7.80% )
```
使用 `__builtin_ctzll()` 版本效能
```shell
$ perf stat --repeat 100 ./gcd2
Performance counter stats for './gcd2' (100 runs):
0.20 msec task-clock # 0.583 CPUs utilized ( +- 0.59% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
47 page-faults # 232.071 K/sec ( +- 0.26% )
77,9769 cycles # 3.890 GHz ( +- 0.43% )
77,3578 instructions # 0.99 insn per cycle ( +- 0.15% )
15,1437 branches # 755.462 M/sec ( +- 0.15% )
3746 branch-misses # 2.47% of all branches ( +- 1.00% )
0.00034358 +- 0.00000270 seconds time elapsed ( +- 0.79% )
```
可以清楚看見所花時間減少一定程度
### Linux 內建 GCD
[linux/lib/math/gcd.c](https://github.com/torvalds/linux/blob/master/lib/math/gcd.c)
## 測驗 4
在影像處理中,[bit array](https://en.wikipedia.org/wiki/Bit_array) (也稱 bitset) 廣泛使用,考慮以下程式碼:
```c
#include <stddef.h>
size_t naive(uint64_t *bitmap, size_t bitmapsize, uint32_t *out)
{
size_t pos = 0;
for (size_t k = 0; k < bitmapsize; ++k) {
uint64_t bitset = bitmap[k];
size_t p = k * 64;
for (int i = 0; i < 64; i++) {
if ((bitset >> i) & 0x1)
out[pos++] = p + i;
}
}
return pos;
}
```
測試資料為
```c
uint64_t bitmap[2];
bitmap[0] = 0x0000000000000001;
bitmap[1] = 0x0000000000000001;
size_t count = naive(bitmap, 2, out);
```
count 為 1 的個數,out 為 1 的位置
```shell
$ ./main
count = 2
out = 0, 64
```
以上程式碼缺點為要全部掃過一次,若 bitmap 越鬆散 (即 `1` 越少),`improved` 的效益就更高
考慮 GNU extension 的 [__builtin_ctzll](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) 的行為是回傳由低位往高位遇上連續多少個 `0` 才碰到 `1`。
> 範例: 當 a = 16
16 這個十進位數值的二進位表示法為 00000000 00000000 00000000 00010000
從低位元 (即右側) 往高位元,我們可發現 0→0→0→0→1,於是 ctz 就為 4,表示最低位元往高位元有 4 個 0
用以改寫的程式碼如下:
```c=
#include <stddef.h>
size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out)
{
size_t pos = 0;
uint64_t bitset;
for (size_t k = 0; k < bitmapsize; ++k) {
bitset = bitmap[k];
while (bitset != 0) {
uint64_t t = EXP6;
int r = __builtin_ctzll(bitset);
out[pos++] = k * 64 + r;
bitset ^= t;
}
}
return pos;
}
```
其中第 9 行的作用是找出目前最低位元的 1,並紀錄到 t 變數。舉例來說,若目前 bitset 為 $000101_b$,那 `t` 就會變成 $000001_b$,接著就可以透過 XOR 把該位的 1 清掉,其他保留 (此為 XOR 的特性)。
$101000_b = x$
$100100_b = x-1$
$011011_b = ~(x-1) = -x$
因此 `EXP6 = bitset & -bitset`
:::info
延伸問題
- [X] 解釋上述程式運作原理,並舉出這樣的程式碼用在哪些真實案例中;
- [X] 設計實驗,檢驗 ctz/clz 改寫的程式碼相較原本的實作有多少改進?應考慮到不同的 bitmap density;
- [ ] 思考進一步的改進空間;
- [ ] 閱讀 Data Structures in the Linux Kernel 並舉出 Linux 核心使用 bitmap 的案例;
:::
#### [Find First Set](https://en.wikipedia.org/wiki/Find_first_set) 應用
* Gosper's loop-detection algorithm
* find the period of a function of finite range using limited resources.
* The [binary GCD algorithm](https://en.wikipedia.org/wiki/Binary_GCD_algorithm)
* spends many cycles removing trailing zeros; this can be replaced by a count trailing zeros (ctz) followed by a shift. A similar loop appears in computations of the hailstone sequence.
* Priority Queue
* ffs is useful in implementing the "pop" or "pull highest priority element" operation efficiently.
* The Linux kernel real-time scheduler internally uses [`sched_find_first_bit()`](https://github.com/torvalds/linux/blob/master/include/asm-generic/bitops/sched.h)
### Perf 效能分析
```c
int main(){
srand(time(NULL));
size_t bitmapsize = 100000;
uint64_t *bitmap = malloc(sizeof(uint64_t) * bitmapsize);
uint32_t *out = malloc(sizeof(uint32_t) * 64 * bitmapsize);
for (int i = 0; i < bitmapsize; i++)
bitmap[i] = rand64();
for (int i = 0; i < bitmapsize * 64; i++)
out[i] = 0;
size_t count = naive(bitmap, bitmapsize, out);
// size_t count = imporved(bitmap, bitmapsize, out);
free(bitmap);
free(out);
return 0;
}
```
其中 `rand64()` 採用 [How to generate random 64-bit unsigned integer in C](https://stackoverflow.com/questions/33010010/how-to-generate-random-64-bit-unsigned-integer-in-c) 提到的
```c
#define IMAX_BITS(m) ((m)/((m)%255+1) / 255%255*8 + 7-86/((m)%255+12))
#define RAND_MAX_WIDTH IMAX_BITS(RAND_MAX)
_Static_assert((RAND_MAX & (RAND_MAX + 1u)) == 0, "RAND_MAX not a Mersenne number");
uint64_t rand64(void) {
uint64_t r = 0;
for (int i = 0; i < 64; i += RAND_MAX_WIDTH) {
r <<= RAND_MAX_WIDTH;
r ^= (unsigned) rand();
}
return r;
}
```
#### 效能分析
```shell
$ perf stat --repeat 10 -e cache-misses,cache-references,instructions,cycles ./bitmap
Performance counter stats for './bitmap' (10 runs):
92,8232 cache-misses # 78.178 % of all cache refs ( +- 0.32% )
118,7329 cache-references ( +- 0.11% )
2,3843,0969 instructions # 1.04 insn per cycle ( +- 0.00% )
2,2884,3892 cycles ( +- 0.25% )
0.058516 +- 0.000600 seconds time elapsed ( +- 1.03% )
```
```shell
$ perf stat --repeat 10 -e cache-misses,cache-references,instructions,cycles ./bitmap_r
Performance counter stats for './bitmap_r' (10 runs):
92,5642 cache-misses # 77.866 % of all cache refs ( +- 0.39% )
118,8755 cache-references ( +- 0.16% )
2,0279,3516 instructions # 1.49 insn per cycle ( +- 0.01% )
1,3593,8963 cycles ( +- 0.56% )
0.034934 +- 0.000375 seconds time elapsed ( +- 1.07% )
```
* **IPC of naive version = 1.04**
* **IPC of imporved version = 1.49**
### 使用 SIMD 改善
使用 [lemire](https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2018/03/07) 的方式建立 SIMD 版本 ( 但我還沒搞懂 )
閱讀
* [硬科技:淺談x86的SIMD指令擴張史(上):MMX到SSE](https://www.cool3c.com/article/152918)
* [硬科技:淺談x86的SIMD指令擴張史(中):SSE2到SSE4](https://www.cool3c.com/article/152919)
* [硬科技:淺談x86的SIMD指令擴張史(下):AVX到AVX-512](https://www.cool3c.com/article/152953)
```c
int bitmap_decode_avx2(uint64_t * array, size_t sizeinwords, uint32_t *out) {
uint32_t *initout = out;
__m256i baseVec = _mm256_set1_epi32(-1);
__m256i incVec = _mm256_set1_epi32(64);
__m256i add8 = _mm256_set1_epi32(8);
for (int i = 0; i < sizeinwords; ++i) {
uint64_t w = array[i];
if (w == 0) {
baseVec = _mm256_add_epi32(baseVec, incVec);
} else {
for (int k = 0; k < 4; ++k) {
uint8_t byteA = (uint8_t) w;
uint8_t byteB = (uint8_t)(w >> 8);
w >>= 16;
__m256i vecA = _mm256_load_si256((const __m256i *) vecDecodeTable[byteA]);
__m256i vecB = _mm256_load_si256((const __m256i *) vecDecodeTable[byteB]);
uint8_t advanceA = lengthTable[byteA];
uint8_t advanceB = lengthTable[byteB];
vecA = _mm256_add_epi32(baseVec, vecA);
baseVec = _mm256_add_epi32(baseVec, add8);
vecB = _mm256_add_epi32(baseVec, vecB);
baseVec = _mm256_add_epi32(baseVec, add8);
_mm256_storeu_si256((__m256i *) out, vecA);
out += advanceA;
_mm256_storeu_si256((__m256i *) out, vecB);
out += advanceB;
}
}
}
return out - initout;
}
```
### 考慮 density 的效能分析
以下為平均 CPI (Cycle Per Instruction)
| Density | Naive | Improved | SIMD |
|:--- |:--- |:--- |:--- |
| **0.03** | **93.367** | 10.921 | 12.085 |
| 0.12 | 37.259 | 4.591 | 3.257 |
| 0.25 | 28.108 | 3.124 | 1.621 |
| 0.5 | 18.264 | 2.426 | 0.809 |
| 0.9 | 3.915 | 2.062 | 0.563 |
可以看到越稀疏 Improved (`__builtin_ctzll`) 與 SIMD 版本比 Naive 版本越快 CPI 越低
## 測驗 5
[LeetCode 166. Fraction to Recurring Decimal ](https://leetcode.com/problems/fraction-to-recurring-decimal/)
> [5.frac.c](https://github.com/chengingx/linux2022-quiz2/blob/main/5.frac.c)
`struct rem_node` 與 `find` 可看出待會要在 `rem_node` 的 linked list 中找相對應的值
```c
struct rem_node {
int key;
int index;
struct list_head link;
};
static int find(struct list_head *heads, int size, int key)
{
struct rem_node *node;
int hash = key % size;
list_for_each_entry (node, &heads[hash], link) {
if (key == node->key)
return node->index;
}
return -1;
}
```
```c
char *fractionToDecimal(int numerator, int denominator)
{
int size = 1024;
char *result = malloc(size);
char *p = result;
// 分母為 0
if (denominator == 0) {
result[0] = '\0';
return result;
}
// 分子為 0
if (numerator == 0) {
result[0] = '0';
result[1] = '\0';
return result;
}
/* using long long type make sure there has no integer overflow */
long long n = numerator;
long long d = denominator;
/* deal with negtive cases */
if (n < 0)
n = -n;
if (d < 0)
d = -d;
bool sign = (float) numerator / denominator >= 0;
if (!sign)
*p++ = '-'; // 若sign 為負,result 的第一個 assign 負號
long long remainder = n % d; // 餘數
long long division = n / d; // 商
sprintf(p, "%ld", division > 0 ? (long) division : (long) -division);
if (remainder == 0) // 整除
return result;
p = result + strlen(result);
*p++ = '.'; // 沒有整除帶有小數
/* Using a map to record all of reminders and their position.
* if the reminder appeared before, which means the repeated loop begin,
*/
char *decimal = malloc(size);
memset(decimal, 0, size); // 清空,一堆 '\0'
char *q = decimal;
size = 1333; // ?
struct list_head *heads = malloc(size * sizeof(*heads));
for (int i = 0; i < size; i++)
INIT_LIST_HEAD(&heads[i]);
for (int i = 0; remainder; i++) {
// pos 為餘數的位置
int pos = find(heads, size, remainder);
if (pos >= 0) { // 大等於 0 表示開始循環小數
while (PPP > 0)
*p++ = *decimal++;
*p++ = '(';
while (*decimal != '\0')
*p++ = *decimal++;
*p++ = ')';
*p = '\0';
return result;
}
struct rem_node *node = malloc(sizeof(*node));
node->key = remainder;
node->index = i;
MMM(&node->link, EEE);
*q++ = (remainder * 10) / d + '0';
remainder = (remainder * 10) % d;
}
strcpy(p, decimal);
return result;
}
```
`PPP = pos--`
`MMM = list_add`
`EEE = &heads[remainder % size]`
一開始看不懂以下程式碼,查 ASCII 碼
```c
*q++ = (remainder * 10) / d + '0';
```
``'0'`` 的 ASCII 碼為 `0x30`,以此類推 `'9'` 的 ASCII 碼為 `0x39`, 假設 `(remainder * 10) / d = 2`, `2 + '0' = 0x32`,cast 成 `char` 就會顯示 `2`
:::info
延伸問題
- [X] 解釋上述程式碼運作原理,指出其中不足,並予以改進
> 例如,判斷負號只要寫作 `bool isNegative = numerator < 0 ^ denominator < 0`;
搭配研讀 [The simple math behind decimal-binary conversion algorithms](https://indepth.dev/posts/1019/the-simple-math-behind-decimal-binary-conversion-algorithms)
- [ ] 在 Linux 核心原始程式碼的 mm/ 目錄 (即記憶體管理) 中,找到特定整數比值 (即 fraction) 和 bitwise 操作的程式碼案例,予以探討和解說其應用場景
:::
### 可改進之處
#### 靜態動態程式碼檢查
首先使用 `cppcheck` 出現以下字樣
```c
$ cppcheck ./frac2
Checking frac2 ...
frac2:1:25: error: The code contains unhandled character(s) (character code=208). Neither unicode nor extended ascii is supported. [syntaxError]
ELF>�@`@8
@$#@@@h������� ���-�=�=���-�=�=����DDP�td
LLQ�tdR�td�-�=�=pp/lib64/ld-linux-x86-64.so.2GNU6�l����}6��g"�2�N�GNU
�
�e�mTq
B� � 4%"libc.so.6strcpyputscallocmalloc__cxa_finalize__sprintf_chk__libc_start_mainfreeGLIBC_2.3.4GLIBC_2.2.5_ITM_deregisterTMCloneTable__gmon_start___ITM_registerTMCloneTableti Yu�i e��@�?�?�?�? �?
�?�?�?�?��?
```
參考 [Fails to suppress the syntaxError](https://sourceforge.net/p/cppcheck/discussion/general/thread/d4463c60/) 方法解決
```shell
$ cppcheck --suppress=syntaxError ./frac2
Checking frac2 ...
```
接著用 valgrind 測試出是有 definitely lost 的問題
```shell
$ valgrind -q --leak-check=full ./frac2
==148760== 1,024 bytes in 1 blocks are definitely lost in loss record 1 of 1
==148760== at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==148760== by 0x109347: fractionToDecimal (frac2.c:72)
==148760== by 0x1090AF: main (frac2.c:127)
==148760==
```
接著查看是哪裡出錯
```c
char *decimal = malloc(size);
```
倘若直接加 `free(decimal)` 是不對的,`decimal` 早已不在原本的位置上
```c
if (pos >= 0) {
while (pos-- > 0)
*p++ = *decimal++;
*p++ = '(';
while (*decimal != '\0')
*p++ = *decimal++;
*p++ = ')';
*p = '\0';
return result;
}
```
直接想到的 naive 方法就是加
```c
char *decimal_origin = decimal;
```
然後在最後 `free(decimal_origin)`,這樣 valgrind 測試就會通過
#### 不用除法判斷負號
```c
bool sign = (float) numerator / denominator >= 0;
if (!sign)
*p++ = '-';
```
可以改進成
```c
bool isNegative = numerator < 0 ^ denominator < 0
if (isNegative)
*p++ = '-';
```
#### Branch-free 版本的 abs
```c
/* deal with negtive cases */
if (n < 0)
n = -n;
if (d < 0)
d = -d;
```
可以改進成 branchless 版本,講解參考 [你所不知道的 C 語言: bitwise 操作](https://hackmd.io/@sysprog/c-bitwise)
```c
n= ((n >> 63) ^ n) - (n >> 63);
d = ((d >> 63) ^ d) - (d >> 63);
```
#### strcpy 可能會產生的問題
使用 `man strcpy` 查閱相關資料,在描述中提到
* The strcpy() function copies the string pointed to by src, including the terminating null byte ('\0'), to the buffer pointed to by dest.
* The strings may not overlap, and ==the destination string dest must be large enough== to receive the copy. Beware of ==buffer overruns==! (See BUGS.)
使用 `strcpy` 需要小心的就是 buffer overruns,接著查看 `BUGS`
* If the destination string of a strcpy() is not large enough, then anything might happen.
* Overflowing fixed-length string buffers is a favorite cracker technique for taking complete control of the machine.
也就是說要小心 `dest` 的 `size` 是否會足夠,即時有辦法證明其足夠,仍然要注意時間推移後仍有資訊安全的危險
使用 `strncpy` 則是要注意要記得加 `\0`
* The strncpy() function is similar, except that at most n bytes of src are copied. Warning: If there is no null byte among the first n bytes of src, the string placed in dest will not be null-terminated.
使用 `strncpy` 可以採取以下寫法
```c
if (buflen > 0) {
strncpy(buf, str, buflen - 1);
buf[buflen - 1]= '\0';
}
```
> 那 buflen 要加在哪?
#### 增加 free_list 函式
```c
void free_list(struct list_head *heads, size_t size)
{
for (int i = 0; i < size; i++)
{
struct rem_node *n, *s;
list_for_each_entry_safe(n, s, &heads[i], link)
{
free(n);
}
}
free(heads);
}
```
#### 使用 Goto
原先以下這段程式碼沒有釋放記憶體,但又想要寫個好看的方式
```c
if (pos >= 0) {
while (pos-- > 0)
*p++ = *decimal++;
*p++ = '(';
while (*decimal != '\0')
*p++ = *decimal++;
*p++ = ')';
*p = '\0';
return result;
}
```
改成以下
```c
if (pos >= 0) {
while (pos-- > 0)
*p++ = *decimal++;
*p++ = '(';
while (*decimal != '\0')
*p++ = *decimal++;
*p++ = ')';
*p = '\0';
goto finish;
}
```
```c
finish:
free_list(heads, size);
free(decimal_origin);
```
## 測驗 6
[`__alignof__`](https://gcc.gnu.org/onlinedocs/gcc/Alignment.html) 是 GNU extension,以下是其可能的實作方式:
```c
/*
* ALIGNOF - get the alignment of a type
* @t: the type to test
*
* This returns a safe alignment for the given type.
*/
#define ALIGNOF(t) \
((char *)(&((struct { char c; t _h; } *)0)->M) - (char *)X)
```
`M = _h`
`X = 0`
利用 `struct` 為了記憶體對齊會有 padding bytes 的行為得到此 `type` 在計算機上真正的對齊,[C 語言規格書 6.2.6.1](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf) 寫到一些關於 `struct`, `union` 有趣的觀念
> 為什麼要 ` - (char *)0`,刪掉會怎樣?
使用此巨集要注意的是,[`__alignof__`](https://gcc.gnu.org/onlinedocs/gcc/Alignment.html) 提到,如果 `__alignof__` 的 operand 是 `lvalue` 而不是 `type`,那麼它的值會是其類型所需的對齊方式,考慮到由 `attribute` 對齊指定的任何最小對齊方式,閱讀 [Common Variable Attributes](https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#Common-Variable-Attributes)
```c
struct foo { int x; char y; } foo1;
```
`__alignof__ (foo1.y) ` 會是 `1`,儘管真正的對齊會是 `2` 或是 `4`,[C 語言規格書閱讀](https://hackmd.io/@chenging/c_basic) 整理 `lvalue`, `object`, `type` 等基本定義
:::info
延伸問題
- [X] 解釋上述程式碼運作原理
- [ ] 在 Linux 核心原始程式碼中找出 [`__alignof__`](https://gcc.gnu.org/onlinedocs/gcc/Alignment.html) 的使用案例 2 則,並針對其場景進行解說
- [ ] 在 Linux 核心源使程式碼找出 ALIGN, ALIGN_DOWN, ALIGN_UP 等巨集,探討其實作機制和用途,並舉例探討 (可和上述第二點的案例重複)。思索能否對 Linux 核心提交貢獻,儘量共用相同的巨集
:::
### `__alignof__` 使用案例
直接查詢 `__alignof__` 使用案例會看見許多 `__attribute__`,因此先了解下 `__attribute__` 的用處
#### [`__attribute__`](https://gcc.gnu.org/onlinedocs/gcc/Variable-Attributes.html) 簡介
specify special properties of
* variables
* function parameters
* structure, union
舉個例子,將 foo 指定成 8 bytes 對齊
```c
// create a double-word aligned int pair
struct foo { int x[2] __attribute__ ((aligned (8))); };
```
另外 GCC 提供 `__BIGGEST_ALIGNMENT__` (largest alignment)
```c
short array[3] __attribute__ ((aligned (__BIGGEST_ALIGNMENT__)));
```
Doing this can often make copy operations more efficient, because the compiler can use whatever instructions copy the biggest chunks of memory when performing copies to or from the variables or fields that you have aligned this way.
### `ARCH_SLAB_MINALIGN`
[linux/arch/sparc/include/asm/cache.h](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/arch/sparc/include/asm/cache.h) 中定義了以下這段程式碼
```c
#define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
```
> 希望在未來的記憶體管理認真好好學,把這段空白填上
## 測驗 7
考慮貌似簡單卻蘊含實作深度的 [FizzBuzz](https://en.wikipedia.org/wiki/Fizz_buzz) 題目:
* 從 1 數到 n,如果是 3的倍數,印出 “Fizz”
* 如果是 5 的倍數,印出 “Buzz”
* 如果是 15 的倍數,印出 “FizzBuzz”
* 如果都不是,就印出數字本身
直覺的實作程式碼如下: (naive.c)
```c
#include <stdio.h>
int main() {
for (unsigned int i = 1; i < 100; i++) {
if (i % 3 == 0) printf("Fizz");
if (i % 5 == 0) printf("Buzz");
if (i % 15 == 0) printf("FizzBuzz");
if ((i % 3) && (i % 5)) printf("%u", i);
printf("\n");
}
return 0;
}
```
觀察 printf 的(格式)字串,可分類為以下三種:
1. 整數格式字串 "%d" : 長度為 2 B
2. “Fizz” 或 “Buzz” : 長度為 4 B
3. “FizzBuzz” : 長度為 8 B
考慮下方程式碼:
```c
char fmt[M9];
strncpy(fmt, &"FizzBuzz%u"[start], length);
fmt[length] = '\0';
printf(fmt, i);
printf("\n");
```
我們若能精準從給定輸入 i 的規律去控制 start 及 length,即可符合 FizzBuzz 題意:
```c
string literal: "fizzbuzz%u"
offset: 0 4 8
```
以下是利用 bitwise 和上述技巧實作的 FizzBuzz 程式碼: (bitwise.c)
```c
static inline bool is_divisible(uint32_t n, uint64_t M)
{
return n * M <= M - 1; // n * M < M
}
// c = ceil((1 << 64) / d)
static uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1;
static uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1;
int main(int argc, char **argv)
{
for (size_t i = 1; i <= 100; i++) {
uint8_t div3 = is_divisible(i, M3);
uint8_t div5 = is_divisible(i, M5);
unsigned int length = (2 << KK1) << KK2;
char fmt[9];
strncpy(fmt, &"FizzBuzz%u"[(8 >> div5) >> (KK3)], length);
fmt[length] = '\0';
printf(fmt, i);
printf("\n");
}
return 0;
}
```
`KK1 = div3`
`KK2 = div5`
`KK3 = div3 << 2`
> 其中 is_divisible 函式技巧來自 [Faster remainders when the divisor is a constant: beating compilers and libdivide](https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/),甚至 gcc-9 還內建了 [FizzBuzz optimization](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853) (Bug 82853 - Optimize x % 3 == 0 without modulo)
根據 [Faster remainders when the divisor is a constant: beating compilers and libdivide](https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/) 的 [重點整理](https://hackmd.io/@chenging/SkgnXHjW9),加上 uint32_t * uint64_t 會是 uint32_t 剛好是 fraction part 接著若小於 `c` 則代表可以被整除
### 運算成本測量
對於處理器來說,每個運算所花的成本是不同的,比如 add, sub 就低於 mul,而這些運算的 cost 又低於 div 。依據[〈Infographics: Operation Costs in CPU Clock Cycles〉](http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/),可發現整數除法的成本幾乎是整數加法的 50 倍

```c
void count35(uint32_t N, uint32_t *count3, uint32_t *count5) {
for (uint32_t i = 0; i < N; i++) {
if ((i % 3) == 0)
*count3 += 1;
if ((i % 5) == 0)
*count5 += 1;
}
}
```
```c
// M = ceil( (1<<64) / d ), d > 0
static inline uint64_t computeM_u32(uint32_t d) {
return UINT64_C(0xFFFFFFFFFFFFFFFF) / d + 1;
}
// given precomputed M, checks whether n % d == 0
static inline bool is_divisible(uint32_t n, uint64_t M) {
return n * M <= M - 1;
}
void fastcount35(uint32_t N, uint32_t *count3, uint32_t *count5) {
uint64_t M3 = computeM_u32(3);
uint64_t M5 = computeM_u32(5);
for (uint32_t i = 0; i < N; i++) {
if (is_divisible(i, M3))
*count3 += 1;
if (is_divisible(i, M5))
*count5 += 1;
}
}
```
```shell
$ ./fizzbuzz
count35(N, &count3, &count5) : 7.044 cycles per input word (best) 7.123 cycles per input word (avg)
1666666700 1000000000
fastcount35(N, &count3, &count5) : 11.434 cycles per input word (best) 11.498 cycles per input word (avg)
1666666700 1000000000
```
```c
unsigned int mod3(unsigned int a) { return 0==(a%3); }
int main(){
unsigned int a = mod3(22);
return 0;
}
```
test.s : 沒有最佳化
test_o.s : 有最佳化
```c
$ colordiff -u test.s test_o.s
mod3:
.LFB0:
- pushq %rbp
- .cfi_def_cfa_offset 16
- .cfi_offset 6, -16
- movq %rsp, %rbp
- .cfi_def_cfa_register 6
- movl %edi, -4(%rbp)
- movl -4(%rbp), %ecx
+ movl %edi, %eax
movl $-1431655765, %edx
- movl %ecx, %eax
mull %edx
- movl %edx, %eax
- shrl %eax
- movl %eax, %edx
- addl %edx, %edx
- addl %eax, %edx
- movl %ecx, %eax
- subl %edx, %eax
- testl %eax, %eax
+ shrl %edx
+ leal (%rdx,%rdx,2), %eax
+ cmpl %eax, %edi
sete %al
movzbl %al, %eax
- popq %rbp
ret
```
:::info
延伸問題:
- [x] 解釋上述程式運作原理並評估 naive.c 和 bitwise.c 效能落差
- [ ] 避免 stream I/O 帶來的影響,可將 printf 更換為 sprintf
- [ ] 分析 [Faster remainders when the divisor is a constant: beating compilers and libdivide](https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/) 一文的想法 (可參照同一篇網誌下方的評論),並設計另一種 bitmask,如「可被 3 整除則末位設為 1」「可被 5 整除則倒數第二位設定為 1」,然後再改寫 bitwise.c 程式碼,試圖運用更少的指令來實作出 branchless;
* 參照 [fastmod: A header file for fast 32-bit division remainders on 64-bit hardware](https://github.com/lemire/fastmod)
- [ ] 研讀 [The Fastest FizzBuzz Implementation](https://tech.marksblogg.com/fastest-fizz-buzz.html) 及其 [Hacker News](https://news.ycombinator.com/item?id=29413656) 討論,提出 throughput (吞吐量) 更高的 Fizzbuzz 實作
- [ ] 解析 Linux 核心原始程式碼 [kernel/time/timekeeping.c](https://github.com/torvalds/linux/blob/master/kernel/time/timekeeping.c) 裡頭涉及到除法運算的機制,探討其快速除法的實作 (注意: 你可能要對照研讀 kernel/time/ 目錄的標頭檔和程式碼)
> 過程中,你可能會發現可貢獻到 Linux 核心的空間,請充分討論
:::
## 參考資料
* [Greatest Common Divisor: Binary GCD Algorithm](http://web.ntnu.edu.tw/~algo/Divisor.html)
* [Introduction to the leal instruction](https://www.cs.swarthmore.edu/~newhall/cs31/resources/addrleal.php)
* [x64 Cheat Sheet](https://cs.brown.edu/courses/cs033/docs/guides/x64_cheatsheet.pdf)
* [How to insert text into a root-owned file using sudo?](https://unix.stackexchange.com/questions/4335/how-to-insert-text-into-a-root-owned-file-using-sudo)
* [perf 原理和實務](https://hackmd.io/@sysprog/gnu-linux-dev/https%3A%2F%2Fhackmd.io%2Fs%2FB11109rdg)
* [How to generate random 64-bit unsigned integer in C](https://stackoverflow.com/questions/33010010/how-to-generate-random-64-bit-unsigned-integer-in-c)
* [你所不知道的 C 語言: bitwise 操作](https://hackmd.io/@sysprog/c-bitwise)