2022q1 Homework2 (quiz2)

# 2022q1 Homework2 (quiz2) contributed by <shawn5141> ## [測驗1](https://hackmd.io/@sysprog/linux2022-quiz2#%E6%B8%AC%E9%A9%97-1) ### 1. 解釋上述程式碼運作的原理 ```cpp= #include <stdint.h> uint32_t average(uint32_t low, uint32_t high) { return low + (high - low) / 2; } ``` 可以改寫成 ```cpp= #include <stdint.h> uint32_t average(uint32_t a, uint32_t b) { return (a >> 1) + (b >> 1) + (EXP1); } ``` 舉例的話就可以知道right shift 之後就會發現有LSB消失後造成carrier也一併消失的可能。所以需要把它加回去 ``` a = 9, b= 13 平均為(9+13)/2 = 11 => 0b1011 a = 9 => 0b1001 >> 1 => 0b0100 b = 13 => 0b1101 >> 1 => 0b0110 0b0100 | 0b0110 => 0b1010 => 10 ``` 使用`a & b & 1` 可以留下 `a & b`皆為1的部分，再透過`&1`取最後一位。所以只要當兩者最後一位都為1的情況下（代表會有進位產生）需要把此進位加回去。 #### EX1 = `a & b & 1` ```cpp= #include <stdint.h> uint32_t average(uint32_t a, uint32_t b) { return (a >> 1) + (b >> 1) + (a & b & 1); } ``` 接著我們改寫為以下函式（限定用 OR, AND, XOR 這三個運算子） ```cpp= uint32_t average(uint32_t a, uint32_t b) { return (EXP2) + ((EXP3) >> 1); } ``` 這次我們使用 `a=11 (0b1011)`和`b=13 (0b1101)`的例子，兩者的平均為 12 (0b1100)。因為題目提示說`EXP2`和`EXP3`接為a,b 或AND, OR, XOR operator 的組合。所以透過排列組合發現 ```cpp= a & b = 0b1001 (9) a | b = 0b1111 (15) a ^ b = 0b0110 (6) ``` 接著把此三數都右移1 ```cpp= a & b >>1 = 0b100 (4) a | b >>1 = 0b111 (7) a ^ b >>1 = 0b011 (3) ``` 可以發現在 `(a & b) + ((a ^ b) >> 1)`的情況下會等於`0b1100`也就是我們在找的平均。從此篇[文章中](https://stackoverflow.com/questions/2385389/xor-using-mathematical-operators#:~:text=TL%3BDR,of%20a%20%26%20b.)(amazing!)，得知`a XOR b` in binary為`a + b - 2ab` 而 `a AND b` in binary 是`ab`。 ``` (a AND b) + (a XOR b) >> 1 => (ab) + (a+b-2ab)/2 => (a+b)/2 ``` #### EX2 = `a & b` #### EX3 = `(a ^ b) >> 1` ### 比較上述實作在編譯器最佳化開啟的狀況，對應的組合語言輸出，並嘗試解讀 (可研讀 CS:APP 第 3 章) ```cpp= uint32_t avg(uint32_t a, uint32_t b) { return (a >> 1) + (b >> 1) + (a & b & 1); } ``` ```cpp= 0x64b <avg> mov %edi,%eax // $edi (a) value move to $eax │ 0x64d <avg+2> shr %eax // right shift %eax │ 0x64f <avg+4> mov %esi,%edx // $esi (b) value move to $edx │ 0x651 <avg+6> shr %edx // right shift %edx │ 0x653 <avg+8> add %edx,%eax // add %edx and $eax => (a >> 1) + (b >> 1) │ 0x655 <avg+10> and %esi,%edi // %esi AND $edi => a & b and store in $edi │ 0x657 <avg+12> and $0x1,%edi // $0x1 AND $edi => 0x1 & (a & b) store in %edi │ 0x65a <avg+15> add %edi,%eax // add $edi with $eax │ 0x65c <avg+17> retq ``` - SHR 是[right shift](https://www.csie.ntu.edu.tw/~acpang/course/asm_2004/slides/chapt_07_PartISolve.pdf) ```cpp= uint32_t avg2(uint32_t a, uint32_t b) { return (a&b) + ((a^b)>>1); } ``` ```cpp= 0x65d <avg2> mov %edi,%edx // move $edi (a) to $edx │ 0x65f <avg2+2> xor %esi,%edx // $esi(b) XOR $edx => (a ^ b) and store in $edx │ 0x661 <avg2+4> shr %edx // right shift $edx => (a ^ b) >> 1 │ 0x663 <avg2+6> and %esi,%edi // $esi AND $edi => (a & b) │ 0x665 <avg2+8> lea (%rdx,%rdi,1),%eax // load address (%rdx, %rdi, 1) to $eax │ 0x668 <avg2+11> retq ``` - 可以看出使用較少的instruction - 不是很確定為什麼要使用lea，不直接使用mov就好 ### 2. 研讀 Linux 核心原始程式碼 include/linux/average.h，探討其 Exponentially weighted moving average (EWMA) 實作，以及在 Linux 核心的應用 - 先回答第二題，EWMA有使用在[/include/linux/sched.h](https://elixir.bootlin.com/linux/latest/source/include/linux/sched.h#:~:text=enqueued%3B%0A%09unsigned%20int-,ewma%3B,-%23define%20UTIL_EST_WEIGHT_SHIFT%09%092) - 再來看程式碼 [include/linux/average.h](https://elixir.bootlin.com/linux/v4.1/source/include/linux/average.h) ```cpp= #ifndef _LINUX_AVERAGE_H #define _LINUX_AVERAGE_H /* Exponentially weighted moving average (EWMA) */ /* For more documentation see lib/average.c */ struct ewma { unsigned long internal; unsigned long factor; unsigned long weight; }; extern void ewma_init(struct ewma *avg, unsigned long factor, unsigned long weight); extern struct ewma *ewma_add(struct ewma *avg, unsigned long val); /** * ewma_read() - Get average value * @avg: Average structure * * Returns the average value held in @avg. */ static inline unsigned long ewma_read(const struct ewma *avg) { return avg->internal >> avg->factor; } #endif /* _LINUX_AVERAGE_H */ ``` [average.c](https://docs.huihoo.com/doxygen/linux/kernel/3.7/average_8c_source.html) ```cpp= /* * lib/average.c * * This source code is licensed under the GNU General Public License, * Version 2. See the file COPYING for more details. */ #include <linux/export.h> #include <linux/average.h> #include <linux/kernel.h> #include <linux/bug.h> #include <linux/log2.h> void ewma_init(struct ewma *avg, unsigned long factor, unsigned long weight) { WARN_ON(!is_power_of_2(weight) || !is_power_of_2(factor)); avg->weight = ilog2(weight); avg->factor = ilog2(factor); avg->internal = 0; } EXPORT_SYMBOL(ewma_init); struct ewma *ewma_add(struct ewma *avg, unsigned long val) { avg->internal = avg->internal ? (((avg->internal << avg->weight) - avg->internal) + (val << avg->factor)) >> avg->weight : (val << avg->factor); return avg; } EXPORT_SYMBOL(ewma_add); ``` :::info Math definition of [EWMA](https://towardsdatascience.com/time-series-from-scratch-exponentially-weighted-moving-averages-ewma-theory-and-implementation-607661d574fe) - EWMA applies weights to the values of a time series. More weight is applied to more recent data points, making them more relevant for future forecasts. That’s the most significant improvement over simple moving averages. ![](https://i.imgur.com/39L9J54.png) ![](https://i.imgur.com/C8Ryzdo.png) [image source](https://towardsdatascience.com/time-series-from-scratch-exponentially-weighted-moving-averages-ewma-theory-and-implementation-607661d574fe) ::: #### [ilog2](https://docs.huihoo.com/doxygen/linux/kernel/3.7/log2_8h.html#:~:text=%23define%20ilog2,) ilog2: [Implement a general integer log2 facility in the kernel](https://lwn.net/Articles/203596/) - log of base 2 of 32-bit or a 64-bit unsigned value - parameter constant-capable log of base 2 calculation - this can be used to initialise global variables from constant data, hence the massive ternary operator construction selects the appropriately-sized optimised version depending on sizeof(n) ```cpp= /** * ilog2 - log base 2 of 32-bit or a 64-bit unsigned value * @n: parameter * * constant-capable log of base 2 calculation * - this can be used to initialise global variables from constant data, hence * the massive ternary operator construction * * selects the appropriately-sized optimised version depending on sizeof(n) */ #define ilog2(n) \ ( \ __builtin_constant_p(n) ? \ ((n) < 2 ? 0 : \ 63 - __builtin_clzll(n)) : \ (sizeof(n) <= 4) ? \ __ilog2_u32(n) : \ __ilog2_u64(n) \ ) ``` - 這邊出現了`ternary oeprator`的vartion. [c-nested-ternary-operator](https://www.geeksforgeeks.org/c-nested-ternary-operator/)。例子如下： ```cpp= a ? b: c ? d : e ? f : g ? h : i a ? b : c ? d : e ? f : g ? h : i if a then b else if c then d else if e then f else if g then h else i ``` - 所以上面ilog2就是在說： ```cpp= if n is const: if (n) < 2: return 0 else: return 63 - __builtin_clzll(n) \\ 63 - count of leading 0 => 剩下的位數 else: \\ n is not const if sizeof(n) <= 4: return __ilog2_u32(n) else: return __ilog2_u64(n) ``` :::info [__builtin_constant_p](https://www.daemon-systems.org/man/__builtin_constant_p.3.html) The __builtin_constant_p() is a GNU extension for determining whether a value is known to be constant at compile time. The function is closely related to the concept of "constant folding" used by modern optimizing compilers. ::: - What is `__builtin_clzll`, similar to `__builtin_clz`, except the argument type is unsigned long long. Then what is [__builtin_clz](https://stackoverflow.com/questions/9353973/implementation-of-builtin-clz) :::info CLZ (count leading zero) and BSR (bit-scan reverse) are related but different. CLZ equals (type bit width less one) - BSR. CTZ (count trailing zero), also know as FFS (find first set) is the same as BSF (bit-scan forward.) ::: - 同樣是在ilog2.h中有定義`__ilog2_u32` 和 `__ilog2_u64`。 ```cpp= /* SPDX-License-Identifier: GPL-2.0-or-later */ /* Integer base 2 logarithm calculation * * Copyright (C) 2006 Red Hat, Inc. All Rights Reserved. * Written by David Howells (dhowells@redhat.com) */ #ifndef _LINUX_LOG2_H #define _LINUX_LOG2_H #include <linux/types.h> #include <linux/bitops.h> /* * non-constant log of base 2 calculators * - the arch may override these in asm/bitops.h if they can be implemented * more efficiently than using fls() and fls64() * - the arch is not required to handle n==0 if implementing the fallback */ #ifndef CONFIG_ARCH_HAS_ILOG2_U32 static inline __attribute__((const)) int __ilog2_u32(u32 n) { return fls(n) - 1; } #endif #ifndef CONFIG_ARCH_HAS_ILOG2_U64 static inline __attribute__((const)) int __ilog2_u64(u64 n) { return fls64(n) - 1; } #endif ``` 他們分別是呼叫不同的function `fls` or `fls64`。這兩個function 都在 [fls64.h](include/asm-generic/bitops/fls64.h)中。簡單來說，就是要找第一個set bit。[ref](https://hackmd.io/@cccccs100203/linux2021-quiz2#fls-ampamp-fls64) ```cpp= static inline int fls64(unsigned long word) { return 64 - __kernel_ctlz(word); } static inline int fls(unsigned int x) { return fls64(x); } static inline unsigned long __fls(unsigned long x) { return fls64(x) - 1; } static inline int fls(unsigned int x) { return fls64(x); } ``` ## [測驗二](https://hackmd.io/@sysprog/linux2022-quiz2#%E6%B8%AC%E9%A9%97-2) ### 1. 解釋上述程式碼運作的原理 ```cpp= #include <stdint.h> uint32_t max(uint32_t a, uint32_t b) { return a ^ ((EXP4) & -(EXP5)); } ``` 參考[XOR-Trick](https://florian.github.io/xor-trick/)。 - 假設x是`((EXP4) & -(EXP5))`，`a^x` 要得到 `a`，`x`必須是0。同理，要是想要得到b的話，`x`必須是 `a^b^a`，這樣才會剩下`b`。所以當 `a>b` 時，取0。當`a<b` 就取 `b^a`。 - 如果在`a>b`的情況下，要讓`((EXP4) & -(EXP5))`變成0。需要讓EXP5是`a<b`，這樣`-(a<b)`就會變成0，進而導致`((EXP4) & -(EXP5))`為零。 - 相對的在`a<b`的情況，考慮`EXP5=(a<b)`，可以得到`((EXP4) & -(1))`，因為`-1` 是 `0xFFFFFFFF`，所以EXP4只要是`b^a`就可以讓最前面的`a`消掉。 ```cpp= #include <stdint.h> uint32_t max(uint32_t a, uint32_t b) { return a ^ ((a^b) & -(a < b)); } ``` ### 2. 針對 32 位元無號/有號整數，撰寫同樣 branchless 的實作 ```cpp= #include <stdint.h> uint32_t max(int32_t a, int32_t b) { return something; } ``` ### 3. Linux 核心也有若干 branchless / branch-free 的實作，例如 lib/sort.c。請在 Linux 核心原始程式碼中，找出更多類似的實作手法。請善用 git log 檢索。 - One [result](https://github.com/torvalds/linux/commit/747f49ba67b8895a5831ab539de551b916f3738c) on `branchless` (`git log --grep="branchless"`) ```cpp= commit 747f49ba67b8895a5831ab539de551b916f3738c Author: Steven Fuerst <svfuerst@gmail.com> Date: Wed Aug 15 15:07:15 2012 -0700 Replace int2float() with an optimized version. We use __fls() to find the most significant bit. Using that, the loop can be avoided. A second trick is to use the behaviour of the rotate instructions to expand the range of the unsigned int to float conversion to the full 32 bits in a branchless way. The routine is now exact up to 2^24. Above that, we truncate which is equivalent to rounding towards zero. Signed-off-by: Steven Fuerst <svfuerst@gmail.com> ``` drivers/gpu/drm/radeon/r600_blit.c ```cpp= /* 23 bits of float fractional data */ #define I2F_FRAC_BITS 23 #define I2F_MASK ((1 << I2F_FRAC_BITS) - 1) /* * Converts unsigned integer into 32-bit IEEE floating point representation. * Will be exact from 0 to 2^24. Above that, we round towards zero * as the fractional bits will not fit in a float. (It would be better to * round towards even as the fpu does, but that is slower.) */ uint32_t int2float(uint32_t x) { uint32_t msb, exponent, fraction; /* Zero is special */ if (!x) return 0; /* Get location of the most significant bit */ msb = __fls(x); /* * Use a rotate instead of a shift because that works both leftwards * and rightwards due to the mod(32) behaviour. This means we don't * need to check to see if we are above 2^24 or not. */ fraction = ror32(x, (msb - I2F_FRAC_BITS) & 0x1f) & I2F_MASK; exponent = (127 + msb) << I2F_FRAC_BITS; return fraction + exponent; } ``` ## [測驗三](https://hackmd.io/@sysprog/linux2022-quiz2#%E6%B8%AC%E9%A9%97-3) 改寫此程式 ```cpp= #include <stdint.h> uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; while (v) { uint64_t t = v; v = u % v; u = t; } return u; } ``` 成為以下程式 ```cpp= #include <stdint.h> uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; int shift; for (shift = 0; !((u | v) & 1); shift++) { u /= 2, v /= 2; } while (!(u & 1)) u /= 2; do { while (!(v & 1)) v /= 2; if (u < v) { v -= u; } else { uint64_t t = u - v; u = v; v = t; } } while (COND); return RET; } ``` ### 1. 解釋上述程式運作原理; ![](https://i.imgur.com/eGNI2uX.png) - #6-#7 `（(u | v) & 1)` 去找 u 或 v 的最後一位如果有1的話，代表其中必定有一個為奇數，所以就停下來（取not） - #9-#10 把二除淨，確保 u 是奇數 - #11-#21 - #12-#13 把 v 的二除淨，確定v是奇數 - #14-#19 如果`u < v`的話，就把 `v-=u`，不然就讓`t`等於`u-v`，然後把v assign給 u，餘數t assign給 v (交換後，把小的變成餘數t) - 按照上面所給的演算法，COND應該是當u 或者是 v其中一個為零的時候。所以應該是 `COND= ` - #22 而RET則是要回傳剩下不為 0 的數。 `RET = d << shift` ### 2. 在 x86_64 上透過 __builtin_ctz 改寫 GCD，分析對效能的提升; ### 3. Linux 核心中也內建 GCD (而且還不只一種實作)，例如 lib/math/gcd.c，請解釋其實作手法和探討在核心內的應用場景。 ```cpp= // SPDX-License-Identifier: GPL-2.0-only #include <linux/kernel.h> #include <linux/gcd.h> #include <linux/export.h> /* * This implements the binary GCD algorithm. (Often attributed to Stein, * but as Knuth has noted, appears in a first-century Chinese math text.) * * This is faster than the division-based algorithm even on x86, which * has decent hardware division. */ #if !defined(CONFIG_CPU_NO_EFFICIENT_FFS) /* If __ffs is available, the even/odd algorithm benchmarks slower. */ /** * gcd - calculate and return the greatest common divisor of 2 unsigned longs * @a: first value * @b: second value */ unsigned long gcd(unsigned long a, unsigned long b) { unsigned long r = a | b; if (!a || !b) return r; // 當 a or b其中一個為0 時，可以回傳 r （代表另一個數是兩者的gcd） b >>= __ffs(b); if (b == 1) return r & -r; //當 b 成為1時，回傳 r & (-r)，會找出r在binary form中的first bits。因為在b right shift __ffs(b)之後變成1代表b原本是2的倍數。而使用r & -r可以找出兩者的最小的bits for (;;) { a >>= __ffs(a); if (a == 1) return r & -r; if (a == b) return a << __ffs(r); if (a < b) swap(a, b); a -= b; } } #else /* If normalization is done by loops, the even/odd algorithm is a win. */ unsigned long gcd(unsigned long a, unsigned long b) { unsigned long r = a | b; if (!a || !b) return r; /* Isolate lsbit of r */ r &= -r; while (!(b & r)) b >>= 1; if (b == r) return r; for (;;) { while (!(a & r)) a >>= 1; if (a == r) return r; if (a == b) return a; if (a < b) swap(a, b); a -= b; a >>= 1; if (a & r) a += b; a >>= 1; } } #endif EXPORT_SYMBOL_GPL(gcd); ``` - [__ffs](https://github.com/spotify/linux/blob/master/include/asm-generic/bitops/__ffs.h) is used to find the first set (which is the least significant set in word). In first gcd function, `__ffs` 被大量的使用。 - 使用例子來解釋。設a = 6 (0b0110)和 b=3 (0b0011)並使用第一個gcd function。 - #25 r= a | b = 0b0111 => r = 7 - !a || !b 不為`True` 因為 !(6) 和 !(3) is 0。 - #30 `b >>= __ffs(b)`。__ffs(b)是0。所以3不用right shift。 - #35 `a >> __ffs(a)`會讓a變成3。因為a不等於1，所以不會trigger #36，反而是會讓conditino in #38成立，也就是 a==b。這時候會回傳`a << __ffs(r)`，也就是 `a <<__ffs(7)` (__ffs(7) = 0)。所以答案回傳3。 - #41-#43 在a!=b的情況下，則會把大的減去小的，然後繼續下個循環。 :::success [r&(-r)](https://stackoverflow.com/questions/15395317/meaning-of-bitwise-and-of-a-positive-and-negative-number#:~:text=N%26(%2DN)%20will%20give%20you%20position%20of%20the%20first%20bit%20%271%27%20in%20binary%20form%20of%20N) will give you position of the first bit '1' in binary form of N。 ::: :::success b>>__ffs(b) or a>>__ffs(a)在這邊使用就是把偶數除竟成為奇數。 ::: - 使用場景 [/arch/mips/ar7/clock.c](https://elixir.bootlin.com/linux/latest/source/arch/mips/ar7/clock.c)中的[calcualte](https://elixir.bootlin.com/linux/latest/source/arch/mips/ar7/clock.c#:~:text=tmp_gcd%20%3D%20gcd(target%2C%20tmp_base)%3B) ```cpp= static void calculate(int base, int target, int *prediv, int *postdiv, int *mul) { int tmp_gcd, tmp_base, tmp_freq; for (*prediv = 1; *prediv <= 32; (*prediv)++) { tmp_base = base / *prediv; tmp_gcd = gcd(target, tmp_base); *mul = target / tmp_gcd; *postdiv = tmp_base / tmp_gcd; if ((*mul < 1) || (*mul >= 16)) continue; if ((*postdiv > 0) & (*postdiv <= 32)) break; } if (base / *prediv * *mul / *postdiv != target) { approximate(base, target, prediv, postdiv, mul); tmp_freq = base / *prediv * *mul / *postdiv; printk(KERN_WARNING "Adjusted requested frequency %d to %d\n", target, tmp_freq); } printk(KERN_DEBUG "Clocks: prediv: %d, postdiv: %d, mul: %d\n", *prediv, *postdiv, *mul); } ``` - 需要知道`target`跟`tmp_base`代表的分別是什麼，舉例來說，[calculate](https://elixir.bootlin.com/linux/latest/source/arch/mips/ar7/clock.c#:~:text=calculate(cpu_base%2C%20TNETD7200_DEF_CPU_CLK%2C%20%26cpu_prediv%2C%0A%09%09%09%26cpu_postdiv%2C%20%26cpu_mul)%3B)有在此情景下被呼叫。target，是`TNETD7200_DEF_CPU_CLK`然後tmp_base則是由base 除prediv來的，base從caller得知是`cpu_base`。所以可以得知gcd在calculate的應用為找出 `CPU_CLK`與`cpu_base`的gcd。 ```cpp= printk(KERN_INFO "Clocks: Setting CPU clock\n"); calculate(cpu_base, TNETD7200_DEF_CPU_CLK, &cpu_prediv, &cpu_postdiv, &cpu_mul); ``` ## [測驗4](https://hackmd.io/@sysprog/linux2022-quiz2#%E6%B8%AC%E9%A9%97-4) ```cpp= #include <stddef.h> size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; uint64_t bitset; for (size_t k = 0; k < bitmapsize; ++k) { bitset = bitmap[k]; while (bitset != 0) { uint64_t t = EXP6; int r = __builtin_ctzll(bitset); out[pos++] = k * 64 + r; bitset ^= t; } } return pos; } ``` ### 1. 解釋上述程式運作原理，並舉出這樣的程式碼用在哪些真實案例中; 如題所述，#9要找出目前最低位元的 1，並紀錄到 t 變數。所以可以使用[n & -n]((https://stackoverflow.com/questions/15395317/meaning-of-bitwise-and-of-a-positive-and-negative-number#:~:text=N%26(%2DN)%20will%20give%20you%20position%20of%20the%20first%20bit%20%271%27%20in%20binary%20form%20of%20N))得到最低位元。所以`EXP6=bitset & -bitset` ### 2. 設計實驗，檢驗 ctz/clz 改寫的程式碼相較原本的實作有多少改進？應考慮到不同的 bitmap density; #### Version 1: w/o considering about bitmap density. ```cpp= #include <stddef.h> #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <time.h> #define BILLION 1000000000L; typedef size_t (*func_ptr)(uint64_t *, size_t, uint32_t *); size_t naive(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; for (size_t k = 0; k < bitmapsize; ++k) { uint64_t bitset = bitmap[k]; size_t p = k * 64; for (int i = 0; i < 64; i++) { if ((bitset >> i) & 0x1) out[pos++] = p + i; } } return pos; } size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; uint64_t bitset; for (size_t k = 0; k < bitmapsize; ++k) { bitset = bitmap[k]; while (bitset != 0) { uint64_t t = bitset & -bitset; int r = __builtin_ctzll(bitset); out[pos++] = k * 64 + r; bitset ^= t; } } return pos; } void test(func_ptr func) { struct timespec start, stop; double accum; size_t bitmapsize = 10000000; uint64_t *bitmap = malloc(bitmapsize * sizeof(uint64_t)); for (int i =0; i < bitmapsize; i++) bitmap[i] = i; if( clock_gettime( CLOCK_REALTIME, &start) == -1 ) { perror( "clock gettime" ); exit(1); } uint32_t *out1 = malloc((bitmapsize + 64) * 64 * sizeof(uint32_t)); func(bitmap, bitmapsize, out1); if( clock_gettime( CLOCK_REALTIME, &stop) == -1 ) { perror( "clock gettime" ); exit(1); } accum = ( stop.tv_sec - start.tv_sec ) + (double)( stop.tv_nsec - start.tv_nsec ) / (double)BILLION; printf( "%lf\n", accum ); } int main() { test(naive); test(improved); } ``` - result: naive: 1.767586 s improved: 0.489726 s #### Version 2 with bitmap density ```cpp= #include <stddef.h> #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <time.h> #include <stdlib.h> #define BILLION 1000000000L; typedef size_t (*func_ptr)(uint64_t *, size_t, uint32_t *); size_t naive(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; for (size_t k = 0; k < bitmapsize; ++k) { uint64_t bitset = bitmap[k]; size_t p = k * 64; for (int i = 0; i < 64; i++) { if ((bitset >> i) & 0x1) out[pos++] = p + i; } } return pos; } size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; uint64_t bitset; for (size_t k = 0; k < bitmapsize; ++k) { bitset = bitmap[k]; while (bitset != 0) { uint64_t t = bitset & -bitset; int r = __builtin_ctzll(bitset); out[pos++] = k * 64 + r; bitset ^= t; } } return pos; } void test(size_t bitmapsize, uint64_t *bitmap, func_ptr func) { struct timespec start, stop; uint32_t *out = malloc((bitmapsize + 64) * 64 * sizeof(uint32_t)); double accum; if( clock_gettime( CLOCK_REALTIME, &start) == -1 ) { perror( "clock gettime" ); exit(1); } func(bitmap, bitmapsize, out); if( clock_gettime( CLOCK_REALTIME, &stop) == -1 ) { perror( "clock gettime" ); exit(1); } accum = ( stop.tv_sec - start.tv_sec ) + (double)( stop.tv_nsec - start.tv_nsec ) / (double)BILLION; printf( "%lf\n", accum ); } int main() { srand(time(NULL)); double Thres[] = {0, 0.25, 0.5, 0.75, 1}; size_t bitmapsize = 10000000; uint64_t *bitmap = malloc(bitmapsize * sizeof(uint64_t)); for (int r=0; r < sizeof(Thres)/sizeof(Thres[0]);r++) { for (int i =0; i < bitmapsize; i++) { for (int j = 0; j < 64; j++) { double ran = ((rand() % 10000) / 10000.0); if (ran > Thres[r]) bitmap[i] | = 1UL <<j; } } printf("===Thres %f===\n", ratios[r]); test(bitmapsize, bitmap, naive); test(bitmapsize, bitmap, improved); for (int i =0; i < bitmapsize; i++) { bitmap[i] = 0; } } } ``` - use raitos as 0, 0.25, 0.5, 0.75, 1 (越來越sparse)。然後產生亂數如果比這個比例還大的話就assign給bitmap。這樣得到的結果如下圖。其實可以看improved algorithm隨著越sparse有更好的效率。 ```cpp= when bitmapsize = 10 ===Thres 0.000000=== 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff naive: 0.000003 improvied: 0.000002, ratio: 1.181237 ===Thres 0.250000=== 0xfb1f7bfafbb3f87e 0xbffb3ec4ff97f7cb 0xefd57bcd7cf7d7de 0xbe1bef6edd302bf6 0xdf70fffafffedded 0xed7f7effdce7fee2 0xff263befa7ff7fef 0x6bf5dfb77cb7b74e 0x4fff4fde1dbbffff 0x7fdd7b3bf08d7fdd naive: 0.000005 improvied: 0.000002, ratio: 2.640046 ===Thres 0.500000=== 0x779d0ea094e810d8 0x70f5fdd8d8773263 0xdb3088c05ec73284 0xd577d9868d002f6e 0xcf1a104ca796ac5f 0xe5f2aaaafc8a7ea 0xf25f78b0106ddc2c 0xdf7befaa53054f76 0x4b2f9810a5c63c03 0xe075191e08e17748 naive: 0.000005 improvied: 0.000001, ratio: 4.247843 ===Thres 0.750000=== 0x81501048a400c0 0x881042d0c05622e8 0x5610801a1054340 0x8280405d88159749 0x810442260d0000e8 0x27031204e069a121 0x2004b6405007411a 0x82606a4605098001 0x320130b01080850 0x8944045016140505 naive: 0.000003 improvied: 0.000001, ratio: 4.338624 ===Thres 1.000000=== 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 naive: 0.000001 improvied: 0.000000, ratio: 11.253968 ``` - When bitmapsize = 10000000 ```csvpreview {header="true"} Density, naive, improved, naive/improved Ratio 1, 1.796758, 0.487978, 3.682048 0.75, 1.615117, 0.436827, 3.697380 0.5, 1.480783, 0.341373, 4.337730 0.25, 1.363405, 0.186559, 7.308186 0, 1.220360, 0.023148, 52.719746 ``` ### 3. 思考進一步的改進空間; ### 4. 閱讀 Data Structures in the Linux Kernel 並舉出 Linux 核心使用 bitmap 的案例; - bitmap 可以用在[priority array](https://www.informit.com/articles/article.aspx?p=101760&seqNum=2#:~:text=The%20Priority%20Arrays,this%20priority%20array.) in kernel/sched.c # [測驗五](https://hackmd.io/@sysprog/linux2022-quiz2#%E6%B8%AC%E9%A9%97-5) 以下是 LeetCode 166. [Fraction to Recurring Decimal](https://leetcode.com/problems/fraction-to-recurring-decimal/) 的可能實作: ```cpp= #include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include "list.h" struct rem_node { int key; int index; struct list_head link; }; static int find(struct list_head *heads, int size, int key) { struct rem_node *node; int hash = key % size; list_for_each_entry (node, &heads[hash], link) { if (key == node->key) return node->index; } return -1; } char *fractionToDecimal(int numerator, int denominator) { int size = 1024; char *result = malloc(size); char *p = result; if (denominator == 0) { result[0] = '\0'; return result; } if (numerator == 0) { result[0] = '0'; result[1] = '\0'; return result; } /* using long long type make sure there has no integer overflow */ long long n = numerator; long long d = denominator; /* deal with negtive cases */ if (n < 0) n = -n; if (d < 0) d = -d; bool sign = (float) numerator / denominator >= 0; if (!sign) *p++ = '-'; long long remainder = n % d; long long division = n / d; sprintf(p, "%ld", division > 0 ? (long) division : (long) -division); if (remainder == 0) return result; p = result + strlen(result); *p++ = '.'; /* Using a map to record all of reminders and their position. * if the reminder appeared before, which means the repeated loop begin, */ char *decimal = malloc(size); memset(decimal, 0, size); char *q = decimal; size = 1333; struct list_head *heads = malloc(size * sizeof(*heads)); for (int i = 0; i < size; i++) INIT_LIST_HEAD(&heads[i]); for (int i = 0; remainder; i++) { int pos = find(heads, size, remainder); if (pos >= 0) { while (PPP > 0) *p++ = *decimal++; *p++ = '('; while (*decimal != '\0') *p++ = *decimal++; *p++ = ')'; *p = '\0'; return result; } struct rem_node *node = malloc(sizeof(*node)); node->key = remainder; node->index = i; MMM(&node->link, EEE); *q++ = (remainder * 10) / d + '0'; remainder = (remainder * 10) % d; } strcpy(p, decimal); return result; ``` PPP = `pos--` (表示式) MMM = `list_add_tail` (巨集) EEE = `heads[remainder % size]` (表示式) ## 1. 解釋上述程式碼運作原理，指出其中不足，並予以改進 :::info 例如，判斷負號只要寫作 bool isNegative = numerator < 0 ^ denominator < 0; 搭配研讀 [The simple math behind decimal-binary conversion algorithms](https://indepth.dev/posts/1019/the-simple-math-behind-decimal-binary-conversion-algorithms) ::: - 利用linked list hash map去紀錄remainder。當出現的remainder時，`find(heads, size, remainder)`會大於零。這時候就要找出開始循環的第一個index，pos前面的都不會是循環小數。利用`while (pos--)`的性質可以把前面的都先加到result中。 ## 2. 在 Linux 核心原始程式碼的 mm/ 目錄 (即記憶體管理) 中，找到特定整數比值 (即 fraction) 和 bitwise 操作的程式碼案例，予以探討和解說其應用場景 - [linux/mm/page-writeback.c](https://github.com/torvalds/linux/blob/master/mm/page-writeback.c) ```cpp= static ssize_t max_ratio_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct backing_dev_info *bdi = dev_get_drvdata(dev); unsigned int ratio; ssize_t ret; ret = kstrtouint(buf, 10, &ratio); if (ret < 0) return ret; ret = bdi_set_max_ratio(bdi, ratio); if (!ret) ret = count; return ret; } ``` ```cpp= /* * setpoint - dirty 3 * f(dirty) := 1.0 + (----------------) * limit - setpoint * * it's a 3rd order polynomial that subjects to * * (1) f(freerun) = 2.0 => rampup dirty_ratelimit reasonably fast * (2) f(setpoint) = 1.0 => the balance point * (3) f(limit) = 0 => the hard limit * (4) df/dx <= 0 => negative feedback control * (5) the closer to setpoint, the smaller |df/dx| (and the reverse) * => fast response on large errors; small oscillation near setpoint */ static long long pos_ratio_polynom(unsigned long setpoint, unsigned long dirty, unsigned long limit) { long long pos_ratio; long x; x = div64_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT, (limit - setpoint) | 1); pos_ratio = x; pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT; pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT; pos_ratio += 1 << RATELIMIT_CALC_SHIFT; return clamp(pos_ratio, 0LL, 2LL << RATELIMIT_CALC_SHIFT); } ``` # [測驗六](https://hackmd.io/@sysprog/linux2022-quiz2#%E6%B8%AC%E9%A9%97-6) __alignof__ 是 GNU extension，以下是其可能的實作方式: ```cpp= /* * ALIGNOF - get the alignment of a type * @t: the type to test * * This returns a safe alignment for the given type. */ #define ALIGNOF(t) \ ((char *)(&((struct { char c; t _h; } *)0)->M) - (char *)X) ``` 請補完上述程式碼，使得功能與 [__alignof__](https://gcc.gnu.org/onlinedocs/gcc/Alignment.html) 等價。請補完，使得程式碼符合預期，儘量以最簡短且符合一致排版的形式來撰寫。作答區 M = ? X = ? ((size_t)&((TYPE *)0)->MEMBER) ### 1. 解釋上述程式碼運作原理: [參考隔壁同學寫得很清楚的解釋](https://hackmd.io/@sternacht09/SkzgUIcec#%E6%B8%AC%E9%A9%97-6) ALIGNOF() 要回傳 t 的長度，也就是t的長度 - `(char *)（&((struct { char c; t _h; } *)0)->_h)）` - `(struct { char c; t _h; } *)0` 也就是將這個結構的指標放到記憶體 0 的位址。 - 因為 alignment 的緣故， _h 會根據型態 t 的不同，與 char c 的距離有所改變。而 _h 的記憶體起始位址減去 c的記憶體起始位址，就是 alignment 的距離。 - 前面的部份就是 `_h` 的位址 - 後面的部份就是 `char c` 的起始位址，本題題目將整個結構放在 0 的位址上， `c` 的起始位置在 `0` - 所以`M=_h`然後`X=0` ### 2. 在 Linux 核心原始程式碼中找出 __alignof__ 的使用案例 2 則，並針對其場景進行解說 1. [/arch/arm/mm/mmu.c](https://github.com/torvalds/linux/blob/3bf7edc84a9eb4007dd9a0cb8878a7e1d5ec6a3b/arch/arm/mm/mmu.c#L992) ```cpp= /* * Create the architecture specific mappings */ void __init iotable_init(struct map_desc *io_desc, int nr) { struct map_desc *md; struct vm_struct *vm; struct static_vm *svm; if (!nr) return; svm = memblock_alloc(sizeof(*svm) * nr, __alignof__(*svm)); if (!svm) panic("%s: Failed to allocate %zu bytes align=0x%zx\n", __func__, sizeof(*svm) * nr, __alignof__(*svm)); for (md = io_desc; nr; md++, nr--) { create_mapping(md); vm = &svm->vm; vm->addr = (void *)(md->virtual & PAGE_MASK); vm->size = PAGE_ALIGN(md->length + (md->virtual & ~PAGE_MASK)); vm->phys_addr = __pfn_to_phys(md->pfn); vm->flags = VM_IOREMAP | VM_ARM_STATIC_MAPPING; vm->flags |= VM_ARM_MTYPE(md->type); vm->caller = io _init; add_static_vm_early(svm++); } } ``` - MMU（Memory Management Unit) 與內存管理相關，延伸閱讀：[Linux 核心 Copy On Write 實作機制](https://hackmd.io/@linD026/Linux-kernel-COW-Copy-on-Write) 先來了解 `map_desc`，`vm_struct`和`static_vm`分別代表什麼 - static_vm 。在[mm.h](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/arch/arm/mm/mm.h#L68) 中是把vm_struct串成linked list。 ```cpp= struct static_vm { struct vm_struct vm; struct list_head list; }; ``` - vm_struct ```cpp= struct vm_struct { struct vm_struct *next; void *addr; unsigned long size; unsigned long flags; struct page **pages; #ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC unsigned int page_order; #endif unsigned int nr_pages; phys_addr_t phys_addr; const void *caller; }; ``` - #13 `svm = memblock_alloc(sizeof(*svm), __alignof__(*svm));` 中使用`memblock_alloc` 去allocate memory block. ```cpp= phys_addr_t __init memblock_alloc(phys_addr_t size, phys_addr_t align) { return memblock_alloc_base(size, align, MEMBLOCK_ALLOC_ACCESSIBLE); } ``` 在一連串的function call之後會align這個值會傳到以下這個function ```cpp= /* * __memblock_find_range_bottom_up - find free area utility in bottom-up * @start: start of candidate range * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE} * @size: size of free area to find * @align: alignment of free area to find * @nid: nid of the free area to find, %NUMA_NO_NODE for any node * @flags: pick from blocks based on memory attributes * * Utility called from memblock_find_in_range_node(), find free area bottom-up. * * RETURNS: * Found address on success, 0 on failure. */ static phys_addr_t __init_memblock __memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end, phys_addr_t size, phys_addr_t align, int nid, ulong flags) { phys_addr_t this_start, this_end, cand; u64 i; for_each_free_mem_range(i, nid, flags, &this_start, &this_end, NULL) { this_start = clamp(this_start, start, end); this_end = clamp(this_end, start, end); cand = round_up(this_start, align); if (cand < this_end && this_end - cand >= size) return cand; } return 0; } ``` 然後會call `cand = round_up(this_start, align);` ```cpp= /* * This looks more complex than it should be. But we need to * get the type for the ~ right in round_down (it needs to be * as wide as the result!), and we want to evaluate the macro * arguments just once each. */ #define __round_mask(x, y) ((__typeof__(x))((y)-1)) #define round_up(x, y) ((((x)-1) | __round_mask(x, y))+1) ``` 根據[此篇](https://buaq.net/go-72700.html#:~:text=split_mem_range()%E6%A0%B9%E6%8D%AE%E4%BC%A0%E5%85%A5%E7%9A%84%E5%86%85%E5%AD%98%20start%20%E5%92%8C%20end%20%E5%81%9A%E5%9B%9B%E8%88%8D%E4%BA%94%E5%85%A5%E7%9A%84%E5%AF%B9%E9%BD%90%E6%93%8D%E4%BD%9C%EF%BC%88%E5%8D%B3%20round_up%20%E5%92%8C%20round_down%EF%BC%89)文章，得知round_up 宏依靠整数除法来完成这项工作，當兩個參數均为整數時，它才有效。x 是需要四捨五入的数字，y 是应该四捨五入的間隔，也就是说，round_up(12,5)應該返回 15，因为 15 是大於 12 的 5 的第一个间隔，而 round_down(12,5)應返回 10。所以`__memblock_find_range_bottom_up`看起來是會找到可以塞進去的candidate memory location. ### 3. 在 Linux 核心源使程式碼找出 ALIGN, ALIGN_DOWN, ALIGN_UP 等巨集，探討其實作機制和用途，並舉例探討 (可和上述第二點的案例重複)。思索能否對 Linux 核心提交貢獻，儘量共用相同的巨集 - ALIGN 定義在 [include/linux/kernel.h](https://elixir.bootlin.com/linux/v4.5/source/include/linux/kernel.h#:~:text=%23define%20ALIGN(x%2C%20a)%09%09__ALIGN_KERNEL((x)%2C%20(a))) ```cpp= #define ALIGN(x, a) __ALIGN_KERNEL((x), (a)) ``` 然後__ALIGN_KERNEL是在 [include/uapi/linux/kernel.h](https://elixir.bootlin.com/linux/v4.5/source/include/uapi/linux/kernel.h#L9), ```cpp= /* * 'kernel.h' contains some often-used function prototypes etc */ #define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1) #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask)) ``` 用[例子](https://stackoverflow.com/questions/13122846/align-macro-kernel#:~:text=Say%20you%20have%20a%20number%3A%200x1006)來看。假設你想要找`0x1006`跟4 byte alignment的情況。所以接近的memory是`0x1000`, `0x1004`, `0x1008`。在這個例子中，會選`0x1008`。在這邊alignment mask 對4來說是 `0x03` ( (typeof(x))(4) - 1 = 0x03)，所以 0x1006 + 0x03 = 0x1009 然後 0x1009 & ~0x03 = 0x1008。 - ALIGN_DOWN 定義在 [arch/metag/kernel/stacktrace.c]((https://elixir.bootlin.com/linux/v4.5/source/arch/metag/kernel/stacktrace.c#:~:text=%23define%20ALIGN_DOWN(addr%2C%20size)%20%20((addr)%26(~((size)%2D1))))) ```cpp= #define ALIGN_DOWN(addr, size) ((addr)&(~((size)-1))) ``` 假設addr是`0x1006`，然後size 是4。`ALIGN_DOWN`會得到((0x1006)&(~((4)-1))) = ((0x1006)&(~(3))) = ((0x1006)&(0xFFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFC))= 0x1004。所以就會往下得到最接近的memory。 ALIGN_UP 其中一定義在 [tools/testing/selftests/net/tcp_mmap.c](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/tools/testing/selftests/net/tcp_mmap.c#L123) ```cpp= #define ALIGN_UP(x, align_to) (((x) + ((align_to)-1)) & ~((align_to)-1)) ``` 相反的ALIGN_UP就會往上得到最近的值。 ### [測驗七](https://hackmd.io/@sysprog/linux2022-quiz2#%E6%B8%AC%E9%A9%97-7) #### naive ```cpp= #include <stdio.h> int main() { for (unsigned int i = 1; i < 100; i++) { if (i % 3 == 0) printf("Fizz"); if (i % 5 == 0) printf("Buzz"); if (i % 15 == 0) printf("FizzBuzz"); if ((i % 3) && (i % 5)) printf("%u", i); printf("\n"); } return 0; } ``` #### improved ```cpp= static inline bool is_divisible(uint32_t n, uint64_t M) { return n * M <= M - 1; } static uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1; static uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1; int main(int argc, char **argv) { for (size_t i = 1; i <= 100; i++) { uint8_t div3 = is_divisible(i, M3); uint8_t div5 = is_divisible(i, M5); unsigned int length = (2 << KK1) << KK2; char fmt[9]; strncpy(fmt, &"FizzBuzz%u"[(9 >> div5) >> (KK3)], length); fmt[length] = '\0'; printf(fmt, i); printf("\n"); } return 0; } ``` #### 1. 解釋上述程式運作原理並評估 naive.c 和 bitwise.c 效能落差 - 避免 stream I/O 帶來的影響，可將 printf 更換為 sprintf 先分析在不同div3 和div5 的情況下，`FizzBuzz%u`的start和length分別必須是多少 ```csvpreview {header="true"} ,length,start 3 的倍數 ,4, 0 5 的倍數, 4, 4 15 的倍數, 8, 0 都不是, 2, 8 ``` 這邊可以看出來，對於`length = (2 << KK1) << KK2`，KK1和KK2分別是`div3`和`div5`。因為div3=1和div5=1時代表是15倍數，所以要`<<2`。反之，若只有其中一個的倍數，則只要<<1。都不是的話就<<0。再來考慮start=`9 >> div5) >> (KK3)`。 - 先找都不是兩者的倍數的情況，得知start需要是8。但光靠right shift，是不可能把9>>KK3變成8的。 - 再來考慮div3=1, div5=0的情況，這時候start必須是0。也就是說`(9 >> 0) >> (KK3)` 需要等於0。也就是KK3必須是 4 (9>>4 = 0) - 接著是考慮div3=0, div5=1 的情況，這時候start必須是4，也就是說`(9 >> 1) >> (KK3)`必須是4。（8 >> KK3 = 4）。 KK3 = 1 - 最後則是div3=1, div5=1，這時候start是0，也就是KK3必須是4。 ```csvpreview {header="true"} ,(9 >> div5), 9 >> (KK3),KK3 3 的倍數(div5=0) , 9, 0, 4 5 的倍數(div5=1), 4, 4, 0 15 的倍數(div5=1), 4, 0, 3 都不是(div5=0), 9, 8, ? ``` #### 2. 分析 [Faster remainders when the divisor is a constant: beating compilers and libdivide](https://hackmd.io/@sysprog/linux2022-quiz2#:~:text=%E5%88%86%E6%9E%90%20Faster%20remainders%20when%20the%20divisor%20is%20a%20constant%3A%20beating%20compilers%20and%20libdivide%20%E4%B8%80%E6%96%87%E7%9A%84%E6%83%B3%E6%B3%95%20(%E5%8F%AF%E5%8F%83%E7%85%A7%E5%90%8C%E4%B8%80%E7%AF%87%E7%B6%B2%E8%AA%8C%E4%B8%8B%E6%96%B9%E7%9A%84%E8%A9%95%E8%AB%96)%EF%BC%8C%E4%B8%A6%E8%A8%AD%E8%A8%88%E5%8F%A6%E4%B8%80%E7%A8%AE%20bitmask%EF%BC%8C%E5%A6%82%E3%80%8C%E5%8F%AF%E8%A2%AB%203) 一文的想法 (可參照同一篇網誌下方的評論)，並設計另一種 bitmask，如「可被 3 整除則末位設為 1」「可被 5 整除則倒數第二位設定為 1」，然後再改寫 bitwise.c 程式碼，試圖運用更少的指令來實作出 branchless; - 參照 [fastmod: A header file for fast 32-bit division remainders on 64-bit hardware](https://github.com/lemire/fastmod) #### 3. 研讀 [The Fastest FizzBuzz Implementation](https://hackmd.io/@sysprog/linux2022-quiz2#:~:text=The%20Fastest%20FizzBuzz%20Implementation) 及其 [Hacker News](https://news.ycombinator.com/item?id=29413656) 討論，提出 throughput (吞吐量) 更高的 Fizzbuzz 實作 #### 4. 解析 Linux 核心原始程式碼 [kernel/time/timekeeping.c](https://github.com/torvalds/linux/blob/master/kernel/time/timekeeping.c) 裡頭涉及到除法運算的機制，探討其快速除法的實作 (注意: 你可能要對照研讀 kernel/time/ 目錄的標頭檔和程式碼) 過程中，你可能會發現可貢獻到 Linux 核心的空間，請充分討論