# 2022q1 Homework2 (quiz2)
contributed by <shawn5141>
## [測驗1](https://hackmd.io/@sysprog/linux2022-quiz2#%E6%B8%AC%E9%A9%97-1)
### 1. 解釋上述程式碼運作的原理
```cpp=
#include <stdint.h>
uint32_t average(uint32_t low, uint32_t high)
{
return low + (high - low) / 2;
}
```
可以改寫成
```cpp=
#include <stdint.h>
uint32_t average(uint32_t a, uint32_t b)
{
return (a >> 1) + (b >> 1) + (EXP1);
}
```
舉例的話就可以知道right shift 之後就會發現有LSB消失後造成carrier也一併消失的可能。所以需要把它加回去
```
a = 9, b= 13 平均為(9+13)/2 = 11 => 0b1011
a = 9 => 0b1001 >> 1 => 0b0100
b = 13 => 0b1101 >> 1 => 0b0110
0b0100 | 0b0110 => 0b1010 => 10
```
使用`a & b & 1` 可以留下 `a & b`皆為1的部分,再 透過`&1`取最後一位。所以只要當兩者最後一位都為1的情況下(代表會有進位產生)需要把此進位加回去。
#### EX1 = `a & b & 1`
```cpp=
#include <stdint.h>
uint32_t average(uint32_t a, uint32_t b)
{
return (a >> 1) + (b >> 1) + (a & b & 1);
}
```
接著我們改寫為以下函式(限定用 OR, AND, XOR 這三個運算子)
```cpp=
uint32_t average(uint32_t a, uint32_t b)
{
return (EXP2) + ((EXP3) >> 1);
}
```
這次我們使用 `a=11 (0b1011)`和`b=13 (0b1101)`的例子,兩者的平均為 12 (0b1100)。因為題目提示說`EXP2`和`EXP3`接為a,b 或AND, OR, XOR operator 的組合。
所以透過排列組合發現
```cpp=
a & b = 0b1001 (9)
a | b = 0b1111 (15)
a ^ b = 0b0110 (6)
```
接著把此三數都右移1
```cpp=
a & b >>1 = 0b100 (4)
a | b >>1 = 0b111 (7)
a ^ b >>1 = 0b011 (3)
```
可以發現在 `(a & b) + ((a ^ b) >> 1)`的情況下會等於`0b1100`也就是我們在找的平均。從此篇[文章中](https://stackoverflow.com/questions/2385389/xor-using-mathematical-operators#:~:text=TL%3BDR,of%20a%20%26%20b.)(amazing!),得知`a XOR b` in binary為`a + b - 2ab` 而 `a AND b` in binary 是`ab`。
```
(a AND b) + (a XOR b) >> 1
=> (ab) + (a+b-2ab)/2
=> (a+b)/2
```
#### EX2 = `a & b`
#### EX3 = `(a ^ b) >> 1`
### 比較上述實作在編譯器最佳化開啟的狀況,對應的組合語言輸出,並嘗試解讀 (可研讀 CS:APP 第 3 章)
```cpp=
uint32_t avg(uint32_t a, uint32_t b)
{
return (a >> 1) + (b >> 1) + (a & b & 1);
}
```
```cpp=
0x64b <avg> mov %edi,%eax // $edi (a) value move to $eax │
0x64d <avg+2> shr %eax // right shift %eax │
0x64f <avg+4> mov %esi,%edx // $esi (b) value move to $edx │
0x651 <avg+6> shr %edx // right shift %edx │
0x653 <avg+8> add %edx,%eax // add %edx and $eax => (a >> 1) + (b >> 1) │
0x655 <avg+10> and %esi,%edi // %esi AND $edi => a & b and store in $edi │
0x657 <avg+12> and $0x1,%edi // $0x1 AND $edi => 0x1 & (a & b) store in %edi │
0x65a <avg+15> add %edi,%eax // add $edi with $eax │
0x65c <avg+17> retq
```
- SHR 是[right shift](https://www.csie.ntu.edu.tw/~acpang/course/asm_2004/slides/chapt_07_PartISolve.pdf)
```cpp=
uint32_t avg2(uint32_t a, uint32_t b)
{
return (a&b) + ((a^b)>>1);
}
```
```cpp=
0x65d <avg2> mov %edi,%edx // move $edi (a) to $edx │
0x65f <avg2+2> xor %esi,%edx // $esi(b) XOR $edx => (a ^ b) and store in $edx │
0x661 <avg2+4> shr %edx // right shift $edx => (a ^ b) >> 1 │
0x663 <avg2+6> and %esi,%edi // $esi AND $edi => (a & b) │
0x665 <avg2+8> lea (%rdx,%rdi,1),%eax // load address (%rdx, %rdi, 1) to $eax │
0x668 <avg2+11> retq
```
- 可以看出使用較少的instruction
- 不是很確定為什麼要使用lea,不直接使用mov就好
### 2. 研讀 Linux 核心原始程式碼 include/linux/average.h,探討其 Exponentially weighted moving average (EWMA) 實作,以及在 Linux 核心的應用
- 先回答第二題,EWMA有使用在[/include/linux/sched.h](https://elixir.bootlin.com/linux/latest/source/include/linux/sched.h#:~:text=enqueued%3B%0A%09unsigned%20int-,ewma%3B,-%23define%20UTIL_EST_WEIGHT_SHIFT%09%092)
- 再來看程式碼 [include/linux/average.h](https://elixir.bootlin.com/linux/v4.1/source/include/linux/average.h)
```cpp=
#ifndef _LINUX_AVERAGE_H
#define _LINUX_AVERAGE_H
/* Exponentially weighted moving average (EWMA) */
/* For more documentation see lib/average.c */
struct ewma {
unsigned long internal;
unsigned long factor;
unsigned long weight;
};
extern void ewma_init(struct ewma *avg, unsigned long factor,
unsigned long weight);
extern struct ewma *ewma_add(struct ewma *avg, unsigned long val);
/**
* ewma_read() - Get average value
* @avg: Average structure
*
* Returns the average value held in @avg.
*/
static inline unsigned long ewma_read(const struct ewma *avg)
{
return avg->internal >> avg->factor;
}
#endif /* _LINUX_AVERAGE_H */
```
[average.c](https://docs.huihoo.com/doxygen/linux/kernel/3.7/average_8c_source.html)
```cpp=
/*
* lib/average.c
*
* This source code is licensed under the GNU General Public License,
* Version 2. See the file COPYING for more details.
*/
#include <linux/export.h>
#include <linux/average.h>
#include <linux/kernel.h>
#include <linux/bug.h>
#include <linux/log2.h>
void ewma_init(struct ewma *avg, unsigned long factor, unsigned long weight)
{
WARN_ON(!is_power_of_2(weight) || !is_power_of_2(factor));
avg->weight = ilog2(weight);
avg->factor = ilog2(factor);
avg->internal = 0;
}
EXPORT_SYMBOL(ewma_init);
struct ewma *ewma_add(struct ewma *avg, unsigned long val)
{
avg->internal = avg->internal ?
(((avg->internal << avg->weight) - avg->internal) +
(val << avg->factor)) >> avg->weight :
(val << avg->factor);
return avg;
}
EXPORT_SYMBOL(ewma_add);
```
:::info
Math definition of [EWMA](https://towardsdatascience.com/time-series-from-scratch-exponentially-weighted-moving-averages-ewma-theory-and-implementation-607661d574fe)
- EWMA applies weights to the values of a time series. More weight is applied to more recent data points, making them more relevant for future forecasts. That’s the most significant improvement over simple moving averages.
![](https://i.imgur.com/39L9J54.png)
![](https://i.imgur.com/C8Ryzdo.png)
[image source](https://towardsdatascience.com/time-series-from-scratch-exponentially-weighted-moving-averages-ewma-theory-and-implementation-607661d574fe)
:::
#### [ilog2](https://docs.huihoo.com/doxygen/linux/kernel/3.7/log2_8h.html#:~:text=%23define%20ilog2,)
ilog2: [Implement a general integer log2 facility in the kernel](https://lwn.net/Articles/203596/)
- log of base 2 of 32-bit or a 64-bit unsigned value
- parameter constant-capable log of base 2 calculation
- this can be used to initialise global variables from constant data, hence the massive ternary operator construction
selects the appropriately-sized optimised version depending on sizeof(n)
```cpp=
/**
* ilog2 - log base 2 of 32-bit or a 64-bit unsigned value
* @n: parameter
*
* constant-capable log of base 2 calculation
* - this can be used to initialise global variables from constant data, hence
* the massive ternary operator construction
*
* selects the appropriately-sized optimised version depending on sizeof(n)
*/
#define ilog2(n) \
( \
__builtin_constant_p(n) ? \
((n) < 2 ? 0 : \
63 - __builtin_clzll(n)) : \
(sizeof(n) <= 4) ? \
__ilog2_u32(n) : \
__ilog2_u64(n) \
)
```
- 這邊出現了`ternary oeprator`的vartion. [c-nested-ternary-operator](https://www.geeksforgeeks.org/c-nested-ternary-operator/)。例子如下:
```cpp=
a ? b: c ? d : e ? f : g ? h : i
a ? b
: c ? d
: e ? f
: g ? h
: i
if a then b
else if c then d
else if e then f
else if g then h
else i
```
- 所以上面ilog2就是在說:
```cpp=
if n is const:
if (n) < 2:
return 0
else:
return 63 - __builtin_clzll(n) \\ 63 - count of leading 0 => 剩下的位數
else: \\ n is not const
if sizeof(n) <= 4:
return __ilog2_u32(n)
else:
return __ilog2_u64(n)
```
:::info
[__builtin_constant_p](https://www.daemon-systems.org/man/__builtin_constant_p.3.html)
The __builtin_constant_p() is a GNU extension for determining whether a
value is known to be constant at compile time. The function is closely
related to the concept of "constant folding" used by modern optimizing
compilers.
:::
- What is `__builtin_clzll`, similar to `__builtin_clz`, except the argument type is unsigned long long. Then what is [__builtin_clz](https://stackoverflow.com/questions/9353973/implementation-of-builtin-clz)
:::info
CLZ (count leading zero) and BSR (bit-scan reverse) are related but different. CLZ equals (type bit width less one) - BSR. CTZ (count trailing zero), also know as FFS (find first set) is the same as BSF (bit-scan forward.)
:::
- 同樣是在ilog2.h中有定義`__ilog2_u32` 和 `__ilog2_u64`。
```cpp=
/* SPDX-License-Identifier: GPL-2.0-or-later */
/* Integer base 2 logarithm calculation
*
* Copyright (C) 2006 Red Hat, Inc. All Rights Reserved.
* Written by David Howells (dhowells@redhat.com)
*/
#ifndef _LINUX_LOG2_H
#define _LINUX_LOG2_H
#include <linux/types.h>
#include <linux/bitops.h>
/*
* non-constant log of base 2 calculators
* - the arch may override these in asm/bitops.h if they can be implemented
* more efficiently than using fls() and fls64()
* - the arch is not required to handle n==0 if implementing the fallback
*/
#ifndef CONFIG_ARCH_HAS_ILOG2_U32
static inline __attribute__((const))
int __ilog2_u32(u32 n)
{
return fls(n) - 1;
}
#endif
#ifndef CONFIG_ARCH_HAS_ILOG2_U64
static inline __attribute__((const))
int __ilog2_u64(u64 n)
{
return fls64(n) - 1;
}
#endif
```
他們分別是呼叫不同的function `fls` or `fls64`。這兩個function 都在 [fls64.h](include/asm-generic/bitops/fls64.h)中。簡單來說,就是要找第一個set bit。[ref](https://hackmd.io/@cccccs100203/linux2021-quiz2#fls-ampamp-fls64)
```cpp=
static inline int fls64(unsigned long word)
{
return 64 - __kernel_ctlz(word);
}
static inline int fls(unsigned int x)
{
return fls64(x);
}
static inline unsigned long __fls(unsigned long x)
{
return fls64(x) - 1;
}
static inline int fls(unsigned int x)
{
return fls64(x);
}
```
## [測驗二](https://hackmd.io/@sysprog/linux2022-quiz2#%E6%B8%AC%E9%A9%97-2)
### 1. 解釋上述程式碼運作的原理
```cpp=
#include <stdint.h>
uint32_t max(uint32_t a, uint32_t b)
{
return a ^ ((EXP4) & -(EXP5));
}
```
參考[XOR-Trick](https://florian.github.io/xor-trick/)。
- 假設x是`((EXP4) & -(EXP5))`,`a^x` 要得到 `a`,`x`必須是0。同理,要是想要得到b的話,`x`必須是 `a^b^a`,這樣才會剩下`b`。所以當 `a>b` 時,取0。當`a<b` 就取 `b^a`。
- 如果在`a>b`的情況下,要讓`((EXP4) & -(EXP5))`變成0。需要讓EXP5是`a<b`,這樣`-(a<b)`就會變成0,進而導致`((EXP4) & -(EXP5))`為零。
- 相對的在`a<b`的情況,考慮`EXP5=(a<b)`,可以得到`((EXP4) & -(1))`,因為`-1` 是 `0xFFFFFFFF`,所以EXP4只要是`b^a`就可以讓最前面的`a`消掉。
```cpp=
#include <stdint.h>
uint32_t max(uint32_t a, uint32_t b)
{
return a ^ ((a^b) & -(a < b));
}
```
### 2. 針對 32 位元無號/有號整數,撰寫同樣 branchless 的實作
```cpp=
#include <stdint.h>
uint32_t max(int32_t a, int32_t b)
{
return something;
}
```
### 3. Linux 核心也有若干 branchless / branch-free 的實作,例如 lib/sort.c。請在 Linux 核心原始程式碼中,找出更多類似的實作手法。請善用 git log 檢索。
- One [result](https://github.com/torvalds/linux/commit/747f49ba67b8895a5831ab539de551b916f3738c) on `branchless` (`git log --grep="branchless"`)
```cpp=
commit 747f49ba67b8895a5831ab539de551b916f3738c
Author: Steven Fuerst <svfuerst@gmail.com>
Date: Wed Aug 15 15:07:15 2012 -0700
Replace int2float() with an optimized version.
We use __fls() to find the most significant bit. Using that, the
loop can be avoided. A second trick is to use the behaviour of the
rotate instructions to expand the range of the unsigned int to float
conversion to the full 32 bits in a branchless way.
The routine is now exact up to 2^24. Above that, we truncate which
is equivalent to rounding towards zero.
Signed-off-by: Steven Fuerst <svfuerst@gmail.com>
```
drivers/gpu/drm/radeon/r600_blit.c
```cpp=
/* 23 bits of float fractional data */
#define I2F_FRAC_BITS 23
#define I2F_MASK ((1 << I2F_FRAC_BITS) - 1)
/*
* Converts unsigned integer into 32-bit IEEE floating point representation.
* Will be exact from 0 to 2^24. Above that, we round towards zero
* as the fractional bits will not fit in a float. (It would be better to
* round towards even as the fpu does, but that is slower.)
*/
uint32_t int2float(uint32_t x)
{
uint32_t msb, exponent, fraction;
/* Zero is special */
if (!x) return 0;
/* Get location of the most significant bit */
msb = __fls(x);
/*
* Use a rotate instead of a shift because that works both leftwards
* and rightwards due to the mod(32) behaviour. This means we don't
* need to check to see if we are above 2^24 or not.
*/
fraction = ror32(x, (msb - I2F_FRAC_BITS) & 0x1f) & I2F_MASK;
exponent = (127 + msb) << I2F_FRAC_BITS;
return fraction + exponent;
}
```
## [測驗三](https://hackmd.io/@sysprog/linux2022-quiz2#%E6%B8%AC%E9%A9%97-3)
改寫此程式
```cpp=
#include <stdint.h>
uint64_t gcd64(uint64_t u, uint64_t v)
{
if (!u || !v) return u | v;
while (v) {
uint64_t t = v;
v = u % v;
u = t;
}
return u;
}
```
成為以下程式
```cpp=
#include <stdint.h>
uint64_t gcd64(uint64_t u, uint64_t v)
{
if (!u || !v) return u | v;
int shift;
for (shift = 0; !((u | v) & 1); shift++) {
u /= 2, v /= 2;
}
while (!(u & 1))
u /= 2;
do {
while (!(v & 1))
v /= 2;
if (u < v) {
v -= u;
} else {
uint64_t t = u - v;
u = v;
v = t;
}
} while (COND);
return RET;
}
```
### 1. 解釋上述程式運作原理;
![](https://i.imgur.com/eGNI2uX.png)
- #6-#7 `((u | v) & 1)` 去找 u 或 v 的最後一位如果有1的話,代表其中必定有一個為奇數,所以就停下來(取not)
- #9-#10 把二除淨,確保 u 是奇數
- #11-#21
- #12-#13 把 v 的二除淨,確定v是奇數
- #14-#19 如果`u < v`的話,就把 `v-=u`,不然就讓`t`等於`u-v`,然後把v assign給 u,餘數t assign給 v (交換後,把小的變成餘數t)
- 按照上面所給的演算法,COND應該是當u 或者是 v其中一個為零的時候。所以應該是 `COND= `
- #22 而RET則是要回傳剩下不為 0 的數。 `RET = d << shift`
### 2. 在 x86_64 上透過 __builtin_ctz 改寫 GCD,分析對效能的提升;
### 3. Linux 核心中也內建 GCD (而且還不只一種實作),例如 lib/math/gcd.c,請解釋其實作手法和探討在核心內的應用場景。
```cpp=
// SPDX-License-Identifier: GPL-2.0-only
#include <linux/kernel.h>
#include <linux/gcd.h>
#include <linux/export.h>
/*
* This implements the binary GCD algorithm. (Often attributed to Stein,
* but as Knuth has noted, appears in a first-century Chinese math text.)
*
* This is faster than the division-based algorithm even on x86, which
* has decent hardware division.
*/
#if !defined(CONFIG_CPU_NO_EFFICIENT_FFS)
/* If __ffs is available, the even/odd algorithm benchmarks slower. */
/**
* gcd - calculate and return the greatest common divisor of 2 unsigned longs
* @a: first value
* @b: second value
*/
unsigned long gcd(unsigned long a, unsigned long b)
{
unsigned long r = a | b;
if (!a || !b)
return r; // 當 a or b其中一個為0 時,可以回傳 r (代表另一個數是兩者的gcd)
b >>= __ffs(b);
if (b == 1)
return r & -r; //當 b 成為1時,回傳 r & (-r),會找出r在binary form中的first bits。因為在b right shift __ffs(b)之後變成1代表b原本是2的倍數。而 使用r & -r可以找出兩者的最小的bits
for (;;) {
a >>= __ffs(a);
if (a == 1)
return r & -r;
if (a == b)
return a << __ffs(r);
if (a < b)
swap(a, b);
a -= b;
}
}
#else
/* If normalization is done by loops, the even/odd algorithm is a win. */
unsigned long gcd(unsigned long a, unsigned long b)
{
unsigned long r = a | b;
if (!a || !b)
return r;
/* Isolate lsbit of r */
r &= -r;
while (!(b & r))
b >>= 1;
if (b == r)
return r;
for (;;) {
while (!(a & r))
a >>= 1;
if (a == r)
return r;
if (a == b)
return a;
if (a < b)
swap(a, b);
a -= b;
a >>= 1;
if (a & r)
a += b;
a >>= 1;
}
}
#endif
EXPORT_SYMBOL_GPL(gcd);
```
- [__ffs](https://github.com/spotify/linux/blob/master/include/asm-generic/bitops/__ffs.h) is used to find the first set (which is the least significant set in word). In first gcd function, `__ffs` 被大量的使用。
- 使用例子來解釋。設a = 6 (0b0110)和 b=3 (0b0011)並使用第一個gcd function。
- #25 r= a | b = 0b0111 => r = 7
- !a || !b 不為`True` 因為 !(6) 和 !(3) is 0。
- #30 `b >>= __ffs(b)`。__ffs(b)是0。所以3不用right shift。
- #35 `a >> __ffs(a)`會讓a變成3。因為a不等於1,所以不會trigger #36,反而是會讓conditino in #38成立,也就是 a==b。這時候會回傳`a << __ffs(r)`,也就是 `a <<__ffs(7)` (__ffs(7) = 0)。所以答案回傳3。
- #41-#43 在a!=b的情況下,則會把大的減去小的,然後繼續下個循環。
:::success
[r&(-r)](https://stackoverflow.com/questions/15395317/meaning-of-bitwise-and-of-a-positive-and-negative-number#:~:text=N%26(%2DN)%20will%20give%20you%20position%20of%20the%20first%20bit%20%271%27%20in%20binary%20form%20of%20N) will give you position of the first bit '1' in binary form of N。
:::
:::success
b>>__ffs(b) or a>>__ffs(a)在這邊使用就是把偶數除竟成為奇數。
:::
- 使用場景 [/arch/mips/ar7/clock.c](https://elixir.bootlin.com/linux/latest/source/arch/mips/ar7/clock.c)中的[calcualte](https://elixir.bootlin.com/linux/latest/source/arch/mips/ar7/clock.c#:~:text=tmp_gcd%20%3D%20gcd(target%2C%20tmp_base)%3B)
```cpp=
static void calculate(int base, int target, int *prediv, int *postdiv,
int *mul)
{
int tmp_gcd, tmp_base, tmp_freq;
for (*prediv = 1; *prediv <= 32; (*prediv)++) {
tmp_base = base / *prediv;
tmp_gcd = gcd(target, tmp_base);
*mul = target / tmp_gcd;
*postdiv = tmp_base / tmp_gcd;
if ((*mul < 1) || (*mul >= 16))
continue;
if ((*postdiv > 0) & (*postdiv <= 32))
break;
}
if (base / *prediv * *mul / *postdiv != target) {
approximate(base, target, prediv, postdiv, mul);
tmp_freq = base / *prediv * *mul / *postdiv;
printk(KERN_WARNING
"Adjusted requested frequency %d to %d\n",
target, tmp_freq);
}
printk(KERN_DEBUG "Clocks: prediv: %d, postdiv: %d, mul: %d\n",
*prediv, *postdiv, *mul);
}
```
- 需要知道`target`跟`tmp_base`代表的分別是什麼,舉例來說,[calculate](https://elixir.bootlin.com/linux/latest/source/arch/mips/ar7/clock.c#:~:text=calculate(cpu_base%2C%20TNETD7200_DEF_CPU_CLK%2C%20%26cpu_prediv%2C%0A%09%09%09%26cpu_postdiv%2C%20%26cpu_mul)%3B)有在此情景下被呼叫。target,是`TNETD7200_DEF_CPU_CLK`然後tmp_base則是由base 除prediv來的,base從caller得知是`cpu_base`。所以可以得知gcd在calculate的應用為找出 `CPU_CLK`與`cpu_base`的gcd。
```cpp=
printk(KERN_INFO "Clocks: Setting CPU clock\n");
calculate(cpu_base, TNETD7200_DEF_CPU_CLK, &cpu_prediv,
&cpu_postdiv, &cpu_mul);
```
## [測驗4](https://hackmd.io/@sysprog/linux2022-quiz2#%E6%B8%AC%E9%A9%97-4)
```cpp=
#include <stddef.h>
size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out)
{
size_t pos = 0;
uint64_t bitset;
for (size_t k = 0; k < bitmapsize; ++k) {
bitset = bitmap[k];
while (bitset != 0) {
uint64_t t = EXP6;
int r = __builtin_ctzll(bitset);
out[pos++] = k * 64 + r;
bitset ^= t;
}
}
return pos;
}
```
### 1. 解釋上述程式運作原理,並舉出這樣的程式碼用在哪些真實案例中;
如題所述,#9要找出目前最低位元的 1,並紀錄到 t 變數。所以可以使用[n & -n]((https://stackoverflow.com/questions/15395317/meaning-of-bitwise-and-of-a-positive-and-negative-number#:~:text=N%26(%2DN)%20will%20give%20you%20position%20of%20the%20first%20bit%20%271%27%20in%20binary%20form%20of%20N))得到最低位元。所以`EXP6=bitset & -bitset`
### 2. 設計實驗,檢驗 ctz/clz 改寫的程式碼相較原本的實作有多少改進?應考慮到不同的 bitmap density;
#### Version 1: w/o considering about bitmap density.
```cpp=
#include <stddef.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#define BILLION 1000000000L;
typedef size_t (*func_ptr)(uint64_t *, size_t, uint32_t *);
size_t naive(uint64_t *bitmap, size_t bitmapsize, uint32_t *out)
{
size_t pos = 0;
for (size_t k = 0; k < bitmapsize; ++k) {
uint64_t bitset = bitmap[k];
size_t p = k * 64;
for (int i = 0; i < 64; i++) {
if ((bitset >> i) & 0x1)
out[pos++] = p + i;
}
}
return pos;
}
size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out)
{
size_t pos = 0;
uint64_t bitset;
for (size_t k = 0; k < bitmapsize; ++k) {
bitset = bitmap[k];
while (bitset != 0) {
uint64_t t = bitset & -bitset;
int r = __builtin_ctzll(bitset);
out[pos++] = k * 64 + r;
bitset ^= t;
}
}
return pos;
}
void test(func_ptr func)
{
struct timespec start, stop;
double accum;
size_t bitmapsize = 10000000;
uint64_t *bitmap = malloc(bitmapsize * sizeof(uint64_t));
for (int i =0; i < bitmapsize; i++)
bitmap[i] = i;
if( clock_gettime( CLOCK_REALTIME, &start) == -1 ) {
perror( "clock gettime" );
exit(1);
}
uint32_t *out1 = malloc((bitmapsize + 64) * 64 * sizeof(uint32_t));
func(bitmap, bitmapsize, out1);
if( clock_gettime( CLOCK_REALTIME, &stop) == -1 ) {
perror( "clock gettime" );
exit(1);
}
accum = ( stop.tv_sec - start.tv_sec )
+ (double)( stop.tv_nsec - start.tv_nsec )
/ (double)BILLION;
printf( "%lf\n", accum );
}
int main()
{
test(naive);
test(improved);
}
```
- result:
naive: 1.767586 s
improved: 0.489726 s
#### Version 2 with bitmap density
```cpp=
#include <stddef.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <stdlib.h>
#define BILLION 1000000000L;
typedef size_t (*func_ptr)(uint64_t *, size_t, uint32_t *);
size_t naive(uint64_t *bitmap, size_t bitmapsize, uint32_t *out)
{
size_t pos = 0;
for (size_t k = 0; k < bitmapsize; ++k) {
uint64_t bitset = bitmap[k];
size_t p = k * 64;
for (int i = 0; i < 64; i++) {
if ((bitset >> i) & 0x1)
out[pos++] = p + i;
}
}
return pos;
}
size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out)
{
size_t pos = 0;
uint64_t bitset;
for (size_t k = 0; k < bitmapsize; ++k) {
bitset = bitmap[k];
while (bitset != 0) {
uint64_t t = bitset & -bitset;
int r = __builtin_ctzll(bitset);
out[pos++] = k * 64 + r;
bitset ^= t;
}
}
return pos;
}
void test(size_t bitmapsize, uint64_t *bitmap, func_ptr func)
{
struct timespec start, stop;
uint32_t *out = malloc((bitmapsize + 64) * 64 * sizeof(uint32_t));
double accum;
if( clock_gettime( CLOCK_REALTIME, &start) == -1 ) {
perror( "clock gettime" );
exit(1);
}
func(bitmap, bitmapsize, out);
if( clock_gettime( CLOCK_REALTIME, &stop) == -1 ) {
perror( "clock gettime" );
exit(1);
}
accum = ( stop.tv_sec - start.tv_sec )
+ (double)( stop.tv_nsec - start.tv_nsec )
/ (double)BILLION;
printf( "%lf\n", accum );
}
int main()
{
srand(time(NULL));
double Thres[] = {0, 0.25, 0.5, 0.75, 1};
size_t bitmapsize = 10000000;
uint64_t *bitmap = malloc(bitmapsize * sizeof(uint64_t));
for (int r=0; r < sizeof(Thres)/sizeof(Thres[0]);r++) {
for (int i =0; i < bitmapsize; i++) {
for (int j = 0; j < 64; j++) {
double ran = ((rand() % 10000) / 10000.0);
if (ran > Thres[r])
bitmap[i] | = 1UL <<j;
}
}
printf("===Thres %f===\n", ratios[r]);
test(bitmapsize, bitmap, naive);
test(bitmapsize, bitmap, improved);
for (int i =0; i < bitmapsize; i++) {
bitmap[i] = 0;
}
}
}
```
- use raitos as 0, 0.25, 0.5, 0.75, 1 (越來越sparse)。然後產生亂數如果比這個比例還大的話就assign給bitmap。這樣得到的結果如下圖。其實可以看improved algorithm隨著越sparse有更好的效率。
```cpp=
when bitmapsize = 10
===Thres 0.000000===
0xffffffffffffffff
0xffffffffffffffff
0xffffffffffffffff
0xffffffffffffffff
0xffffffffffffffff
0xffffffffffffffff
0xffffffffffffffff
0xffffffffffffffff
0xffffffffffffffff
0xffffffffffffffff
naive: 0.000003 improvied: 0.000002, ratio: 1.181237
===Thres 0.250000===
0xfb1f7bfafbb3f87e
0xbffb3ec4ff97f7cb
0xefd57bcd7cf7d7de
0xbe1bef6edd302bf6
0xdf70fffafffedded
0xed7f7effdce7fee2
0xff263befa7ff7fef
0x6bf5dfb77cb7b74e
0x4fff4fde1dbbffff
0x7fdd7b3bf08d7fdd
naive: 0.000005 improvied: 0.000002, ratio: 2.640046
===Thres 0.500000===
0x779d0ea094e810d8
0x70f5fdd8d8773263
0xdb3088c05ec73284
0xd577d9868d002f6e
0xcf1a104ca796ac5f
0xe5f2aaaafc8a7ea
0xf25f78b0106ddc2c
0xdf7befaa53054f76
0x4b2f9810a5c63c03
0xe075191e08e17748
naive: 0.000005 improvied: 0.000001, ratio: 4.247843
===Thres 0.750000===
0x81501048a400c0
0x881042d0c05622e8
0x5610801a1054340
0x8280405d88159749
0x810442260d0000e8
0x27031204e069a121
0x2004b6405007411a
0x82606a4605098001
0x320130b01080850
0x8944045016140505
naive: 0.000003 improvied: 0.000001, ratio: 4.338624
===Thres 1.000000===
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
naive: 0.000001 improvied: 0.000000, ratio: 11.253968
```
- When bitmapsize = 10000000
```csvpreview {header="true"}
Density, naive, improved, naive/improved Ratio
1, 1.796758, 0.487978, 3.682048
0.75, 1.615117, 0.436827, 3.697380
0.5, 1.480783, 0.341373, 4.337730
0.25, 1.363405, 0.186559, 7.308186
0, 1.220360, 0.023148, 52.719746
```
### 3. 思考進一步的改進空間;
### 4. 閱讀 Data Structures in the Linux Kernel 並舉出 Linux 核心使用 bitmap 的案例;
- bitmap 可以用在[priority array](https://www.informit.com/articles/article.aspx?p=101760&seqNum=2#:~:text=The%20Priority%20Arrays,this%20priority%20array.) in kernel/sched.c
# [測驗五](https://hackmd.io/@sysprog/linux2022-quiz2#%E6%B8%AC%E9%A9%97-5)
以下是 LeetCode 166. [Fraction to Recurring Decimal](https://leetcode.com/problems/fraction-to-recurring-decimal/) 的可能實作:
```cpp=
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "list.h"
struct rem_node {
int key;
int index;
struct list_head link;
};
static int find(struct list_head *heads, int size, int key)
{
struct rem_node *node;
int hash = key % size;
list_for_each_entry (node, &heads[hash], link) {
if (key == node->key)
return node->index;
}
return -1;
}
char *fractionToDecimal(int numerator, int denominator)
{
int size = 1024;
char *result = malloc(size);
char *p = result;
if (denominator == 0) {
result[0] = '\0';
return result;
}
if (numerator == 0) {
result[0] = '0';
result[1] = '\0';
return result;
}
/* using long long type make sure there has no integer overflow */
long long n = numerator;
long long d = denominator;
/* deal with negtive cases */
if (n < 0)
n = -n;
if (d < 0)
d = -d;
bool sign = (float) numerator / denominator >= 0;
if (!sign)
*p++ = '-';
long long remainder = n % d;
long long division = n / d;
sprintf(p, "%ld", division > 0 ? (long) division : (long) -division);
if (remainder == 0)
return result;
p = result + strlen(result);
*p++ = '.';
/* Using a map to record all of reminders and their position.
* if the reminder appeared before, which means the repeated loop begin,
*/
char *decimal = malloc(size);
memset(decimal, 0, size);
char *q = decimal;
size = 1333;
struct list_head *heads = malloc(size * sizeof(*heads));
for (int i = 0; i < size; i++)
INIT_LIST_HEAD(&heads[i]);
for (int i = 0; remainder; i++) {
int pos = find(heads, size, remainder);
if (pos >= 0) {
while (PPP > 0)
*p++ = *decimal++;
*p++ = '(';
while (*decimal != '\0')
*p++ = *decimal++;
*p++ = ')';
*p = '\0';
return result;
}
struct rem_node *node = malloc(sizeof(*node));
node->key = remainder;
node->index = i;
MMM(&node->link, EEE);
*q++ = (remainder * 10) / d + '0';
remainder = (remainder * 10) % d;
}
strcpy(p, decimal);
return result;
```
PPP = `pos--` (表示式)
MMM = `list_add_tail` (巨集)
EEE = `heads[remainder % size]` (表示式)
## 1. 解釋上述程式碼運作原理,指出其中不足,並予以改進
:::info
例如,判斷負號只要寫作 bool isNegative = numerator < 0 ^ denominator < 0;
搭配研讀 [The simple math behind decimal-binary conversion algorithms](https://indepth.dev/posts/1019/the-simple-math-behind-decimal-binary-conversion-algorithms)
:::
- 利用linked list hash map去紀錄remainder。 當出現的remainder時,`find(heads, size, remainder)`會大於零。這時候就要找出開始循環的第一個index,pos前面的都不會是循環小數。利用`while (pos--)`的性質可以把前面的都先加到result中。
## 2. 在 Linux 核心原始程式碼的 mm/ 目錄 (即記憶體管理) 中,找到特定整數比值 (即 fraction) 和 bitwise 操作的程式碼案例,予以探討和解說其應用場景
- [linux/mm/page-writeback.c](https://github.com/torvalds/linux/blob/master/mm/page-writeback.c)
```cpp=
static ssize_t max_ratio_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t count)
{
struct backing_dev_info *bdi = dev_get_drvdata(dev);
unsigned int ratio;
ssize_t ret;
ret = kstrtouint(buf, 10, &ratio);
if (ret < 0)
return ret;
ret = bdi_set_max_ratio(bdi, ratio);
if (!ret)
ret = count;
return ret;
}
```
```cpp=
/*
* setpoint - dirty 3
* f(dirty) := 1.0 + (----------------)
* limit - setpoint
*
* it's a 3rd order polynomial that subjects to
*
* (1) f(freerun) = 2.0 => rampup dirty_ratelimit reasonably fast
* (2) f(setpoint) = 1.0 => the balance point
* (3) f(limit) = 0 => the hard limit
* (4) df/dx <= 0 => negative feedback control
* (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
* => fast response on large errors; small oscillation near setpoint
*/
static long long pos_ratio_polynom(unsigned long setpoint,
unsigned long dirty,
unsigned long limit)
{
long long pos_ratio;
long x;
x = div64_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
(limit - setpoint) | 1);
pos_ratio = x;
pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
return clamp(pos_ratio, 0LL, 2LL << RATELIMIT_CALC_SHIFT);
}
```
# [測驗六](https://hackmd.io/@sysprog/linux2022-quiz2#%E6%B8%AC%E9%A9%97-6)
__alignof__ 是 GNU extension,以下是其可能的實作方式:
```cpp=
/*
* ALIGNOF - get the alignment of a type
* @t: the type to test
*
* This returns a safe alignment for the given type.
*/
#define ALIGNOF(t) \
((char *)(&((struct { char c; t _h; } *)0)->M) - (char *)X)
```
請補完上述程式碼,使得功能與 [__alignof__](https://gcc.gnu.org/onlinedocs/gcc/Alignment.html) 等價。
請補完,使得程式碼符合預期,儘量以最簡短且符合一致排版的形式來撰寫。
作答區
M = ?
X = ?
((size_t)&((TYPE *)0)->MEMBER)
### 1. 解釋上述程式碼運作原理:
[參考隔壁同學寫得很清楚的解釋](https://hackmd.io/@sternacht09/SkzgUIcec#%E6%B8%AC%E9%A9%97-6)
ALIGNOF() 要回傳 t 的長度,也就是t的長度
- `(char *)(&((struct { char c; t _h; } *)0)->_h))`
- `(struct { char c; t _h; } *)0` 也就是將這個結構的指標放到記憶體 0 的位址。
- 因為 alignment 的緣故, _h 會根據型態 t 的不同,與 char c 的距離有所改變。而 _h 的記憶體起始位址減去 c的記憶體起始位址,就是 alignment 的距離。
- 前面的部份就是 `_h` 的位址
- 後面的部份就是 `char c` 的起始位址,本題題目將整個結構放在 0 的位址上, `c` 的起始位置在 `0`
- 所以`M=_h`然後`X=0`
### 2. 在 Linux 核心原始程式碼中找出 __alignof__ 的使用案例 2 則,並針對其場景進行解說
1. [/arch/arm/mm/mmu.c](https://github.com/torvalds/linux/blob/3bf7edc84a9eb4007dd9a0cb8878a7e1d5ec6a3b/arch/arm/mm/mmu.c#L992)
```cpp=
/*
* Create the architecture specific mappings
*/
void __init iotable_init(struct map_desc *io_desc, int nr)
{
struct map_desc *md;
struct vm_struct *vm;
struct static_vm *svm;
if (!nr)
return;
svm = memblock_alloc(sizeof(*svm) * nr, __alignof__(*svm));
if (!svm)
panic("%s: Failed to allocate %zu bytes align=0x%zx\n",
__func__, sizeof(*svm) * nr, __alignof__(*svm));
for (md = io_desc; nr; md++, nr--) {
create_mapping(md);
vm = &svm->vm;
vm->addr = (void *)(md->virtual & PAGE_MASK);
vm->size = PAGE_ALIGN(md->length + (md->virtual & ~PAGE_MASK));
vm->phys_addr = __pfn_to_phys(md->pfn);
vm->flags = VM_IOREMAP | VM_ARM_STATIC_MAPPING;
vm->flags |= VM_ARM_MTYPE(md->type);
vm->caller = io
_init;
add_static_vm_early(svm++);
}
}
```
- MMU(Memory Management Unit) 與內存管理相關,
延伸閱讀:[Linux 核心 Copy On Write 實作機制](https://hackmd.io/@linD026/Linux-kernel-COW-Copy-on-Write)
先來了解 `map_desc`,`vm_struct`和`static_vm`分別代表什麼
- static_vm 。在[mm.h](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/arch/arm/mm/mm.h#L68) 中是把vm_struct串成linked list。
```cpp=
struct static_vm {
struct vm_struct vm;
struct list_head list;
};
```
- vm_struct
```cpp=
struct vm_struct {
struct vm_struct *next;
void *addr;
unsigned long size;
unsigned long flags;
struct page **pages;
#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
unsigned int page_order;
#endif
unsigned int nr_pages;
phys_addr_t phys_addr;
const void *caller;
};
```
- #13 `svm = memblock_alloc(sizeof(*svm), __alignof__(*svm));` 中使用`memblock_alloc` 去allocate memory block.
```cpp=
phys_addr_t __init memblock_alloc(phys_addr_t size, phys_addr_t align)
{
return memblock_alloc_base(size, align, MEMBLOCK_ALLOC_ACCESSIBLE);
}
```
在一連串的function call之後會align這個值會傳到以下這個function
```cpp=
/*
* __memblock_find_range_bottom_up - find free area utility in bottom-up
* @start: start of candidate range
* @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
* @size: size of free area to find
* @align: alignment of free area to find
* @nid: nid of the free area to find, %NUMA_NO_NODE for any node
* @flags: pick from blocks based on memory attributes
*
* Utility called from memblock_find_in_range_node(), find free area bottom-up.
*
* RETURNS:
* Found address on success, 0 on failure.
*/
static phys_addr_t __init_memblock
__memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end,
phys_addr_t size, phys_addr_t align, int nid,
ulong flags)
{
phys_addr_t this_start, this_end, cand;
u64 i;
for_each_free_mem_range(i, nid, flags, &this_start, &this_end, NULL) {
this_start = clamp(this_start, start, end);
this_end = clamp(this_end, start, end);
cand = round_up(this_start, align);
if (cand < this_end && this_end - cand >= size)
return cand;
}
return 0;
}
```
然後會call `cand = round_up(this_start, align);`
```cpp=
/*
* This looks more complex than it should be. But we need to
* get the type for the ~ right in round_down (it needs to be
* as wide as the result!), and we want to evaluate the macro
* arguments just once each.
*/
#define __round_mask(x, y) ((__typeof__(x))((y)-1))
#define round_up(x, y) ((((x)-1) | __round_mask(x, y))+1)
```
根據[此篇](https://buaq.net/go-72700.html#:~:text=split_mem_range()%E6%A0%B9%E6%8D%AE%E4%BC%A0%E5%85%A5%E7%9A%84%E5%86%85%E5%AD%98%20start%20%E5%92%8C%20end%20%E5%81%9A%E5%9B%9B%E8%88%8D%E4%BA%94%E5%85%A5%E7%9A%84%E5%AF%B9%E9%BD%90%E6%93%8D%E4%BD%9C%EF%BC%88%E5%8D%B3%20round_up%20%E5%92%8C%20round_down%EF%BC%89)文章,得知round_up 宏依靠整数除法来完成这项工作,當兩個參數均为整數時,它才有效。x 是需要四捨五入的数字,y 是应该四捨五入的間隔,也就是说,round_up(12,5)應該返回 15,因为 15 是大於 12 的 5 的第一个间隔,而 round_down(12,5)應返回 10。
所以`__memblock_find_range_bottom_up`看起來是會找到可以塞進去的candidate memory location.
### 3. 在 Linux 核心源使程式碼找出 ALIGN, ALIGN_DOWN, ALIGN_UP 等巨集,探討其實作機制和用途,並舉例探討 (可和上述第二點的案例重複)。思索能否對 Linux 核心提交貢獻,儘量共用相同的巨集
- ALIGN 定義在 [include/linux/kernel.h](https://elixir.bootlin.com/linux/v4.5/source/include/linux/kernel.h#:~:text=%23define%20ALIGN(x%2C%20a)%09%09__ALIGN_KERNEL((x)%2C%20(a)))
```cpp=
#define ALIGN(x, a) __ALIGN_KERNEL((x), (a))
```
然後__ALIGN_KERNEL是在 [include/uapi/linux/kernel.h](https://elixir.bootlin.com/linux/v4.5/source/include/uapi/linux/kernel.h#L9),
```cpp=
/*
* 'kernel.h' contains some often-used function prototypes etc
*/
#define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1)
#define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
```
用[例子](https://stackoverflow.com/questions/13122846/align-macro-kernel#:~:text=Say%20you%20have%20a%20number%3A%200x1006)來看。假設你想要找`0x1006`跟4 byte alignment的情況。所以接近的memory是`0x1000`, `0x1004`, `0x1008`。在這個例子中,會選`0x1008`。在這邊alignment mask 對4來說是 `0x03` ( (typeof(x))(4) - 1 = 0x03),所以 0x1006 + 0x03 = 0x1009 然後 0x1009 & ~0x03 = 0x1008。
- ALIGN_DOWN 定義在 [arch/metag/kernel/stacktrace.c]((https://elixir.bootlin.com/linux/v4.5/source/arch/metag/kernel/stacktrace.c#:~:text=%23define%20ALIGN_DOWN(addr%2C%20size)%20%20((addr)%26(~((size)%2D1)))))
```cpp=
#define ALIGN_DOWN(addr, size) ((addr)&(~((size)-1)))
```
假設addr是`0x1006`,然後size 是4。`ALIGN_DOWN`會得到((0x1006)&(~((4)-1))) = ((0x1006)&(~(3))) = ((0x1006)&(0xFFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFC))= 0x1004。所以就會往下得到最接近的memory。
ALIGN_UP 其中一定義在 [tools/testing/selftests/net/tcp_mmap.c](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/tools/testing/selftests/net/tcp_mmap.c#L123)
```cpp=
#define ALIGN_UP(x, align_to) (((x) + ((align_to)-1)) & ~((align_to)-1))
```
相反的ALIGN_UP就會往上得到最近的值。
### [測驗七](https://hackmd.io/@sysprog/linux2022-quiz2#%E6%B8%AC%E9%A9%97-7)
#### naive
```cpp=
#include <stdio.h>
int main() {
for (unsigned int i = 1; i < 100; i++) {
if (i % 3 == 0) printf("Fizz");
if (i % 5 == 0) printf("Buzz");
if (i % 15 == 0) printf("FizzBuzz");
if ((i % 3) && (i % 5)) printf("%u", i);
printf("\n");
}
return 0;
}
```
#### improved
```cpp=
static inline bool is_divisible(uint32_t n, uint64_t M)
{
return n * M <= M - 1;
}
static uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1;
static uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1;
int main(int argc, char **argv)
{
for (size_t i = 1; i <= 100; i++) {
uint8_t div3 = is_divisible(i, M3);
uint8_t div5 = is_divisible(i, M5);
unsigned int length = (2 << KK1) << KK2;
char fmt[9];
strncpy(fmt, &"FizzBuzz%u"[(9 >> div5) >> (KK3)], length);
fmt[length] = '\0';
printf(fmt, i);
printf("\n");
}
return 0;
}
```
#### 1. 解釋上述程式運作原理並評估 naive.c 和 bitwise.c 效能落差
- 避免 stream I/O 帶來的影響,可將 printf 更換為 sprintf
先分析在不同div3 和div5 的情況下,`FizzBuzz%u`的start和length分別必須是多少
```csvpreview {header="true"}
,length,start
3 的倍數 ,4, 0
5 的倍數, 4, 4
15 的倍數, 8, 0
都不是, 2, 8
```
這邊可以看出來,對於`length = (2 << KK1) << KK2`,KK1和KK2分別是`div3`和`div5`。因為div3=1和div5=1時代表是15倍數,所以要`<<2`。反之,若只有其中一個的倍數,則只要<<1。都不是的話就<<0。
再來考慮start=`9 >> div5) >> (KK3)`。
- 先找都不是兩者的倍數的情況,得知start需要是8。但光靠right shift,是不可能把9>>KK3變成8的。
- 再來考慮div3=1, div5=0的情況,這時候start必須是0。也就是說`(9 >> 0) >> (KK3)` 需要等於0。也就是KK3必須是 4 (9>>4 = 0)
- 接著是考慮div3=0, div5=1 的情況,這時候start必須是4,也就是說`(9 >> 1) >> (KK3)`必須是4。(8 >> KK3 = 4)。 KK3 = 1
- 最後則是div3=1, div5=1,這時候start是0,也就是KK3必須是4。
```csvpreview {header="true"}
,(9 >> div5), 9 >> (KK3),KK3
3 的倍數(div5=0) , 9, 0, 4
5 的倍數(div5=1), 4, 4, 0
15 的倍數(div5=1), 4, 0, 3
都不是(div5=0), 9, 8, ?
```
#### 2. 分析 [Faster remainders when the divisor is a constant: beating compilers and libdivide](https://hackmd.io/@sysprog/linux2022-quiz2#:~:text=%E5%88%86%E6%9E%90%20Faster%20remainders%20when%20the%20divisor%20is%20a%20constant%3A%20beating%20compilers%20and%20libdivide%20%E4%B8%80%E6%96%87%E7%9A%84%E6%83%B3%E6%B3%95%20(%E5%8F%AF%E5%8F%83%E7%85%A7%E5%90%8C%E4%B8%80%E7%AF%87%E7%B6%B2%E8%AA%8C%E4%B8%8B%E6%96%B9%E7%9A%84%E8%A9%95%E8%AB%96)%EF%BC%8C%E4%B8%A6%E8%A8%AD%E8%A8%88%E5%8F%A6%E4%B8%80%E7%A8%AE%20bitmask%EF%BC%8C%E5%A6%82%E3%80%8C%E5%8F%AF%E8%A2%AB%203) 一文的想法 (可參照同一篇網誌下方的評論),並設計另一種 bitmask,如「可被 3 整除則末位設為 1」「可被 5 整除則倒數第二位設定為 1」,然後再改寫 bitwise.c 程式碼,試圖運用更少的指令來實作出 branchless;
- 參照 [fastmod: A header file for fast 32-bit division remainders on 64-bit hardware](https://github.com/lemire/fastmod)
#### 3. 研讀 [The Fastest FizzBuzz Implementation](https://hackmd.io/@sysprog/linux2022-quiz2#:~:text=The%20Fastest%20FizzBuzz%20Implementation) 及其 [Hacker News](https://news.ycombinator.com/item?id=29413656) 討論,提出 throughput (吞吐量) 更高的 Fizzbuzz 實作
#### 4. 解析 Linux 核心原始程式碼 [kernel/time/timekeeping.c](https://github.com/torvalds/linux/blob/master/kernel/time/timekeeping.c) 裡頭涉及到除法運算的機制,探討其快速除法的實作 (注意: 你可能要對照研讀 kernel/time/ 目錄的標頭檔和程式碼)
過程中,你可能會發現可貢獻到 Linux 核心的空間,請充分討論