2022q1 Homework2 (quiz2)

# 2022q1 Homework2 (quiz2) contributed by < [cantfindagoodname](https://github.com/cantfindagoodname) > > [題目](https://hackmd.io/@sysprog/linux2022-quiz2) ## 實驗環境 ```shell $ cat /proc/version Linux version 5.13.0-30-generic (buildd@lcy02-amd64-003) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #33~20.04.1-Ubuntu SMP Mon Feb 7 14:25:10 UTC 2022 $ gcc --version gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 ``` ## 測驗1 When finding addition of two numbers, overflow may occur : ```c #include <stdint.h> uint32_t average(uint32_t a, uint32_t b) { return (a + b) / 2; } ``` Use subtraction instead, addition in this case won't have overflow : ```c #include <stdint.h> uint32_t average(uint32_t low, uint32_t high) { return low + (high - low) / 2; } ``` \begin{equation} \begin{split} low + high - low &= high \\ low + \frac{high - low}{2} &< high < max(uint32) \end{split} \end{equation} --- Rewrite with bit-wise operators : ```c #include <stdint.h> uint32_t average(uint32_t a, uint32_t b) { return (a >> 1) + (b >> 1) + (EXP1); } ``` `a >> 1` $+$ `b >> 1` would be equivalent to $\frac{a}{2}+\frac{b}{2}$, except the expression will never take the least significant bit into consideration at all. So we compute the first bit seperately, with the average being $f(x) = \lfloor \frac{a_0 + b_0}{2}\rfloor$, which coincidently is equal to `&` operation : |$a_0$|$b_0$|$f(x)$|$\&$| |:-:|:-:|:-:|:-:| |$0$|$0$|$0$|$0$| |$0$|$1$|$0$|$0$| |$1$|$0$|$0$|$0$| |$1$|$1$|$1$|$1$| We are only interested in the first bit, so we would do `res & 1` at the end. `EXP1` = `a & b & 1` --- Rewrite with bit-wise operators : ```c uint32_t average(uint32_t a, uint32_t b) { return (EXP2) + ((EXP3) >> 1); } ``` Taking the logic behind [half adder](https://en.wikipedia.org/wiki/Adder_(electronics)), we can solve this problem. |$x$|$y$|$s$|$c$| |:-:|:-:|:-:|:-:| |$0$|$0$|$0$|$0$| |$0$|$1$|$1$|$0$| |$1$|$0$|$1$|$0$| |$1$|$1$|$0$|$1$| Where $c$ is carry and $s$ is sum. Sum is equivalent to XOR operator `^` and carry is equilvalent to AND operator `&`. We know that sum of two value is `x + y` = `(x & y) << 1 + (x ^ y)` \begin{array}{cccc}&&&x\\&+&&y\\\hline&&c&s\\\hline\end{array} Divide the equation by $2$ by shifting right once and we can get average. `(a & b) + ((a ^ b) >> 1)` `EXP2` = `a & b` `EXP3` = `a ^ b` --- ### 延伸問題 #### 比較實作對應的組合語言輸出 Get Assembly codes. ```shell $ gcc -c -O3 main.c $ objdump -drCS -Mintel main.o ``` --- ``` 0000000000000000 <average_naive>: 0: f3 0f 1e fa endbr64 4: 8d 04 37 lea eax,[rdi+rsi*1] 7: d1 e8 shr eax,1 9: c3 ret a: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0] ``` `lea` is load effective address. Operand2 of the instruction could compute address for `a + b * n`, then assign to Operand1. This instruction can be seen as an arithmetic operation but using address-calculation hardware. > [What's the purpose of the LEA instruction?](https://stackoverflow.com/questions/1658294/whats-the-purpose-of-the-lea-instruction) `shr` is shift logical right. Content of Operand1 is shifted right logically by number of bits specified by Operand2. Shifting right by 1 bit is essentially divide by 2. > [SAL/SAR/SHL/SHR - Shift](https://www.felixcloutier.com/x86/sal:sar:shl:shr) This block of code would do `lea`, then `shr`. --- ``` 0000000000000010 <average_noov>: 10: f3 0f 1e fa endbr64 14: 89 f0 mov eax,esi 16: 29 f8 sub eax,edi 18: d1 e8 shr eax,1 1a: 01 f8 add eax,edi 1c: c3 ret 1d: 0f 1f 00 nop DWORD PTR [rax] ``` `mov` is move. Content of Operand2 is copied to Operand1. > [MOV - Move](https://www.felixcloutier.com/x86/mov) `sub` is subtract. Content of Operand1 would subtract Operand2 then stores result in Operand1. > [SUB - Subtract](https://www.felixcloutier.com/x86/sub) `add` is addition. Content of Operand1 would add Operand2 then stores result in Operand1. > [ADD - Add](https://www.felixcloutier.com/x86/add) This block of code would do `mov`, `sub`, `shr`, then `add`. --- ``` 0000000000000020 <average_bitwise>: 20: f3 0f 1e fa endbr64 24: 89 f8 mov eax,edi 26: 89 f2 mov edx,esi 28: 21 f7 and edi,esi 2a: d1 e8 shr eax,1 2c: d1 ea shr edx,1 2e: 83 e7 01 and edi,0x1 31: 01 d0 add eax,edx 33: 01 f8 add eax,edi 35: c3 ret 36: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0] 3d: 00 00 00 ``` `and` is logical AND. Performs bitwise AND operation on Operand1 and Operand2 then stores result in Operand1. > [AND - Logical AND](https://www.felixcloutier.com/x86/and) This block of code would do 2 `mov`, 2 `and`, 2 `shr`, and 2 `add`. --- ``` 0000000000000040 <average_bitwise2>: 40: f3 0f 1e fa endbr64 44: 89 f8 mov eax,edi 46: 21 f7 and edi,esi 48: 31 f0 xor eax,esi 4a: d1 e8 shr eax,1 4c: 01 f8 add eax,edi 4e: c3 ret ``` `xor` is logical exclusive OR. Performs bitwise XOR operation on Operand1 and Operand2 then stores result in Operand1. > [XOR - Logical Exclusive OR](https://www.felixcloutier.com/x86/xor) This block of code would do `mov`, `and`, `xor`, `shr`, and `add`. #### Linux 核心原始程式碼 include/linux/average.h The main part of the program resides in `ewma_#name$$_add()`. ```c unsigned long internal = READ_ONCE(e->internal); unsigned long weight_rcp = ilog2(_weight_rcp); unsigned long precision = _precision; /* code for checking */ WRITE_ONCE(e->internal, internal ? \ (((internal << weight_rcp) - internal) + \ (val << precision)) >> weight_rcp : \ (val << precision)); ``` Which can be simplified to : ```c if (internal) e->internal = (((internal << weight_rcp) - internal) + (val << precision)) >> weight_rcp; else e->internal = val << precision; ``` Note that `weight_rcp` is assigned value `ilog2(_weight_rcp)` and `_weight_rcp` is always power of 2. So that `x << weight_rcp` would be equivalent to `x * _weight_rcp`. Using mathematics expression to represent the code : $internal_0 = val * 2^{precision}$ $internal_t = \dfrac{val * 2^{precision}}{{\_weight\_rcp}} + internal_{t-1} * (\_weight\_rcp - 1)$ Say, we denote the weight as $\alpha$ and `_weight_rcp` would be reciprocal of weight. $\alpha = \dfrac{1}{\_weight\_rcp}$ `val << precision` is what known as [binary scaling](https://en-academic.com/dic.nsf/enwiki/3724299), which is used to use integer arithmetic to perform floating floating point instruction, which is both faster and more accurate. When writing the value into `e->internal`, `val << precision` would be use as a pseudo floating point. After the arithmetic is finished, we would use `val >> precision` to access the value. ```c static inline unsigned long \ ewma_##name##_read(struct ewma_##name *e) \ { \ BUILD_BUG_ON(!__builtin_constant_p(_precision)); \ BUILD_BUG_ON(!__builtin_constant_p(_weight_rcp)); \ BUILD_BUG_ON((_precision) > 30); \ BUILD_BUG_ON_NOT_POWER_OF_2(_weight_rcp); \ return e->internal >> (_precision); \ } ``` We use : symbol $Y_t$ to represent the value at time $t$, symbol $S_t$ to represent interval at time $t$. The mathematic expression now looks like this : $S_0 = Y_0$ $S_t = \alpha Y_t + S_{t-1} * (\dfrac{1}{\alpha} - 1) = \alpha Y_t + (1 - \alpha )S_{t-1}$ Which is the [exponentially weighted moving average](https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average). :::spoiler moving average The moving average is a value what is the expected value $E[X]$. Where $X$ is random variable that are consist of the current value and history of $X$, weighted by a constant number. Moving average is a [inifinite impulse response](https://en.wikipedia.org/wiki/Infinite_impulse_response). Where each history of the result would contribute to the current result by a certain weighting factor. The result would be an estimation converging to a value if the data follows a certain distribution. ::: There are few techniques in the program that make good use of features from C. Such as: - using preprocessor to generate codes. The program uses concatenation `##` symbol to generate code upon preprocessing. The code would be generated when preprocessing without specifiying a name for the code. - test value of constant at compile time The program uses macros to test value at compile time, such as : ```c BUILD_BUG_ON(!__builtin_constant_p(_precision)); ``` In which some bit-wise operation are used. ```c #define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)])) ``` The macro would do `!!(condition)` to make sure the value is either `0` or `1`. Which `likely()` from `compiler.h` also did. When the condition are `true`, the program would give a error message. as `char[-1]` is illegal. --- ##### 在 Linux 核心的應用 > [Elixir Cross Referencer](https://elixir.bootlin.com/linux/latest/A/ident/DECLARE_EWMA) The moving average function from what I saw would be used on some driver. For what I assume the reason to use is to have a more relevant average on programs that would run for a long time without halt. That, or they just wanted a average that take its history into consideration by a self-given weight configuration. --- For example : ###### Direct Rendering Manager > These averages will be used when calculating how long to delay before entering self refresh mode after activity. ###### Virtio net > RX packet size EWMA. The average packet size is used to determine the packet buffer size when refilling RX rings. As the entire RX ring may be refilled at once, the weight is chosen so that the EWMA will be insensitive to short-term, transient changes in packet size. --- ## 測驗2 Fill in `EXP4` and `EXP5` ```c #include <stdint.h> uint32_t max(uint32_t a, uint32_t b) { return a ^ ((EXP4) & -(EXP5)); } ``` --- Some properties of XOR `^` `x ^ y` would set bit where bit pattern of `x` and `y` differs. |$x$|$y$|$\land$| |:-:|:-:|:-:| |$0$|$0$|$0$| |$0$|$1$|$1$| |$1$|$0$|$1$| |$1$|$1$|$0$| "Difference" of "difference" would result in the complementing value. `x ^ (x ^ y)` = `y` `y ^ (x ^ y)` = `x` "Difference" with `0` would result in its original value, as the bit is set when the bit pattern are different. `x ^ 0` = `x` --- Some properties of AND `&` |$x$|$y$|$\&$| |:-:|:-:|:-:| |$0$|$0$|$0$| |$0$|$1$|$0$| |$1$|$0$|$0$| |$1$|$1$|$1$| `-1` has all its bit set, any arbitary value doing bitwise `&` with `-1` would result in itself. `0` has all its bit cleared, any arbitary value doing bitwise `&` with `0` would result in `0`. --- We intend to return the maximum value with the statement : ```c return a ^ ((EXP4) & -(EXP5)); ``` When `a` is larger, the function would return `a`. The function would return `b` otherwise. From the properties of `&` and `^`, we can derive the solution. As `EXP5` had a unary `-` operator before itself, `EXP5` would be `0` or `1`, which is the result of how C represents boolean value. And so, `EXP4` would be `a ^ b`. `a` would be switched to `b` when `b` is larger, `EXP5` would be `a < b`. `EXP4` = `a ^ b` `EXP5` = `a < b` --- ### 延伸問題 #### 在 Linux 核心原始程式碼中，找出更多類似的實作手法 Pretty much every expression with ternary operation with a certain structure `? :` could be replaced with a branchless expression. `var1 = cond ? var2 : var3` `var1 = var3 ^ ((var2 ^ var3) & -(cond))` > 請善用 git log 檢索 > [git log Source](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?qt=grep&q=branchless) --- [int2float](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=747f49ba67b8895a5831ab539de551b916f3738c) :::spoiler diff ```diff +uint32_t int2float(uint32_t x) { - u32 result, i, exponent, fraction; - - if ((input & 0x3fff) == 0) - result = 0; - else { - exponent = 140; - fraction = (input & 0x3fff) << 10; - for (i = 0; i < 14; i++) { - if (fraction & 0x800000) - break; - else { - fraction = fraction << 1; - exponent = exponent - 1; - } - } - result = exponent << 23 | (fraction & 0x7fffff); - } - return result; -} + uint32_t msb, exponent, fraction; + + if (!x) return 0; + + /* Get location of the most significant bit */ + msb = __fls(x); + fraction = ror32(x, (msb - I2F_FRAC_BITS) & 0x1f) & I2F_MASK; + exponent = (127 + msb) << I2F_FRAC_BITS; + + return fraction + exponent; +} ``` ::: > IEEE 754 source : [wikipedia](https://en.wikipedia.org/wiki/IEEE_754) > ![](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Float_example.svg/590px-Float_example.svg.png) The original way implementation is by shifting the integer in a loop until the msb is set. ```c for (i = 0; i < 14; i++) { if (fraction & 0x800000) break; else { fraction = fraction << 1; exponent = exponent - 1; } } ``` Then set with bit-wise or `|` and bit mask. ```c result = exponent << 23 | (fraction & 0x7fffff); ``` The other implementation use find last set `__fls(x)` to get location of last set bit. Then is uses rotation to rotate the bits, which will insert bit shifted out to the other side. ```c msb = __fls(x); fraction = ror32(x, (msb - I2F_FRAC_BITS) & 0x1f) & I2F_MASK; exponent = (127 + msb) << I2F_FRAC_BITS; return fraction + exponent; ``` Definition of rotation : ```c static inline __u32 ror32(__u32 word, unsigned int shift) { return (word >> shift) | (word << (32 - shift)); } ``` Dumb down to 8 bits to see why rotation works : ``` d : destination m : left most set bit -------------------- 0 0 m 0 d 0 0 0 7-6-5-4-3-2-1-0 5 - 3 = 2 rotate right twice -------------------- 0 0 d 0 m 0 0 0 7-6-5-4-3-2-1-0 3 - 5 = -2 11111110 & 111 = 110 rotate right 6 times -------------------- ``` Behaviour of mod(bit-size) would cause rotate right and rotate left works the same. --- In [KVM : Add SMAP support](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.h?id=97ec8c067d322d32effdc1701760d3babbc5595f), :::spoiler diff ```diff #define PT_DIRECTORY_LEVEL 2 #define PT_PAGE_TABLE_LEVEL 1 -#define PFERR_PRESENT_MASK (1U << 0) -#define PFERR_WRITE_MASK (1U << 1) -#define PFERR_USER_MASK (1U << 2) -#define PFERR_RSVD_MASK (1U << 3) -#define PFERR_FETCH_MASK (1U << 4) +#define PFERR_PRESENT_BIT 0 +#define PFERR_WRITE_BIT 1 +#define PFERR_USER_BIT 2 +#define PFERR_RSVD_BIT 3 +#define PFERR_FETCH_BIT 4 + +#define PFERR_PRESENT_MASK (1U << PFERR_PRESENT_BIT) +#define PFERR_WRITE_MASK (1U << PFERR_WRITE_BIT) +#define PFERR_USER_MASK (1U << PFERR_USER_BIT) +#define PFERR_RSVD_MASK (1U << PFERR_RSVD_BIT) +#define PFERR_FETCH_MASK (1U << PFERR_FETCH_BIT) int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]); void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask); @@ -73,6 +79,8 @@ int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct); void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context); void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context, bool execonly); +void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, + bool ept); static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm) { @@ -110,10 +118,30 @@ static inline bool is_write_protection(struct kvm_vcpu *vcpu) * Will a fault with a given page-fault error code (pfec) cause a permission * fault with the given access (in ACC_* format)? */ -static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access, - unsigned pfec) +static inline bool permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, + unsigned pte_access, unsigned pfec) { - return (mmu->permissions[pfec >> 1] >> pte_access) & 1; + int cpl = kvm_x86_ops->get_cpl(vcpu); + unsigned long rflags = kvm_x86_ops->get_rflags(vcpu); + + /* + * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1. + * + * If CPL = 3, SMAP applies to all supervisor-mode data accesses + * (these are implicit supervisor accesses) regardless of the value + * of EFLAGS.AC. + * + * This computes (cpl < 3) && (rflags & X86_EFLAGS_AC), leaving + * the result in X86_EFLAGS_AC. We then insert it in place of + * the PFERR_RSVD_MASK bit; this bit will always be zero in pfec, + * but it will be one in index if SMAP checks are being overridden. + * It is important to keep this branchless. + */ + unsigned long smap = (cpl - 3) & (rflags & X86_EFLAGS_AC); + int index = (pfec >> 1) + + (smap >> (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1)); + + return (mmu->permissions[index] >> pte_access) & 1; } void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm); ``` ::: The branchless segment is around the end of `mmu.h`. ```c unsigned long smap = (cpl - 3) & (rflags & X86_EFLAGS_AC); ``` The code would need do a branch logically : > If CPL < 3, SMAP prevention are disabled if EFLAGS\\.AC = 1. > If CPL = 3, SMAP applies to all supervisor-mode data accesses (these are implicit supervisor accesses) regardless of the value of EFLAGS\\.AC. The code uses `cpl - 3` to assign `0` to `smap` with bit-wise AND `&` operator when smap is 3. Otherwise, smap become `(rflags & X86_EFLAGS_AC)` : - CPL could only be `0`, `1`, or `2`, after `cpl - 3`, only the 2 least significant bits is possible to be cleared. - rflags and X86_EFLAGS_AC, last 4 bits is always cleared. [Reference](https://en.wikipedia.org/wiki/FLAGS_register) Hence, the code would works properly. --- [xfrm: branchless addr4_match() on 64-bit](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6c786bcb29dd684533ec165057f437d5bb34a4b2) :::spoiler diff ```diff static inline bool addr4_match(__be32 a1, __be32 a2, u8 prefixlen) { /* C99 6.5.7 (3): u32 << 32 is undefined behaviour */ - if (prefixlen == 0) + if (sizeof(long) == 4 && prefixlen == 0) return true; - return !((a1 ^ a2) & htonl(0xFFFFFFFFu << (32 - prefixlen))); + return !((a1 ^ a2) & htonl(~0UL << (32 - prefixlen))); } ``` ::: ```diff - return !((a1 ^ a2) & htonl(0xFFFFFFFFu << (32 - prefixlen))); + return !((a1 ^ a2) & htonl(~0UL << (32 - prefixlen))); ``` The original shift is undefined behavior, the compiler would generate more code as assembly to deal with the specified behavior. > [你所不知道的 C 語言: bitwise 操作](https://hackmd.io/@sysprog/c-bitwise) > - 左移超過變數長度，其結果未定義; > ```c > int i=0xFFFFFFFF; > i = i << 32; // 此結果未定義 > ``` The fix they did is use a 64-bit register to do the shift operation then truncate the value. :::spoiler When I test the program in my local machine though, there is no branch nor more generate code. ```c uint32_t fun1() { return (uint32_t)(~0UL << 32); } uint32_t fun2() { return (uint32_t)(~0U << 32); } ``` The compiler would generate warning for `fun2()` ```shell main.c:6:27: warning: left shift count >= width of type [-Wshift-count-overflow] ``` And the assembly code : ``` 0000000000000000 <fun1>: 0: f3 0f 1e fa endbr64 4: 55 push rbp 5: 48 89 e5 mov rbp,rsp 8: b8 00 00 00 00 mov eax,0x0 d: 5d pop rbp e: c3 ret 000000000000000f <fun2>: 0: f3 0f 1e fa endbr64 4: 55 push rbp 5: 48 89 e5 mov rbp,rsp 8: b8 00 00 00 00 mov eax,0x0 d: 5d pop rbp e: c3 ret ``` If the commit is not recent enough, this should means that the compiler is more powerful than before. ::: One point to note : `htonl()` is a function to swap byte order, its implementation would use `__b_swap_constant_64` from [byteswap.h](https://github.com/lattera/glibc/blob/master/bits/byteswap.h). Which its implementation is very similar to `bit_reverse` function that was mentioned in class. --- // TODO [CRC32](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=efdb25efc7645b326cd5eb82be5feeabe167c24e) [powerpc/ebpf32](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e7de0023e1232f42a10ef6af03352538cc27eaf6) ## 測驗3 We are given a function returning [greatest common divisor](https://en.wikipedia.org/wiki/Greatest_common_divisor) of two values. ```c #include <stdint.h> uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; while (v) { uint64_t t = v; v = u % v; u = t; } return u; } ``` The function find the value using [Euclid's algorithm](https://en.wikipedia.org/wiki/Euclidean_algorithm). :::spoiler Euclid's Algorithm The algorithm can be summarized with few equations. \begin{align} r_0&=q_1r_1+r_2\tag{0}\\ r_1&=q_2r_2+r_3\tag{1}\\ ...\\ r_{n-2}&=q_{n-1}r_{n-1}+r_{n}\tag{2}\\ r_{n-1}&=q_{n}r_{n}\tag{3}\\ \end{align} Notice that each steps is just each remainders replacing operands to the left for the next equation. As an example, from $(0)$ and $(1)$, $r_1$ and $r_2$ in $(0)$ would replace $r_0$ and $r_1$ in $(1)$ respectively. Each equation is essentially a division. Taking $(1)$ as example : $r_0\ /\ r_1 = q_1\tag*{}$ $r_0\ \%\ r_1=r_2\tag*{}$ $r_n$ would be the greatest common divisor as no more remainder remains in the equation. We could substitute $r_n$ to $(2)$ which would result in : $r_{n-2} = (q_{n-1}q_{n+1}+1)r_n\tag*{}$ and keep substituting until we reach ${(0)}$ to get : $r_0=(...)r_n\tag*{}$ Which means $r_n$ is the greatest common divisor of $r_0$ and $r_1$ or else something else would shows up earlier. The function above is the direct translation of the equations. It uses a while loop to move through the equations step by step until the remainder is zero. ::: --- We are required to fill-in `COND` and `RET` in the function below with the same goal. ```c= #include <stdint.h> uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; int shift; for (shift = 0; !((u | v) & 1); shift++) { u /= 2, v /= 2; } while (!(u & 1)) u /= 2; do { while (!(v & 1)) v /= 2; if (u < v) { v -= u; } else { uint64_t t = u - v; u = v; v = t; } } while (COND); return RET; } ``` --- To function make use of serveral properties. Take $gcd(a,b)$ > $o_n$ would represent any arbitary odd number > $e_n$ would represent any arbitary even number **If both $a$ and $b$ are even** $a = e_1*e_2*\ldots*o_a*o_1*o_2*\ldots$ $b = e_1*e_2*\ldots*o_b*o_1*o_2*\ldots$ $gcd = e_1*e_2*...*e_n*r$ $e_1*e_2*...*e_n$ must be part of the common divisor. $e_1*e_2*...*e_n$ would first be removed by dividing by $2$ repeatedly. $e_1*e_2*...*e_n$ would be recovered at the end of function. > This description would be refered as $1)$ **If exactly one of $a$ or $b$ is even** $a = e_1*e_2*\ldots*o_a*o_1*o_2*\ldots$ $b = o_b*o_1*o_2*\ldots$ $e_1*e_2*...*e_n$ is never a part of the common divisor. $e_1*e_2*...*e_n$ is removed by dividing by $2$ repeatedly. As they does not contributes to the common divisor, they won't be recovered. > This description would be refered as $2)$. --- `x & 1` would checking if `x` is even. The least significant bit of whole numbers in base-2 is $2^0$, any number with $2^0$ bit clear would be an even number. `line 6-8` records the count of the loop in `shift` to recover the common divisor afterwards : ```c=6 for (shift = 0; !((u | v) & 1); shift++) { u /= 2, v /= 2; } ``` `line 9-10` and `line 12-13` would do what $2)$ is describing : ```c=9 while (!(u & 1)) u /= 2; ``` ```c=12 while (!(v & 1)) v /= 2; ``` `line 14-15` would try to do the modulo `%` operation without actually using the `%` operator. Which is repeatedly subtract the dividend by divisor until the dividend is smaller than divisor. ```c=14 if (u < v) v -= u; ``` Due to the nature of the program, `line 14-15` would never actually complete the modulo `%` operation. `line 9-10` would guarentees `u` is odd number. ```c=9 while (!(u & 1)) u /= 2; ``` At `line 14-15` and `line 17-19`, `u` is always an odd number. ```c=14 if (u < v) { v -= u; } else { uint64_t t = u - v; u = v; v = t; } ``` Both `u` and `v` is odd numbers. The result of subtraction must be an even number. `line 12-13` would make the code more efficient by applying $2)$ ```c=12 while (!(v & 1)) v /= 2; ``` If $2)$ is not applied, the program would waste cycles by doing subtration repeatedly. `line 17-19` is actually the process of swapping two values, following rules of [Euclid's Algorithm](https://en.wikipedia.org/wiki/Euclidean_algorithm). ```c=17 uint64_t t = u - v; u = v; v = t; ``` ``` swap() t=a a=b b=t ``` However, before swapping, the program subtracts `u` by `v`, saving 1 iteration in the while loop. `v` is always the value being subtracted. Hence, the remainder ( that could be zero ) would be `v`, which would be the termintating condition of the loop. `COND` = `v` `v` is the terminating condition of the loop, which is zero, hence only `u` could be the result. Before assigning the result, the program would first recover the common divisor shifted from $1)$. `RET` = `u << shift` --- ### 延伸問題 #### 透過 __builtin_ctz 改寫 GCD :::spoiler `gcd64()` ```c uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; int shift = __builtin_ctzll(u|v); u >>= shift; v >>= shift; u >>= __builtin_ctzll(u); do { v >>= __builtin_ctzll(v); if (u < v) { v -= u; } else { uint64_t t = u - v; u = v; v = t; } } while (v); return u << shift; } uint64_t gcd64(uint64_t u, uint64_t v) { if (!u || !v) return u | v; int shift; for (shift = 0; !((u | v) & 1); shift++) { u /= 2, v /= 2; } while (!(u & 1)) u /= 2; do { while (!(v & 1)) v /= 2; if (u < v) { v -= u; } else { uint64_t t = u - v; u = v; v = t; } } while (v); return u << shift; } ``` ::: Replace divide by 2 loop with shift `__builtin_ctzll()` bit. ```diff - int shift; - for (shift = 0; !((u | v) & 1); shift++) { - u /= 2, v /= 2; - } - while (!(u & 1)) - u /= 2; + int shift = __builtin_ctzll(u|v); + u >>= shift; + v >>= shift; + u >>= __builtin_ctzll(u); do { - while (!(v & 1)) - v /= 2; + v >>= __builtin_ctzll(v); ``` Write a `main` function to run the gcd functions. ```c for (int i = 2000;i < 3000;++i) for (int j = 2000;j < 3000;++j) gcd64(i,j); ``` Result, we are interested in cycles reduced without the loop dividing 2 : Without `__builtin_ctz` ```shell $ # without __builtin_ctz $ perf stat -r 10 -e cycles,instructions ./gcd Performance counter stats for './gcd' (10 runs): 506,064,444 cycles ( +- 0.36% ) 269,162,907 instructions # 0.53 insn per cycle ( +- 0.24% ) 0.64403 +- 0.00915 seconds time elapsed ( +- 1.42% ) ``` With `__builtin_ctz` ```shell $ # with __builtin_ctz $ perf stat -r 10 -e cycles,instructions ./gcd2 Performance counter stats for './gcd2' (10 runs): 277,131,828 cycles ( +- 0.24% ) 151,498,335 instructions # 0.55 insn per cycle ( +- 0.01% ) 0.348384 +- 0.000891 seconds time elapsed ( +- 0.26% ) ``` There are about 1.8x improvement in performance. #### Linux 核心中內建 GCD ```c= unsigned long gcd(unsigned long a, unsigned long b) { unsigned long r = a | b; if (!a || !b) return r; b >>= __ffs(b); if (b == 1) return r & -r; for (;;) { a >>= __ffs(a); if (a == 1) return r & -r; if (a == b) return a << __ffs(r); if (a < b) swap(a, b); a -= b; } } ``` Some notes : `r & -r` would isolate the least significant bit. `__ffs` is find first set, starting from bit 0 ```c static __always_inline unsigned long __ffs(unsigned long word) { asm("rep; bsf %1,%0" : "=r" (word) : "rm" (word)); return word; } ``` > in `string.h`, `ffs` would start from 1 `line 5-6` : ```c if (!a || !b) return r; ``` return if any one of `a` or `b` is zero. `r` would be the non-zero variable as `x | 0` would be `x`. `line 8-10` : ```c b >>= __ffs(b); if (b == 1) return r & -r; ``` As stated before, any even number can be divided out from the variable. `b == 1` after means `b` is $2^k$, hence the smaller first set bit would be the gcd. `a > b` : $a = \sum_{i=k}\{0,1\}*2^i$, $b = 2^k$, `b` must be factor of `a`. `b > a` : $a = 2^k$, `a` must be factor of `b` $a = \sum_{i=0}\{0,1\}*2^i$, $b = 2^k$, first set bit of `a` would be factor of `a` and `b`. The for loop is similar to how test3 were implemented, combined with some of the explanation above. #### 核心內的應用場景 [factor polynomial using Berlekamp Trace algorithm](https://elixir.bootlin.com/linux/latest/source/lib/bch.c#L912) [... and more reference in Linux](https://elixir.bootlin.com/linux/latest/A/ident/gcd) Most of the time they would do. ```c g = gcd(a,b); a /= g; b /= g; ``` Which they find the **greatest common divisor** then divide it. Leaving two relative prime numbers. > $gcd(\dfrac{a}{gcd(a,b)},\dfrac{b}{gcd(a,b)})=1$ Though I had no idea what they are implementing. --- ## 測驗4 We are given a function that stores position of bits by setting to a array. ```c #include <stddef.h> size_t naive(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; for (size_t k = 0; k < bitmapsize; ++k) { uint64_t bitset = bitmap[k]; size_t p = k * 64; for (int i = 0; i < 64; i++) { if ((bitset >> i) & 0x1) out[pos++] = p + i; } } return pos; } ``` Consider the following test program : ```c int main(){ size_t s; uint8_t out[64]; uint8_t bitmap[] = {119,166,77}; s = naive(bitmap, sizeof bitmap / sizeof *bitmap, out); for(size_t i = 0;i < s;++i) printf("%d ", out[i]); printf("\n"); return 0; } ``` The result is : `0 1 2 4 5 6 9 10 13 15 16 18 19 22 ` When we look at the binary pattern of the bit value : ```shell (gdb) x/3t bitmap 0x7fffffffdd5d: 01110111 10100110 01001101 ``` The function would stores the poisiton of bits in array `out`. \begin{split} &&&&\ _6&\ _5&\ _4&&\ _2&\ _1&\ _0\\ &199 &= &\ 0\ &\ 1\ &\ 1\ &\ 1\ &\ 0\ &\ 1\ &\ 1\ &\ 1_2\ \\\\ &&&\ _{15}&&\ _{13}&&&\ _{10}&\ _9&\\ &166 &= &\ 1\ &\ 0\ &\ 1\ &\ 0\ &\ 0\ &\ 1\ &\ 1\ &\ 0_2\ \\\\ &&&&\ _{22}&&&\ _{19}&\ _{18}&&\ _{16}\\ &77 &= &\ 0\ &\ 1\ &\ 0\ &\ 0\ &\ 1\ &\ 1\ &\ 0\ &\ 1_2\ \\\\ \end{split} The logic behind the function is simple : `(bitset >> i) & 0x1` The function test for every least most significant bit. The test is about an improved version of the function, where we have to fill in `EXP6`. ```c #include <stddef.h> size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; uint64_t bitset; for (size_t k = 0; k < bitmapsize; ++k) { bitset = bitmap[k]; while (bitset != 0) { uint64_t t = EXP6; int r = __builtin_ctzll(bitset); out[pos++] = k * 64 + r; bitset ^= t; } } return pos; } ``` The expression `bitset ^= t` would like to clear the last set bit of `bitset`, so `t` would be the lowest set bit of `bitset`. From the hint, we know that we would use the properties of `-bitwise`, where `-x` = `~x + 1`. `x & ~x` would result in `0`, as all their bit pattern complements each other. When we do `~x + 1`, the addition would set the right most clear bit, clearing all the consecutive set bit from the least significant bit in the process of addition. `EXP6` = `bitset & -bitset` >Reference [Low Level Bit Hacks](https://catonmat.net/low-level-bit-hacks) --- ### 延伸問題 #### 舉出程式碼用在哪些真實案例中 // TODO ## 測驗5 We are given a function that would print out result of division as decimal. Fill-in `PPP`, `MMM`, and `EEE` ```c #include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include "list.h" struct rem_node { int key; int index; struct list_head link; }; static int find(struct list_head *heads, int size, int key) { struct rem_node *node; int hash = key % size; list_for_each_entry (node, &heads[hash], link) { if (key == node->key) return node->index; } return -1; } char *fractionToDecimal(int numerator, int denominator) { int size = 1024; char *result = malloc(size); char *p = result; if (denominator == 0) { result[0] = '\0'; return result; } if (numerator == 0) { result[0] = '0'; result[1] = '\0'; return result; } /* using long long type make sure there has no integer overflow */ long long n = numerator; long long d = denominator; /* deal with negtive cases */ if (n < 0) n = -n; if (d < 0) d = -d; bool sign = (float) numerator / denominator >= 0; if (!sign) *p++ = '-'; long long remainder = n % d; long long division = n / d; sprintf(p, "%ld", division > 0 ? (long) division : (long) -division); if (remainder == 0) return result; p = result + strlen(result); *p++ = '.'; /* Using a map to record all of reminders and their position. * if the reminder appeared before, which means the repeated loop begin, */ char *decimal = malloc(size); memset(decimal, 0, size); char *q = decimal; size = 1333; struct list_head *heads = malloc(size * sizeof(*heads)); for (int i = 0; i < size; i++) INIT_LIST_HEAD(&heads[i]); for (int i = 0; remainder; i++) { int pos = find(heads, size, remainder); if (pos >= 0) { while (PPP > 0) *p++ = *decimal++; *p++ = '('; while (*decimal != '\0') *p++ = *decimal++; *p++ = ')'; *p = '\0'; return result; } struct rem_node *node = malloc(sizeof(*node)); node->key = remainder; node->index = i; MMM(&node->link, EEE); *q++ = (remainder * 10) / d + '0'; remainder = (remainder * 10) % d; } strcpy(p, decimal); return result; } ``` The function would return `""` if denominator is `0`, `"0"` if numerator is `0` : ```c if (denominator == 0) { result[0] = '\0'; return result; } if (numerator == 0) { result[0] = '0'; result[1] = '\0'; return result; } ``` Then the function would see if the result would be negative, then prepend result with `-` if it is : ```c bool sign = (float) numerator / denominator >= 0; if (!sign) *p++ = '-'; ``` The function would divide as if both numbers are positive : ```c if (n < 0) n = -n; if (d < 0) d = -d; ``` Then the function does the division : ```c long long remainder = n % d; long long division = n / d; ``` The whole number part is presented directly with `division`, however, the decimal part needs extra computation : ```c /* whole number part*/ sprintf(p, "%ld", division > 0 ? (long) division : (long) -division); if (remainder == 0) return result; p = result + strlen(result); *p++ = '.'; /* continue with remainder part */ ``` `p` would point after the decimal, waiting for the remainder part. ![](https://i.imgur.com/T6TFeU8.jpg) The remainder part of the program is computed with concept of long division. ```c for (int i = 0; remainder; i++) { int pos = find(heads, size, remainder); if (pos >= 0) { while (PPP > 0) *p++ = *decimal++; *p++ = '('; while (*decimal != '\0') *p++ = *decimal++; *p++ = ')'; *p = '\0'; return result; } struct rem_node *node = malloc(sizeof(*node)); node->key = remainder; node->index = i; MMM(&node->link, EEE); *q++ = (remainder * 10) / d + '0'; remainder = (remainder * 10) % d; } strcpy(p, decimal); return result; ``` Example of long division $19\div12$ : \begin{array}{} &&&1.&5&8&3&3\\\hline12&\vert&1&9\\&&1&2\\\hline&&&7&0\\&&&6&0\\\hline&&&1&0&0\\&&&&9&6\\\hline&&&&&4&0\\&&&&&3&6\\\hline&&&&&&4&0\\&&&&&&3&6\\\hline&&&&&&&4\\ \end{array} Notice that each remainder after would shift by one digit when we are filling in each number after the decimal point : ```c // 19 % 12 = 7 d = ((19 % 12) * 10) / 12; // 5 r = ((19 % 12) * 10) % 12; // 10 ``` In the actual code, it is as below : ```c *q++ = (remainder * 10) / d + '0'; remainder = (remainder * 10) % d; ``` Notice that when the remainder is a non-unique number, that means the division would start repeating itself. ```c // 100 % 96 = 4 r1 = ((100 % 96) * 10) % 12; // r1 = 4 // 40 % 36 = 4 r2 = ((40 % 36) * 10) % 12; // r2 = 4 // r1 == r2, the division would repeat itself ``` The function implements this using a hash table data structure. It stores the remainder into the data structure, then check if the remainder is unique. ```c for (int i = 0; remainder; i++) { int pos = find(heads, size, remainder); if (pos >= 0) { // remainder found; not unique /* ... */ return result; } /* ... */ node->key = remainder; /* add to data structure */ list_add(&node->link, &heads[remainder % size]); /* do long division */ *q++ = (remainder * 10) / d + '0'; remainder = (remainder * 10) % d; } ``` The function would then copy the computed string value to `result`. Without repeating remainder : ```c strcpy(p, decimal); ``` With repeating remainder, the function would first put non-repeating digit into the result : ```c while (pos-- > 0) *p++ = *decimal++; ``` |Before|After| |:--:|:--:| ![](https://i.imgur.com/T5CEttB.jpg)|![](https://i.imgur.com/LjhGcxe.jpg) Then put the repeating digit into the result, with parenthesis : ```c *p++ = '('; while (*decimal != '\0') *p++ = *decimal++; *p++ = ')'; *p = '\0'; return result; ``` Result : ![](https://i.imgur.com/74ZbhhQ.jpg) `PPP` = `pos--` `MMM` = `list_add` `EEE` = `&heads[remainder % size]` ### 延伸問題 #### 指出程式碼不足，並予以改進 hash uses `%` operator to `size`, which is very expensive. `size` is a constant which is 1333, which I am unsure of the reason. If there is no specific reason for `size` to be 1333, we can change `size` to a value that is power of 2, say $2^k$. The remainder could be `n & (1 << k - 1)` `division > 0 ? (long) division : (long) -division` could be implemented without branch. Assuming LP64 data model, `((division >> 63) ^ division) + (division < 0)` As the question states, `(float)numerator / denominator >= 0` could be changed to : `(numerator < 0) ^ (denominator < 0)` //TODO ## 測驗6 Fill-in `M` and `X` ```c #define ALIGNOF(t) \ ((char *)(&((struct { char c; t _h; } *)0)->M) - (char *)X) ``` :::spoiler To learn how to read the marco, I seperate the macro into few components. First off is what I suppose is an anonymous structure. ```c struct{ char c; t_h;}; ``` Following [ISO/IEC 9899:201x](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf) 6.7.2.1:13 [ Structure and union specifiers ] > An unnamed member of structure type with no tag is called an anonymous structure; an unnamed member of union type with no tag is called an anonymous union. The members of an anonymous structure or union are considered to be members of the containing structure or union. This applies recursively if the containing structure or union is also anonymous. [ISO/IEC 9899:1999](http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1256.pdf) though, does not have anonymous structure defined. However, 6.7.2.2:6/112 notes as follow : > If there is no identifier , the type can, within the translation unit, only be referred to by the declaration of which it is a part. Of course, when the declaration is of a typedef name, subsequent declarations can mak e use of that typedef name to declare objects having the specified structure, union, or enumerated type. Which means `((struct { char c; t _h; } *)0)` is a type that can only be reffered from within the macro. Next off is null pointer constant. ```c ((struct { char c; t _h; } *)0) ``` The program would later do something on memory address `0x00`, but is completely legal. As defined in [ISO/IEC 9899:1999](http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1256.pdf) 6.3.2.3:3 [Pointers] > An integer constant expression with the value 0, or such an expression cast to type void *, is called a null pointer constant) If a null pointer constant is converted to a pointer type, the resulting pointer, called a null pointer, is guaranteed to compare unequal to a pointer to any object or function. and in 6.3.2.3:3/55 : > The macro NULL is defined in <stddef.h> (and other headers) as a null pointer constant; see 7.17. Then, in 7.17:3 [ Common definitions <stddef.h> ] > The macros are `NULL` which expands to an implementation-defined null pointer constant; and `offsetof(type, member-designator)` which expands to an integer constant expression that has type size_t, the value of which is the offset in bytes, to the structure member (designated by member-designator), from the beginning of its structure (designated by type). The type and member designator shall be such that given `static type t;` then the expression &(t.member-designator) evaluates to an address constant. (If the specified member is a bit-field, the behavior is undefined.) We know that the expression `&(t.member-designator)` (`&(t*->member-designator)`) evaluates to address constant, which would be the offset in bytes to struture member to beginning of structure. Following the path `/usr/lib/gcc/x86_64-linux/gnu/9/include` in my local machine I found : ```c #define offsetof(TYPE, MEMBER) __builtin_offsetof (TYPE, MEMBER) ``` Which shows that `offsetof()` would be implementation defined. If we go to [include/linux/stddef.h](https://github.com/torvalds/linux/blob/master/include/linux/stddef.h), we could find : ```c #undef offsetof #ifdef __compiler_offsetof #define offsetof(TYPE, MEMBER) __compiler_offsetof(TYPE, MEMBER) #else #define offsetof(TYPE, MEMBER) ((size_t)&((TYPE *)0)->MEMBER) #endif ``` We can see that `((size_t)&((TYPE *)0)->MEMBER)` is very similar to the `ALIGNOF()` macro. `ALIGNOF()` would do subtract by `(char *)0`. As stated before, null constant pointer would be implementation defined, so in case of the pointer not pointing to address `0x00`, the macro would try to subtract it by `(char *)0`, the base of address. The address of member would be `0 + sizeof(padding + char)`, and the padding would be, according to [System V Application Binary Interface](https://www.uclibc.org/docs/psABI-x86_64.pdf) > Structure and union objects can require padding to meet size and alignment constraints. ::: In [你所不知道的C語言：指標篇 / 延伸閱讀: C99 的 offsetof macro](http://blog.linux.org.tw/~jserv/archives/001399.html), which it would showcase [Learn a new trick with the offsetof() macro](https://www.embedded.com/learn-a-new-trick-with-the-offsetof-macro/), there was some implementation of `offsetof()` under various compilers. Under that, there is one that look almost identical to `ALIGNOF()` macro. ```c // Diab Coldfire compiler #define offsetof(s,memb) ((size_t)((char *)&((s *)0)->memb-(char *)0)) // Test6 macro #define ALIGNOF(t) \ ((char *)(&((struct { char c; t _h; } *)0)->_h)-(char *)0) ``` There are also some description about the macro, which describe them very well. > 1. ((s *)0) takes the integer zero and casts it as a pointer to s . > 2. ((s *)0)->m dereferences that pointer to point to structure member m . > 3. &(((s *)0)->m) computes the address of m . > 4. (size_t)&(((s *)0)->m) casts the result to an appropriate data type. `M` : `_h` `X` : `0` ### 延伸問題 #### __alignof__ 的使用案例 [Bootlin linux](https://elixir.bootlin.com/linux/latest/A/ident/__alignof__) ## 測驗7 First, look at the naive approach. ```c #include <stdio.h> int main() { for (unsigned int i = 1; i < 100; i++) { if (i % 3 == 0) printf("Fizz"); if (i % 5 == 0) printf("Buzz"); if (i % 15 == 0) printf("FizzBuzz"); if ((i % 3) && (i % 5)) printf("%u", i); printf("\n"); } return 0; } ``` The program would : print out `Fizz` when `i` is multiple of 3, print out `Buzz` when `i` is multiple of 5, print out `FizzBuzzFizzBuzz` when `i` is multiple of 15 (3 and 5), print out `i` when `i` is neither of the above statement. > Notes > To follow [Fizz buzz](https://en.wikipedia.org/wiki/Fizz_buzz) definition, `else if` should be used instead. Then, we are given a program using some tricks, and are required to fill in `KK1`, `KK2`, and `KK3` ```c= static inline bool is_divisible(uint32_t n, uint64_t M) { return n * M <= M - 1; } static uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1; static uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1; int main(int argc, char **argv) { for (size_t i = 1; i <= 100; i++) { uint8_t div3 = is_divisible(i, M3); uint8_t div5 = is_divisible(i, M5); unsigned int length = (2 << KK1) << KK2; char fmt[9]; strncpy(fmt, &"FizzBuzz%u"[(9 >> div5) >> (KK3)], length); fmt[length] = '\0'; printf(fmt, i); printf("\n"); } return 0; } ``` From the [description](https://hackmd.io/@sysprog/linux2022-quiz2#%E6%B8%AC%E9%A9%97-7), >```c > string literal: "fizzbuzz%u" > offset: 0 4 8 >``` We can see that the starting position should be either `0`, `4`, or `8`. However, in `line 17`, ```c strncpy(fmt, &"FizzBuzz%u"[(9 >> div5) >> (KK3)], length); ``` For `(9 >> div5)`, The character `%` of `%u` should be at index 8 of the array. As `KK3` is enclosed inside `(` `)` brackets, and is right operand of `>>` operator. There is no possible bit-wise operations for `KK3` to make the value `(9 >> div5) >> (KK3)` become 8. :::spoiler `is_divisible()` moved to [延伸問題](https://hackmd.io/@cantfindagoodname/linux2022-quiz2#%E5%BB%B6%E4%BC%B8%E5%95%8F%E9%A1%8C6) ::: div3 and div5 would be assigned value from `is_divisible()`, which returns a bool value `0` and `1`. |div3|div5|length| |:--:|:--:|:--:| |$0$|$0$|2| |$0$|$1$|4| |$1$|$0$|4| |$1$|$1$|8| We assign a value to length at `line 14` : ```c unsigned int length = (2 << KK1) << KK2; ``` We could easily see that the value `2` would shift once if either `div3` or `div5` is set, shift twice when both `div3` and `div5` are set. `KK1` : `(div3 || div5)` `KK2` : `(div3 && div5)` Which they would be equivalent to shifting 2 by `div3` bit once, then shift `div5` bit once. And so, `KK1` : `div3` `KK2` : `div5` Note that `KK1` and `KK2` are interchangeable. Suppose that `line 17` is modified by a little : ```c strncpy(fmt, &"FizzBuzz%u"[(8 >> div5) >> (KK3)], length); ``` From the [description](https://hackmd.io/@sysprog/linux2022-quiz2#%E6%B8%AC%E9%A9%97-7), >```c > string literal: "fizzbuzz%u" > offset: 0 4 8 >``` Give `(8 >> div5) >> (KK3)` an alias `start` |div3|div5|8 >> div5|start| |:--:|:--:|:--:|:--:| |$0$|$0$|8|8| |$0$|$1$|4|4| |$1$|$0$|8|0| |$1$|$1$|4|0| `KK3` = `0` when `div3` is clear. `KK3` > `3` when `div3` is `1`. `KK3` : `div3 << 2`, or `KK3` : `4 & -div3` ### 延伸問題 [Faster remainders when the divisor is a constant: beating compilers and libdivide](https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/)