2024q1 Homework4 (quiz3+4)

--- tags: linux2024 --- # 2024q1 Homework4 (quiz3+4) contributed by <[Howard Chen](https://github.com/howardjchen/linux2024-hw2)> ## Week 3 ### 測驗 1 #### Modify using `fls` > ffs, ffsl, ffsll, fls, flsl, flsll -- find first or last bit set in a bit string From the source code of [Linux kernel 6.8: include/asm-generic/bitops/fls.h](https://elixir.bootlin.com/linux/latest/source/include/asm-generic/bitops/fls.h#L13) ```c /** * fls - find last (most-significant) bit set * @x: the word to search * * This is defined the same way as ffs. * Note fls(0) = 0, fls(1) = 1, fls(0x80000000) = 32. */ ``` For example, `x = 0x20000000`, which means x[29]= 1. Since `fls` will return the position of msb bit, to express x using `fls`, we could use `1UL << (fls(x) - 1)`. Therefore, `__builtin_clz` could be replaced with `fls` as below, ```c int m = 1UL << ((fls(x) - 1) & ~1UL); ``` #### Application in Linux kernel From linux kernel 5.19, we could find `int_sqrt` source code in [lib/math/int_sqrt.c](https://elixir.bootlin.com/linux/v5.19.17/source/lib/math/int_sqrt.c#L20) In which use the same `fls` function to mentioned above. For application use case, under [/drivers/video/fbdev/core/fbmon.c](https://elixir.bootlin.com/linux/v5.19.17/source/drivers/video/fbdev/core/fbmon.c#L1115), it use `int_sqrt` to calculate the `h_period` ```c /** * fb_get_hblank_by_dclk - get horizontal blank time given pixelclock * @dclk: pixelclock in Hz * @xres: horizontal resolution in pixels * * DESCRIPTION: * * xres * duty_cycle * hblank = ------------------ * 100 - duty_cycle * * duty cycle = percent of htotal assigned to inactive display * duty cycle = C - (M * h_period) * * where: h_period = SQRT(100 - C + (0.4 * xres * M)/dclk) + C - 100 * ----------------------------------------------- * 2 * M * M = 300; * C = 30; */ static u32 fb_get_hblank_by_dclk(u32 dclk, u32 xres) { u32 duty_cycle, h_period, hblank; dclk /= 1000; h_period = 100 - C_VAL; h_period *= h_period; h_period += (M_VAL * xres * 2 * 1000)/(5 * dclk); h_period *= 10000; h_period = int_sqrt(h_period); h_period -= (100 - C_VAL) * 100; h_period *= 1000; h_period /= 2 * M_VAL; duty_cycle = C_VAL * 1000 - (M_VAL * h_period)/100; hblank = (xres * duty_cycle)/(100000 - duty_cycle) + 8; hblank &= ~15; return (hblank); } ``` ### 測驗2 :::warning Q:針對正整數在相鄰敘述進行 mod 10 和 div 10 操作，如何減少運算成本？ --> "相鄰敘述" 是什麼意思？ ::: ```c #include <stdint.h> void divmod_10(uint32_t in, uint32_t *div, uint32_t *mod) { uint32_t x = (in | 1) - (in >> 2); /* div = in/10 ==> div = 0.75*in/8 */ uint32_t q = (x >> 4) + x; x = q; q = (q >> 8) + x; q = (q >> 8) + x; q = (q >> 8) + x; q = (q >> 8) + x; *div = (q >> 3); *mod = in - ((q & ~0x7) + (*div << 1)); } ``` :::info TODO: 練習撰寫不依賴任何除法指令的 % 9 (modulo 9) 和 % 5 (modulo 5) 程式碼，並證明在 uint32_t 的有效範圍可達到等效。 ::: ### 測驗 3 The funciton `ilog2` stands for "integer logarithm base 2". It calculates the logarithm of a given integer with base 2. Actually the it's just finding the most significant bit of the given value `i` ```c static size_t ilog2_2(size_t i) { size_t result = 0; while (i >= 0xffff) { result += 16; i >>= 16; } while (i >= 0xff) { result += 8; i >>= 8; } while (i >= 0xf) { result += 4; i >>= 4; } while (i >= 2) { result += 1; i >>= 1; } return result; } ``` Therefore, we could use `__builtin_clz` to to find `msb` then subtract by 31. Assuming the given value was 32-bit. **Note that gcc `__builtin_clz` built-in function leads to the undefined result when input is 0, we should at least give it a number. So we choose `v | 1` as input.** **[Commimt: use Protection for clz()](https://github.com/howardjchen/linux2024-hw3-4/commit/f7dbe1f7ba4f64edfb0fcc8ce408560518bdd398)** ```diff int ilog32(uint32_t v) { printf("[%s]: %d\n", __func__, __builtin_clz(v)); - return (31 - __builtin_clz(v)); + return (31 - __builtin_clz(v | 1)); } ``` #### ilog2 in Lnux kernel The `ilog2` macro calculates the logarithm base 2 of a 32-bit or 64-bit unsigned integer value `n`. - It checks if the input `n` is a constant using `__builtin_constant_p(n)`. For 64 bit value, it calculates the logarithm using `63 - __builtin_clzll(n)`, which counts the number of leading zeros in a 64-bit integer. However, if `n` is not a constant or larger than 2, it selects an appropriately sized optimized version depending on the size of `n`. - If the size of `n` is 4 bytes or less, it calls the function `__ilog2_u32(n)`. - If the size of `n` is greater than 4 bytes, it calls the function `__ilog2_u64(n)`. And let's check deeply into `__ilog2_u32` and `__ilog2_u64`. The both function simply calls the `fls` and `fls64` (find last set) function and subtracts 1 from the result. The `fls` function counts the number of leading zeros before the first set bit and returns the position of the highest set bit. ```c /** * ilog2 - log base 2 of 32-bit or a 64-bit unsigned value * @n: parameter * * constant-capable log of base 2 calculation * - this can be used to initialise global variables from constant data, hence * the massive ternary operator construction * * selects the appropriately-sized optimised version depending on sizeof(n) */ #define ilog2(n) \ ( \ __builtin_constant_p(n) ? \ ((n) < 2 ? 0 : \ 63 - __builtin_clzll(n)) : \ (sizeof(n) <= 4) ? \ __ilog2_u32(n) : \ __ilog2_u64(n) \ ) /* * non-constant log of base 2 calculators * - the arch may override these in asm/bitops.h if they can be implemented * more efficiently than using fls() and fls64() * - the arch is not required to handle n==0 if implementing the fallback */ #ifndef CONFIG_ARCH_HAS_ILOG2_U32 static inline __attribute__((const)) int __ilog2_u32(u32 n) { return fls(n) - 1; } #endif #ifndef CONFIG_ARCH_HAS_ILOG2_U64 static inline __attribute__((const)) int __ilog2_u64(u64 n) { return fls64(n) - 1; } #endif ``` Regarding the application of `ilog2` in linux kernel, we took [pcie-rockchip-ep.c](https://elixir.bootlin.com/linux/v5.15.152/source/drivers/pci/controller) for example. It's a PCIe endpoint driver developed by rockchip. During PCIe enumeration, host will check the EP device BAR size and allocation corresponding memory address and MMIO space for the device. And how the host determine the device BAR size is by checking it's PCIe configuration space on BAR register. According to PCIe base spec, BAR size was describe as using `msb` in the BAR register to determine the MMIO size that the device want. Therefore, we could see the rockchip EP driver using `ilog2` to check the device BAR size after reading it's PCIe configuration space. ```c static int rockchip_pcie_ep_set_bar(struct pci_epc *epc, u8 fn, u8 vfn, struct pci_epf_bar *epf_bar) { struct rockchip_pcie_ep *ep = epc_get_drvdata(epc); struct rockchip_pcie *rockchip = &ep->rockchip; dma_addr_t bar_phys = epf_bar->phys_addr; enum pci_barno bar = epf_bar->barno; int flags = epf_bar->flags; u32 addr0, addr1, reg, cfg, b, aperture, ctrl; u64 sz; /* BAR size is 2^(aperture + 7) */ sz = max_t(size_t, epf_bar->size, MIN_EP_APERTURE); /* * roundup_pow_of_two() returns an unsigned long, which is not suited * for 64bit values. */ sz = 1ULL << fls64(sz - 1); aperture = ilog2(sz) - 7; /* 128B -> 0, 256B -> 1, 512B -> 2, ... */ ``` ### 測驗 4 The `weight` and `factor` parameters are required to be powers of 2 because this enables simplification of multiplication operations to left shifts, which are more efficient. In the `ewma_init` function, the `ilog2` function is used to calculate the base-2 logarithms of `weight` and `factor`, ensuring that multiplication operations can be replaced by left shifts. This optimization enhances the performance of the code. ```c void ewma_init(struct ewma *avg, unsigned long factor, unsigned long weight) { if (!is_power_of_2(weight) || !is_power_of_2(factor)) assert(0 && "weight and factor have to be a power of two!"); avg->weight = ilog2(weight); avg->factor = ilog2(factor); avg->internal = 0; } struct ewma *ewma_add(struct ewma *avg, unsigned long val) { avg->internal = avg->internal ? (((avg->internal << avg->weight) - avg->internal) + (val << avg->factor)) >> avg->weight : (val << avg->factor); return avg; } ``` #### EWMA in Linux kernel We could found that in the [/kernel/sched/fair.c](https://elixir.bootlin.com/linux/v5.15.153/source/kernel/sched/fair.c#L4037), the scheduling algorithm use EWMA to update task's estimation utilization. ```c /* * Update Task's estimated utilization * * When *p completes an activation we can consolidate another sample * of the task size. This is done by storing the current PELT value * as ue.enqueued and by using this value to update the Exponential * Weighted Moving Average (EWMA): * * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1) * = w * task_util(p) + ewma(t-1) - w * ewma(t-1) * = w * (task_util(p) - ewma(t-1)) + ewma(t-1) * = w * ( last_ewma_diff ) + ewma(t-1) * = w * (last_ewma_diff + ewma(t-1) / w) * * Where 'w' is the weight of new samples, which is configured to be * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT) */ ue.ewma <<= UTIL_EST_WEIGHT_SHIFT; ue.ewma += last_ewma_diff; ue.ewma >>= UTIL_EST_WEIGHT_SHIFT; ``` ### 測驗 5 The variable `r` is used to record the position of the first set bit of `x` from the most significant bit (MSB), which is essentially the same as the value recorded by `ilog2()`. However, `ilog2()` calculates the floor of log2. Since both `r` and `shift` are powers of 2, adding oprtation can be performed using bitwise OR (|). Thus, `r | shift` is equivalent to `r + shift`. Finally, when the condition `x > 1` is necessary because when checking `x > 0x3`, if `x` equals `0x2` or `0x3`, `shift` will still be `0`, leading to an undercount of 1. Hence, the additional check `x > 1` is required. The initial decrement of `x` with `x--` handles the case when `x` is exactly a power of 2. ==**[issue] However, this function only valid for `x` is power of 2. If `x` is not power of 2, there will be wrong valure returned. e.g. 0x0f456780**== ```c int ceil_ilog2(uint32_t x) { uint32_t r, shift; x--; r = (x > 0xFFFF) << 4; x >>= r; shift = (x > 0xFF) << 3; x >>= shift; r |= shift; shift = (x > 0xF) << 2; x >>= shift; r |= shift; shift = (x > 0x3) << 1; x >>= shift; return (r | shift | x > 1) + 1; } ``` ### when x = 0 Refer to [Vax-r](https://hackmd.io/@vax-r/linux2024-homework4#%E6%B8%AC%E9%A9%97%E4%BA%94), we could use `!` NOT to check if `x` equals to 0. In C, the `!` operator is the logical NOT operator. It is a unary operator that takes a single operand and returns 0 if the operand evaluates to a nonzero value, and 1 if the operand evaluates to 0. ```diff int ceil_ilog2(uint32_t x) { uint32_t r, shift; - x--; + x -= !!x; ``` ## Ｗeek 4 ### 測驗 1 : Enhance the hamming distance method to O(n) - Bit 0 observation: - All bits are 1 -> Hamming Distance = 0 - Bit 1 observation: - 1 occurrence, the rest are 0 - Hamming Distance = (1) * (numsize - 1) - Bit 2 observation: - 2 occurrences of 1, the rest are 0 - Hamming Distance = (2) * (numsize - 2) So firstly, we create a `bit_array` to calculate how many 1'b in each `nums` in bit[0:31] and store to each `bit_array[i]` Secondly, for each `bit_array[i]`, we use the methology to calculate the total hamming distance **Commit: [Enhance the hamming distance algorithm](https://github.com/howardjchen/linux2024-hw3-4/commit/d880ce6cc2a16466d90dcb1dde74253283991497)** #### Original hamming distance:$O(n^2)$ ![image](https://hackmd.io/_uploads/Hyj_E1dlR.png) #### ==Enhanced version: $O(n)$== ![image](https://hackmd.io/_uploads/rkOw4J_lR.png) ### 測驗 2 Refering to [SHChang-Anderson](https://hackmd.io/ahBQ35SOQyu1CmzVljikLQ?both), We could have better understanding on the `move_mask` ```c /* Enhance tic-tac-toe game performance through a strategic approach. * Rather than exclusively focusing on achieving three consecutive marks on a * 3x3 board, reimagine the game as aiming for three in a row across any of the * eight possible lines. In this variation, a single move can influence multiple * lines simultaneously. */ static const uint32_t move_masks[9] = { 0x40040040, 0x20004000, 0x10000404, 0x04020000, 0x02002022, 0x01000200, 0x00410001, 0x00201000, 0x00100110, }; ``` Each element in the `move_masks` array represents the impact on all potential line configurations after placing a mark at a specific position. The binary representation of each element describes how the line configurations change after placing a mark at that position. **Consider `0x20004000` is placing mark on at position 1:** ||1|2|3| |-|-|-|-| |**1**|0|==1==|2| |**2**|3|4|5| |**3**|6|7|8| There will be two possibile lines to win: #### First winning line: Reflect to `board` in binaray expression as shown： 0**1**11 0000 0000 0000 0000 0000 0000 0000 ||1|2|3| |-|-|-|-| |**1**|==**0**==|==1==|==2==| |**2**|3|4|5| |**3**|6|7|8| #### Second winning line: Reflect to `board` in binaray expression as shown： 0000 0000 0000 0000 0**1**11 0000 0000 0000 ||1|2|3| |-|-|-|-| |**1**|0|==**1**==|2| |**2**|3|==4==|5| |**3**|6|==7==|8| `0x20004000`: 00**1**0 0000 0000 0000 0**1**00 0000 0000 0000 The code $or$ the `player_board` and `move_mask` as shown `board |= move_masks[move];` The victory condition is determined by detecting the occurrence of `0111` in groups of four bits in the `player_board`. When this pattern is detected, the player wins. In the code, `(player_board + BBBB)` is bitwise $AND$ with `0x88888888`. This implies that when `0111` occurs, it needs to be transformed into `1000`. Adding `0111` to `1` achieves this effect. Therefore, `BBBB` should be replaced with `0x11111111`. Checking [Hacker Delight](https://web.archive.org/web/20180517023231/http://www.hackersdelight.org/divcMore.pdf), there is look up table for $n \ \ (mod \ \ 7) = popcount(n)$ ```c int remu7(unsigned n) { static char table [75] = {0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4}; n = (n >> 15) + (n & 0x7FFF); n = (n >> 9) + (n & 0x001FF); n = (n >> 6) + (n & 0x0003F); return table[n]; } ``` In [tictactoe.c](), it illustrate another method ```c static inline int mod7(uint32_t x) { x = (x >> 15) + (x & UINT32_C(0x7FFF)); /* Take reminder as (mod 8) by mul/shift. Since the multiplier * was calculated using ceil() instead of floor(), it skips the * value '7' properly. * M <- ceil(ldexp(8/7, 29)) */ return (int) ((x * UINT32_C(0x24924925)) >> 29); } ``` :::info TODO: cannot understand the meaning of 0x24924925 :::