--- tags: linux2024 --- # [2024q1](http://wiki.csie.ncku.edu.tw/linux/schedule) 第 7 週測驗題 :::info 目的: 檢驗學員對 bitwise 的認知 ::: ==[作答表單: 測驗 1](https://docs.google.com/forms/u/1/d/e/1FAIpQLScXfped9MFJXFHW_FgiC3U-ckya0ucrM9tTazUcuqMdPjcnUg/viewform)== (針對 Linux 核心「設計」課程) ==[作答表單: 測驗 2](https://docs.google.com/forms/d/e/1FAIpQLScuwxSA9aeaA8TN_DL-XVGDvB6lhH1OFUu63zSUKNUBYYI95g/viewform)== (針對 Linux 核心「實作」課程) ### 測驗 `1` [SIMD within a register](https://en.wikipedia.org/wiki/SWAR) (SWAR) 是軟體最佳化技巧之一,以下展示 SWAR 運用於 64 位元微處理器架構,原本判斷 2 個 32 位元寬度的整數是否都是奇數 (odd),可能會這樣撰寫: ```c #include <stdint.h> bool both_odd(uint32_t x, uint32_t y) { return (x & 1) && (y & 1); } ``` 但我們可先組合 (compound) 2 個 32 位元寬度的整數為 1 個 64 位元整數,再運用特製的 bitmask,從而減少運算量: ```c static uint64_t SWAR_ODD_MASK = (1L << 32) + 1; bool both_odd_swar(uint64_t xy) { return (xy & SWAR_ODD_MASK) == SWAR_ODD_MASK; } ``` 測試程式: ```c static inline uint64_t bit_compound(uint32_t x, uint32_t y) { return ((0L + x) << 32) | ((y + 0L) & (-1L >> 32)); } int main() { int x = 12345678; int y = 9012345; uint64_t xy = bit_compound(x, y); printf("%d, %d\n", both_odd(x, y), both_odd_swar(xy)); } ``` 延伸閱讀: [SIMD and SWAR Techniques](https://www.chessprogramming.org/SIMD_and_SWAR_Techniques) 在 Linux 核心原始程式碼中,[lib/string.c](https://github.com/torvalds/linux/blob/master/lib/string.c) 具備 [memchr](https://man7.org/linux/man-pages/man3/memchr.3.html) 的實作: ```c /** * memchr - Find a character in an area of memory. * @s: The memory area * @c: The byte to search for * @n: The size of the area. * * returns the address of the first occurrence of @c, or %NULL * if @c is not found */ void *memchr(const void *s, int c, size_t n) { const unsigned char *p = s; while (n-- != 0) { if ((unsigned char)c == *p++) { return (void *)(p - 1); } } return NULL; } ``` 測試程式: ```c int main() { const char str[] = "https://wiki.csie.ncku.edu.tw"; const char ch = '.'; char *ret = memchr_opt(str, ch, strlen(str)); printf("String after |%c| is - |%s|\n", ch, ret); return 0; } ``` 預期執行結果: ``` String after |.| is - |.csie.ncku.edu.tw| ``` 利用上述 [SIMD within a register](https://en.wikipedia.org/wiki/SWAR) (SWAR) 的技巧,我們可改寫為以下 `memchr_opt` 函式: ```c #include <limits.h> #include <stddef.h> #include <stdint.h> #include <string.h> /* Nonzero if X is not aligned on a "long" boundary */ #define UNALIGNED(X) ((long) X & (sizeof(long) - 1)) /* How many bytes are loaded each iteration of the word copy loop */ #define LBLOCKSIZE (sizeof(long)) /* Threshhold for punting to the bytewise iterator */ #define TOO_SMALL(LEN) ((LEN) < LBLOCKSIZE) #if LONG_MAX == 2147483647L #define DETECT_NULL(X) (((X) - (0x01010101)) & ~(X) & (0x80808080)) #else #if LONG_MAX == 9223372036854775807L /* Nonzero if X (a long int) contains a NULL byte. */ #define DETECT_NULL(X) \ (((X) - (0x0101010101010101)) & ~(X) & (0x8080808080808080)) #else #error long int is not a 32bit or 64bit type. #endif #endif /* @return nonzero if (long)X contains the byte used to fill MASK. */ #define DETECT_CHAR(X, mask) DETECT_NULL(AAAA) void *memchr_opt(const void *str, int c, size_t len) { const unsigned char *src = (const unsigned char *) str; unsigned char d = c; while (UNALIGNED(src)) { if (!len--) return NULL; if (*src == d) return (void *) src; src++; } if (!TOO_SMALL(len)) { /* If we get this far, len is large and src is word-aligned. */ /* The fast code reads the source one word at a time and only performs * a bytewise search on word-sized segments if they contain the search * character. This is detected by XORing the word-sized segment with a * word-sized block of the search character, and then checking for the * presence of NULL in the result. */ unsigned long *asrc = (unsigned long *) src; unsigned long mask = d << 8 | d; mask |= mask << 16; for (unsigned int i = 32; i < LBLOCKSIZE * 8; i <<= 1) mask |= BBBB; while (len >= LBLOCKSIZE) { if (DETECT_CHAR(CCCC, mask)) break; asrc++; len -= LBLOCKSIZE; } /* If there are fewer than LBLOCKSIZE characters left, then we resort to * the bytewise loop. */ src = (unsigned char *) asrc; } while (len--) { if (*src == d) return (void *) src; src++; } return NULL; } ``` 請補完程式碼,使上述 `memchr_opt` 的實作符合 [memchr](https://man7.org/linux/man-pages/man3/memchr.3.html) 行為,作答規範: 1. 以最精簡的程式碼撰寫 2. 以第一次作業規範的程式碼風格書寫 3. AAAA, BBBB, CCCC 皆為表示式 :::success 延伸問題: 1. 解釋上述程式碼運作原理 2. 比較 Linux 核心原本 (與平台無關) 的實作和 `memchr_opt`,設計實驗來觀察隨著字串長度和特定 pattern 變化的效能影響 3. 研讀 [2022 年學生報告](https://hackmd.io/@arthur-chang/linux2022-quiz8),在 Linux 核心原始程式碼找出 x86_64 或 arm64 對應的最佳化實作,跟上述程式碼比較,並嘗試舉出其中的策略和分析 > 下一步:貢獻程式碼到 Linux 核心 > [Phoronix 報導](https://www.phoronix.com/news/Linux-Kernel-Faster-memchr) ::: --- ### 測驗 `2` 若干處理器架構無法有效地處理除法,且我們想避免除法造成的例外狀況 (如 [Division by zero](https://en.wikipedia.org/wiki/Division_by_zero)),於是我們考慮以下實作: ```c #include <stdbool.h> #include <stdint.h> #define likely(x) __builtin_expect(!!(x), 1) static inline bool is_power_of_2(unsigned long n) { return (n != 0 && ((n & (n - 1)) == 0)); } static inline int __ilog2_u32(uint32_t x) { return x ? sizeof(x) * 8 - __builtin_clz(x) - 1 : 0; } static inline int __ilog2_u64(uint64_t x) { uint32_t h = x >> 32; if (h) return sizeof(h) * 8 - __builtin_clz(h) + 31; return x ? sizeof(x) * 8 - __builtin_clz(x) - 1 : 0; } /* Calculates the logarithm base 2 of a 32-bit or 64-bit unsigned integer. * This function performs a constant-capable log base 2 computation, allowing * it to initialize global variables with constant data, which explains the * extensive use of ternary operators. * It automatically chooses the optimized version suitable for the size of 'n', * based on whether it is 32-bit or 64-bit. */ #define ilog2(n) \ (__builtin_constant_p(n) ? ((n) < 2 ? 0 : 63 - __builtin_clzll(n)) \ : (sizeof(n) <= 4) ? __ilog2_u32(n) \ : __ilog2_u64(n)) /* In cases where the divisor is a constant, we compute its inverse during * compile time. This allows us to replace the division operation with several * inline multiplications, significantly improving performance. */ #define __div64_const32(n, ___b) \ ({ \ /* This approach multiplies by the reciprocal of b: to divide n by \ * b, we multiply n by (p / b) and then divide by p. The \ * efficiency comes from the compiler's ability to optimize much \ * of this code during compile time through constant propagation, \ * leaving behind only a few multiplication operations. This is \ * why we use this large macro, as a static inline function \ * does not consistently achieve the same optimization. \ */ \ uint64_t ___res, ___x, ___t, ___m, ___n = (n); \ bool ___bias; \ \ /* determine MSB of b */ \ uint32_t ___p = 1 << ilog2(___b); \ \ /* compute m = ((p << 64) + b - 1) / b */ \ ___m = (~0ULL / ___b) * ___p; \ ___m += (((~0ULL % ___b + 1) * ___p) + ___b - 1) / ___b; \ \ /* one less than the dividend with highest result */ \ ___x = ~0ULL / ___b * ___b - 1; \ \ /* test our ___m with res = m * x / (p << 64) */ \ ___res = ((___m & 0xffffffff) * (___x & 0xffffffff)) >> 32; \ ___t = ___res += (___m & 0xffffffff) * (___x >> 32); \ ___res += (___x & 0xffffffff) * (___m >> 32); \ ___t = (___res < ___t) ? (1ULL << 32) : 0; \ ___res = (___res >> 32) + ___t; \ ___res += (___m >> 32) * (___x >> 32); \ ___res /= ___p; \ \ /* Next, we will clean up and enhance the efficiency of our current \ * implementation. \ */ \ if (~0ULL % (___b / (___b & -___b)) == 0) { \ /* special case, can be simplified to ... */ \ ___n /= (___b & -___b); \ ___m = ~0ULL / (___b / (___b & -___b)); \ ___p = 1; \ ___bias = true; \ } else if (___res != ___x / ___b) { \ /* It is necessary to introduce a bias to counteract errors \ * caused by bit truncation. Without this bias, an extra bit \ * would be required to accurately represent the variable 'm', \ * which would exceed the capacity of a 64-bit variable. To \ * manage this, the solution involves setting 'm' equal to 'p' \ * divided by 'b', and 'n' divided by 'b' as equal to \ * (n * m + m) / p. This approach is taken because avoiding the \ * bias is not feasible due to the limitations imposed by bit \ * truncation errors and the 64-bit variable size constraint. \ */ \ ___bias = true; \ /* Compute m = (p << 64) / b */ \ ___m = (~0ULL / ___b) * ___p; \ ___m += ((~0ULL % ___b + AAAA) * ___p) / ___b; \ } else { \ /* Reduce m / p, and try to clear bit 31 of m when possible, \ * otherwise that will need extra overflow handling later. \ */ \ uint32_t ___bits = -(___m & -___m); \ ___bits |= ___m >> 32; \ ___bits = (~___bits) << 1; \ /* If ___bits == 0 then setting bit 31 is unavoidable. \ * Simply apply the maximum possible reduction in that \ * case. Otherwise the MSB of ___bits indicates the \ * best reduction we should apply. \ */ \ if (!___bits) { \ ___p /= (___m & -___m); \ ___m /= (___m & -___m); \ } else { \ BBBB; \ CCCC; \ } \ /* No bias needed. */ \ ___bias = false; \ } \ \ /* Now we have a combination of 2 conditions: \ * 1. whether or not we need to apply a bias, and \ * 2. whether or not there might be an overflow in the cross \ * product determined by (___m & ((1 << 63) | (1 << 31))). \ * \ * Select the best way to do (m_bias + m * n) / (1 << 64). \ * From now on there will be actual runtime code generated. \ */ \ ___res = __xprod_64(___m, ___n, ___bias); \ \ ___res /= ___p; \ }) /* * Semantic: retval = ((bias ? m : 0) + m * n) >> 64 * * The product is a 128-bit value, scaled down to 64 bits. * Assuming constant propagation to optimize away unused conditional code. */ static inline uint64_t __xprod_64(const uint64_t m, uint64_t n, bool bias) { uint32_t m_lo = m; uint32_t m_hi = m >> 32; uint32_t n_lo = n; uint32_t n_hi = n >> 32; uint64_t res; uint32_t res_lo, res_hi; if (!bias) { res = ((uint64_t) m_lo * n_lo) >> 32; } else if (!(m & ((1ULL << 63) | (1ULL << 31)))) { /* it is impossible for an overflow to occur in this case */ res = (m + (uint64_t) m_lo * n_lo) >> 32; } else { res = m + (uint64_t) m_lo * n_lo; res_lo = res >> 32; res_hi = (res_lo < m_hi); res = res_lo | ((uint64_t) res_hi << 32); } if (!(m & ((1ULL << 63) | (1ULL << 31)))) { /* it is impossible for an overflow to occur in this case */ res += (uint64_t) m_lo * n_hi; res += (uint64_t) m_hi * n_lo; res >>= 32; } else { res += (uint64_t) m_lo * n_hi; uint32_t tmp = res >> 32; res += (uint64_t) m_hi * n_lo; res_lo = res >> 32; res_hi = (res_lo < tmp); res = res_lo | ((uint64_t) res_hi << 32); } res += (uint64_t) m_hi * n_hi; return res; } static inline uint32_t __div64_32(uint64_t *n, uint32_t base) { uint64_t rem = *n; uint64_t b = base; uint64_t res, d = 1; uint32_t high = rem >> 32; /* Reduce the thing a bit first */ res = 0; if (high >= base) { high /= base; res = (uint64_t) high << 32; rem -= (uint64_t) (high * base) << 32; } while ((int64_t) b > 0 && b < rem) { b = b + b; d = d + d; } do { if (rem >= b) { rem -= b; res += d; } b >>= 1; d >>= 1; } while (d); *n = res; return rem; } /* The unnecessary pointer compare is there to check for type safety. * n must be 64bit. */ #define div64(n, base) \ ({ \ uint32_t __base = (base); \ uint32_t __rem; \ if (__builtin_constant_p(__base) && is_power_of_2(__base)) { \ __rem = (n) & (__base - 1); \ (n) >>= DDDD; \ } else if (__builtin_constant_p(__base) && __base != 0) { \ uint32_t __res_lo, __n_lo = (n); \ (n) = __div64_const32(n, __base); \ /* the remainder can be computed with 32-bit regs */ \ __res_lo = (n); \ __rem = __n_lo - __res_lo * __base; \ } else if (likely(((n) >> 32) == 0)) { \ __rem = (uint32_t) (n) % __base; \ (n) = (uint32_t) (n) / __base; \ } else { \ __rem = __div64_32(&(n), __base); \ } \ __rem; \ }) ``` 使用案例: ```c #include <stdio.h> int main(int argc, char *argv[]) { uint64_t res; res = argc + 5; div64(res, 3); printf("%lu\n", res); } ``` 作答規範: * 以最精簡的形式書寫,並依循第一次作業的程式碼風格 * AAAA 為整數 * BBBB 為包含 `___p` 的表示式 * CCCC 為包含 `___m` 的表示式 * BBBB, CCC, DDDD 都該包含 `ilog2` 的使用