2019q3 Homework2 (quiz2)

--- tags: sysprog --- # 2019q3 Homework2 (quiz2) contributed by < `yxguo2536` > ## 測驗6 延伸測驗 `5`，實作 branchless 的 `popcnt` 並附上對應的程式原理解說。 :::success 延伸問題: 1. 指出 `popcnt` 的應用場景; 2. 在 Linux 核心程式碼找出具體用法並解析; ::: --- ### 原理參考 [Hamming weight - Wikipedia](https://en.wikipedia.org/wiki/Hamming_weight#Efficient_implementation) 的範例： :::danger 用 HackMD 內建的表格語法改寫下方 :notes: jserv ::: <table> <thead> <tr> <th>Expression</th> <th>Binary</th> <th>Decimal</th> <th>Comment</th> </tr> </thead> <tbody> <tr> <th>a</th> <th>01 10 11 00 10 11 10 10</th> <th> </th> <th>The original number</th> </tr> <tr> <th>b0 = (a >> 0) & 01 01 01 01 01 01 01 01</th> <th>01 00 01 00 00 01 00 00</th> <th>1,0,1,0,0,1,0,0</th> <th>every other bit from a</th> </tr> <tr> <th>b1 = (a >> 1) & 01 01 01 01 01 01 01 01</th> <th>00 01 01 00 01 01 01 01</th> <th>0,1,1,0,1,1,1,1</th> <th>the remaining bits from a</th> </tr> <tr> <th>c = b0 + b1</th> <th>01 01 10 00 01 10 01 01</th> <th>1,1,2,0,1,2,1,1</th> <th><font color=red>list giving # of 1s in each 2-bit slice of a</font></th> </tr> <tr> <th>d0 = (c >> 0) & 0011 0011 0011 0011</th> <th>0001 0000 0010 0001</th> <th>1,0,2,1</th> <th>every other count from c</th> </tr> <tr> <th>d2 = (c >> 2) & 0011 0011 0011 0011</th> <th>0001 0010 0001 0001</th> <th>1,2,1,1</th> <th>the remaining counts from c</th> </tr> <tr> <th>e = d0 + d2</th> <th>0010 0010 0011 0010</th> <th>2,2,3,2</th> <th><font color=red>list giving # of 1s in each 4-bit slice of a</font></th> </tr> <tr> <th>f0 = (e >> 0) & 00001111 00001111</th> <th>00000010 00000010</th> <th>2,2</th> <th>every other count from e</th> </tr> <tr> <th>f4 = (e >> 4) & 00001111 00001111</th> <th>00000010 00000011</th> <th>2,3</th> <th>the remaining counts from e</th> </tr> <tr> <th>g = f0 + f4</th> <th>00000100 00000101</th> <th>4,5</th> <th><font color=red>list giving # of 1s in each 8-bit slice of a</font></th> </tr> <tr> <th>h0 = (g >> 0) & 0000000011111111</th> <th>0000000000000101</th> <th>5</th> <th>every other count from g</th> </tr> <tr> <th>h8 = (g >> 8) & 0000000011111111</th> <th>0000000000000100</th> <th>4</th> <th>the remaining counts from g</th> </tr> <tr> <th>i = h0 + h8</th> <th>0000000000001001</th> <th>9</th> <th><font color=red>the final answer of the 16-bit word</font></th> </tr> </tbody> </table> <style> table { font-weight:normal; } table th:first-of-type { width: 150px; } table th:nth-of-type(2) { width: 220px; } </style> 可以看到他的基本流程就是： 1. 根據 a 計算出 c，c 中每個 pair ( 2 bits )紀錄的值代表「a 中每 2 bits 有幾個"1"」 2. 根據 c 計算出 e，e 中每個 nibble ( 4 bits )紀錄的值代表「a 中每 4 bits 有幾個"1"」 3. 根據 e 計算出 g，g 中每個 byte 紀錄的值代表「a 中每 8 bits 有幾個"1"」 4. 根據 g 計算出 i，i 中每 2 個 bytes 紀錄的值代表「a 中每 16 bits (也就是全部) 有幾個"1"」 ### 實做為了達成上述行為，需要兩「單位」為一組做**相加** ( 在上述範例，單位是1、 2、4、8 bits )。也因此實做上需要 bit mask 和 bitwise shift 的組合應用。 ```clike int popcnt_branchless(int n) { n = (n & 0x55555555) + ((n>>1) & 0x55555555); n = (n & 0x33333333) + ((n>>2) & 0x33333333); n = (n & 0x0f0f0f0f) + ((n>>4) & 0x0f0f0f0f); n = (n & 0x00ff00ff) + ((n>>8) & 0x00ff00ff); n = (n & 0x0000ffff) + ((n>>16) & 0x0000ffff); return n; } ``` ### 應用場景 population count 可以拿來計算 [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance)。假設兩個長度相等的字串，則 Hamming Distance 則是兩字串對應位置不相符的數量和： * 10<font color=red>1</font>1<font color=red>1</font>01與10<font color=green>0</font>1<font color=green>0</font>01的距離是2 * 2<font color=red>14</font>3<font color=red>8</font>96與2<font color=green>23</font>3<font color=green>7</font>96的距離是3 * "<font color=red>t</font>o<font color=red>n</font>e<font color=red>d</font>"與"<font color=green>r</font>o<font color=green>s</font>e<font color=green>s</font>"的距離是3 而在二進位中，尋找「兩個 bit array A 和 B 的 Hamming Distance 」則等同於計算「 A XOR B 後的population count」 ```clike int HammingDistance(int a, int b) { return __buildin_popcount(a ^ b); } ``` 而 Hamming distance 又是計算 [Hamming code](https://en.wikipedia.org/wiki/Hamming_code) 需要的步驟，後者是檢驗並校正位元錯誤的方法。 ### Linux 核心實際應用上網找了一下 counting bit set 在 linux 上的應用，首先從[網友的回覆](https://stackoverflow.com/questions/14366069/kernel-macro-for-counting-bits-set)鎖定一個 hweight_long，去 github 找一下，發現確實有這個函式： * [linux/tools/include/linux/bitops.h](https://github.com/torvalds/linux/blob/7e67a859997aad47727aff9c5a32e160da079ce3/tools/include/linux/bitops.h) ```clike static inline unsigned long hweight_long(unsigned long w) { return sizeof(w) == 4 ? hweight32(w) : hweight64(w); } ``` 但還不能確定這個函式真的就是 population count，所以繼續往底層追，找 hweight32 是怎麼實作的： * [linux/tools/include/asm-generic/bitops/const_hweight.h ](https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/tools/include/asm-generic/bitops/const_hweight.h) ```clike /* * Compile time versions of __arch_hweightN() */ #define __const_hweight8(w) \ ((unsigned int) \ ((!!((w) & (1ULL << 0))) + \ (!!((w) & (1ULL << 1))) + \ (!!((w) & (1ULL << 2))) + \ (!!((w) & (1ULL << 3))) + \ (!!((w) & (1ULL << 4))) + \ (!!((w) & (1ULL << 5))) + \ (!!((w) & (1ULL << 6))) + \ (!!((w) & (1ULL << 7))))) #define __const_hweight16(w) (__const_hweight8(w) + __const_hweight8((w) >> 8 )) #define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16)) #define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32)) /* * Generic interface. */ #define hweight8(w) (__builtin_constant_p(w) ? __const_hweight8(w) : __arch_hweight8(w)) #define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : __arch_hweight16(w)) #define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : __arch_hweight32(w)) #define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w)) ``` 原來 hweight32 還不是最底層，最底層是__const_hweight8(w) 和 __arch_hweight8(w)。從 __const_hweight8(w) 看來確實是 population count 。只不過這種寫法多了很多運算子，~~照理來講~~就我的理解，這樣運算會更慢，目前還搞不懂是什麼考量要採用這種寫法。 :::danger 不要濫用「照理說」這詞彙，你比較輸出/反組譯的組合語言列表了嗎？Linux 核心的實作需要支援多種硬體架構，因此巨集定義會有許多層次，但整體的效能還是得以兼顧 :notes: jserv ::: 另外 `__builtin_constant_p(w)` 的原理和應用場合還不清楚，從 [此篇文章](https://www.daemon-systems.org/man/__builtin_constant_p.3.html) 介紹來看: > The `__builtin_constant_p()` is a GNU extension for determining whether a value is known to be constant at compile time. > The function is closely related to the concept of "constant folding" used by modern optimizing compilers. 所以看起來此段程式碼是在做編譯器最佳化 : ```clike #define hweight8(w) (__builtin_constant_p(w) ? __const_hweight8(w) : __arch_hweight8(w)) #define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : __arch_hweight16(w)) #define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : __arch_hweight32(w)) #define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w)) ``` 另外，畢竟 `__builtin_constant_p(w)` 有分支，所以我也去找了一下 __arch_hweight8(w) 的實作: * [linux/arch/mips/include/asm/arch_hweight.h](https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/arch/mips/include/asm/arch_hweight.h) ```clike static inline unsigned int __arch_hweight32(unsigned int w) { return __builtin_popcount(w); } static inline unsigned int __arch_hweight16(unsigned int w) { return __builtin_popcount(w & 0xffff); } static inline unsigned int __arch_hweight8(unsigned int w) { return __builtin_popcount(w & 0xff); } static inline unsigned long __arch_hweight64(__u64 w) { return __builtin_popcountll(w); } ``` __builtin_popcount(w) 是 GCC 對 population count 做的 builtin function。至此才終於可以確定 hweight_long(w) 是在做 population count。 --- 現在，我們知道 hweight_long(w) 是在做 64 位元的 population count，不過，所以 hweight_long 到底可以用在什麼場合呢? 往上層應用追蹤，首先發現 bitmap_weight() * [linux/lib/bitmap.h](https://code.woboq.org/linux/linux/include/linux/bitmap.h.html#bitmap_weight) ```clike static __always_inline int bitmap_weight(const unsigned long *src, unsigned int nbits) { if (small_const_nbits(nbits)) return hweight_long(*src & BITMAP_LAST_WORD_MASK(nbits)); return __bitmap_weight(src, nbits); } ``` * [linux/lib/bitmap.c](https://code.woboq.org/linux/linux/lib/bitmap.c.html#__bitmap_weight) ```clike int __bitmap_weight(const unsigned long *bitmap, unsigned int bits) { unsigned int k, lim = bits/BITS_PER_LONG; int w = 0; for (k = 0; k < lim; k++) w += hweight_long(bitmap[k]); if (bits % BITS_PER_LONG) w += hweight_long(bitmap[k] & BITMAP_LAST_WORD_MASK(bits)); return w; } EXPORT_SYMBOL(__bitmap_weight); ``` * [linux/include/linux/bitmap.h](https://code.woboq.org/linux/linux/include/linux/bitmap.h.html#212) ```clike #define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1))) #define small_const_nbits(nbits) \ (__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG && (nbits) > 0) ``` 看起來，bitmap_weight(src, nbits) 就是延伸 hweight_long(w)，只要 `nbits` > 0，他就可以做出 `nbits` 的 population count。不過這樣問題還是沒有得到實質性的解決 : 到底 population count 在 linux 核心中有什麼實際應用 ? 所以，我再次查詢呼叫了 bitmap_weight() 的程式碼，但因為太多程式碼都有呼叫，多到看不完，所以我只能選擇其中一個去追蹤： * [linux/tools/perf/builtin-c2c.c](https://code.woboq.org/linux/linux/tools/perf/builtin-c2c.c.html#1109) ```clike // ... static int cpucnt_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp, struct hist_entry *he) { struct c2c_hist_entry *c2c_he; int width = c2c_width(fmt, hpp, he->hists); char buf[10]; c2c_he = container_of(he, struct c2c_hist_entry, he); scnprintf(buf, 10, "%d", bitmap_weight(c2c_he->cpuset, c2c.cpus_cnt)); return scnprintf(hpp->buf, hpp->size, "%*s", width, buf); } // ... ``` 找到程式碼後，首先要先回答 : 所以 builtin-c2c.c 這支程式到底是在做什麼? 後來找到一篇文章 [perf, c2c: Add new tool to analyze cacheline contention on NUMA systems](https://lwn.net/Articles/588866/)，確定了 builtin-c2c.c 是 linux 檢測工具 `perf c2c` 的主程式。再根據 builtin-c2c 的主要開發者 Joe Mario 撰寫的文章 [C2C - False Sharing Detection in Linux Perf](https://joemario.github.io/blog/2016/09/01/c2c-blog/) 描述，"c2c"代表"cacheline-to-cacheline"，這個工具主要是檢測程式在多執行續環境是否有 False Sharing 這種效能陷阱。不過關於 False Sharing 更詳細的發生原因就看不懂了，而且短時間內也看不懂 builtin-c2c.c。 :::warning 請複習計算機組織/結構教科書 :notes: jserv ::: ### 參考 1. [False Sharing 原理動態展示](https://blog.darkthread.net/blog/false-sharing-demo-with-jquery-animation/)