2023q1 Homework2 (quiz2)

contributed by < chiangkd >

開發環境

$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            9
    CPU max MHz:         3800.0000
    CPU min MHz:         800.0000
    BogoMIPS:            5599.85

測驗一

延伸問題

解釋程式碼原理，並用 __builtin_clzl 改寫

int __builtin_clz (unsigned int x)
Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.
int __builtin_clzl (unsigned long)
Similar to __builtin_clz, except the argument type is unsigned long.
在 Linux 核心原始程式碼找出類似的使用案例並解釋
當上述 clz 內建函式已運用時，編譯器能否產生對應的 x86 指令？

提示: 可執行 cc -O2 -std=c99 -S next_pow2.c 並觀察產生的 next_pow2.s 檔案，確認 bsrq 指令 (表示 “Bit Scan Reverse”)

在原程式的實作方式中，透過 binary search 的方式，與輸入值做比較，直到找到符合輸入值的 2 的冪

#include <stdint.h>
static inline uint64_t pow2(uint8_t e) { return ((uint64_t)1) << e; }

uint64_t next_pow2_m1(uint64_t x)
{
    uint8_t lo = 0, hi = 63;
    while (lo < hi) {
        uint8_t test = (lo + hi)/2;
        if (x < pow2(test)) { hi = test; }
        else if (pow2(test) < x) { lo = test+1; }
        else { return pow2(test); }
    }
    return pow2(lo);
}

以輸入 x = 7 舉例，lo 和 hi 會經過以下迭代

lo	hi	test
0	63	31
0	31	15
0	15	7
0	7	3
0	3	1
2	3	2

然後觸發 else if 條件

else if (pow2(test) < x) { lo = test+1; }

並回傳 pow2(lo) = 8，上述程式碼具有分支，以下是沒有分支的版本

uint64_t next_pow2_m2(uint64_t x)
{
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 8;
    x |= x >> 16;
    x |= x >> 32;
    return x + 1;
}

考慮到題目敘述給定 64 位元數值，故上述程式碼必須將將最高位是 1 的後面都填充為 1 ，在前 7 個 x |= x >> 1 操作中已將包含最高位的頭 8 個 bit 填補為 1，考慮函數功能， AAAA 為 8， BBBB 為 32 即可填補完成。

但上述沒有分支的實作中在遇到 2 的冪次方輸入時，會產生錯誤，假設輸入 8(0b1000)，經過上述運算過後將會回傳 15(0b1111)+1 = 16，而預期輸出為 8，故我對程式碼稍做修改，新增一個 logical and (&&) 判斷輸入是否為 0 並對修正 2 的冪次方輸入所造成的問題。

uint64_t next_pow2_m2(uint64_t x)
{
    x -= x && 1;    /* if x != 0, x = x-1, for 2^k,it will borrow 1 bit from the top */
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 8;
    x |= x >> 16;
    x |= x >> 32;
    return x + 1;
}

用 `builtin_clzl` 改寫

__builtin_clzl 回傳有幾個 leading 0，當輸入為 0 時，輸出 undefined。

有幾個 leading 0 代表有值第一個 1 出現在 64 - __builting_clzl(x)，也就是說最接近的大於等於 2 的冪的值為 1 << (64 - __builting_clzl(x))

將上述的 branch less 用此函式可改寫為

uint64_t next_pow2_m3(uint64_t x)
{
    x -= x && 1;
    return 1 << (64 - __builtin_clzl(x));
}

同樣的考慮到上述提到的輸入為 2 的冪次方以及輸入為 0 的問題，不過當前版本在輸入 x 為 0 或 1 時會造成 __builtin_clzl(0) ，為 undefined ，且預期輸出應為 1，故可將函式改寫為

uint64_t next_pow2_m3(uint64_t x)
{
    x -= x && 1;
    return x ? 1 << (64 - __builtin_clzl(x)) : 1;
}

編譯器能否產生對應的 x86 指令

gcc -O2 -std=c99 -S next_pow2.c

函式名稱為 next_pow2_m3，產生的對應 x86 指令

next_pow2_m3:
.LFB16:
	.cfi_startproc
	endbr64
	cmpq	$1, %rdi
	movl	$1, %eax
	adcq	$-1, %rdi
	testq	%rdi, %rdi
	je	.L10
	bsrq	%rdi, %rdi
	movl	$64, %ecx
	xorq	$63, %rdi
	subl	%edi, %ecx
	sall	%cl, %eax
	cltq
.L10:
	ret
	.cfi_endproc

有產生對應的 bsr 指令，其中後綴可以為 b、w、l、q 分別代表 8 位，16 位，32 位，64 位。因本函式處理 uint64_t ，所以會看到大量的 q 後綴

在 Linux 核心原始程式碼找出類似的使用案例並解釋

TODO

測驗二

延伸問題

解釋程式碼運作原理
嘗試使用 __builtin_{clz,ctz,ffs} 改寫，並改進
$10^{9} + 7$ 的運算

int concatenatedBinary(int n){
    const int M = 1e9 + 7;
    int len = 0;    /* bit length to be shift */
    long ans = 0;
    for(int i = 1; i <= n; i++) {
        /* removing the rightmost set bit
         * e.g. 100100 -> 100000
         *      000001 -> 000000
         *      000000 -> 000000
         * after removal, if it is 0, then it means it is power of 2
         * as all power of 2 only contains 1 set bit
         * if it is power of 2, we increase the bit length
         */
        if(!(i & (i - 1)))
            len++;
        ans = (i | (ans << len)) % M;
    }
    return ans;
}

首先考慮題目要求回傳輸入 1 ~ n 轉換程 binary 表示後的串接，先用 n=4 推演看看

value        binary
n = 1    ->    1
n = 2    ->    10
n = 3    ->    11
n = 4    ->    100

concatenated -> 11011100 = 220

接著搭配題目註解分析，將 rightmost 的 set bit 移除後即為 2 的冪次方，且遇到 2 的冪次時，要增加 bit 的長度，對應上述，在遇到 n=2 時，二進位值由 1 變為 10 ，也就是在二進制下進位，所以這裡才會需要增加 bit 的長度，同理遇到 n=4 時也一樣，故 DDDD 就必須處理移除 rightmost set bit 這件事情，也就是 i & (i-1) ，這裡如果 i 為 2 的冪次時，若減一則會向 leftmost set bit 借位，此時取 bitwise and 若為 0 ，代表 set bit 僅有 leftmost set bit 一位，即代表 i 屬於 2 的冪次，需增加 bit 長度。

接著 EEEE 為將二進位資料串接的過程，故 EEEE 為 ans << len，用同樣的例子推演一次

value        binary
n = 1    ->    1
n = 2    ->    10
n = 3    ->    11
n = 4    ->    100

/* iteration */
ans = 0b00000000    (len = 1)
ans = 0b00000001    (len = 1)
ans = 0b00000110    (len = 2)
ans = 0b00011011    (len = 2)
ans = 0b11011100    (len = 3)

嘗試使用 `__builtin_{clz,ctz,ffs}` 改寫並改進
$10^{9} + 7$ 運算

__builtin_clz: 回傳 leading 0 的數量
__builtin_ctz: 回傳 trailing 0 的數量
__builtin_ffs: 回傳從右邊數起第一個 1 的位置

在原方法中的 len 是根據是否有進位來判斷 left shift 的長度為多少，在這裡可以直接用 32 - __builtin_clz(i) 來判斷 left shift 長度，可改寫如下

int concatenatedBinary_m2(int n){
    long ans = 0;
    const int M = 1e9 + 7;
    for(int i = 1; i <= n; i++)
        ans = (i | ans << (32 - __builtin_clz(i))) % M;
    return ans;

一樣可以通過 leetcode 1680

文章 Modulo 10^9+7 (1000000007) 提到為何要進行 mod 運算。

TODO:
暫時還沒想到 mod 該怎麼優化，感覺可以從 modulo 的等價性 (identities) 下手

測驗三

延伸問題

解釋程式碼運作原理，比較 SWAR 和原本的實作效能落差
在 Linux 核心原始程式碼找出 UTF-8 和 Unicode 相關字串處理的程式碼，探討其原理，並指出可能的改進空間

size_t swar_count_utf8(const char *buf, size_t len)
{
    const uint64_t *qword = (const uint64_t *) buf;
-   const uint64_t *end = qword + len >> 3;
+   const uint64_t *end = qword + (len >> 3);

    size_t count = 0;
    for (; qword != end; qword++) {
        const uint64_t t0 = *qword;
        const uint64_t t1 = ~t0;
        const uint64_t t2 = t1 & 0x04040404040404040llu;
        const uint64_t t3 = t2 + t2;
        const uint64_t t4 = t0 & t3;
        count += __builtin_popcountll(t4);
    }

    count = (1 << 3) * (len / 8)  - count;
    count += (len & 7) ? count_utf8((const char *) end, len & 7) : 0;

    return count;
}

編譯時發現錯誤，根據 Precedence table ，+ 的優先權高於 >> ，根據函式功能應該要加個括號才合理。

透過 8 bytes 長度的 qword 將原先 1 byte 的 char 放進去計算，根據編碼規則， UTF-8 對於首位元組最高的兩個位元始終為 11，而後續位元組的最高兩個位元設定為 10

ASCII	0xxx.xxxx
2 bytes	110x.xxxx 10xx.xxxx
3 bytes	1110.xxxx 10xx.xxxx 10xx.xxxx
4 bytes	111.0xxx 10xx.xxxx 10xx.xxxx 10xx.xxxx

透過題目說明的規則將特定 pattern (not bit6 and bit7) 提取出來後利用 __builtin_popcount 計算有多少 1 (也就是符合規則的部份)

在 for 迴圈中處理超過

N * 8

bytes 的部份 (即可以放滿 uint64_t 的部份)，後續迴圈外的部份為處理剩下的 7 bytes

在迴圈外的 (1 << 3) * (len / 8) - count; 是為了計算出字元的數量，將總字串長度撿到 continuation bytes (符合 pattern 部份) 數量即可拿到 8 bytes 整數倍部份的字元總數。

最後再來處理不為 8 bytes 整數倍的部份，在前一行透過 len / 8 取出整數倍部份，此行則透過 len & 7 取出不為整數倍的部份，若不為 0，則直接透過 count_utf8 計算並加入結果之中。

效能測試

引入 cpucycles.h

測試 buffer size 為 10000，20000，50000，100000 下的 ticks

test code

Buffer Size	w/o SWAR	w/ SWAR
10000	257978	39937
20000	515033	76065
50000	1124807	169976
100000	2323698	361673

可以看到有使用 SWAR 的明顯較佳。

測驗四

延伸問題

解釋程式碼運作原理
在 Linux 核心原始程式碼找出上述 bitmask 及產生器，探討應用範疇
參見 Data Structures in the Linux Kernel

題目要求要找到的 pattern 為從 MSB 開始有 1 個或以上連續的 1 ，接著剩餘的位數都是 0 ，符合條件的 pattern 包含

pattern: 8000 (32768)    0b1000 0000 0000 0000
pattern: c000 (49152)    0b1100 0000 0000 0000
pattern: e000 (57344)    0b1110 0000 0000 0000
pattern: f000 (61440)    0b1111 0000 0000 0000
pattern: f800 (63488)    0b1111 1000 0000 0000
pattern: fc00 (64512)    0b1111 1100 0000 0000
pattern: fe00 (65024)    0b1111 1110 0000 0000
pattern: ff00 (65280)    0b1111 1111 0000 0000
pattern: ff80 (65408)    0b1111 1111 1000 0000
pattern: ffc0 (65472)    0b1111 1111 1100 0000
pattern: ffe0 (65504)    0b1111 1111 1110 0000
pattern: fff0 (65520)    0b1111 1111 1111 0000
pattern: fff8 (65528)    0b1111 1111 1111 1000
pattern: fffc (65532)    0b1111 1111 1111 1100
pattern: fffe (65534)    0b1111 1111 1111 1110
pattern: ffff (65535)    0b1111 1111 1111 1111

#include <stdint.h>
#include <stdbool.h>

bool is_pattern(uint16_t x)
{
    if (!x)
        return 0;

    for (; x > 0; x <<= 1) {
        if (!(x & 0x8000))
            return false;
    }

    return true;
}

原方法在 x > 0 的情況下 (迴圈中)，若檢測到 x & 0x8000(MSB) 為 0 ，則不滿足 pattern，即檢查若符合條件，最高位元為 0 的情況只有在大家都為 0，即代表 x=0

bool is_pattern(uint16_t x)
{
    const uint16_t n = EEEE;
    return (n ^ x) < FFFF;
}

此處回傳值為 n ^ x < FFFF，代表判斷是否符合 pattern 是透過比較大小，而符合 pattern 的值正好就是符合 pattern 的相同數量 1 中的最大值，考慮到若符合 pattern ，則其二補數為保留最靠近 LSB 的 1，其餘為 0，若不符合 pattern ，其二補數一樣會保留最靠近 LSB 的 1 ，但在這個 1 的左側 bit 會進行翻轉。

若符合 pattern ，取二補數後在和本身取 xor 必小於原本的值，因為最靠近 LSB 的 1 被消除了，若不符合 pattern ，則取二補數和本身取 xor 必大於原本的值，因為

x =         11110111
-x =        00001001
-x ^ x =    11111110
--------------------
x =         01101111
-x =        10010001
-x ^ x =    11111110
--------------------
x =         01010101
-x =        10101011
-x ^ x =    11111110

最靠近 LSB 的 1 的左側會被 1 填滿，且這個 1 本身會被 xor 成 0

EEEE 可為 -x or ~x + 1
FFFF 為 x

2023q1 Homework2 (quiz2)

開發環境

測驗一

用 builtin_clzl 改寫

編譯器能否產生對應的 x86 指令

在 Linux 核心原始程式碼找出類似的使用案例並解釋

測驗二

嘗試使用 __builtin_{clz,ctz,ffs} 改寫並改進 109+7 運算

測驗三

效能測試

測驗四

Read more

面試紀錄 & 練習 (聯發科)

2023q1 Homework1 (lab0)

將現有的 Raspberry PI 系統備份

Run FreeRTOS on VexRiscv and Measure/Tweak context switch overhead

用 `builtin_clzl` 改寫

嘗試使用 `__builtin_{clz,ctz,ffs}` 改寫並改進
$10^{9} + 7$ 運算