2023q1 Homework2 (quiz2)

contributed by < CYT701 >

開發環境

$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
    CPU family:          6
    Model:               78
    Thread(s) per core:  2
    Core(s) per socket:  2
    Socket(s):           1
    Stepping:            3
    CPU max MHz:         2800.0000
    CPU min MHz:         400.0000
    BogoMIPS:            4800.00

測驗一

程式碼原理

#include <stdint.h>
static inline uint64_t pow2(uint8_t e) { return ((uint64_t)1) << e; }

uint64_t next_pow2(uint64_t x)
{
    uint8_t lo = 0, hi = 63;
    while (lo < hi) {
        uint8_t test = (lo + hi)/2;
        if (x < pow2(test)) { hi = test; }
        else if (pow2(test) < x) { lo = test+1; }
        else { return pow2(test); }
    }
    return pow2(lo);
}

pow2() 讀入一個 8 位元無號數 e ，並回傳將 64 位元無號數 1 左移 e 的值。
next_pow2() 讀入一個 64 位元無號數 x ，並回傳最接近且大於等於 x 的 2 的冪的值

原題目敘述中 最接近且大於等於 2 的冪的值 應寫為 最接近且大於等於 x 的 2 的冪的值才符合題意

這段程式碼是在 2^lo 與 2^hi 之間找尋欲求的值，當 lo < hi 時令 temp = (lo + hi) / 2 ，不斷比較 x 與 2^test 的大小直到 lo >= hi：

如果 x < 2^test ，則查找範圍變成 [2^lo, 2^test]
如果 2^test < x ，則查找範圍變成 [2^test+1, 2^hi]
如果 x 等於 2^test ，則回傳 2^test

uint64_t next_pow2(uint64_t x)
{               // x = 1000 0000 0000 0000 0000 0000 0000 0000
    x |= x >> 1;// x = 1100 0000 0000 0000 0000 0000 0000 0000
    x |= x >> 1;// x = 1110 0000 0000 0000 0000 0000 0000 0000
    x |= x >> 1;// x = 1111 0000 0000 0000 0000 0000 0000 0000
    x |= x >> 1;// x = 1111 1000 0000 0000 0000 0000 0000 0000
    x |= x >> 1;// x = 1111 1100 0000 0000 0000 0000 0000 0000
    x |= x >> 1;// x = 1111 1110 0000 0000 0000 0000 0000 0000
    x |= x >> 1;// x = 1111 1111 0000 0000 0000 0000 0000 0000
    x |= x >> AAAA;
    x |= x >> 16;
    x |= x >> BBBB;
    return CCCC;
}

題目中說明這段程式碼用以將原最高位元到最低位元中，所有的位元設定為 1 ，在這裡 x |= x >> 1; 同義於 x = x | (x >> 1);

所以可以推斷 AAAA 為 8 ， BBBB 為 32

由於回傳值為最接近且大於等於 x 的 2 的冪的值，而經過剛才的運算，此時可能出現兩種情況：

x 的初始值是 2 的冪次，此時應回傳 x 的初始值

在這裡因為 x 的初始值已經被運算過後的 x 值覆蓋，無法求出其初始值，判斷應該是題目要求有誤
在原程式碼中加入一個變數 temp 用以儲存 x 的初始值即可解決，利用運算後的 x 加 1 後右移 1 位元與 temp 作比較，若相等則表示 x 的初始值本來就是 2 的冪次，此時回傳 temp

x 的初始值不是 2 的冪次，此時回傳 x + 1

根據以上兩點可以得到以下判斷式

if(x + 1 >> 1 == temp)
    return temp;
else
    return x + 1;

也可寫成

return x + 1 >> 1 == temp ? temp : x + 1;

故原程式碼應改寫如下：

處理 x = 0 的情況有誤

uint64_t next_pow2(uint64_t x)
{     
    uint64_t temp = x;//store the value of x
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 1;
    x |= x >> 8;
    x |= x >> 16;
    x |= x >> 32;
    return (x + 1) >> 1 == temp ? temp : x + 1;
}

加入判斷條件來處理 x = 0 的情況

uint64_t next_pow2_bitshift_new(uint64_t x) {
  uint64_t temp = x; // store the value of x
  x |= x >> 1;
  x |= x >> 1;
  x |= x >> 1;
  x |= x >> 1;
  x |= x >> 1;
  x |= x >> 1;
  x |= x >> 1;
  x |= x >> 8;
  x |= x >> 16;
  x |= x >> 32;
  if (temp == 0)
    return 1;
  return (x + 1) >> 1 == temp ? temp : x + 1;
}

以 `__builtin_clzl` 改寫

在 6.59 Other Built-in Functions Provided by GCC 中說明：

Built-in Function: int __builtin_clzl (unsigned long)

    Similar to __builtin_clz, except the argument type is unsigned long.

所以要看到 __builtin_clz ：

Built-in Function: int __builtin_clz (unsigned int x)

    Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.

簡單來說就是 __builtin_clz 會回傳最高位的 1 左邊有幾個 0 ，可以此判斷找出最高位的 1 在第幾個位元，而 __builtin_clzl 是 __builtin_clz 的 unsigned long 版本，實作如下：

uint64_t next_pow2_builtin(uint64_t x) {
  /*
  if((uint64_t)1 << 63 - __builtin_clzl(x) == x)
          return x;
  else
          return (uint64_t)1 << 63 - __builtin_clzl(x) + 1;
  */
  if (x == 0)
    return 1;
  return (uint64_t)1 << 63 - __builtin_clzl(x) == x
             ? x
             : (uint64_t)1 << 63 - __builtin_clzl(x) + 1;
}

在參考資料中提到： If x is 0, the result is undefined. 所以在實作時印出 __builtin_clzl(0) 的結果，得到 64

所以當 x = 0 時要回傳最接近且大於等於 0 的的 2 的冪次為 1
如果 x 不是 0 則要檢查 x 本身是否為 2 的冪次，如果是則回傳 x 否則回傳(uint64_t)1 << 63 - __builtin_clzl(x) + 1

結果如下：

$ ./next_pow
Enter a number x (uint64_t):0
__builtin_clzl(0) = 64
__builtin_clzl(x) = 63
Using dichotomy: 1
Using bitshift_origin: 1
Using bitshift_new: 1
Using builtin: 1

Enter a number x (uint64_t):23
__builtin_clzl(0) = 64
__builtin_clzl(x) = 59
Using dichotomy: 32
Using bitshift_origin: 32
Using bitshift_new: 32
Using builtin: 32

Enter a number x (uint64_t):128
__builtin_clzl(0) = 64
__builtin_clzl(x) = 56
Using dichotomy: 128
Using bitshift_origin: 256
Using bitshift_new: 128
Using builtin: 128

Using dichotomy 為題目原本給的函式所產生
Using bitshift_origin 為題目原本的 bitshift 程式碼產生
Using bitshift_new 為修正後的 bitshift 程式碼
Using builtin 為使用 __builtin_clzl() 所產生

可以看到題目原本的 bitshift 程式碼在輸入值原本就是2的冪次時，會得到錯誤的結果

這裡在輸入為 0 時，__builtin_clzl(0) 與 __builtin_clzl(x) 的輸出居然不同，不確定為什麼會有這種狀況

當上述 clz 內建函式已運用時，編譯器能否產生對應的 x86 指令？

next_pow2_builtin:
.LFB27:
    .cfi_startproc
    endbr64
    movl    $1, %eax
    testq   %rdi, %rdi
    je  .L15
    movabsq $-9223372036854775808, %rax
    bsrq    %rdi, %rdx
    xorq    $63, %rdx
    movl    %edx, %ecx
    shrq    %cl, %rax
    cmpq    %rax, %rdi
    je  .L15
    movl    $64, %ecx
    movl    $1, %eax
    subl    %edx, %ecx
    salq    %cl, %rax
.L15:
    ret
    .cfi_endproc
.LFE27:
    .size   next_pow2_builtin, .-next_pow2_builtin
    .section    .rodata.str1.1,"aMS",@progbits,1

測驗二

程式碼原理

LeetCode 1680. Concatenation of Consecutive Binary Numbers
給定一整數 n ，回傳將 1 到 n 的二進位表示法依序串接在一起所得到的二進位字串，其所代表的十進位數字 mod 10⁹ + 7 之值

int concatenatedBinary(int n)
{
    const int M = 1e9 + 7;
    int len = 0; /* the bit length to be shifted */
    /* use long here as it potentially could overflow for int */
    long ans = 0;
    for (int i = 1; i <= n; i++) {
        /* removing the rightmost set bit
         * e.g. 100100 -> 100000
         *      000001 -> 000000
         *      000000 -> 000000
         * after removal, if it is 0, then it means it is power of 2
         * as all power of 2 only contains 1 set bit
         * if it is power of 2, we increase the bit length
         */
        if (!(DDDD))
            len++;
        ans = (i | (EEEE)) % M;
    }
    return ans;
}

由給定程式碼中的提示可知， for 迴圈中所作的事情是將 i 的 rightmost set bit 設為 0 ，所得到的值如果為 0 表示 i 為 2 的冪次，因為 2 的冪次的二進位表示只包含一個 1 ，其餘為 0 ，`
此時 i 的長度會比 i-1 增加 1 ，所以 len 要加 1
len 代表的是此時 i 共有幾位元（不包含 leftmost set bit 左邊的 0），也同時代表在把下個數字串接上時 ans 需要左移的位元數

根據程式碼，可以看出當 !(DDDD) （也就是 DDDD == 0 ）時 len++ ，此時 i 為 2 的冪次，即 i 將 rightmost set bit 設為 0 後的值為 0 ，故 DDDD 為「將 rightmost set bit 設為 0 」這個動作。

你所不知道的 C 語言：數值系統篇在運用 bit-wise operator 處提到，在 C 語言中 x & (x - 1) == 0 的數學意義表示 x 為 2 的冪次

所以 DDDD 為 i & (i - 1)

接下來要將 i 串接到 ans ，所以要將ans 左移 len 後與 i 作 OR 運算

所以 EEEE 為 ans << len

使用 `__builtin_{clz,ctz,ffs}` 改寫

在 6.59 Other Built-in Functions Provided by GCC 中說明：

Built-in Function: int __builtin_ffs (int x)

    Returns one plus the index of the least significant 1-bit of x, or if x is zero, returns zero.

回傳 1 加上 rightmost set bit 的 index ，若 x 為 0 則回傳 0

Built-in Function: int __builtin_ctz (unsigned int x)

    Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined.

回傳最低位的 1 右邊有幾個 0

改寫如下：

int using_builtin(int n) {
  const int M = 1e9 + 7;
  int len = 0;
  long ans = 0;
  for (int i = 1; i <= n; i++) {
    if (__builtin_clz(i) + __builtin_ctz(i) == 31)
      len = __builtin_ffs(i);
    ans = (i | (ans << len)) % M;
  }
  return ans;
}

由於 __built_clz() 回傳 leftmost set bit 左邊的 0 的數量，而 __builtin_ctz() 回傳 rightmost set bit 右邊的 0 數量，所以當 i 為 2 的冪次時，兩者的和會是總位元數減 1 ，又因為 i 是 int 型態，其範圍在 -2³¹ 與 2³¹ - 1 之間，也就是 32 bits ，所以判斷是否為 2 的冪次的條件可改寫成 if (__builtin_clz(i) + __builtin_ctz(i) == 31) ，而 len 為當前 i 的位元數（不包含 leftmost set bit 左邊的 0 ），也就是從右邊開始數到 leftmost set bit 的值，即為 LSB 的 index ，故使用 __builtin_ffs()

測驗三

程式碼原理

#include <stdint.h>
bool both_odd(uint32_t x, uint32_t y) {
    return (x & 1) && (y & 1);
}

要檢查 x 與 y 兩個 32 位元的數字是否都是奇數，將兩者分別與 1 作 AND 運算，兩者都是 1 則回傳true

static inline uint64_t bit_compound(uint32_t x, uint32_t y) {
    return ((0L + x) << 32) | ((y + 0L) & (-1L >> 32));
}

static uint64_t SWAR_ODD_MASK = (1L << 32) + 1;

bool both_odd_swar(uint64_t xy) {
    return (xy & SWAR_ODD_MASK) == SWAR_ODD_MASK;
}

利用特製的 bitmask SWAR_ODD_MASK ，將 1 賦型為 long 後左移 32 位元後加 1 ，然後檢查將 x 與 y 合併成 64 位元後與 SWAR_ODD_MASK 作 AND 運算，若與 SWAR_ODD_MASK 相等則回傳 true

ASCII	0xxx.xxxx
2 bytes	110x.xxxx 10xx.xxxx
3 bytes	1110.xxxx 10xx.xxxx 10xx.xxxx
4 bytes	1111.0xxx 10xx.xxxx 10xx.xxxx 10xx.xxxx

UTF-8 的多位元組字元有以下特性：

首位元組的最高 2 位元必為 11
後續位元組的最高 2 位元必為 10

所以要確認 UTF-8 的字串共有多少字元，可以用總字串長度減去後續位元組的數量

#include <stddef.h>
#include <stdint.h>

size_t count_utf8(const char *buf, size_t len)
{
    const int8_t *p = (const int8_t *) buf;
    size_t counter = 0;
    for (size_t i = 0; i < len; i++) {
        /* -65 is 0b10111111, anything larger in two-complement's should start
         * new code point.
         */
        if (p[i] > -65)
            counter++;
    }
    return counter;
}

將字串（ char * ）以二補數的形式轉換為有號整數的位元組（ int8_t ），再跟 -65 作比較以確認是否為後續位元組，若不是後續位元組則 counter++ ，最後回傳 counter 即為後續位元組數量

正數與 0 的二補數為期二進位表示再補上最高位元 0 ，負數的二補數為其正數按位元取反後加 1
用 -65 是因為其以二補數表示為 0b10111111 ，也就是說只要大於 -65 其最高 2 位元必為 11 ，即為首位元組

由於後續位元組的第 6 位元必為 0 ，第 7 位元必為 1 （最低位元為第 0 位元），也就是 not bit6 and bit7 ，所以可以以下步驟實作 SWAR

將輸入的位元組取反
與 0x40404040 作 AND 運算（隔離第 6 位元）
將第 6 位元左移 1 位元
再與原本輸入的位元組作 AND 運算
最後以 popcount() 作運算

Built-in Function: int __builtin_popcount (unsigned int x)

    Returns the number of 1-bits in x.

輸入一個無號整數 x ，回傳 x 的 set bit 數

SWAR 實作如下：

size_t swar_count_utf8(const char *buf, size_t len)
{
    const uint64_t *qword = (const uint64_t *) buf;
    const uint64_t *end = qword + len >> 3;

    size_t count = 0;
    for (; qword != end; qword++) {
        const uint64_t t0 = *qword;
        const uint64_t t1 = ~t0;
        const uint64_t t2 = t1 & 0x04040404040404040llu;
        const uint64_t t3 = t2 + t2;
        const uint64_t t4 = t0 & t3;
        count += __builtin_popcountll(t4);
    }

    count = (1 << AAAA) * (len / 8)  - count;
    count += (len & BBBB) ? count_utf8((const char *) end, len & CCCC) : DDDD;

    return count;
}

這裡首先將 buf 轉型為 uint64_t 的形態，並利用 end 指向位元組總長度除以 8 的位置 <補圖>
可以看出 for 迴圈結束後 count 即為後續位元組數量，所以再來要做的就是將總字串長度減去後續位元組數量

(1 << AAAA) * (len / 8)

所以可以推斷上面這段程式碼就是總字串長度，而 len 這個變數就是字串總長度，所以上述運算做完後依然會等於 len ，故 1 << AAAA 後為 8

所以 AAAA 為 3

count += (len & BBBB) ? count_utf8((const char *) end, len & CCCC) : DDDD;

這裡由於位元組數量有可能不整除 8 ，所以如果有剩下的位元組則利用 count_utf8() 來計算，沒有的話 count 就維持原本的值

所以 BBBB 為 7 ， CCCC 為 7 ， DDDD 為 0

SWAR 和原本的實作效能落差

原本的 count_utf8() 一次只能處理 8 bits ，但 SWAR 實作則可以一次處理 64 bits

測驗四

程式碼原理

#include <stdint.h>
#include <stdbool.h>

bool is_pattern(uint16_t x)
{
    if (!x)
        return 0;

    for (; x > 0; x <<= 1) {
        if (!(x & 0x8000))
            return fa   }

    return true;
}

利用 for 迴圈，每次將 x 左移 1 ，若 x 與 0x8000 作 AND 運算後為 0 則回傳 false ，若可以完整跑完 for 迴圈並不回傳 false ，則回傳 true

也就是說只有在 x 的 leftmost bit 為 1 且 leftmost bit 與其他的 set bit 為連續出現，也就是所有的1都是連續的才會回傳 true

符合的樣式如下：

pattern: 8000 (1000 0000 0000 0000)
pattern: c000 (1100 0000 0000 0000)
pattern: e000 (1110 0000 0000 0000)
.
.
.
pattern: ffff (1111 1111 1111 1111)

改寫如下：

bool is_pattern(uint16_t x)
{
    const uint16_t n = EEEE;
    return (n ^ x) < FFFF;
}

首先釐清程式碼要做到的是：

x 的 leftmost bit 為 1
x 的 1 必連續出現

滿足以上回傳 true ，否則 false

初步想法是將 x 與 0x0000 或是 0xffff 作 XOR 運算，但 x ^ 0x0000 = x ，對於判斷不會有幫助，而 x ^ 0xffff = ~x 會將 x 的 1 變為 0 ， 0 變為 1 ，所以若 x 符合樣式時， x ^ 0xffff 會是 rightmost bit 為 1 且所有 1 都連續的數，此時加 1 則會使整串位元組剩下一個 1 ，且這個 1 就是在 x 的 rightmost set bit ，故可知：

EEEE 為 ~(x) + 1 ， FFFF為 x

2023q1 Homework2 (quiz2)

開發環境

測驗一

程式碼原理

以 __builtin_clzl 改寫

當上述 clz 內建函式已運用時，編譯器能否產生對應的 x86 指令？

測驗二

程式碼原理

使用 __builtin_{clz,ctz,ffs} 改寫

測驗三

程式碼原理

SWAR 和原本的實作效能落差

測驗四

程式碼原理

Read more

2023q1 Homework1 (lab0)

以 `__builtin_clzl` 改寫

使用 `__builtin_{clz,ctz,ffs}` 改寫