2017q3 Homework1 (clz)

tags: `進階電腦系統理論與實作 (Fall 2017)`

contributed by < jcyztw >

注：繳交期限過後補交

開發環境

Architecture:          x86_64
CPU 作業模式：    32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
每核心執行緒數：1
每通訊端核心數：2
Socket(s):             1
NUMA 節點：         1
供應商識別號：  GenuineIntel
CPU 家族：          6
型號：              15
Model name:            Intel(R) Core(TM)2 Duo CPU     T5750  @ 2.00GHz
製程：              13
CPU MHz：             1000.000
CPU max MHz:           2000.0000
CPU min MHz:           1000.0000
BogoMIPS:              3990.25
L1d 快取：          32K
L1i 快取：          32K
L2 快取：           2048K
NUMA node0 CPU(s):     0,1
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm

P.S. Flags 中找不到 RDTSC，不曉得我的機器有無支援 RDTSC 指令。

來自老師的提示：參閱 How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures，在一開始的 Abstract 以及 Introduction 部份(p.2, p.5-6)有提到 RDTSC 指令適用於 generic Intel architecture 處理器(32/64 bits)。謝謝老師！

變更開發環境 [2017.12.31 更新]

Architecture:          x86_64
CPU 作業模式：    32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
每核心執行緒數：2
每通訊端核心數：4
Socket(s):             1
NUMA 節點：         1
供應商識別號：  GenuineIntel
CPU 家族：          6
型號：              158
Model name:            Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
製程：              9
CPU MHz：             900.000
CPU max MHz:           4200.0000
CPU min MHz:           800.0000
BogoMIPS:              7200.00
虛擬：              VT-x
L1d 快取：          32K
L1i 快取：          32K
L2 快取：           256K
L3 快取：           8192K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp

各個 CLZ 實作的運作原理

本次作業給的程式碼用到了以下這些 CLZ 的實作方法：

recursive version
iteration version
binary search technique
byte-shift version
Harley’s algorithm

先從 CLZ 實作的運作原理開始吧！(從 code 下手)

recursive

每一次呼叫 clz2() 函式，都是代表將該次關注的 bits「切一半」來檢查 leading zero。以下透過一個實際數值(0x0001F000)來輔助說明主要步驟：

0x0001F000 = 0000 0000 0000 0001 1111 0000 0000 0000 ₂

step 1

首先將此數等分為兩部份：較高位(upper)，以及較低位(lower)部份

upper	lower
0000 0000 0000 0001	1111 0000 0000 0000

step 2

此時由 upper 數值判斷下次遞迴呼叫要取那一部分來累計 leading zero 個數。若 upper 等於 0，下次遞迴呼叫則取 lower 部份繼續檢查 leading zero，並且用 counter 紀錄目前已有的 leading zeros(等於 upper bits 數)；若 upper 不等於 0，則下次遞迴呼叫拿 upper 部份繼續檢查。

upper = 0000 0000 0000 0001，下次遞迴呼叫取 upper 部份繼續檢查 leading zero

step 3
遞迴呼叫 clz2() 函式，直到只檢查 2 bits 的 leading zero 時再 return。

static const int mask[] =  { 0, 8, 12,14 };
static const int magic[] = { 2, 1,  0, 0 };

unsigned clz2(uint32_t x,int c)
{
    if (!x && !c) return 32;

    uint32_t upper = (x >> (16 >> c));
    uint32_t lower = (x & (0xFFFF >> mask[c]));
    if (c == 3) return upper ? magic[upper] : 2 + magic[lower];
    return upper ? clz2(upper, c + 1) : (16 >> (c)) + clz2(lower, c + 1);
}

iteration

代入數值跑跑看，x = 0x00000080，每一輪跑完的變數數值如下：

iteration	y	c	x	n	我們所關心的 x 中 bit(s)
1	0x00000000	8	0x00000080	32	0x 0000 0080
2	0x00000000	4	0x00000080	32	0x 0000 0080
3	0x00000008	2	0x00000008	28	0x 0000 0080
4	0x00000002	1	0x00000002	26	上述粗體 nibble (8)中 1000₂
5	0x00000001	0	0x00000001	25	上述粗體 nibble (8)中 1000₂

最後得到 leading zero 個數 n = 24 (= 25 - x = 25 - 1)

unsigned clz(uint32_t x)
{
    int n = 32, c = 16;
    do {
        uint32_t y = x >> c;
        if (y) {
            n -= c;
            x = y;
        }
        c >>= 1;
    } while (c);
    return (n - x);
}

binary search technique

主要精神就是每次檢查高位(前半段)的 bits 是否都為 0。若皆為 0，計算 leading zero 數的 counter 則加上未被 mask 的 bit 個數，並將該數左移未被 mask 的 bit 個數，接著檢查更高位的 bits。共需檢查 5 (=

\log_{2} 32

) 次。

unsigned clz(uint32_t x)
{
    if (x == 0) return 32;
    int n = 0;
    if (x <= 0x0000FFFF) {
        n += 16;
        x <<= 16;
    }
    if (x <= 0x00FFFFFF) {
        n += 8;
        x <<= 8;
    }
    if (x <= 0x0FFFFFFF) {
        n += 4;
        x <<= 4;
    }
    if (x <= 0x3FFFFFFF) {
        n += 2;
        x <<= 2;
    }
    if (x <= 0x7FFFFFFF) {
        n += 1;
        x <<= 1;
    }
    return n;
}

byte-shift

實際代入數值跑一次看看吧！代入 x = 0x00000010：

x 不等於 0，故執行下一個檢查。n 的預設值為 1，可想成至少有一個 leading zero
檢查前 16 (= 32 - 16) bits 是否皆為 0：都為 0，則 n = 17，x = 0x00100000
檢查前 8 (= 32 - 24) bits 是否皆為 0：都為 0，則 n = 25，x = 0x10000000
檢查前 4 (= 32 - 28) bits 是否皆為 0：都為 0，不全為 0，跳至下一個檢查。
檢查前 2 (= 32 - 30) bits 是否皆為 0：都為 0，則 n = 27，x = 0x40000000
檢查此時的 x 最高位元是否為 1，是的話 leading zero 個數減掉 1 (此時 n = 27)。最後算出 leading zero 個數為 27

unsigned clz(uint32_t x)
{
    if (x == 0) return 32;
    int n = 1;
    if ((x >> 16) == 0) {
        n += 16;
        x <<= 16;
    }
    if ((x >> 24) == 0) {
        n += 8;
        x <<= 8;
    }
    if ((x >> 28) == 0) {
        n += 4;
        x <<= 4;
    }
    if ((x >> 30) == 0) {
        n += 2;
        x <<= 2;
    }
    n = n - (x >> 31);
    return n;
}

Harley's algorithm

unsigned clz(uint32_t x)
{
#ifdef CTZ
    static uint8_t const Table[] = {
        0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF,
        16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF,
        0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF,
        0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF,
        14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF,
        18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13,
        26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25,
        0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF,
    };

#else
    static uint8_t const Table[] = {
        32,31, 0,16, 0,30, 3, 0,15, 0, 0, 0,29,10, 2, 0,
        0, 0,12,14,21, 0,19, 0, 0,28, 0,25, 0, 9, 1, 0,
        17, 0, 4, 0, 0, 0,11, 0,13,22,20, 0,26, 0, 0,18,
        5, 0, 0,23, 0,27, 0, 6,0,24, 7, 0, 8, 0, 0, 0
    };
#endif

    /* Propagate leftmost 1-bit to the right */
    x = x | (x >> 1);
    x = x | (x >> 2);
    x = x | (x >> 4);
    x = x | (x >> 8);
    x = x | (x >> 16);

    /* x = x * 0x6EB14F9 */
    x = (x << 3) - x;   /* Multiply by 7. */
    x = (x << 8) - x;   /* Multiply by 255. */
    x = (x << 8) - x;   /* Again. */
    x = (x << 8) - x;   /* Again. */

    return Table[(x >> 26)];
}

比較 32-bit 數值對各個實作的效能差異

編譯程式及測試執行檔

取得程式碼之後，首先編譯執行檔，並在編譯完成後隨便拿一個執行檔(binary)來執行：

$ make
...
$ ./binary

出現以下錯誤訊息：

binary: main.c 84: main: Assertion 'argv[1] && argv[2] && "insert argument"' failed.
已經終止 (core dumped)

仔細看過 Makefile 與 main.c 後，重新編譯並執行：

$ make PROFILE=1
...
$ ./binary 100 1000

這次沒有再出現前述的錯誤訊息，但依然發生問題(錯誤訊息如下)：

0:32
1:31
2:30
4:29
8:28
...
1073741824:1
2147483648:0
不合法的命令 (core dumped)

輸入 make run 也會出現類似的上述訊息：

Illegal Instruction (core dumped)
Makefile:35: recipe for target 'run' failed
make: *** [run] Error 132

目前的想法是參考 zhanyangch 同學在 prefix-search 作業的作法：用 gdb 看 core dump file 來除錯。參考相關文件，以做 gdb 看 core dump 的設定，但設定好之後重新 make run，我找不到 core dump file 且錯誤訊息少了core dumped 的字樣：

Illegal instruction
Makefile:35: recipe for target 'run' failed
make: *** [run] Error 132

重新 make run，這次總算找到 core dump file，使用 gdb 除錯，看到以下訊息：

Core was generated by `./recursive 67100000 67116384'.
Program terminated with signal SIGILL, Illegal instruction.
#0  get_cycles_end (low=0x7ffdf186fbac, high=0x7ffdf186fba8) at main.c:24
24	    asm volatile("RDTSCP\n\t"

再搭配此參考資料，推測我的機器不支援 RDTSCP 這個指令。接下來打算先了解 RDTSC 和 RDTSCP 兩個指令的差異，並找方法取代 RDTSCP。

[2017.12.31 更新]

更換開發環境後，重新編譯，執行 binary，得到以下結果：

$ ./binary 100 1000
...
990 51 cycles
991 81 cycles
992 51 cycles
993 52 cycles
994 49 cycles
995 52 cycles
996 48 cycles
997 50 cycles
998 51 cycles
999 51 cycles
took 0.056648 million cycles

更換開發環境後，之前會 core dump 的問題沒有再發生。

效能分析

執行以下指令編譯程式碼：

$ make
$ make run
$ make plot

make run 中測試的數值範圍如下，從 67100000 到 67116384。

run: $(EXEC)
	for method in $(EXEC); do\
		taskset -c 1 ./$$method 67100000 67116384; \
	done

得到以下這張圖，由圖可看到 byte 和 binary 兩種方法大部分的數值計算 leading zero 所需的 cycle 數都落在 25 ~ 60 之間；而 iteration，harley 以及 recursive 則分別落在 60 ~ 125，60 ~ 110 和 60 ~ 125。

參考資料

巨人的肩膀
- 修課共筆：ktvexe，0xff07
- 同學共筆：st9007a，HTYISABUG，ChiuYiTang，HMKRL，amikai
Makefile
- Makefile 語法和示範
- Makefile 參考資料
  - subst, strip 函式
  - shell 函式
使用 gdb 看 core dump file