2017q3 Homework1 (clz)

tags: `sysprogram`

contributed by <st9007a>

Github

Reviewed by `amikai`

726ac7e 此 commit 做的是對整個架構的大修改, 建議加上修改的動機, 讓其他協作者 (如果很多人開發) 了解為什麼要有這些修改
8 bit, 16bit, 32bit 三張圖, 沒有對三張圖的比較做詳細的解釋, 講述了數據不穩定, 但沒有說緣由

以上為小弟簡短的建議, 如有冒犯請見諒

開發環境

Linux 4.4.0-92-generic
Ubuntu 16.04.1 LTS
L1d cache:  32K
L1i cache:  32K
L2 cache:   256K
L3 cache:   6144K

針對 32-bits 數值對各版本之 clz 比較效能

利用 fork 過來的程式，可以快速的實測各版本的 clz 對 32-bits 整數的效能比較

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

從上圖可以知道 byte-shift 跟 binary search 所使用的 cycle 數平均是比較少的，然後 cycle 數 400 到 500 之間也有點分布，試著隨便測其他區間：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

分布狀況看起來都差不多，不過一部分的數據會跳到 400 至 500 cycles 之間，這是由於作業系統multi task，如果有其他 task 要優先值行的話，作業系統會先中斷現在執行的 task，等完成其他 task 在跳回來繼續執行，所以花費的 cycle 數會比較高

解析原始碼

執行完效能比較後，來看看 clz.h 裡面是怎麼實作個版本的 clz 演算法

recursive

static const int mask[] = {0, 8, 12, 14};
static const int magic[] = {2, 1, 0, 0};

//static inline __attribute((always_inline))
unsigned clz2(uint32_t x,int c)
{
    // *INDENT-OFF*
    if (!x && !c) return 32;

    uint32_t upper = (x >> (16 >> c));
    uint32_t lower = (x & (0xFFFF>>mask[c]));
    if (c == 3) return upper ? magic[upper] : 2 + magic[lower];
    return upper ? clz2(upper, c + 1) : (16 >> (c)) + clz2(lower, c + 1);
    // *INDENT-ON*
}

c 這個變數紀錄每次遞迴的層數，每次遞迴會將 x 分成前半部( upper )跟後半部( lower )，如果前半部不為零，則回傳前半部的 clz，若前半部為零，則回傳前半部的 bit 數量加上後半部的 clz，而倒數第二行的停止條件則是第四次遞迴時，前半部跟後半部只剩 2 個 bits，所以直接回傳查表對應的 clz

iteration

static inline __attribute((always_inline))
unsigned clz(uint32_t x)
{
    // *INDENT-OFF*
    int n = 32, c = 16;
    do {
        uint32_t y = x >> c;
        if (y) { n -= c; x = y; }
        c >>= 1;
    } while (c);
    return (n - x);
    // *INDENT-ON*
}

迴圈裡面的 y 用來記錄 x 的前半部，n 用來記錄 clz，如果前半部不為零就用前半部去做下一個迴圈，同時 n 減去後半部的 bit 的數量( 因為至少後半部的 bit 不會算進 clz )，迴圈做到最後當 c 變成 0 跳出迴圈時 x 還剩下一個 bit， n 直接減去最後一個 bit 的值就能得到 clz

byte shift

static inline __attribute((always_inline))
unsigned clz(uint32_t x)
{
    // *INDENT-OFF*
    if (x == 0) return 32;
    int n = 1;
    if ((x >> 16) == 0) { n += 16; x <<= 16; }
    if ((x >> 24) == 0) { n += 8; x <<= 8; }
    if ((x >> 28) == 0) { n += 4; x <<= 4; }
    if ((x >> 30) == 0) { n += 2; x <<= 2; }
    n = n - (x >> 31);
    return n;
    // *INDENT-ON*
}

n 用來記錄 clz，利用 if 來分別確認 x 的前 16 , 8 , 4 , 2 個 bits 是否為零，如果為零則將他們加入 n，並且左移 x 確保下個 if 能檢查到正確的位置，如果不為零則直接做下一個 if 檢查，最後 x 剩下最前面的 1 bit 沒有被檢查到，做法跟 iteration 一樣，直接減去 bit 的值就是 clz

binary search

static inline __attribute((always_inline))
unsigned clz(uint32_t x)
{
    // *INDENT-OFF*
    if (x == 0) return 32;
    int n = 0;
    if (x <= 0x0000FFFF) { n += 16; x <<= 16; }
    if (x <= 0x00FFFFFF) { n += 8; x <<= 8; }
    if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; }
    if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; }
    if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; }
    return n;
    // *INDENT-ON*
}

原理跟 byte shift 一樣，只是改成用比大小的方式來確認前 16 , 8 , 4 , 2 個 bits 是否為零，最後一個 if 取代 byte shift 版本的 n = n - (x >> 31); 檢查最前面的 bit

harely

static inline __attribute((always_inline))
unsigned clz(uint32_t x)
{
    // *INDENT-OFF*
    static uint8_t const Table[] = {
        32, 31,  0, 16,  0, 30,  3, 0, 15,  0,  0,  0, 29, 10, 2,  0,
         0,  0, 12, 14, 21,  0, 19, 0,  0, 28,  0, 25,  0,  9, 1,  0,
        17,  0,  4,  0,  0,  0, 11, 0, 13, 22, 20,  0, 26,  0, 0, 18,
         5,  0,  0, 23,  0, 27,  0, 6,  0, 24,  7,  0,  8,  0, 0,  0
    };
    
    // *INDENT-ON*
    /* Propagate leftmost 1-bit to the right */
    x = x | (x >> 1);
    x = x | (x >> 2);
    x = x | (x >> 4);
    x = x | (x >> 8);
    x = x | (x >> 16);

    /* x = x * 0x6EB14F9 */
    x = (x << 3) - x;   /* Multiply by 7. */
    x = (x << 8) - x;   /* Multiply by 255. */
    x = (x << 8) - x;   /* Again. */
    x = (x << 8) - x;   /* Again. */

    return Table[(x >> 26)];
}

從演算法名稱可以預期到 0x6EB14F9 是 De Bruijn 數列，一開始前五個 or 運算將 x 的最高有效位往右的 bit 都變成 1，接著可以從作業共筆知道存在著 De Bruijn 數列 M 使得

(a \times M) >> n

出來的結果彼此不衝突，其中

a = 0. . .000111 . . {.1}_{2}

，在這裡

M = 6 E B 14 F 9_{16}, n = 26

，接著 hash 出來的結果直接去查表就能得到 clz

修改程式碼

由於作業要求有提到 C11 的 _Generic，研究了一下把程式改成支援 8-bit, 16-bit, 32-bit, 64-bit unsigned 整數，首先將標頭檔套用 _Generic

#ifndef CLZ_H
#define CLZ_H

#include <stdint.h>

#if defined(recursive)
#define clz(x) \
    _Generic((x), \
        uint8_t: clz_u8, \
        uint16_t: clz_u16, \
        uint32_t: clz_u32, \
        uint64_t: clz_u64 \
    )(x, 0)

unsigned clz_u8(uint8_t x, int c);
unsigned clz_u16(uint16_t x, int c);
unsigned clz_u32(uint32_t x, int c);
unsigned clz_u64(uint64_t x, int c);

#else
#define clz(x) \
    _Generic((x), \
        uint8_t: clz_u8, \
        uint16_t: clz_u16, \
        uint32_t: clz_u32, \
        uint64_t: clz_u64 \
    )(x)
unsigned clz_u8(uint8_t x);
unsigned clz_u16(uint16_t x);
unsigned clz_u32(uint32_t x);
unsigned clz_u64(uint64_t x);

#endif
#endif

接著分別在 iteration.c, recursive.c, byte.c , binary.c 根據原本的函式修改出 8-bit, 16-bit, 64-bit的版本，由於全部程式碼太多行，且只是按 32-bit 改出其它版本，所以就不附上程式碼，harely 還沒實作，需先找到對應的 De Bruijn 數列
改好程式碼後，把 Makefile 中的 -std=gnu99 改成 -std=c11，然後來編譯看看

cc -Wall -std=c11 -g -DDEBUG -O0 -o iteration \
        -MMD -MF .iteration.d \
        -Diteration main.c
main.c: In function ‘get_cycles’:
main.c:14:5: error: ‘asm’ undeclared (first use in this function)
     asm volatile ("CPUID\n\t"
     ^
main.c:14:5: note: each undeclared identifier is reported only once for each function it appears in
main.c:14:9: error: expected ‘;’ before ‘volatile’
     asm volatile ("CPUID\n\t"
         ^
main.c: In function ‘get_cycles_end’:
main.c:24:5: error: ‘asm’ undeclared (first use in this function)
     asm volatile("RDTSCP\n\t"
     ^
main.c:24:9: error: expected ‘;’ before ‘volatile’
     asm volatile("RDTSCP\n\t"
         ^
Makefile:30: recipe for target 'iteration' failed
make: *** [iteration] Error 1

會發現他不認得 asm 這個保留字，查了一下原因，知道了 asm 是 C 語言的 extension，要使用 extension 必須把 gcc 的 -std=c11 改成 -std=gnu11，接著再試著編譯看看

make: Warning: File 'Makefile' has modification time 0.76 s in the future
cc -Wall -std=gnu11 -g -DDEBUG -O0 -o iteration \
        -MMD -MF .iteration.d \
        -Diteration main.c
/tmp/ccU6IkJY.o: In function `main':
/home/st9007a/project/sysprog/clz-tests/main.c:107: undefined reference to `clz_u32'
collect2: error: ld returned 1 exit status
Makefile:30: recipe for target 'iteration' failed
make: *** [iteration] Error 1

發現它竟然認不得 clz_u32，所以打開 Makefile 看看是什麼狀況

%: $(SRCS_common) %.c clz.h
      $(CC) $(CFLAGS_common) -o $@ \
          -MMD -MF .$@.d \
          -D$(shell echo $(subst .o,,$@)) $(SRCS_common)

原來實作 clz function 的檔案沒有被一起編譯，所以改成如下

%: $(SRCS_common) %.c clz.h
    $(CC) $(CFLAGS_common) -o $@ \
        -MMD -MF .$@.d \
        -D$(shell echo $(subst .o,,$@)) $(SRCS_common) $@.c

在最後面加入 $@.c，把實作 clz function 的檔案一起編譯

$ make
cc -Wall -std=gnu11 -g -DDEBUG -O0 -o iteration \
        -MMD -MF .iteration.d \
        -Diteration main.c iteration.c
cc -Wall -std=gnu11 -g -DDEBUG -O0 -o binary \
        -MMD -MF .binary.d \
        -Dbinary main.c binary.c
cc -Wall -std=gnu11 -g -DDEBUG -O0 -o byte \
        -MMD -MF .byte.d \
        -Dbyte main.c byte.c
cc -Wall -std=gnu11 -g -DDEBUG -O0 -o recursive \
        -MMD -MF .recursive.d \
        -Drecursive main.c recursive.c

大功告成！來用原本的測試測試看看如何

跟先前的版本差不多，看來效能沒有因為我的修改被改爛，接著來測看看不同的bit width，首先把 main.c 做一點小修改

#if defined(u8)
    uint8_t i;
#elif defined(u16)
    uint16_t i;
#elif defined(u32)
    uint32_t i;
#else
    uint64_t i;
#endif

把要測試的數字根據不同 bit width 宣告成不同型態，Makefile 也加入對應 flag 來自動化

bit_width ?= 32
from ?= 67100000
to ?= 67110000

CFLAGS_common += -Du$(bit_width)

run: $(EXEC)
    for method in $(EXEC); do\
        taskset -c 1 ./$$method $(from) $(to); \
    done

8-bit 結果

16-bit 結果

64-bit 結果

可以看到效能是差不多的，8-bit 由於數字只有 256 個，所以只看 100 到 200 的範圍，其數據比較不穩定

應用場合

Division Algorithm

參考 A Fast Hi Precision Fixed Point Divide 中，對於硬體實作的除法演算法要做 32 個 iterations 花費 97 個 cycles，程式如下：

@ Entry  r0: numerator (lo) must be signed positive
@        r2: deniminator (den) must be non-zero and signed negative
idiv:
        lo .req r0; hi .req r1; den .req r2
        mov hi, #0 @ hi = 0
        adds lo, lo, lo
        .rept 32 @ repeat 32 times
          adcs hi, den, hi, lsl #1
          subcc hi, hi, den
          adcs lo, lo, lo
        .endr
        mov pc, lr @ return
@ Exit   r0: quotient (lo)
@        r1: remainder (hi)

在被除數與除數有 leading zero 的狀況下，這個演算法前幾個( 被除數與除數共有的leading zero 數量 ) iterations 就等同於只單純的左移 hi 跟 lo 這兩個暫存器，因此可以預先計算好 clz 然後先移動 hi , lo 來減少 iterations 的數量，參考資料中提到，在 ARM v5 架構下使用 clz 指令可以減少 9 到 18 個 cycles

Integer Square Root

整數平方根的算法有很多種，這裡指的是牛頓法( Newton's Method )，牛頓法的概念是數字

a

的平方根可以透過給定一個初始值

g_{0}

並反覆計算以下公式來找到：

g_{n + 1} = (g_{n} + a / g_{n}) \times 0.5

在計算過程中

g

的值會趨近於

a

的平方根，因此可以知道

g_{0}

的值會影響運算時間，在 Hacker's Delight 281頁中提到，應找大於等於

a

之平方根的

2^{n}

整數作為

g_{0}

，尋找的方法就可以透過 clz 去算出來

附上程式碼：

int isqrt(unsigned x)
{
    int s, g0, g1;
    
    if (x <= 1) return x;
    s = 16 - clz(x - 1) / 2;
    
    g0 = 1 << s;
    g1 = (g0 + x >> s) >> 1;
    
    while (g1 < g0) {
        g0 = g1;
        g1 = (g0 + x / g0) >> 1;
    }
    
    return g0;
}

關鍵在於 s = 16 - clz(x - 1) / 2 這個部分， s 的目的是算出最接近的數字符合

2^{s} >= \sqrt{x}

，可以將程式碼換成數學式如下：

s = 16 - ⌊ \frac{1}{2} (31 - ⌊ l o g_{2} (x - 1) ⌋) ⌋ >= 16 - \frac{1}{2} (31 - l o g_{2} (x - 1))

= \frac{1}{2} + l o g_{2} \sqrt{x - 1}

所以

2^{s} >= \sqrt{2} + \sqrt{x - 1} >= \sqrt{2} + \sqrt{x} - 1 > \sqrt{x}

參考資料

ktvexe 同學共筆
 baimao8437 同學共筆
 wrong clock cycle measurements with rdtsc
C11 - Generic Selections
stackoverflow – Writing function definition in header files in C++
what “inline __attribute__((always_inline))” means in the function?
C 語言的各整版本
 A Fast Hi Precision Fixed Point Divide
ARM Keil
carolc0708 同學共筆

2017q3 Homework1 (clz)

tags: sysprogram

Github

Reviewed by amikai

開發環境

針對 32-bits 數值對各版本之 clz 比較效能

解析原始碼

recursive

iteration

byte shift

binary search

harely

修改程式碼

8-bit 結果

16-bit 結果

64-bit 結果

應用場合

Division Algorithm

Integer Square Root

參考資料

tags: `sysprogram`

Reviewed by `amikai`