Try   HackMD

2016q3Homework1(CLZ)

tags: tundergod hw1 2016q3

contributed by <tundergod>

作業內容

作業要求

  • 閱讀 重新理解數值 裡頭關於 count leading zero (clz) 的描述,設計實驗,比較 32-bit 數值對以下實做的效能差異:
    • recursive version
    • iteration version
    • binary search technique
    • byte-shift version
    • Harley’s algorithm

測試方式

  • 走訪全部的 32-bit 數值,用上述演算法帶入計算 clz 值,先驗證正確性,如果演算法不正確,試圖改正
  • 比照 phonebook 和 compute-pi,設計一套 benchmark suite,得以針對所有的 32-bit 數值進行個別 clz 實做效能的分析,並透過 gnuplot 製圖
  • 要附上個別數值實際耗費時間,不能只列出平均值
  • 落點分析圖,類似 tcp-anaysis (with-code)
  • 為了避免編譯器最佳化的影響,務必指定編譯參數 -O0 來抑制最佳化

CLZ測試

  • 本次測試基於wikipeida上給的五個clz算法
  • clz 1:一個一個bit去位移,理論上是最慢的版本
int clz1()
{
        n=0;
        if (x == 0) return 32;
        for (n = 0; ((x & 0x80000000) == 0); n++, x <<= 1);
        return n;
}
  • clz 2:clk1的進階版,以每次4bit的方法去位移
int clz2()
{
        n=0;
         if (x == 0) return 32;
          for (n = 0; ((x & 0xF0000000) == 0); n += 4, x <<= 4);
          n += (int)clz_table_4bit[x >> (32-4)];
          return n;
}
  • clz3:通過binary search的原理,第一次查看16bit,第二次8bit以此類推的做下去
int clz3()
{
   if (x == 0) return(32);
   n = 0;
   if (x & 0x0000FFFF) {n = n +16; x = x <<16;}
   if (x & 0x00FFFFFF) {n = n + 8; x = x << 8;}
   if (x & 0x0FFFFFFF) {n = n + 4; x = x << 4;}
   if (x & 0x3FFFFFFF) {n = n + 2; x = x << 2;}
   if (x & 0x7FFFFFFF) {n = n + 1;}
   return n;
}
  • clz4:利用一個lookup table的幫助算法的增強
int clz4()
{
        n=0;
          if ((x & 0xFFFF0000) == 0) {n  = 16; x <<= 16;} else {n = 0;}
          if ((x & 0xFF000000) == 0) {n +=  8; x <<=  8;}
          if ((x & 0xF0000000) == 0) {n +=  4; x <<=  4;}
          n += (int)clz_table_4bit[x >> (32-4)];
          return n;
}
  • clz5:The fastest practical approach to simulate clz uses a precomputed 64KB lookup table
int clz5()
{
        n=0;
        if ((x & 0xFFFF0000) == 0)
                return (int)clz5_table_16bit[x] + 16;
        else
                return (int)clz5_table_16bit[x >> 16];
}
  • clz5提到的16bit loopup table:
static uint8_t clz5_table_16bit[65536] =
{16,15,14,14,13,13,13,13,12,12,12,12,..........
	..................................0,0,0,0,0,0};
  • 測試環境:
    • for i in `seq 160000 100000 20000000`;
  • 測試結果:
    • 當數值爲0x00000001時(ans = 31):
    • 當數值爲0x00000080時(ans = 24):
    • 當數值爲0x00008000時(ans = 16):
    • 當數值爲0x00800000時(ans = 8):
    • 當數值爲0x80000000時(ans = 0):

可以看到無論clz所求值爲多少,執行時間幾乎不變.
而意外的是最簡單的單位元位移算法竟然是最快的lim wen sheng

撰寫作業要求程式

Recursion Version

uint8_t clz(uint32_t x,int half)
{
	if(x == 0) return 32;
	//end recursive after fifth
	if(half == 0) return 0;
	/* shift upper half down, rest is filled up with 0s */
	uint16_t upper = (x >> half);
	// mask upper half away
	uint16_t lower = (x & (0x0000FFFF >> (16 - half)));
	//if upper have value move it to lower else ans+16 exe recursive for another half
	return upper ? clz(upper,half>>1) : (half) + clz(lower,half>>1);
}

Iteration Version

uint8_t clz(uint32_t x)
{
	int n = 0;
	if (x == 0) return 32;
	//every iterative left shift 1 bit if val = 0, ans++
	for (n = 0; ((x & 0x80000000) == 0); n++, x <<= 1);
	return n;
}

Binary Search Technique

uint8_t clz(uint32_t x)
{
	int n = 0;
	if (x == 0) return(32);
	//32 -> 16
	if (x & 0x0000FFFF) {n = n +16; x = x << 16;}
	//16 -> 8
	if (x & 0x00FFFFFF) {n = n + 8; x = x << 8;}
	//8 -> 4 
	if (x & 0x0FFFFFFF) {n = n + 4; x = x << 4;}
	//4 -> 2
	if (x & 0x3FFFFFFF) {n = n + 2; x = x << 2;}
	//2 -> 1
	if (x & 0x7FFFFFFF) {n = n + 1;}
	return n;
}

Byte-Shift Version

  • 單純用byte-shift做不出來

參考學長筆記看到的版本都長這樣但是不知道爲什麼這是byte-shiftlim wen sheng

unsigned clz(uint32_t x) 
{
    if (x == 0) return 32;
    int n = 1;
    if ((x >> 16) == 0) { n += 16; x <<= 16; }
    if ((x >> 24) == 0) { n += 8; x <<= 8; }
    if ((x >> 28) == 0) { n += 4; x <<= 4; }
    if ((x >> 30) == 0) { n += 2; x <<= 2; }
    n = n - (x >> 31);
    return n;
}

Harley's Algorithm

在上述網站可以看到6種類似Harley's Algorithm的做法,原理還在研究中lim wen sheng

#include <stdio.h>
#include <stdint.h>

uint8_t clz(uint32_t x)
{
    static unsigned char table [48] = {
        32,  6,  5,  0,  4, 12,  0, 20,
        15,  3, 11,  0,  0, 18, 25, 31,
        8, 14,  2,  0, 10,  0,  0,  0,
        0,  0,  0, 21,  0,  0, 19, 26,
        7,  0, 13,  0, 16,  1, 22, 27,
        9,  0, 17, 23, 28, 24, 29, 30
    };

    x = x | (x >> 1);    // Propagate leftmost
    x = x | (x >> 2);    // 1-bit to the right.
    x = x | (x >> 4);
    x = x | (x >> 8);
    x = x & ~(x >> 16);
    x = x * 0x3EF5D037;
    return table[x >> 26];
}

圖表

  • recursive version

  • iteration version

  • binary search technique

  • byte-shift version

  • Harley’s algorithm

  • Compare All: