Try   HackMD

2017q3 Homework1 (phonebook)

contributed by <HMKRL>

系統環境

$ uname -a
Linux alfheim 4.10.0-33-generic #37-Ubuntu SMP Fri Aug 11 10:55:28 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 158
Model name:            Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Stepping:              9
CPU MHz:               899.951
CPU max MHz:           3800.0000
CPU min MHz:           800.0000
BogoMIPS:              5616.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp

$ free
              total        used        free      shared  buff/cache   available
Mem:       16235072     2420160    12605616      487012     1209296    13048120
Swap:       2097148           0     2097148

對預設 Makefile 進行修改

  • 進行測試時,發現雖然對程式碼做了修改,但 make plot 畫出的圖並沒有變化,因此在 cache-test 的部份加入了以下指令刪除原本的紀錄檔:
cache-test: $(EXEC)
	rm -f ./*.txt
	perf stat --repeat 100 \
		-e cache-misses,cache-references,instructions,cycles \
		./phonebook_orig
	perf stat --repeat 100 \
		-e cache-misses,cache-references,instructions,cycles \
		./phonebook_opt
  • 為了方便觀察結果,修改了 make plot 部份,利用 eog 在分析圖產生後直接顯示:
plot: output.txt
	gnuplot scripts/runtime.gp
	eog ./runtime.png

請注意到 Don't invoke program eog. You should not assume the availability of X11.
課程助教

減小 size of entry

  • 想法:除了 lastName 以外的欄位都不是搜尋時會用到的
    因此另外定義 struct __PHONE_BOOK_ADDITION 儲存這些欄位
    並在原本的資料結構中預留指標,有需要時再分配記憶體。

  • 優點:
    size of entry 縮小後由於佔用 cache 較小,應能降低cache-miss

  • 缺點:
    需要多一次 malloc 來分配記憶體給這些欄位,可能增加 append() 時間

  • 實測
    將資料結構改為以下型態

    ​​​​typedef struct __PHONE_BOOK_ADDITION {
    ​​​​    char firstName[16];
    ​​​​    char email[16];
    ​​​​    char phone[10];
    ​​​​    char cell[10];
    ​​​​    char addr1[16];
    ​​​​    char addr2[16];
    ​​​​    char city[16];
    ​​​​    char state[2];
    ​​​​    char zip[5];
    ​​​​} addition;
    
    ​​​​typedef struct __PHONE_BOOK_ENTRY {
    ​​​​    char lastName[MAX_LAST_NAME_SIZE];
    ​​​​    struct __PHONE_BOOK_ADDITION *addtionInfo;
    ​​​​    struct __PHONE_BOOK_ENTRY *pNext;
    ​​​​} entry;
    

    entry size 變為 32 bytes, cache-miss 則變為 64.499 %

    這樣的調整使得「entry size 變為 32 bytes」,是否還能縮減呢?與其用指標 addtionInfo 來指向完整資料的記憶體空間,不如換用可預先計算範圍的數值,並用 uint16_t (或更小的型態) 來記錄。
    "jserv"

    若能事先確認需額外儲存的資料量,則可以採用 uint16_t 等資料型態,但需特別注意:若為做額外處理,entry size 會因為 alignments 而保持在 32 bytes , 無法有效減小:

    ​​typedef struct __PHONE_BOOK_ENTRY {
    ​​  char lastName[MAX_LAST_NAME_SIZE];
    ​​  uint16_t index;
    ​​  struct __PHONE_BOOK_ENTRY *pNext;
    ​​} entry;
    

    執行結果:

    ​​size of entry : 32 bytes
    ​​execution time of append() : 0.033276 sec
    ​​execution time of findName() : 0.001591 sec
    

    為了能確實減小尺寸,可以加入 __attribute__((__packed__)) (reference)

    ​​typedef struct __attribute__((__packed__)) __PHONE_BOOK_ENTRY {
    ​​  char lastName[MAX_LAST_NAME_SIZE];
    ​​  uint16_t index;
    ​​  struct __PHONE_BOOK_ENTRY *pNext;
    ​​} entry;
    

    執行結果:

    ​​size of entry : 26 bytes
    ​​execution time of append() : 0.029881 sec
    ​​execution time of findName() : 0.001542 sec
    

應該用 perf 比較 cache miss 一類的落差
"jserv"

透過perf驗證後,發現雖然減小了entry size, 但cache miss及時間表現卻沒有進展(甚至有些微落後)

32-byte entry cache miss:

Performance counter stats for './phonebook_orig' (100 runs):

         1,665,405      cache-misses              #   72.560 % of all cache refs      ( +-  0.17% )
         2,295,210      cache-references                                              ( +-  0.25% )
       242,975,461      instructions              #    1.57  insn per cycle           ( +-  0.02% )
       155,039,764      cycles                                                        ( +-  0.55% )

26-byte entry cache miss:

Performance counter stats for './phonebook_opt' (100 runs):

         1,667,705      cache-misses              #   73.910 % of all cache refs      ( +-  0.19% )
         2,256,388      cache-references                                              ( +-  0.22% )
       242,905,795      instructions              #    1.55  insn per cycle           ( +-  0.02% )
       157,022,276      cycles                                                        ( +-  0.66% )

猜想原因:雖然entry size透過取消alignment減小,但CPU存取cache時仍然採用aligned模式,因此沒有幫助

嘗試採用hash table加速

實作 hash table, 採用 dj2b hash functionlastName 字串做 hash

並建立 Size 為2048的Hash table

entry *findName(char lastName[], entry **pArr)
{
    entry *pHead = pArr[hash(lastName) % HASH_TABLE_SIZE];
    while (pHead != NULL) {
        if (strcasecmp(lastName, pHead->lastName) == 0)
            return pHead;
        pHead = pHead->pNext;
    }
    return NULL;
}
entry *append(char lastName[], entry **pArr)
{
    /* allocate memory for the new entry and put lastName */
    int index = hash(lastName) % HASH_TABLE_SIZE;
    entry *e = (entry *) malloc(sizeof(entry));
    e->pNext = pArr[index];
    pArr[index] = e;
    strcpy(e->lastName, lastName);

    return e;
}

結果:

orig opt hash
cache miss 94.140% 67.303% 73.737%
append time 0.039918 0.032211 0.038637
find time 0.005459 0.002070 0.000000

比起縮小 entry size 後的版本,append() 時間增加,findName() 時間則大幅縮短