2017q1 Homework1 (phonebook)

contributed by <zmke>

Reviewed by `ryanwang522`

參考 Cache 的說明，Cache 是為了保留最可能被存取的資料（Most likely to be referenced），避免較耗時的記憶體操作。
可以利用 Hash function 來增進 findName() 的效率。
這裡實作的 BST 缺乏對 Cache line 大小的配合，可以透過更改 BST node 結構以符合 Cache line，參考關於 CPU Cache。

開發環境

OS: Ubuntu 16.04 LTS
Architecture: x86_64
CPU 作業模式： 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
Model name: Intel® Core™ i5-4200H CPU @ 2.80GHz
L1d 快取： 32K
L1i 快取： 32K
L2 快取： 256K
L3 快取： 3072K

cache

~~cache 是為了讓資料存取的速度適應 CPU 的處理速度~~

這句話不精確，請參照 eecs-370 的說明 Cache，搭配短片 What is cache memory，再來回顧你原本的論述 "jserv"

static random access memory (SRAM) 速度快、成本高
電腦的 main memory 主要是 DRAM ，跟不上 CPU 存取的速度， cache 採用的是 SRAM
CPU 在 cache 找不到需要的資料時，會到 main memory 去找，但 CPU 和 main memory 的速度差了好幾個數量級

Von Neumann bottleneck ： main memory 的速度跟不上 CPU 的速度，造成 CPU performance 的限制

察看記憶體資訊$ dmidecode --type 17 | more

Memory Device
Array Handle: 0x0042
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 4096 MB
Form Factor: SODIMM
Set: None
Locator: ChannelA-DIMM0
Bank Locator: BANK 0
Type: DDR3
Type Detail: Synchronous
Speed: 1600 MHz
Manufacturer: Transcend
Serial Number: 0009FB43
Asset Tag: 9876543210
Part Number: TS512MSK64W6H     
Rank: 1
Configured Clock Speed: 1600 MHz

以我的電腦為例， CPU 時脈是 2.8GHz 但 main memory 時脈只有 1600 MHz

function of the cache
- The cache will hold data that we think is most likely to be referenced.
- Minimize the average memory access latency
- Maximize the number of references that are serviced by the cache
N-Way Set Associative : 避免 Fully Associative 和 Direct Mapped 的缺陷，將 cache 分成 N 個 set
cache miss: 在 cache 找不到需要的資料，需要到 main memory 尋找
檢查電腦的 cache : $sudo dmidecode -t cache | more

Cache Information
	Socket Designation: CPU Internal L1
	Configuration: Enabled, Not Socketed, Level 1
	Operational Mode: Write Back
	Location: Internal
	Installed Size: 128 kB
	Maximum Size: 128 kB
	Supported SRAM Types:
		Unknown
	Installed SRAM Type: Unknown
	Speed: Unknown
	Error Correction Type: Single-bit ECC
	System Type: Other
	Associativity: 8-way Set-associative

原始版本

中英文字間請已空白隔開
課程助教

先清空 cache:$ echo 1 | sudo tee /proc/sys/vm/drop_caches
執行:$ ./phonebook_orig

size of entry : 136 bytes
execution time of append() : 0.066602 sec
execution time of findName() : 0.005839 sec

測試 cache miss rate: $ make cache-test
- 原始版本 cache miss rate 85%

Performance counter stats for './phonebook_orig' (100 runs):

         1,025,260      cache-misses              #   85.415 % of all cache refs      ( +-  0.35% )
         1,200,333      cache-references                                              ( +-  0.29% )
       262,101,582      instructions              #    1.40  insn per cycle           ( +-  0.02% )
       187,299,573      cycles                                                        ( +-  0.14% )

       0.057128980 seconds time elapsed                                          ( +-  1.18% )

利用 perf 找熱點
$ ./phonebook_orig & sudo perf top -p $!
- findName 佔19.16%， append 佔3.32%，但 append 實際執行時間卻較長

  30.72%  libc-2.23.so      [.] __strcasecmp_l_avx
  19.16%  phonebook_orig    [.] findName
   6.63%  [kernel]          [k] clear_page_c_e
   6.52%  phonebook_orig    [.] main
   5.67%  libc-2.23.so      [.] _IO_fgets
   5.30%  libc-2.23.so      [.] _IO_getline_info
   3.71%  libc-2.23.so      [.] _int_malloc
   3.32%  phonebook_orig    [.] append
   2.54%  [kernel]          [.] native_irq_return_iret
   2.46%  [kernel]          [k] unmap_page_range
   2.46%  [kernel]          [k] free_pcppages_bulk
   1.66%  [kernel]          [k] release_pages
   1.65%  phonebook_orig    [.] strcasecmp@plt
   1.35%  libc-2.23.so      [.] __memcpy_sse2
   0.93%  libc-2.23.so      [.] memchr
   0.91%  [kernel]          [k] do_page_fault
   0.86%  [kernel]          [k] __do_page_fault
   0.83%  [unknown]         [k] 0x00007f29e9ab0e02
   0.83%  [unknown]         [k] 0x00007f29e9ab23c8
   0.83%  [kernel]          [k] page_remove_rmap
   0.82%  [kernel]          [k] __find_get_block
   0.82%  [kernel]          [k] free_hot_cold_page
   0.00%  [kernel]          [k] native_write_msr

method1 - 減少 entry 大小

findName 的時候都只用 lastName 去找，原本的結構有很多資料，如果將 entry 縮小，就有更多 entry 可以被放進 cache ，預期能降低 cache miss rate

















typedef struct __ENTRY_DETAIL {
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
} entryDetail;

typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    entryDetail *detail;
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

把~~沒有用到~~ lastName 以外的資料移到 entryDetail ，原本的結構內剩下 lastName 和 entryDetail 的指標， append() 和 findName() 和原始版本相同

$ ./phonebook_opt

不是「沒有用到」，請思考用語的精確性 "jserv"

size of entry : 32 bytes
execution time of append() : 0.066095 sec
execution time of findName() : 0.002160 sec

$ make plot

size of entry 從 136 bytes 縮小到 32 bytes ， append() 和 findName() 的時間都降低了
$ make cache-test

Performance counter stats for './phonebook_opt' (100 runs):

           145,400      cache-misses              #   46.061 % of all cache refs      ( +-  0.34% )
           315,669      cache-references                                              ( +-  0.56% )
       244,493,450      instructions              #    1.90  insn per cycle           ( +-  0.02% )
       128,986,123      cycles                                                        ( +-  0.31% )

       0.039869771 seconds time elapsed                                          ( +-  0.62% )

cache miss rate 從 84% 降低到 46%

method2 - Binary Search Tree

原本的結構是 singly linked list ，搜尋時是 liner search ，將結構改成 Binary Search Tree 預期能降低 findName 的時間
因為字典檔內的 lastName 是已經排序過的字串，直接用來建 BST 的話會使 BST 變成 list ， findName 無法跳脫 liner search ，因此嘗試建出左右平衡的 BST
- reference
  - Sorted Linked List to Balanced BST
  - Convert Sorted List to Binary Search Tree
























bst *list_to_BST(entry *pHead)
{
    int listLen = 0;
    entry *cur = pHead;
    while(cur) {
        listLen++;
        cur = cur->pNext;
    }
    return build(&pHead, listLen);
}

bst *build(entry **pHead, int listLen)
{
    if(listLen <= 0)
        return NULL;

    bst *left = build(pHead, listLen/2);
    bst *root = (bst *) malloc(sizeof(bst));
    strcpy(root->lastName, (*pHead)->lastName);
    root->left = left;
    *pHead = (*pHead)->pNext;
    root->right = build(pHead, listLen-listLen/2-1);
    return root;
}

維持原始版本的 entry structure ，使用二元搜尋樹進行實驗

$ ./phonebook_bst

size of entry : 136 bytes
execution time of append() : 0.066551 sec
execution time of findName() : 0.000000 sec

$ make plot

發現 findName() 的時間明顯降低，但 append() 較前兩個情形稍微上升

$ make cache-test


Performance counter stats for './phonebook_bst' (100 runs):

           770,479      cache-misses              #   80.189 % of all cache refs      ( +-  0.34% )
           960,834      cache-references                                              ( +-  1.09% )
       359,999,587      instructions              #    1.21  insn per cycle           ( +-  0.01% )
       297,319,379      cycles                                                        ( +-  1.15% )

       0.092091102 seconds time elapsed                                          ( +-  1.28% )

與原始版本相比， cache miss rate 有些微下降（85.41% -> 80.19%）

使用 method1 縮小過後的 entry structure 進行實驗

$ ./phonebook_bst

size of entry : 32 bytes
execution time of append() : 0.065951 sec
execution time of findName() : 0.000000 sec

$ make plot

append() 的時間較前兩個情況下降了

$ make cache-test

Performance counter stats for './phonebook_bst' (100 runs):

           302,300      cache-misses              #   65.713 % of all cache refs      ( +-  0.19% )
           460,034      cache-references                                              ( +-  0.57% )
       342,481,904      instructions              #    1.59  insn per cycle           ( +-  0.01% )
       214,799,756      cycles                                                        ( +-  0.32% )

       0.065648659 seconds time elapsed                                          ( +-  0.39% )

同時使用 bst 和較小的 entry ， cache miss rate 較只用 bst 明顯下降

依實驗結果來看，目前讓 findName() 進步最多的是將 linked list 轉成 binary search tree 有效節省 findName() 所需的時間，將 entry 縮小除了能夠降低 cache miss rate 之外，連 cache reference 的次數也跟著降低了

memory leak

valgrind ：可以用來檢查程式記憶體
$ valgrind --leak-check=yes ./phonebook_orig
修正 main.c 最後的 release memory 的部份

while(pHead) {
		entry *tmp = pHead;
		pHead = pHead->pNext;
		free(tmp);
	}

valgrind 結果

==9712== All heap blocks were freed -- no leaks are possible

為 binary search tree 增加 releaseTree() ，確保記憶體完整被釋放

void releaseTree(bst *root)
{
    if(root == NULL) return;
    releaseTree(root->left);
    releaseTree(root->right);
    free(root);
}

valfrind 結果

==9716== All heap blocks were freed -- no leaks are possible

本例選用的 dataset (也就是輸入電話簿的姓名) 是否存在問題？

有代表性嗎？跟真實英文姓氏差異大嗎？
- 裡面有部份是沒意義的單字
資料難道「數大就是美」嗎？
- 並不完全正確，像是 AlphaGo 就是靠大量的對戰資料來進步，但資料的質量也很重要，在做訊號分析的時候常常需要 filter 來過濾，我想資料也是一樣，至少不能和現實相差太大
如果我們要建立符合真實電話簿情境的輸入，該怎麼作？
- 要去找找有沒有真實姓名統計資料，把這些名字做成字典檔
- 可以產生名字的網站
- English Baby Names
- 如果會爬蟲的技術，可以從這些網站把資料整理下來

Reference

Cache
von Neumann bottleneck
What is cache memory – Gary explains
Binary Tree
C Binary Tree with an Example C Code
ktvexe 筆記
 twzjwang 筆記
 檢查程式記憶體的小工具-valgrind