# 2017q1 Homework1 (phonebook) contributed by <`zmke`> ### Reviewed by `ryanwang522` * 參考 [Cache](https://stevenschmatz.gitbooks.io/eecs-370/content/Lecture_Notes/lecture_17.html) 的說明,Cache 是為了保留最可能被存取的資料(Most likely to be referenced),避免較耗時的記憶體操作。 * 可以利用 Hash function 來增進 `findName()` 的效率。 * 這裡實作的 BST 缺乏對 Cache line 大小的配合,可以透過更改 BST node 結構以符合 Cache line,參考 [關於 CPU Cache](http://cenalulu.github.io/linux/all-about-cpu-cache/)。 ## 開發環境 OS: Ubuntu 16.04 LTS Architecture: x86_64 CPU 作業模式: 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 Model name: Intel(R) Core(TM) i5-4200H CPU @ 2.80GHz L1d 快取: 32K L1i 快取: 32K L2 快取: 256K L3 快取: 3072K ## cache * ~~cache 是為了讓資料存取的速度適應 CPU 的處理速度~~ >> 這句話不精確,請參照 eecs-370 的說明 [Cache](https://stevenschmatz.gitbooks.io/eecs-370/content/Lecture_Notes/lecture_17.html),搭配短片 [What is cache memory](https://www.youtube.com/watch?v=roeZs-eL-lw),再來回顧你原本的論述 [name="jserv"] * static random access memory (SRAM) 速度快、成本高 * 電腦的 main memory 主要是 DRAM ,跟不上 CPU 存取的速度, cache 採用的是 SRAM * CPU 在 cache 找不到需要的資料時,會到 main memory 去找,但 CPU 和 main memory 的速度差了好幾個數量級 * Von Neumann bottleneck : main memory 的速度跟不上 CPU 的速度, 造成 CPU performance 的限制 * 察看記憶體資訊`$ dmidecode --type 17 | more` ``` Memory Device Array Handle: 0x0042 Error Information Handle: Not Provided Total Width: 64 bits Data Width: 64 bits Size: 4096 MB Form Factor: SODIMM Set: None Locator: ChannelA-DIMM0 Bank Locator: BANK 0 Type: DDR3 Type Detail: Synchronous Speed: 1600 MHz Manufacturer: Transcend Serial Number: 0009FB43 Asset Tag: 9876543210 Part Number: TS512MSK64W6H Rank: 1 Configured Clock Speed: 1600 MHz ``` 以我的電腦為例, CPU 時脈是 2.8GHz 但 main memory 時脈只有 1600 MHz * function of the cache * The cache will hold data that we think is most likely to be referenced. * Minimize the average memory access latency * Maximize the number of references that are serviced by the cache * N-Way Set Associative : 避免 Fully Associative 和 Direct Mapped 的缺陷,將 cache 分成 N 個 set * cache miss: 在 cache 找不到需要的資料,需要到 main memory 尋找 * 檢查電腦的 cache : `$sudo dmidecode -t cache | more` ``` Cache Information Socket Designation: CPU Internal L1 Configuration: Enabled, Not Socketed, Level 1 Operational Mode: Write Back Location: Internal Installed Size: 128 kB Maximum Size: 128 kB Supported SRAM Types: Unknown Installed SRAM Type: Unknown Speed: Unknown Error Correction Type: Single-bit ECC System Type: Other Associativity: 8-way Set-associative ``` ## 原始版本 >中英文字間請已空白隔開 >[name=課程助教][color=red] * 先清空 cache:`$ echo 1 | sudo tee /proc/sys/vm/drop_caches` * 執行:`$ ./phonebook_orig` ``` size of entry : 136 bytes execution time of append() : 0.066602 sec execution time of findName() : 0.005839 sec ``` * 測試 cache miss rate: `$ make cache-test` * 原始版本 cache miss rate 85% ``` Performance counter stats for './phonebook_orig' (100 runs): 1,025,260 cache-misses # 85.415 % of all cache refs ( +- 0.35% ) 1,200,333 cache-references ( +- 0.29% ) 262,101,582 instructions # 1.40 insn per cycle ( +- 0.02% ) 187,299,573 cycles ( +- 0.14% ) 0.057128980 seconds time elapsed ( +- 1.18% ) ``` * 利用 perf 找熱點 `$ ./phonebook_orig & sudo perf top -p $!` * findName 佔19.16%, append 佔3.32%,但 append 實際執行時間卻較長 ``` 30.72% libc-2.23.so [.] __strcasecmp_l_avx 19.16% phonebook_orig [.] findName 6.63% [kernel] [k] clear_page_c_e 6.52% phonebook_orig [.] main 5.67% libc-2.23.so [.] _IO_fgets 5.30% libc-2.23.so [.] _IO_getline_info 3.71% libc-2.23.so [.] _int_malloc 3.32% phonebook_orig [.] append 2.54% [kernel] [.] native_irq_return_iret 2.46% [kernel] [k] unmap_page_range 2.46% [kernel] [k] free_pcppages_bulk 1.66% [kernel] [k] release_pages 1.65% phonebook_orig [.] strcasecmp@plt 1.35% libc-2.23.so [.] __memcpy_sse2 0.93% libc-2.23.so [.] memchr 0.91% [kernel] [k] do_page_fault 0.86% [kernel] [k] __do_page_fault 0.83% [unknown] [k] 0x00007f29e9ab0e02 0.83% [unknown] [k] 0x00007f29e9ab23c8 0.83% [kernel] [k] page_remove_rmap 0.82% [kernel] [k] __find_get_block 0.82% [kernel] [k] free_hot_cold_page 0.00% [kernel] [k] native_write_msr ``` ## method1 - 減少 entry 大小 * findName 的時候都只用 lastName 去找,原本的結構有很多資料,如果將 entry 縮小,就有更多 entry 可以被放進 cache ,預期能降低 cache miss rate ``` == typedef struct __ENTRY_DETAIL { char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5]; } entryDetail; typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; entryDetail *detail; struct __PHONE_BOOK_ENTRY *pNext; } entry; ``` * 把~~沒有用到~~ lastName 以外的資料移到 entryDetail ,原本的結構內剩下 lastName 和 entryDetail 的指標, append() 和 findName() 和原始版本相同 `$ ./phonebook_opt` >> 不是「沒有用到」,請思考用語的精確性 [name="jserv"] ``` size of entry : 32 bytes execution time of append() : 0.066095 sec execution time of findName() : 0.002160 sec ``` `$ make plot` ![](https://i.imgur.com/0KMhkXn.png) * size of entry 從 136 bytes 縮小到 32 bytes , append() 和 findName() 的時間都降低了 `$ make cache-test` ``` Performance counter stats for './phonebook_opt' (100 runs): 145,400 cache-misses # 46.061 % of all cache refs ( +- 0.34% ) 315,669 cache-references ( +- 0.56% ) 244,493,450 instructions # 1.90 insn per cycle ( +- 0.02% ) 128,986,123 cycles ( +- 0.31% ) 0.039869771 seconds time elapsed ( +- 0.62% ) ``` * cache miss rate 從 84% 降低到 46% ## method2 - Binary Search Tree * 原本的結構是 singly linked list ,搜尋時是 liner search ,將結構改成 Binary Search Tree 預期能降低 findName 的時間 * 因為字典檔內的 lastName 是已經排序過的字串,直接用來建 BST 的話會使 BST 變成 list , findName 無法跳脫 liner search ,因此嘗試建出左右平衡的 BST * reference * [Sorted Linked List to Balanced BST](http://www.geeksforgeeks.org/sorted-linked-list-to-balanced-bst/) * [Convert Sorted List to Binary Search Tree](https://siddontang.gitbooks.io/leetcode-solution/content/tree/convert_sorted_listarray_to_binary_search_tree.html) ```== bst *list_to_BST(entry *pHead) { int listLen = 0; entry *cur = pHead; while(cur) { listLen++; cur = cur->pNext; } return build(&pHead, listLen); } bst *build(entry **pHead, int listLen) { if(listLen <= 0) return NULL; bst *left = build(pHead, listLen/2); bst *root = (bst *) malloc(sizeof(bst)); strcpy(root->lastName, (*pHead)->lastName); root->left = left; *pHead = (*pHead)->pNext; root->right = build(pHead, listLen-listLen/2-1); return root; } ``` * 維持原始版本的 entry structure ,使用二元搜尋樹進行實驗 `$ ./phonebook_bst` ``` size of entry : 136 bytes execution time of append() : 0.066551 sec execution time of findName() : 0.000000 sec ``` `$ make plot` ![](https://i.imgur.com/2FZIfk8.png) 發現 findName() 的時間明顯降低,但 append() 較前兩個情形稍微上升 `$ make cache-test` ``` Performance counter stats for './phonebook_bst' (100 runs): 770,479 cache-misses # 80.189 % of all cache refs ( +- 0.34% ) 960,834 cache-references ( +- 1.09% ) 359,999,587 instructions # 1.21 insn per cycle ( +- 0.01% ) 297,319,379 cycles ( +- 1.15% ) 0.092091102 seconds time elapsed ( +- 1.28% ) ``` 與原始版本相比, cache miss rate 有些微下降(85.41% -> 80.19%) * 使用 method1 縮小過後的 entry structure 進行實驗 `$ ./phonebook_bst` ``` size of entry : 32 bytes execution time of append() : 0.065951 sec execution time of findName() : 0.000000 sec ``` `$ make plot` ![](https://i.imgur.com/IBK3BgF.png) append() 的時間較前兩個情況下降了 `$ make cache-test` ``` Performance counter stats for './phonebook_bst' (100 runs): 302,300 cache-misses # 65.713 % of all cache refs ( +- 0.19% ) 460,034 cache-references ( +- 0.57% ) 342,481,904 instructions # 1.59 insn per cycle ( +- 0.01% ) 214,799,756 cycles ( +- 0.32% ) 0.065648659 seconds time elapsed ( +- 0.39% ) ``` 同時使用 bst 和較小的 entry , cache miss rate 較只用 bst 明顯下降 * 依實驗結果來看,目前讓 findName() 進步最多的是將 linked list 轉成 binary search tree 有效節省 findName() 所需的時間,將 entry 縮小除了能夠降低 cache miss rate 之外,連 cache reference 的次數也跟著降低了 ## memory leak * valgrind :可以用來檢查程式記憶體 `$ valgrind --leak-check=yes ./phonebook_orig` * 修正 main.c 最後的 release memory 的部份 ``` while(pHead) { entry *tmp = pHead; pHead = pHead->pNext; free(tmp); } ``` valgrind 結果 ``` ==9712== All heap blocks were freed -- no leaks are possible ``` * 為 binary search tree 增加 releaseTree() ,確保記憶體完整被釋放 ``` void releaseTree(bst *root) { if(root == NULL) return; releaseTree(root->left); releaseTree(root->right); free(root); } ``` valfrind 結果 ``` ==9716== All heap blocks were freed -- no leaks are possible ``` ## 本例選用的 dataset (也就是輸入電話簿的姓名) 是否存在問題? * 有代表性嗎?跟真實英文姓氏差異大嗎? * 裡面有部份是沒意義的單字 * 資料難道「數大就是美」嗎? * 並不完全正確,像是 AlphaGo 就是靠大量的對戰資料來進步,但資料的質量也很重要,在做訊號分析的時候常常需要 filter 來過濾,我想資料也是一樣,至少不能和現實相差太大 * 如果我們要建立符合真實電話簿情境的輸入,該怎麼作? * 要去找找有沒有真實姓名統計資料,把這些名字做成字典檔 * [可以產生名字的網站](http://www.generatedata.com/#t1) * [English Baby Names](http://babynames.net/all/english) * 如果會爬蟲的技術,可以從這些網站把資料整理下來 ## Reference [Cache](https://stevenschmatz.gitbooks.io/eecs-370/content/Lecture_Notes/lecture_17.html) [von Neumann bottleneck](http://whatis.techtarget.com/definition/von-Neumann-bottleneck) [What is cache memory – Gary explains](http://www.androidauthority.com/what-is-cache-memory-gary-explains-681747/) [Binary Tree](http://www.csie.ntnu.edu.tw/~u91029/BinaryTree.html) [C Binary Tree with an Example C Code](http://www.thegeekstuff.com/2013/02/c-binary-tree") [ktvexe 筆記](https://hackmd.io/MYZgnCwGbALAtGArCADPWAjAhgRnpgOwAmY8UAbMJqsLgBxb2ZA=) [twzjwang 筆記](https://hackmd.io/s/HJsOjpYKl) [檢查程式記憶體的小工具-valgrind](http://daydreamer.idv.tw/rewrite.php/read-18.html)