# 2016q3 Homework1 (phonebook) contributed by <`beieve7028`> --- # 預期目標 * 準備 GNU/Linux 開發工具 * 學習使用 Git 與 GitHub * 學習效能分析工具 * 研究軟體最佳化 --- # 實驗環境 ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 94 Stepping: 3 CPU MHz: 3491.640 BogoMIPS: 6815.99 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K NUMA node0 CPU(s): 0-7 ``` --- # 先前準備 1. [自己學習使用HackMD與Markdown的~~防痴呆~~筆記](https://hackmd.io/KYBgzAbA7ArAxgRgLQCYUBMRICzpTJAIxnQA4kBOKXCBQqRQkIA=) 1. 安裝與更新linux必須套件([reference](https://hackmd.io/s/B1DmtJfT)) `$ sudo apt-get update` `$ sudo apt-get install build-essential` `$ sudo apt-get install linux-tools-common linux-tools-generic` `$ sudo apt-get install astyle colordiff gnuplot` --- # perf 學習 ---- # perf 安裝筆記 使用[perf說明](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial)提到的指令時有出現提到的錯誤 ``` $ perf list WARNING: perf not found for kernel 4.2.0-42 You may need to install the following packages for this specific kernel: linux-tools-4.2.0-42-generic linux-cloud-tools-4.2.0-42-generic You may also want to install one of the following packages to keep up to date: linux-tools-generic-lts-<series> linux-cloud-tools-generic-lts-<series> ``` 因此依照建議的指令安裝正確的perf ``` $ uname -a Linux os-2 4.2.0-42-generic #49~14.04.1-Ubuntu SMP Wed Jun 29 20:22:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux $ sudo apt-get install linux-tools-4.2.0-42-generic linux-cloud-tools-4.2.0-42-generic ``` 再次使用 ``` $ perf list List of pre-defined events (to be used in -e): branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] ref-cycles [Hardware event] alignment-faults [Software event] context-switches OR cs [Software event] cpu-clock [Software event] cpu-migrations OR migrations [Software event] dummy [Software event] emulation-faults [Software event] major-faults [Software event] minor-faults [Software event] page-faults OR faults [Software event] task-clock [Software event] ``` 就可以正常執行了 ---- # perf 練習 在[video](https://www.youtube.com/watch?v=ZICRLKf_bVw)當中老師有對perf稍作說明,我先藉由老師計算pi的範例當作學習perf的進入點,在複製[perf說明](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial)內的程式碼並進行編譯: ```clike= #include <stdio.h> #include <unistd.h> double compute_pi_baseline(size_t N) { double pi = 0.0; double dt = 1.0 / N; for (size_t i = 0; i < N; i++) { double x = (double) i / N; pi += dt / (1.0 + x * x); } return pi * 4.0; } int main() { printf("pid: %d\n", getpid()); sleep(10); compute_pi_baseline(50000000); return 0; } ``` 在經過編譯時遇到了狀況: ``` $ gcc -o pi pi.c pi.c: In function ‘compute_pi_baseline’: pi.c:7:5: error: ‘for’ loop initial declarations are only allowed in C99 mode for (size_t i = 0; i < N; i++) { ^ pi.c:7:5: note: use option -std=c99 or -std=gnu99 to compile your code ``` 得到一個錯誤,因為在範例中是使用g++來編譯,而我是採用gcc進行編譯,而從當中的說明可以見到應該是gcc預設編譯的版本不在c99之後的問題。要解決這個問題直覺上有兩個方法 1. 將迴圈內的變數宣告拉到迴圈外 2. 使用g++編譯 3. 從編譯器給予的指示說明,以c99標準指示編譯器編譯 為了能重複利用程式碼減少修改,我選擇指名c99的指示進行編譯後就可以順利編譯: `$ gcc -o pi pi.c -std=c99` 接著開兩個terminal視窗各別輸入指令: ``` $ ./pi pid: 25306 ``` ``` $ sudo perf top -p 25306 Samples: 2K of event 'cycles', Event count (approx.): 148490621 Overhead Shared O Symbol 99.96% pi [.] compute_pi_baseline 0.04% [kernel] [k] uncharge_list ``` 從perf的數據可以得知這支程式計算的開銷都在計算pi的部份 ---- # perf stat 練習 為了更熟悉perf的使用,我依照[參考資料](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial)一步步操作,在能理解`$sudo perf top`的使用方式後,進一步以`perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ~/Desktop/cache_misses`去觀察cache miss的情況。 cache_misses.c: ```clike= static char array[10000][10000]; int main (void){ int i, j; for (i = 0; i < 10000; i++) for (j = 0; j < 10000; j++) array[j][i]++; return 0; } ``` ``` $ perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ~/Desktop/cache_misses Performance counter stats for '/home/oslab/Desktop/cache_misses' (5 runs): 5,278,828 cache-misses # 2.125 % of all cache refs 248,157,964 cache-references 2,005,916,097 instructions # 1.88 insns per cycle 1,057,336,702 cycles 0.271109771 seconds time elapsed ( +- 0.57% ) ``` 依照[perf教學](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial)把i, j對調如參考之料所述可以發現cache-references有顯著的下降,原因在space locality與陣列在線性記憶體內擺放的方式有關: ``` $ perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ~/Desktop/cache_misses Performance counter stats for '/home/oslab/Desktop/cache_misses' (5 runs): 1,909,115 cache-misses # 37.448 % of all cache refs 5,289,898 cache-references 2,005,906,440 instructions # 2.49 insns per cycle 807,624,474 cycles 0.207120866 seconds time elapsed ( +- 0.89% ) ``` (練習到這時發現自己對於cache-references的意思很模糊,因此就去~~騷擾~~請問老師這方面概念的關鍵字,後來老師說要看資料不要從網路找,因此趁著颱風還不會把我吹走時趕緊去宿舍拿相關的書...QQ) ~~會減少cache-references的次數在於資料放進cache時是以一個cache line(64bytes)為單位,如果資料在存取時具有Spatial locality的特性,或是程式寫法能盡量保有cache friendly就能減少cache-misses與cache-references,因為之後要存取的資料已隨著前面的cache reference一起放置在cache內~~ 在紀錄上面文字敘述時,又冒出了一個新的疑問「cache-references怎麼也會降低?」,看了一本~~爛~~書結果還是充滿著疑惑,於是再度前往給老師~~電~~教導後才比較了解cache。這時深深覺得以前都是一知半解的,等到真的碰觸這塊就顯得以前學習都是無用的。 原本以為cache-references是對cache的存取次數,也是一開始抱持的疑問,因為每次access都會經過cache,那數量怎麼會有差別,這也是自己對名詞定義的不清楚,才使得自己布滿障礙。 ---- # perf record & perf report 在以[perf教學](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial)進行當中perf record&perf report練習時,因為敘述比較簡短,因此從當中的參考資料[perf 性能分析实例——使用perf优化cache利用率](http://blog.csdn.net/trochiluses/article/details/17346803)進行學習,也順便了解code編寫方式對於cache miss的影響,在了解record&report的用法後,再回頭以[perf教學](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial)的例子練習。 ```clike= #define N 5000000 static int array[N] = { 0 }; void normal_loop(int a) { int i; for (i = 0; i < N; i++) array[i] = array[i]+a; } void unroll_loop(int a) { int i; for (i = 0; i < N; i+=5){ array[i] = array[i]+1; array[i+1] = array[i+1]+a; array[i+2] = array[i+2]+a; array[i+3] = array[i+3]+a; array[i+4] = array[i+4]+a; } } int main() { normal_loop(1); unroll_loop(1); return 0; } ``` 經過上面程式以perf進行分析後,在instructions可以看到下面結果,會發現normal_loop佔了比較大的overhead。 ``` Samples: 64 of event 'branch-instructions:u', Event count (approx.): 6031975 Overhead Command Shared Object Symbol 84.68% perf_record_exa perf_record_example [.] normal_loop 14.94% perf_record_exa perf_record_example [.] unroll_loop 0.31% perf_record_exa ld-2.19.so [.] _dl_relocate_object 0.07% perf_record_exa ld-2.19.so [.] dl_main 0.01% perf_record_exa ld-2.19.so [.] _dl_start 0.00% perf_record_exa ld-2.19.so [.] _start ``` --- # 作業 先完成必要的準備 1. fork一份code到自己的repositories 2. clone 到 local `cd ~/cs2016q3/ && git clone https://github.com/believe7028/phonebook.git && cd phonebook/` 在README.md當中提到必須使用自動排版工具astyle以及使用一個hook`pre-commit`,因此先去查了[打造專屬於你的 Git 工作流程](http://tech.mozilla.com.tw/posts/5306/%E6%89%93%E9%80%A0%E5%B0%88%E5%B1%AC%E6%96%BC%E4%BD%A0%E7%9A%84-git-%E5%B7%A5%E4%BD%9C%E6%B5%81%E7%A8%8B-alias%E3%80%81commands%E3%80%81hooks)、[2016q1 Homework 1](https://embedded2016.hackpad.com/ep/pad/static/LARSLykI83L)與[Git 客製化 - Git Hooks](https://hackmd.io/EwYwrAHARgbAzATgLRwCZjkgLAUwGZhICGcMeKeWceqIUREEIQA=?both#) --- # [修改結構成員](https://github.com/believe7028/phonebook/tree/linkList) 在[老師講解的影片](https://www.youtube.com/watch?v=ZICRLKf_bVw)中,提到以前的學長先以修改資料結構大小的方式進行優化,我也以此概念先進行改善,因為這種方式是簡單且直觀的。在過程中共修改兩個相關的檔案: phonebook_opt.h ```clike= #ifndef _PHONEBOOK_H #define _PHONEBOOK_H #define MAX_LAST_NAME_SIZE 16 typedef struct __PHONE_BOOK_DETAIL { char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5]; } detail; typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; struct __PHONE_BOOK_DETAIL *pDetail; struct __PHONE_BOOK_ENTRY *pNext; } entry; entry *findName(char lastName[], entry *pHead); entry *append(char lastName[], entry *e); #endif ``` phonebook_opt.c ```clike= #include <stdio.h> #include <stdlib.h> #include <string.h> #include <ctype.h> #include "phonebook_opt.h" /* FILL YOUR OWN IMPLEMENTATION HERE! */ entry *findName(char lastName[], entry *pHead) { while (pHead != NULL) { if (strcasecmp(lastName, pHead->lastName) == 0) return pHead; pHead = pHead->pNext; } return NULL; } entry *append(char lastName[], entry *e) { /* allocate memory for the new entry and put lastName */ e->pNext = (entry *) malloc(sizeof(entry)); e = e->pNext; strcpy(e->lastName, lastName); e->pDetail = (detail *) malloc(sizeof(detail)); e->pNext = NULL; return e; } ``` 在phonebook_opt.在當中,其實 ```clike e->pDetail = (detail *) malloc(sizeof(detail)); ``` 是沒有用到的,因為實際上沒有詳細的資料,但是為了符合真實情況,依然把這部份撰寫完成以貼近真實。 ### ori: ``` $ sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_orig Performance counter stats for './phonebook_orig' (10 runs): 4,579,669 cache-misses # 93.033 % of all cache refs 4,831,849 cache-references 0.047660639 seconds time elapsed ( +- 2.25% ) ``` ### opt: ``` $ sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_opt Performance counter stats for './phonebook_opt' (10 runs): 5,535,335 cache-misses # 90.737 % of all cache refs 6,082,763 cache-references 0.069276455 seconds time elapsed ( +- 2.46% ) ``` ![](https://i.imgur.com/2eL4nCl.png) 經由上面比較可以發現cache miss 降低的幅度很低,而且執行時間都反而增加,因此我回去影片找老師提到的[範例](https://github.com/charles620016/embedded-summer2015/tree/master/phonebook)來查詢原因,後來發現差異應該是在當中 ```clike= lastNameEntry *appendOptimal(char lastName[], lastNameEntry *lne) { /* allocate memory for the new entry and put lastName in it.*/ lne->pNext = (lastNameEntry *) malloc(sizeof(lastNameEntry)); lne = lne->pNext; strcpy(lne->lastName, lastName); lne->pNext = NULL; return lne; } ``` 雖然在概念上把詳細資料放在另一個結構裡,但在實作上他不會宣告詳細資料的結構。為了驗證是否差異在此,我註解了我會宣告詳細資料的那行並再次進行計算: ```clike= #include <stdio.h> #include <stdlib.h> #include <string.h> #include <ctype.h> #include "phonebook_opt.h" /* FILL YOUR OWN IMPLEMENTATION HERE! */ entry *findName(char lastName[], entry *pHead) { while (pHead != NULL) { if (strcasecmp(lastName, pHead->lastName) == 0) return pHead; pHead = pHead->pNext; } return NULL; } entry *append(char lastName[], entry *e) { /* allocate memory for the new entry and put lastName */ e->pNext = (entry *) malloc(sizeof(entry)); e = e->pNext; strcpy(e->lastName, lastName); //e->pDetail = (detail *) malloc(sizeof(detail)); e->pNext = NULL; return e; } ``` ### ori: ``` $ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_orig Performance counter stats for './phonebook_orig' (10 runs): 4,466,097 cache-misses # 90.015 % of all cache refs 4,980,070 cache-references 0.048785512 seconds time elapsed ( +- 2.25% ) ``` ### opt: ``` $ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_opt Performance counter stats for './phonebook_opt' (10 runs): 1,450,470 cache-misses # 64.407 % of all cache refs 2,333,493 cache-references 0.031379584 seconds time elapsed ( +- 3.77% ) ``` ![](https://i.imgur.com/S56C6ny.png) 經過註解詳細資料結構的記憶體配置指令後,可以見到在cache miss有大幅的改善,且數據的表現也比較接近,執行時間的開銷也有降低。 --- # [字典樹](https://github.com/believe7028/phonebook/tree/trie) 在調整資料結構的大小之後,進一部想改變搜尋方式,直覺上有雜湊搜尋與字典樹兩種,於是我先採用字典樹看看是否在數據上有較好的表現: ### ori: ``` $ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_orig Performance counter stats for './phonebook_orig' (10 runs): 4,544,119 cache-misses # 92.462 % of all cache refs 4,789,443 cache-references 0.048927051 seconds time elapsed ( +- 2.43% ) ``` ### opt: ``` $ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_opt Performance counter stats for './phonebook_opt' (10 runs): 4,935,383 cache-misses # 87.514 % of all cache refs 5,649,657 cache-references 0.135761925 seconds time elapsed ( +- 0.92% ) ``` ![](https://i.imgur.com/QoyL5bM.png) 可以見到雖然雖然採用了字典樹,在搜尋的時間有很好的表現,但是在append的步驟需要較高的時間開銷,因為建構字典樹節點的步驟相較採用link list來的多,而所在意的cache miss反而增加到87%。 --- # [lastName 編碼壓縮](https://github.com/believe7028/phonebook/tree/compressLastName) 表示一個字符需要一個byte,lastName因為只有小寫英文,這表示8bits 共256種可能卻只用到其中26个可能,因此就想以此加壓縮,原本一開始是想採用26進位,但想到會有乘法的操作,因此用位移操作符在計算時會比較快速,因此定為每5bits表示一個小寫英文字,會先把連續字符lastName轉換為一數字,在比較時則以此數值為主要依據: ```clike= unsigned long long int name2Int(char lastName[]) { unsigned long long int num = 0; int index = -1; while(lastName[++index] != '\0') { num= (num << 5) + (lastName[index] - 'a' + 1); } return num; } ``` 當中 `-'a' + 1`會有`+1`是考慮到表示`'\0'`的問題,為了未來可以從數字轉換為字符串,因此小寫英文字符都從1開始,0則用來表示`'\0'`,才能區別一個數值為0是已無英文字符,如果英文從0開始編號,則一個數值為0有可能是`"aa\0"`的表示法。 ### ori: ``` $ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_orig Performance counter stats for './phonebook_orig' (10 runs): 4,376,367 cache-misses # 89.796 % of all cache refs 4,907,063 cache-references 0.049096653 seconds time elapsed ( +- 2.43% ) ``` ### opt: ``` $ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_opt Performance counter stats for './phonebook_opt' (10 runs): 7,314,511 cache-misses # 93.471 % of all cache refs 7,795,438 cache-references 0.092457792 seconds time elapsed ( +- 1.22% ) ``` ![](https://i.imgur.com/aiFA0xC.png) 經過分析後,可以發現不管在chche miss或是執行效率都變得比較差,還有許多可以在改進的地方。 --- # 參考資料 * [A01: phonebook - HackMD](https://hackmd.io/s/S1RVdgza) * [2016 年秋季作業說明 - HackMD](https://hackmd.io/s/B1DmtJfT) * [perf 性能分析实例——使用perf优化cache利用率](http://blog.csdn.net/trochiluses/article/details/17346803)