contributed by <beieve7028
>
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Stepping: 3
CPU MHz: 3491.640
BogoMIPS: 6815.99
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
安裝與更新linux必須套件(reference)
$ sudo apt-get update
$ sudo apt-get install build-essential
$ sudo apt-get install linux-tools-common linux-tools-generic
$ sudo apt-get install astyle colordiff gnuplot
使用perf說明提到的指令時有出現提到的錯誤
$ perf list
WARNING: perf not found for kernel 4.2.0-42
You may need to install the following packages for this specific kernel:
linux-tools-4.2.0-42-generic
linux-cloud-tools-4.2.0-42-generic
You may also want to install one of the following packages to keep up to date:
linux-tools-generic-lts-<series>
linux-cloud-tools-generic-lts-<series>
因此依照建議的指令安裝正確的perf
$ uname -a
Linux os-2 4.2.0-42-generic #49~14.04.1-Ubuntu SMP Wed Jun 29 20:22:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ sudo apt-get install linux-tools-4.2.0-42-generic linux-cloud-tools-4.2.0-42-generic
再次使用
$ perf list
List of pre-defined events (to be used in -e):
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
ref-cycles [Hardware event]
alignment-faults [Software event]
context-switches OR cs [Software event]
cpu-clock [Software event]
cpu-migrations OR migrations [Software event]
dummy [Software event]
emulation-faults [Software event]
major-faults [Software event]
minor-faults [Software event]
page-faults OR faults [Software event]
task-clock [Software event]
就可以正常執行了
在video當中老師有對perf稍作說明,我先藉由老師計算pi的範例當作學習perf的進入點,在複製perf說明內的程式碼並進行編譯:
#include <stdio.h>
#include <unistd.h>
double compute_pi_baseline(size_t N) {
double pi = 0.0;
double dt = 1.0 / N;
for (size_t i = 0; i < N; i++) {
double x = (double) i / N;
pi += dt / (1.0 + x * x);
}
return pi * 4.0;
}
int main() {
printf("pid: %d\n", getpid());
sleep(10);
compute_pi_baseline(50000000);
return 0;
}
在經過編譯時遇到了狀況:
$ gcc -o pi pi.c
pi.c: In function ‘compute_pi_baseline’:
pi.c:7:5: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (size_t i = 0; i < N; i++) {
^
pi.c:7:5: note: use option -std=c99 or -std=gnu99 to compile your code
得到一個錯誤,因為在範例中是使用g++來編譯,而我是採用gcc進行編譯,而從當中的說明可以見到應該是gcc預設編譯的版本不在c99之後的問題。要解決這個問題直覺上有兩個方法
為了能重複利用程式碼減少修改,我選擇指名c99的指示進行編譯後就可以順利編譯:
$ gcc -o pi pi.c -std=c99
接著開兩個terminal視窗各別輸入指令:
$ ./pi
pid: 25306
$ sudo perf top -p 25306
Samples: 2K of event 'cycles', Event count (approx.): 148490621
Overhead Shared O Symbol
99.96% pi [.] compute_pi_baseline
0.04% [kernel] [k] uncharge_list
從perf的數據可以得知這支程式計算的開銷都在計算pi的部份
為了更熟悉perf的使用,我依照參考資料一步步操作,在能理解$sudo perf top
的使用方式後,進一步以perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ~/Desktop/cache_misses
去觀察cache miss的情況。
cache_misses.c:
static char array[10000][10000];
int main (void){
int i, j;
for (i = 0; i < 10000; i++)
for (j = 0; j < 10000; j++)
array[j][i]++;
return 0;
}
$ perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ~/Desktop/cache_misses
Performance counter stats for '/home/oslab/Desktop/cache_misses' (5 runs):
5,278,828 cache-misses # 2.125 % of all cache refs
248,157,964 cache-references
2,005,916,097 instructions # 1.88 insns per cycle
1,057,336,702 cycles
0.271109771 seconds time elapsed ( +- 0.57% )
依照perf教學把i, j對調如參考之料所述可以發現cache-references有顯著的下降,原因在space locality與陣列在線性記憶體內擺放的方式有關:
$ perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ~/Desktop/cache_misses
Performance counter stats for '/home/oslab/Desktop/cache_misses' (5 runs):
1,909,115 cache-misses # 37.448 % of all cache refs
5,289,898 cache-references
2,005,906,440 instructions # 2.49 insns per cycle
807,624,474 cycles
0.207120866 seconds time elapsed ( +- 0.89% )
(練習到這時發現自己對於cache-references的意思很模糊,因此就去騷擾請問老師這方面概念的關鍵字,後來老師說要看資料不要從網路找,因此趁著颱風還不會把我吹走時趕緊去宿舍拿相關的書…QQ)
會減少cache-references的次數在於資料放進cache時是以一個cache line(64bytes)為單位,如果資料在存取時具有Spatial locality的特性,或是程式寫法能盡量保有cache friendly就能減少cache-misses與cache-references,因為之後要存取的資料已隨著前面的cache reference一起放置在cache內
在紀錄上面文字敘述時,又冒出了一個新的疑問「cache-references怎麼也會降低?」,看了一本爛書結果還是充滿著疑惑,於是再度前往給老師電教導後才比較了解cache。這時深深覺得以前都是一知半解的,等到真的碰觸這塊就顯得以前學習都是無用的。
原本以為cache-references是對cache的存取次數,也是一開始抱持的疑問,因為每次access都會經過cache,那數量怎麼會有差別,這也是自己對名詞定義的不清楚,才使得自己布滿障礙。
在以perf教學進行當中perf record&perf report練習時,因為敘述比較簡短,因此從當中的參考資料perf 性能分析实例——使用perf优化cache利用率進行學習,也順便了解code編寫方式對於cache miss的影響,在了解record&report的用法後,再回頭以perf教學的例子練習。
#define N 5000000
static int array[N] = { 0 };
void normal_loop(int a) {
int i;
for (i = 0; i < N; i++)
array[i] = array[i]+a;
}
void unroll_loop(int a) {
int i;
for (i = 0; i < N; i+=5){
array[i] = array[i]+1;
array[i+1] = array[i+1]+a;
array[i+2] = array[i+2]+a;
array[i+3] = array[i+3]+a;
array[i+4] = array[i+4]+a;
}
}
int main() {
normal_loop(1);
unroll_loop(1);
return 0;
}
經過上面程式以perf進行分析後,在instructions可以看到下面結果,會發現normal_loop佔了比較大的overhead。
Samples: 64 of event 'branch-instructions:u', Event count (approx.): 6031975
Overhead Command Shared Object Symbol
84.68% perf_record_exa perf_record_example [.] normal_loop
14.94% perf_record_exa perf_record_example [.] unroll_loop
0.31% perf_record_exa ld-2.19.so [.] _dl_relocate_object
0.07% perf_record_exa ld-2.19.so [.] dl_main
0.01% perf_record_exa ld-2.19.so [.] _dl_start
0.00% perf_record_exa ld-2.19.so [.] _start
先完成必要的準備
cd ~/cs2016q3/ && git clone https://github.com/believe7028/phonebook.git && cd phonebook/
在README.md當中提到必須使用自動排版工具astyle以及使用一個hookpre-commit
,因此先去查了打造專屬於你的 Git 工作流程、2016q1 Homework 1與Git 客製化 - Git Hooks
在老師講解的影片中,提到以前的學長先以修改資料結構大小的方式進行優化,我也以此概念先進行改善,因為這種方式是簡單且直觀的。在過程中共修改兩個相關的檔案:
phonebook_opt.h
#ifndef _PHONEBOOK_H
#define _PHONEBOOK_H
#define MAX_LAST_NAME_SIZE 16
typedef struct __PHONE_BOOK_DETAIL {
char firstName[16];
char email[16];
char phone[10];
char cell[10];
char addr1[16];
char addr2[16];
char city[16];
char state[2];
char zip[5];
} detail;
typedef struct __PHONE_BOOK_ENTRY {
char lastName[MAX_LAST_NAME_SIZE];
struct __PHONE_BOOK_DETAIL *pDetail;
struct __PHONE_BOOK_ENTRY *pNext;
} entry;
entry *findName(char lastName[], entry *pHead);
entry *append(char lastName[], entry *e);
#endif
phonebook_opt.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include "phonebook_opt.h"
/* FILL YOUR OWN IMPLEMENTATION HERE! */
entry *findName(char lastName[], entry *pHead)
{
while (pHead != NULL) {
if (strcasecmp(lastName, pHead->lastName) == 0)
return pHead;
pHead = pHead->pNext;
}
return NULL;
}
entry *append(char lastName[], entry *e)
{
/* allocate memory for the new entry and put lastName */
e->pNext = (entry *) malloc(sizeof(entry));
e = e->pNext;
strcpy(e->lastName, lastName);
e->pDetail = (detail *) malloc(sizeof(detail));
e->pNext = NULL;
return e;
}
在phonebook_opt.在當中,其實
e->pDetail = (detail *) malloc(sizeof(detail));
是沒有用到的,因為實際上沒有詳細的資料,但是為了符合真實情況,依然把這部份撰寫完成以貼近真實。
$ sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_orig
Performance counter stats for './phonebook_orig' (10 runs):
4,579,669 cache-misses # 93.033 % of all cache refs
4,831,849 cache-references
0.047660639 seconds time elapsed ( +- 2.25% )
$ sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_opt
Performance counter stats for './phonebook_opt' (10 runs):
5,535,335 cache-misses # 90.737 % of all cache refs
6,082,763 cache-references
0.069276455 seconds time elapsed ( +- 2.46% )
經由上面比較可以發現cache miss 降低的幅度很低,而且執行時間都反而增加,因此我回去影片找老師提到的範例來查詢原因,後來發現差異應該是在當中
lastNameEntry *appendOptimal(char lastName[], lastNameEntry *lne)
{
/* allocate memory for the new entry and put lastName in it.*/
lne->pNext = (lastNameEntry *) malloc(sizeof(lastNameEntry));
lne = lne->pNext;
strcpy(lne->lastName, lastName);
lne->pNext = NULL;
return lne;
}
雖然在概念上把詳細資料放在另一個結構裡,但在實作上他不會宣告詳細資料的結構。為了驗證是否差異在此,我註解了我會宣告詳細資料的那行並再次進行計算:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include "phonebook_opt.h"
/* FILL YOUR OWN IMPLEMENTATION HERE! */
entry *findName(char lastName[], entry *pHead)
{
while (pHead != NULL) {
if (strcasecmp(lastName, pHead->lastName) == 0)
return pHead;
pHead = pHead->pNext;
}
return NULL;
}
entry *append(char lastName[], entry *e)
{
/* allocate memory for the new entry and put lastName */
e->pNext = (entry *) malloc(sizeof(entry));
e = e->pNext;
strcpy(e->lastName, lastName);
//e->pDetail = (detail *) malloc(sizeof(detail));
e->pNext = NULL;
return e;
}
$ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_orig
Performance counter stats for './phonebook_orig' (10 runs):
4,466,097 cache-misses # 90.015 % of all cache refs
4,980,070 cache-references
0.048785512 seconds time elapsed ( +- 2.25% )
$ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_opt
Performance counter stats for './phonebook_opt' (10 runs):
1,450,470 cache-misses # 64.407 % of all cache refs
2,333,493 cache-references
0.031379584 seconds time elapsed ( +- 3.77% )
經過註解詳細資料結構的記憶體配置指令後,可以見到在cache miss有大幅的改善,且數據的表現也比較接近,執行時間的開銷也有降低。
在調整資料結構的大小之後,進一部想改變搜尋方式,直覺上有雜湊搜尋與字典樹兩種,於是我先採用字典樹看看是否在數據上有較好的表現:
$ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_orig
Performance counter stats for './phonebook_orig' (10 runs):
4,544,119 cache-misses # 92.462 % of all cache refs
4,789,443 cache-references
0.048927051 seconds time elapsed ( +- 2.43% )
$ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_opt
Performance counter stats for './phonebook_opt' (10 runs):
4,935,383 cache-misses # 87.514 % of all cache refs
5,649,657 cache-references
0.135761925 seconds time elapsed ( +- 0.92% )
可以見到雖然雖然採用了字典樹,在搜尋的時間有很好的表現,但是在append的步驟需要較高的時間開銷,因為建構字典樹節點的步驟相較採用link list來的多,而所在意的cache miss反而增加到87%。
表示一個字符需要一個byte,lastName因為只有小寫英文,這表示8bits 共256種可能卻只用到其中26个可能,因此就想以此加壓縮,原本一開始是想採用26進位,但想到會有乘法的操作,因此用位移操作符在計算時會比較快速,因此定為每5bits表示一個小寫英文字,會先把連續字符lastName轉換為一數字,在比較時則以此數值為主要依據:
unsigned long long int name2Int(char lastName[])
{
unsigned long long int num = 0;
int index = -1;
while(lastName[++index] != '\0') {
num= (num << 5) + (lastName[index] - 'a' + 1);
}
return num;
}
當中 -'a' + 1
會有+1
是考慮到表示'\0'
的問題,為了未來可以從數字轉換為字符串,因此小寫英文字符都從1開始,0則用來表示'\0'
,才能區別一個數值為0是已無英文字符,如果英文從0開始編號,則一個數值為0有可能是"aa\0"
的表示法。
$ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_orig
Performance counter stats for './phonebook_orig' (10 runs):
4,376,367 cache-misses # 89.796 % of all cache refs
4,907,063 cache-references
0.049096653 seconds time elapsed ( +- 2.43% )
$ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_opt
Performance counter stats for './phonebook_opt' (10 runs):
7,314,511 cache-misses # 93.471 % of all cache refs
7,795,438 cache-references
0.092457792 seconds time elapsed ( +- 1.22% )
經過分析後,可以發現不管在chche miss或是執行效率都變得比較差,還有許多可以在改進的地方。