contributed by < ryanpatiency
>
ryanpatiency
3d23f44 commit message 看不出做了哪些修改
3cce83d 未說明為什麼要修改 hash table size
課程助教git rebase 後 push 在 github 上了 ,但 3d23f44 是 merge 前的 commit ,我不是很明白要怎麼改訊息。下次上課再問助教
3cce83d
Update hash table size => Update hash table size to prime
ryanpatiency
老師的作業要求很多,但是如果你完全沒有頭緒,不知道從哪裡開始,這篇文章是針對你寫的
重點整理
第一步驟是:安裝 linux 、 學會用 git & github 、學會打 hackmd
因為作業要運行在 linux 上,作業繳交是用 github & hackmd,所以你一定要會這 3 個。
避免用 git commit -a -m “messadge”
,請用 git commit
然後仔細撰寫 Git commit message。
參見 How to Write a Git Commit Message
如果你花了超過3個小時還沒有辦法把上面的東西弄好,找一個已經會的同學教你,因為 ~~ 你知道的,這次作業要求你付出16個小時以上,你不能夠在預備上就花超過 3 個小時
如果你弄好了,就可以開始做作業了:
第二步驟是:看作業1的教學影片 link
看影片大概要花一個小時的時間,看完並且跟著操作後,配合之前所學,你就能在作業區上傳一份空白的作業了。(包含空白的 hackmd 和 同步後的 github)
現在你的 github 上是沒有任何進度的,接下來你要做的,就是 commit 並 push 你的進度。並且紀錄在 hackmd 上。加油!!
看完影片後,你應該要知道:
知道這一次的作業目的是減少 cache miss ,並且提高程式執行速度。
第三步驟是:站在巨人的肩膀上(尋找強者學長和同學的文件)。
首先我們已經知道目標是要減少 cache miss ,那麼我們就應該先找找歷屆同學的範例程式裡面;哪一個的 case-miss 最低?我把我整理的名單放在後面,但是你應該要自己尋找、整理。
第四步驟是:選個文件照著做
如果文件看不懂,比如你不知道 $ make cashe-test
你可以:
這裡我給你一些在讀文件遇到問題時的範例,供你參考
了解 make 的方法 (also perf , gcc, …)
$ man make
了解 memory pool 怎麼實作
了解 hash table 怎麼實作
你可以一直做第四步驟直到你有信心為止。
如果你還是不知道要怎麼做 接下來是我的作法記錄,你可以對照著看
最後最後,祝你修課順利:) 加油!!
ryanwang
同學的實驗 ( struct size, hash table ) 達到已知最低的 cache-miss: 21%ierosodin
同學的實驗 ( struct size, hash table, memory pool):cache-miss 不降反升、再次增快執行速度phonebook_opt.c
先不管加分題,我們先開始調查:
同學的成果,以 cache miss 為標準,以下是成績優異的同學
調查結果,ryan wang
同學 26.915% 第1名
我們的第 1 要務是了解 26.915% 是怎麼做到的。
ryanwang
同學的實驗以下的實驗可當作ryanwang
同學的筆記的補充,請大家可以先看他的筆記,如果遇到難處,可以參考以下我找到的資料
有許多部份我沒有一步一步照著做,而是自己上網找資料,然後用自己的方法做。如果大家看到有些步驟沒有說明,只有一些連結,那代表我所參考的資料
ryan wang
的文件中有 $ echo 1 | sudo tee /proc/sys/vm/drop_caches
解說如下:
不懂 | 和 > 和 &> 和 >&2 … 的話,他們的搜尋關鍵字是 bash pipe, redirection, 參考 link
tee
是用來給 stdout
權限的,,大家可以 man tee
為什麼不 run $ echo 1 > sudo /proc/sys/vm/drop_caches
呢?
實驗看看下面兩行就能理解
$ sudo echo 1 > /proc/sys/vm/drop_caches
$ echo 1 > sudo /proc/sys/vm/drop_caches
這個技巧再後來$ perf record
的時候也會用到
1, 2, 3 各有不同的意思:
echo 1 > /proc/sys/vm/drop_caches
echo 2 > /proc/sys/vm/drop_caches
echo 3 > /proc/sys/vm/drop_caches
我 $ make cache-test
的結果:
Performance counter stats for './phonebook_opt' (100 runs):
1,178,032 cache-misses # 78.818 % of all cache refs ( +- 0.97% )
1,494,619 cache-references ( +- 0.90% )
261,906,975 instructions # 1.17 insns per cycle ( +- 0.02% )
223,419,587 cycles ( +- 1.14% )
0.095444565 seconds time elapsed ( +- 1.55% )
$ make plot
的結果
可以發現:我的電腦速度本來就比這位ryanwang522
同學慢,他的初始執行時間是 0.04 秒 / 0.006 秒,我的是 0.06 秒 / 0.007 秒
我的實作方式:註解程式碼
typedef struct __PHONE_BOOK_ENTRY {
char lastName[MAX_LAST_NAME_SIZE];
// char firstName[16];
// char email[16];
// char phone[10];
// char cell[10];
// char addr1[16];
// char addr2[16];
// char city[16];
// char state[2];
// char zip[5];
struct __PHONE_BOOK_ENTRY *pNext;
} entry;
ryanwang
提到的 l1_cache ,可用 lscpu
找到After Shrink the size I run make cache-miss
again
Performance counter stats for './phonebook_opt' (100 runs):
103,879 cache-misses # 32.039 % of all cache refs ( +- 1.26% )
324,226 cache-references ( +- 1.76% )
241,106,445 instructions # 1.72 insns per cycle ( +- 0.02% )
140,094,134 cycles ( +- 1.03% )
0.059512604 seconds time elapsed ( +- 1.29% )
cashe-miss 從 80% 降到 30%
phonebook_opt.c 中有一個 TODO, 請大家要按照 TODO 做!
改好後的 make plot
參考以下連結 DIY
可見我們老師多愛 C
ryanpatiency
我的實作很簡單,只要把 e
& pHead
改成 hash_e[key]
& hash_pHead[key]
就好了,github 上可以找到
當 hash_table 數量為 1000 時,就連 phonebook_orig 都能跑到 27% 的 cache-miss
為了對照,最好把 HASH_TABLE_SIZE 的定義 放在 #ifdef OPT #else 之中,並定義 orig 的 hash table size 為 1
21%!! 只是 follow 前人的腳步就能超過已知最佳成績,真的要感謝這位ryanwang522
同學
ierosodin
同學的實驗./phonebook_orig & perf top -p $!
&
make the program run in background. google key word: bash &$!
the last ip of command. google key word: bash $!perf stat
: please run $perf help top
yourselfperf record
perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles ./phonebook_orig
DIY the following to understand:
$ perf help stat
$ perf help list
We can find a spooky thing in the end of the $ perf help list
:
Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume
3B: System Programming Guide
http://www.intel.com/Assets/PDF/manual/253669.pdf
AMD64 Architecture Programmer’s Manual Volume 2: System Programming
http://support.amd.com/us/Processor_TechDocs/24593_APM_v2.pdf
it is a 500 page manual, I am terrified
go to wiki
Wiki also tells you to look manual, So to understand what is cache-misses, cache-references, I suggest you just go to look at the manual.
helpful 教學
可以學會其他不錯的 event: L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores
了解 cashe-store 和 cashe-miss 的差別:教學
ierosodin
also shrink struct size, but he add a struct __PHONE_BOOK_INFO
to keep the rest of the info, which looks more proper.
ierosodin
同學有提供 memory pool 的 code,但由於沒有針對這次的作業,我建議大家上網去找 link
我採用 link 中的方式實現 memory pool
typedef struct pool
{
char * next;
char * end;
} POOL;
POOL * pool_create( size_t size ) {
POOL * p = (POOL*)malloc( size + sizeof(POOL) );
p->next = (char*)&p[1];
p->end = p->next + size;
return p;
}
void pool_destroy( POOL *p ) {
free(p);
}
size_t pool_available( POOL *p ) {
return p->end - p->next;
}
void * pool_alloc( POOL *p, size_t size ) {
if( pool_available(p) < size ) return NULL;
void *mem = (void*)p->next;
p->next += size;
return mem;
}
要用 memory pool 取代 malloc, 所以 ctrl-F 找尋所有 malloc 出現的地方,把他們換成 pool_alloc, 當然一開始要先 pool_create() (除了 main.c 以外,不要動到 orig比較好)
由於我這次打算只觀察 memory pool 的影響,所以沒有 shrink entry size, 所以一個 node size = 136 bytes, dictionary 有 350000 筆資料,總共要 alloc 350000 * 136 = 47600000 bytes
Bug 紀錄:
Error: double free or corruption
解決方法:
在 pool_create
和 pool_destroy
外加上 #ifdef OPT
和 #endif
,把原本的 free 放在 #else
中
結果
改善成功! drop from 82 % to 57 %
但是可以發現在 cashe-miss 上的改善不如 Hashfunction (82% to 27%) 及 Shrink entry size (82% to 30%) 明顯
用法 valgrind ./phonebook_opt
可以發現產生許多有用的資訊,用一次就知道了。
範例:
$ valgrind ./phonebook_orig
...
==3341== HEAP SUMMARY:
==3341== in use at exit: 0 bytes in 0 blocks
==3341== total heap usage: 349,906 allocs, 349,906 frees, 47,596,856 bytes allocated
==3341==
==3341== All heap blocks were freed -- no leaks are possible
...
$ valgrind ./phonebook_opt
...
==3291== HEAP SUMMARY:
==3291== in use at exit: 0 bytes in 0 blocks
==3291== total heap usage: 1,006 allocs, 1,006 frees, 47,634,336 bytes allocated
==3291==
==3291== All heap blocks were freed -- no leaks are possible
可以發現 allocs 次數從 349906 降到 1006
現在要 alloc 的 memory 是 entry size (24 bytes) * 350000 = 8400000
但是上面的 terminal out 的時候忘記改了。所以 allocated bytes 才會相同
iero
同學使用 valgrind --leak-check=full
, 但我在 man valgrind 中沒看到 –leak-check=full
發現是執行 valgrind ./phonebook_o* 後,如果有 memory leak, valgrind
會告訴你請加上 –leak-check=full
實作 memory pool,並用 valgrind debug 完後,我想看看綜合之前的方法會怎麼樣
我將 mpool branch 與原本參考 ryanwang
的 master branch git-merge 並執行
執行結果
$ ./phonebook_opt
size of entry : 24 bytes
execution time of append() : 0.063161 sec
execution time of findName() : 0.000003 sec
...
$ ./phonebook_orig
size of entry : 136 bytes
execution time of append() : 0.088766 sec
execution time of findName() : 0.006353 sec
執行時間都變快不少
Performance counter stats for './phonebook_orig' (100 runs):
1,472,016 cache-misses # 88.575 % of all cache refs ( +- 0.46% )
1,661,879 cache-references ( +- 0.36% )
379,636,405 instructions # 1.49 insns per cycle ( +- 0.02% )
255,468,254 cycles ( +- 0.36% )
0.103079308 seconds time elapsed ( +- 0.75% )
Performance counter stats for './phonebook_opt' (100 runs):
73,289 cache-misses # 35.223 % of all cache refs ( +- 8.15% )
208,071 cache-references ( +- 5.94% )
177,799,568 instructions # 1.63 insns per cycle ( +- 0.29% )
108,785,516 cycles ( +- 0.91% )
0.044483962 seconds time elapsed ( +- 1.11% )
什麼?為什麼 cashe-miss 從 21% 上升到了 30% 了? 我不相信,於是 git checkout 以前的版本試試看
// 之前的版本
Performance counter stats for './phonebook_orig' (100 runs):
1,232,064 cache-misses # 85.881 % of all cache refs ( +- 0.48% )
1,434,611 cache-references ( +- 0.43% )
314,415,068 instructions # 1.39 insns per cycle ( +- 0.02% )
225,707,514 cycles ( +- 0.40% )
0.091467723 seconds time elapsed ( +- 0.76% )
Performance counter stats for './phonebook_opt' (100 runs):
77,994 cache-misses # 22.709 % of all cache refs ( +- 1.49% )
343,453 cache-references ( +- 2.01% )
238,434,085 instructions # 1.64 insns per cycle ( +- 0.02% )
145,711,315 cycles ( +- 0.68% )
0.059365206 seconds time elapsed ( +- 1.10% )
opt
的執行結果 ,這次是 22% 。看起來實驗是沒問題的。git diff 後也證實的確只有 memory pool 不同。可能是 memory pool 哪裡出了問題吧?
不過大家可以發現,memory pool 把 instruction 數目 降低了 ,所以時間也從 0.06 變為 0.04
實驗示範到這裡結束,從分隔線後,是我的實驗紀錄。就沒有特別寫給初學者看了。不過看到這裡,你應該有能力去挑戰加分題了吧~
加油!
iero
同學的最終版本還是快我 3 倍我想 git pull 他的 code 來跑看看
iero
同學有畫很特別的 hash_function plot 我也要學起來,並且了解他的橫軸是怎麼一回事?(MOD=17 時不是應該每一個要有 20000 左右的節點嗎?
根據網路上的建議,size 的選擇是質數比較好
改成 all-names.txt
add the line #define INPUT "john"
because now input may change
there is three john in the list, hm… I thought each will be unique
Match score rule:
author of gnu grep
's post: link
the way he use to measure grep
's speed:
$ time sh -c 'find . -type f -print | xargs grep -l 123456789abcdef'
real 0m1.530s
user 0m0.230s
sys 0m1.357s
$ time sh -c 'find . -type f -print | xargs grep --mmap -l 123456789abcdef'
real 0m1.201s
user 0m0.330s
sys 0m0.929s
Summary by the grep
author:
Use Boyer-Moore (and unroll its inner loop a few times).
Roll your own unbuffered input using raw system calls. Avoid copying
the input bytes before searching them. (Do, however, use buffered
*output*. The normal grep scenario is that the amount of output is
small compared to the amount of input, so the overhead of output
buffer copying is small, while savings due to avoiding many small
unbuffered writes can be large.)
Don't look for newlines in the input until after you've found a match.
Try to set things up (page-aligned buffers, page-sized read chunks,
optionally use mmap) so the kernel can ALSO avoid copying the bytes.
The key to making programs fast is to make them do practically nothing. ;-)
read "Fast String Searching", by Andrew Hume and Daniel Sunday, => free pdf online
My Makefile:
phonebook_fuzzy: $(SRCS_common) phonebook_fuzzy.c phonebook_fuzzy.h
$(CC) $(CFLAGS_common) $(CFLAGS_opt) \
-DIMPL="\"$@.h\"" -o $@ \
$(SRCS_common) $@.c
findName in phonebook_fuzzy.c:
entry *findName(char lastName[], entry *pHead)
{
int best_score = INT_MIN;
char *maybe = NULL;
while (pHead != NULL) {
if (strcasecmp(lastName, pHead->lastName) == 0)
return pHead;
int score = fuzzy_match(lastName, pHead->lastName);
// * fuzzy_match is the c version of the code link in the above
if(score > best_score) {
best_score = score;
maybe = pHead->pNext;
}
pHead = pHead->pNext;
}
printf("No result for %s, Did you mean: %s\n", lastName, maybe);
return NULL;
}
try search for my ip: ryanpatiency, the end:
$ ./phonebook_fuzzy
size of entry : 136 bytes
No result for ryanpatiency, Did you mean: ryan
I did it. It was super super fun to make a fuzzy search and search for my id.
just for fun, here is the output if you search for jserv:
size of entry : 136 bytes
No result for jserv, Did you mean: jose
hmm.. jose, quite unlike a name. I think jserv is better the jose.
challenge: using priority queue.
心得:看懂強者的 code 不好玩,看不懂更不好玩。最好玩的還是理解原理自己作
ps: after merge, current speed is:
$ ./phonebook_opt
size of entry : 24 bytes
execution time of append() : 0.003056 sec
execution time of findName() : 0.000000 sec
It is because this all-name.txt is much small
LHS >> RHS
is implementation-defined where in most implementations, this performs arithmetic right shift (so that the result remains negative)Amazing work:
from
uint32_t reverse_bits_32(uint32_t x)
{
uint32_t reverse_x = 0;
for (int i = 0; i < 32; i++) {
if((x & (1 << i)))
reverse_x |= 1 << ((32 - 1) - i);
}
return reverse_x;
}
to
uint32_t reverse_bits_32(uint32_t x)
{
uint32_t reverse_x = 0;
for (int i = 0; i < 32; i++){
reverse_x |= ( x >> ( ( 32 - 1 ) - i ) & 1 ) << i;
}
return reverse_x;
}
2's compliment:
positive -> +
negative -> -
carry out -> c
most significant bit => m
(+) + (+) =>
(+) + (-) => c = 0
(-) + (-) =>
conclution => overflow (x,y, x+y) = x'y'(x+y) + xy(x+y)'
&& & || left first
page fault: 3 kind
invalid: page
major: disk
minor : 20:20
ulimit -s stack size
stacksize observation
TODO run the program around 39:30
type size rule: 50: 06
Kill tail recurtion, Q-matrix , 1:20:00
upper string means late unix variable;
todo: modify find.c source code
what is pure function.
typedef struct {
int a[2];
double d;
} s
modify s.a[3] change the value of d;
unsigned short ux = 0xffff;
short x = 0xffff;
printf("x is %d, ux is %d, x==ux : %d\n", x, ux, x==ux);
$ x is -1, ux is 65535, x==ux:0
(gdb) ptype -2147483648
type = unsigned int
(gdb) ptype -2147483647-1
type = int
unsigned i ;
for( i = cnt-2, i<cnt, i--){
// do any thing
}
問:在
$ valgrind ./phonebook_orig
size of entry : 136 bytes
==3341== HEAP SUMMARY:
==3341== in use at exit: 0 bytes in 0 blocks
==3341== total heap usage: **349,906 allocs**, 349,906 frees, 47,596,856 bytes allocated
==3341==
==3341== All heap blocks were freed -- no leaks are possible
...
$ valgrind ./phonebook_opt
size of entry : 24 bytes
...
==3291== HEAP SUMMARY:
==3291== in use at exit: 0 bytes in 0 blocks
==3291== total heap usage: **1,006 allocs**, 1,006 frees, 47,634,336 bytes allocated
==3291==
==3291== All heap blocks were freed -- no leaks are possible
中 為什麼 明明 size of entry 不一樣,bytes allocated 卻差不多呢?
答:因為 pool_alloc()
的 size 沒變
問: linux 中 $! 代表什麼意思呢?
解答:
- $!
is the PID of the last backgrounded process.
- kill -0 $PID
checks whether it's still running.
- $$
is the PID of the current shell.
問: perf stat -e r1a8 -a sleep 1
中的 1a8 是 event, 但是前面的 r 是什麼意思
問: 為什麼初始情形,我執行 ./phonebook_orig & perf top -p $!
的結果和 ierosodin 不一樣,我執行了好幾次都是 100 % sample 到某個 kernel function 去,即使我的 sample rate 已經調到最高
問: 在 make cache-test 前面為什麼不用清除 cache?
問: 為什麼 main.c 不用 include "phone_book.o*.h" 就可以呼叫 append, findname 等 function 了呢?
問: 為什麼 valgrind 說我有以下的 error 呢?
==714== Conditional jump or move depends on uninitialised value(s)
==714== at 0x4E67E6F: tolower (ctype.c:46)
==714== by 0x4C31899: strcasecmp (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==714== by 0x400DE3: findName (phonebook_opt.c:13)
==714== by 0x400B74: main (main.c:80)
==714==
==714== Use of uninitialised value of size 8
==714== at 0x4E67E83: tolower (ctype.c:46)
==714== by 0x4C31899: strcasecmp (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==714== by 0x400DE3: findName (phonebook_opt.c:13)
==714== by 0x400B74: main (main.c:80)
是說 tolower 錯了嗎?
問: 為什麼使用 memory pool 後,cache-misses 竟然從 19.9% 增加到 30 %
問: 為什麼 git rerere
default enable = false ?
問: change git messadge git rebase
遇到 merge
問:
(gdb) ptype -2147483648
type = unsigned int
(gdb) ptype -2147483647-1
type = int
Ans: According to manual: If both operands have the same type, then no further conversion is needed.