Louie Lu
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # 2016q3 Homework1 (phonebook) contributed by <`louielu`> ### Reviewed by `jserv` * 沒有嘗試不同的 data set,原本的程式輸入是已排序、非典型英文姓氏,這與現實不匹配 * 實作提到透過引入 hash 加速 `fineName()` 的操作,但缺乏不同 hash function 的效能比較和設計取捨 * 在 `append()` 中,`malloc()` 是個顯著的時間開銷,缺乏減緩效能衝擊的方案,而且沒考慮到 `malloc()` 失敗的情況 ![image](https://hackmd.io/_uploads/B1ArXW4pT.png) * 在上圖的環境中,可用記憶體不足以載入 35 萬筆電話資料,於是連 `phonebook_orig` 執行都會失敗: ```shell $ ./phonebook_orig size of entry : 136 bytes 程式記憶體區段錯誤 ``` * 卻乏搜尋演算法的評估和效能分析 * 考慮到電話簿需要作到動態資料新增和刪除,若引入 hash,面對大量資料時,會有什麼影響? * 儘管已經整理頗多 perf 和效能測量的資料,但並未反映到此程式效能分析,除了 cache miss,還請一併探討 branch prediction accuracy 等議題 * `main.c` 無法透過 function pointer 來切換和比較不同實作的效能落差,應該先設計一份可通用的軟體界面,然後將 binary tree, hash table, trie 等不同實作機制加入 * 將 `append()` 和 `findName()` 時間加入統計的意義不大,真實應用往往是個別操作,特別在圖表的呈現 ## 開發環境 * CPU: Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz * MEM: 8GB * Cache: * L1d cache: 32K * L1i cache: 32K * L2 cache: 256K * L3 cache: 4096K ![](https://i.imgur.com/jx9h9w8.png) * [Portable Hardware Locality](https://www.open-mpi.org/projects/hwloc/) (hwloc) ## 開發目標 * 精通 perf * 改善 phonebook 效能 * findName cache miss * append() * fuzzy ## perf * [perf Examples](http://www.brendangregg.com/perf.html) by Brendan Gregg * [lmbench](http://www.bitmover.com/lmbench/) * [如何有效測量 performance ? (FB: embedded2016)](https://www.facebook.com/groups/system.software2016/permalink/1124571464289024/) ### perf - raw counter 格式:`rUUEE` (U = Umask Value, E = Event num) 範例: * Umask: 02H * Event: 11H * Description:`Counts 256-bit packed double-percision floating-polint insturctions` * Event Mask Mnemonic:`SIMD_FP_256.PACKED_DOUBLE` 使用指令:`$ taskset -c 1 perf stat -e r0211 ./time_test_avx` (借用一下 compute-pi, -c 指定 core 1) 讀取register確認:`$ sudo rdmsr -p 1 0x186; output 110211` (-p 指定 core 1) perf 輸出: ``` Performance counter stats for './time_test_avx': 678,616,723 r0211:u 1.698941816 seconds time elapsed ``` 換一個沒有用到 SIMD 的 `$ taskset -c 1 perf stat -e r0211 ./time_test_baseline` perf 輸出: ``` Performance counter stats for './time_test_baseline': 0 r0211:u 5.186517630 seconds time elapsed ``` 明顯會是 0 (不是 0 就是你打錯raw counter,或是你用的不是 sandy bridge) ### Reading PERV_EVT_SEL MSRs (Model-specific register) value `rdmsr` 這個工具可以用來讀取 CPU MSRs (Model-specific register) 的資料。 根據 `Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B` p.3989 `34.7 MSRS IN INTEL PROCESSOR FAMILY (INTEL MICROARCHITECTURE CODE NAME SANDY BRIDGE)` 的 `Table 34-10 (p.3994)` 指出,Register Address 186H ~ 18DH 是 `IA32_PERFEVTSEL` 的暫存器位置。 ![](https://i.imgur.com/uxBc3X9.png) 所以剛才才會指定 `rdmsr 0x186`,因為他是 `IA32_PERFEVTSEL0` 的位置 ### Misleading perf event name! 既然知道怎麼看到現在 event 的 raw code,我們就要來驗證一下,到底 perf list 出來的 events 跟我們想像中的有沒有一樣? ### Reference * [Linux perf #raw counter](http://www.brendangregg.com/perf.html) by Brendan Gregg's * [Intel Sandy Bridge Microarchitecture events](http://oprofile.sourceforge.net/docs/intel-sandybridge-events.php) - 常用 profile events 整理 (可是還是有差,他沒有 event num. 跟 umask value) * [Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B](https://www.cs.uaf.edu/2011/fall/cs301/intel_x64_megadoc_2011.pdf ) - p.3158 `CHAPTER19 Performance-Monitoring Events`, p.3159 `19.2 Performance Monitoring Events for Intel Core Processor 2xxx Series` example of raw counter umask and event num. ![](https://i.imgur.com/u1YQ46c.png) * [About Events for Intel(R) Microarchitecture Code Name Sandy Bridge](https://software.intel.com/sites/products/documentation/doclib/stdxe/2013SP1/amplifierxe/snb/index.htm) (searchable) * [Understanding L1, L2, L3 cache misses](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/557604) v## 案例分析: phonebook ### 檔案分析 ### main.c `main.c` 分為兩個函數,一個是 `diff_in_second` 負責計算時間差,一個是 `main` 負責運行整著處理程序。 #### 7: `#include IMPL` 會使用到 gcc -DIMPL="" 的 macro,可以參考 gcc --help > -D name=definition > The contents of definition are tokenized and > processed as if they appearedduring translation > phase three in a `#define` directive. > In particular, the definition will be truncated > by embedded newline characters. 大意就是說 gcc -Dname=definition 的東西會被翻譯成 `#define name definition`。透過這樣的方式,我們可以只用一個 `main.c` 檔來運行 orig 或是 opt 版本的 phonebook。 !!必須要注意!! 在 shell 裏面的話 `gcc -DIMPL"\"phonebook_orig.h\""` 雙引號要做跳脫字元,要不然會一直出錯 ``` main.c:7:10: error: #include expects "FILENAME" or <FILENAME> #include IMPL ``` >> 延伸閱讀: [警告:"no newline at end of file"](http://blog.linux.org.tw/~jserv/archives/001933.html) [name=jserv] >> #### 9: `#define DICT_FILE "./dictionary/words.txt"` 這邊定義使用的字典檔,可以看到是使用 /dictionary/words.txt #### 46: `#if defined(__GNUC__)` gcc 特有的 predefine macro,可以用來偵測 compiler 是不是 gcc,搭配下一行 gcc 的 buildin funtion 來使用。 可以拿起手邊的 `tcc` 跟 `clang` 來玩,加個程式碼 ```c= #if defined(__GNUC__) puts("__GNUC__"); __builtin___clear_cache((char *) pHead, (char *) pHead + sizeof(entry)); #endif ``` `gcc -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_gcc` `tcc -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_tcc` `clang -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_clang` 執行 `phonebook_orig_[gcc,tcc,clang]` 只有 `phonebook_orig_gcc` 會輸出 `__GNUC__` #### 47: `__builtin__clear_cache((char *) pHead, (char *) pHead + sizeof(entry))` gcc 的 [builtin function](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) `__builtin__clear_cache`,用來清除從 begin 到 end 之間的 instructions cache。 ``` (gdb) x/136x (char *) pHead 0x602240: 0x00000000 0x00000000 0x00000000 0x00000000 0x602250: 0x00000000 0x00000000 0x00000000 0x00000000 0x602260: 0x00000000 0x00000000 0x00000000 0x00000000 0x602270: 0x00000000 0x00000000 0x00000000 0x00000000 0x602280: 0x00000000 0x00000000 0x00000000 0x00000000 0x602290: 0x00000000 0x00000000 0x00000000 0x00000000 0x6022a0: 0x00000000 0x00000000 0x00000000 0x00000000 0x6022b0: 0x00000000 0x00000000 0x00000000 0x00000000 0x6022c0: 0x00000000 0x00000000 0x00000411 0x00000000 0x6022d0: 0x657a6973 0x20666f20 0x72746e65 0x203a2079 0x6022e0: 0x20363331 0x65747962 0x00000a73 0x00000000 0x6022f0: 0x00000000 0x00000000 0x00000000 0x00000000 0x602300: 0x00000000 0x00000000 0x00000000 0x00000000 ``` ``` 50 clock_gettime(CLOCK_REALTIME, &start); (gdb) x/136x (char *) pHead 0x602240: 0x00000000 0x00000000 0x00000000 0x00000000 0x602250: 0x00000000 0x00000000 0x00000000 0x00000000 0x602260: 0x00000000 0x00000000 0x00000000 0x00000000 0x602270: 0x00000000 0x00000000 0x00000000 0x00000000 0x602290: 0x00000000 0x00000000 0x00000000 0x00000000 0x6022a0: 0x00000000 0x00000000 0x00000000 0x00000000 0x6022b0: 0x00000000 0x00000000 0x00000000 0x00000000 0x6022c0: 0x00000000 0x00000000 0x00000411 0x00000000 0x6022d0: 0x4e475f5f 0x5f5f4355 0x72746e0a 0x203a2079 0x6022e0: 0x20363331 0x65747962 0x00000a73 0x00000000 0x6022f0: 0x00000000 0x00000000 0x00000000 0x00000000 0x602300: 0x00000000 0x00000000 0x00000000 0x00000000 ``` 中間 0x6022d0 的數值被改變了,不知道有何作用。 #### 49-59: `clock_gettime(CLOCK_REALTIME, &start)` 這邊負責 append 資料到 `*e` 裡面 ```c= e = append(line, e) ``` 使用這種方式來 append,可以降低一般 linked list append 需要 O(n) 的時間,可是必須要維護一個 list head,所以可以看到 line 63: `e = pHead`,把 pHead 指向的位置傳給 e,讓 e 指向 list head 的地方。 ![](https://i.imgur.com/8xCb7RO.png) #### 66-80: `char input[MAX_LAST_NAME_SIZE] = "zyxel";` 這邊先段來看 ```c=66 char input[MAX_LAST_NAME_SIZE] = "zyxel";` ``` 這行把等等要找的 last name 設定為 "zyxel",看到 `words.txt` 裏面,zyxel 是倒數第 10 個,代表` findName()` 一定要加班了 (幫QQ) ```c=67 e = pHead ``` 因為 e 在 append 完之後現在在 last element,我們要讓 e 指回 first element。 ```c=69 assert(findName(input, e) && "Did you implement findName() in " IMPL "?"); assert(0 == strcmp(findName(input, e)->lastName, "zyxel")); ``` ```c=73 #if defined(__GNUC__) __builtin___clear_cache((char *) pHead, (char *) pHead + sizeof(entry)); #endif /* compute the execution time */ clock_gettime(CLOCK_REALTIME, &start); findName(input, e); clock_gettime(CLOCK_REALTIME, &end); cpu_time2 = diff_in_second(start, end); ``` ## 未優化版本 `$ make run` ``` size of entry : 136 bytes execution time of append() : 0.046524 sec execution time of findName() : 0.006071 sec ``` basic CPU statistic `$ perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles ./phone_orig` ``` Performance counter stats for './phonebook_orig': 175,357,733 cycles:u 229,437,098 instructions:u # 1.31 insn per cycle 1,301,277 cache-references:u 1,277,761 cache-misses:u # 98.193 % of all cache refs 5,058,973 bus-cycles:u 0.066305491 seconds time elapsed ``` CPU L1 cache staticstic ` $ perf stat -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-prefetch-misses,L1-dcache-store-misses,L1-icache-load-misses ./phonebook_orig` ``` Performance counter stats for './phonebook_orig': 1,315,576 cache-misses:u # 120.087 % of all cache refs (59.07%) 1,095,522 cache-references:u (60.37%) 1,231,509 L1-dcache-load-misses:u # 1.95% of all L1-dcache hits (78.15%) 63,186,159 L1-dcache-loads:u (81.73%) 3,369,852 L1-dcache-prefetch-misses:u (43.53%) 35,271 L1-dcache-store-misses:u (39.19%) 21,174 L1-icache-load-misses:u (56.70%) 0.068884934 seconds time elapsed ``` ## 縮減 entry size !!!WARNING!!! > 開始之前,不要被表了,先看 code,記得把 phonebook_opt.h 的 `#define OPT 1` 註解拿掉,要不然你怎麼跑 `make plot` 都是沒用的bb 將 entry size 從 136 bytes 縮小為 32 bytes,append 與 findName 的效率都有所提升。 `append` 提升 21.97% 速度,`findName` 提升 122.31 % 速度。 ```c typedef struct __LAST_NAME_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; phonebook_detail *detail; struct __LAST_NAME_ENTRY *pNext; } entry; ``` ![](https://i.imgur.com/4DR7hp3.png) orig cache-test ``` Performance counter stats for './phonebook_orig' (100 runs): 1,181,680 cache-misses:u # 95.942 % of all cache refs ( +- 0.49% ) 1,231,662 cache-references:u ( +- 0.41% ) 229,689,183 instructions:u # 1.24 insn per cycle ( +- 0.02% ) 185,425,108 cycles:u ( +- 0.46% ) 0.079698901 seconds time elapsed ( +- 0.83% ) ``` entry size cache-test ``` Performance counter stats for './phonebook_opt' (100 runs): 89,105 cache-misses:u # 32.843 % of all cache refs ( +- 0.98% ) 271,309 cache-references:u ( +- 0.70% ) 233,120,203 instructions:u # 1.62 insn per cycle ( +- 0.02% ) 143,480,575 cycles:u ( +- 0.35% ) 0.053849988 seconds time elapsed ( +- 0.78% ) ``` cache misses 從 95% 下降到 32%!太神啦~ ## Hash table ### Hash function 從記憶裏面找到這個網站: [各种字符串Hash函数比较](https://www.byvoid.com/blog/string-hash-compare) by byvoid 採用推薦的 `BKDR hash function` (就是 K&R,鼎鼎大名的 Brian Kernighan 和 Dennis Ritchie 在《[The C Programming Language](https://en.wikipedia.org/wiki/The_C_Programming_Language)》裏面寫的 hash function: ```c= unsigned int BKDRHash(char *str) { unsigned int seed = 131; unsigned int hash = 0; while (*str) { hash = hash * seed + (*str++); } return (hash & 0x7FFFFFFF); } ``` ### Hash table 因為很喜歡使用 Python,所以決定了解 Python 的 `dict` 是如何實作的,同時運用在這次的作業當中。 參考 [Python Dictionary Implementation](http://www.laurentluce.com/posts/python-dictionary-implementation/) by Laurent Luce,可以發現一件重要的事情: * Python dict 是採用 [open addressing](https://en.wikipedia.org/wiki/Open_addressing) 的方式處理 collision #### Open Addressing 系統程式的老師上課有教過,可是忘記了......。看到 wiki 就想起來了。 Open addressing 是當 hash table 出現撞擊的時候的處理方式,運用不同的探針 (probe) 來決定這個 value 的 index。 Open addressing 最大的好處就是,可以將 hash table 的查詢時間降低為 O(1),以往用 list 來處理 collision 的話會讓查詢時間出現 worse case O(n)。 常見的 probe 以下幾種: * Linear probing: +1 大法 * Quadratic probing : new_index = f(old_index); Python 使用 Quadratic probing,使用的 function 如下: ```c= perturb = hash; index = (index << 2) + index + perturb + 1; perturb >>= PERTURB_SHIFT; ``` PERTURB_SHIFT 預設是 5,不能小於 1。 假設 hash table size = 32,hash = 187875484, index = 28 跑 probing 的結果會是: ``` 28 -> 9 -> 18 -> 11 -> 29 -> 5 -> 31 -> 28 -> 13 -> 2 -> 11 -> 24 -> 25 -> 30 -> 23 -> 20 -> 5 -> 26 -> 3 -> 16 -> 17 -> .... ``` #### Dict Structure Python dict 的 structure 總共用到兩個,一個是 `PyDictEntry` 作為 key value structure、一個是 `PyDictObject` 作為整個 Dict 的 Object 可以在 [svn.python.org](http://svn.python.org/view/python/trunk/Objects/dictobject.c?view=markup) 找到 `dictobject.c` 的原始碼,如果你想要做的話,打開他,你會用到的。 在這邊把他特化為 phonebook 專用的 `DictEntry` 跟 `DictObject` 如下: ```c= typedef struct { unsigned int hash; char lastName[MAX_LAST_NAME_SIZE]; phonebook_detail *detail; } DictEntry; typedef struct _dictobject DictObject; struct _dictobject { unsigned int ma_mask; int ma_used; int ma_fill; int ma_size; DictEntry *ma_table; DictEntry *(*ma_lookup)(DictObject *mp, char key[], unsigned int hash); }; ``` * `unsigned int ma_mask`: hash mask, size - 1 * `ma_used`:目前使用的 slot 數量 * `ma_fill`:目前使用的 slot + dummy slot,dummy slot 是刪除 slot 之後,會將 slot 設定為 dummy slot * `ma_size`:目前 ma_table 的大小 * `ma_table`:hash table * `ma_lookup`:lookup function #### Dict function ```c= DictObject *Dict_New(); int Dict_SetItem(DictObject *mp, char key[]); int Dict_Lookup(DictObject *mp, char key[]); DictEntry *lookdict(DictObject *mp, char key[], unsigned int hash); int insertdict(DictObject *mp, char key[], const unsigned int hash); void insertdict_clean(DictObject *mp, char key[], unsigned int hash, phonebook_detail *value); int dictresize(DictObject *mp, int minused); unsigned int BKDRHash(char *s); ``` ### Hashtable 效能 ``` Performance counter stats for './phonebook_opt' (100 runs): 1,553,358 cache-misses:u # 68.207 % of all cache refs ( +- 0.28% ) 2,277,402 cache-references:u ( +- 0.20% ) 270,305,408 instructions:u # 0.62 insn per cycle ( +- 0.02% ) 437,887,064 cycles:u ( +- 0.34% ) 0.162588847 seconds time elapsed ( +- 0.60% ) ``` ![](https://i.imgur.com/2iq4E05.png) * cache-misses 提升到 68.207% * append 時間也上升到 0.147秒,跟 orig 相比慢了3倍 * findName 時間降到 0 秒 電話簿這種東西建完就可以用很久了,findName 降到 0 應該滿棒的啊! ## Performance 比較 - abandoned * append() all word in `dictionary/words.txt` * `findName("zyxel")` 100 times 這不是個好的實驗設計。 ``` Performance counter stats for './phonebook_orig' (100 runs): 42,431,897 cache-misses:u # 99.071 % of all cache refs ( +- 0.11% ) 42,829,855 cache-references:u ( +- 0.09% ) 2,164,254,681 instructions:u # 1.04 insn per cycle ( +- 0.00% ) 2,079,356,180 cycles:u ( +- 0.42% ) 0.653479071 seconds time elapsed ( +- 0.27% ) ``` ``` Performance counter stats for './phonebook_opt' (100 runs): 2,481,889 cache-misses # 33.237 % of all cache refs ( +- 0.36% ) 7,467,180 cache-references ( +- 0.29% ) 2,167,727,565 instructions # 2.21 insn per cycle ( +- 0.00% ) 981,860,196 cycles ( +- 0.18% ) 0.410072312 seconds time elapsed ( +- 1.42% ) ``` ``` Performance counter stats for './phonebook_opt_hash' (100 runs): 1,556,307 cache-misses # 68.441 % of all cache refs ( +- 0.43% ) 2,273,932 cache-references ( +- 0.21% ) 270,227,846 instructions # 0.65 insn per cycle ( +- 0.02% ) 412,602,985 cycles ( +- 0.29% ) 0.211090389 seconds time elapsed ( +- 1.22% ) ``` ![](https://i.imgur.com/ECYVJE8.png) ## Performance - Intel Hierarchical Top-Down Performance Characterization Methodology * Reference from: http://www.brendangregg.com/methodology.html * Are UOPs issued? * If yes: * Are UOPs retired? * If yes: retiring (good) * If no: investigate bad speculations * If no: * Allocation stall? * If yes: investigate back-end stalls * If no: investigate front-end stalls ### CPU front-end 意義 簡單來說前端就是:取得指令、以及把指令解碼成微指令。sandy bridge 系列可以一次傳 4 個 uops 給後端 [中文翻譯](https://blog.louie.lu/2016/09/07/pipeline-speak-learning-more-about-intel-microarchitechture-codename-sandy-bridge-%E4%B8%AD%E6%96%87%E7%BF%BB%E8%AD%AF/) by louielu ### CPU back-end 意義 簡單來說後端就是:執行 uops,有 parallelism [中文翻譯](https://blog.louie.lu/2016/09/07/pipeline-speak-part-2-the-second-part-of-the-sandy-bridge-pipeline-%E4%B8%AD%E6%96%87%E7%BF%BB%E8%AD%AF/) by louielu ### phonebook case ``` $ perf stat -d ./phonebook_oirg Performance counter stats for './phonebook_orig': 66.166969 task-clock:u (msec) # 0.995 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 12,346 page-faults:u # 0.187 M/sec 113,680,547 cycles:u # 1.718 GHz (34.24%) 82,373,177 stalled-cycles-frontend:u # 72.46% frontend cycles idle (48.06%) 64,255,629 stalled-cycles-backend:u # 56.52% backend cycles idle (50.65%) 186,811,699 instructions:u # 1.64 insn per cycle # 0.44 stalled cycles per insn (62.39%) 34,733,972 branches:u # 524.944 M/sec (63.81%) 789,273 branch-misses:u # 2.27% of all branches (62.64%) 49,760,981 L1-dcache-loads:u # 752.052 M/sec (44.50%) 56,790 L1-dcache-load-misses:u # 0.11% of all L1-dcache hits (18.82%) 4,932 LLC-loads:u # 0.075 M/sec (23.20%) <not supported> LLC-load-misses:u 0.066500871 seconds time elapsed ``` ``` $ perf stat -d ./phonebook_opt Performance counter stats for './phonebook_opt': 41.165953 task-clock:u (msec) # 0.993 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 4,147 page-faults:u # 0.101 M/sec 89,513,769 cycles:u # 2.174 GHz (30.69%) 47,728,782 stalled-cycles-frontend:u # 53.32% frontend cycles idle (45.97%) 30,544,573 stalled-cycles-backend:u # 34.12% backend cycles idle (50.45%) 186,816,945 instructions:u # 2.09 insn per cycle # 0.26 stalled cycles per insn (61.88%) 35,932,936 branches:u # 872.880 M/sec (63.67%) 848,250 branch-misses:u # 2.36% of all branches (70.17%) 54,843,255 L1-dcache-loads:u # 1332.248 M/sec (41.03%) 26,195 L1-dcache-load-misses:u # 0.05% of all L1-dcache hits (15.29%) 2,297 LLC-loads:u # 0.056 M/sec (22.03%) <not supported> LLC-load-misses:u 0.041471519 seconds time elapsed ``` ## A Re think of WHY? * 在不改變讀取方式以及 append code 的狀況下,為什麼降低 struct size 可以增加 append 效能? * sizeof((entry *) 0x0) -> 8 * entry pointer 的大小不會改變啊。 * 推測是 malloc size 對於速度有差異 * proof it. ## pthread boss/worker model [pthreads-example/example3-boss-worker-model.c](https://github.com/fbroom/pthreads-examples/blob/master/example3-boss-worker-model.c)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully