2017q3 Homework1 (phonebook) === ###### tags: `sysprog2017` contributed by <`river85511`> --- ## [作業要求](https://hackmd.io/s/HJJUmdXsZ) ## 開發環境 CPU: ``` $ lscpu L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 3072K ``` ``` $ dmidecode -t cache # dmidecode 3.0 Getting SMBIOS data from sysfs. SMBIOS 2.6 present. Handle 0x0002, DMI type 7, 19 bytes Cache Information Socket Designation: L1-Cache Configuration: Enabled, Not Socketed, Level 1 Operational Mode: Write Through Location: Internal Installed Size: 64 kB Maximum Size: 64 kB Supported SRAM Types: Synchronous Installed SRAM Type: Synchronous Speed: Unknown Error Correction Type: Single-bit ECC System Type: Data Associativity: 8-way Set-associative Handle 0x0003, DMI type 7, 19 bytes Cache Information Socket Designation: L2-Cache Configuration: Enabled, Not Socketed, Level 2 Operational Mode: Write Through Location: Internal Installed Size: 256 kB Maximum Size: 256 kB Supported SRAM Types: Synchronous Installed SRAM Type: Synchronous Speed: Unknown Error Correction Type: Single-bit ECC System Type: Data Associativity: 8-way Set-associative Handle 0x0004, DMI type 7, 19 bytes Cache Information Socket Designation: L3-Cache Configuration: Enabled, Not Socketed, Level 3 Operational Mode: Write Back Location: Internal Installed Size: 3072 kB Maximum Size: 3072 kB Supported SRAM Types: Synchronous Installed SRAM Type: Synchronous Speed: Unknown Error Correction Type: Single-bit ECC System Type: Unified Associativity: 12-way Set-associative ``` * 可以得知 * `L1 cache` 的 `Associativity` 為 `8-way Set-associative` * `L2 cache` 的 `Associativity` 為 `8-way Set-associative` * `L3 cache` 的 `Associativity` 為 `12-way Set-associative` 系統版本: ``` $ uname -r 4.10.0-28-generic ``` ## Perf 配置 安裝 **perf** ``` $ sudo apt-get install linux-tools-4.10.0-37-generic $ sudo apt-get install linux-cloud-tools-4.10.0-37-generic ``` 接著輸入: ``` $ perf top ``` 出現以下畫面: ![perf_top_error_1](https://i.imgur.com/8fz3qZ8.png) 從此圖可以發現,我的系統的 `perf_event_paranoid` 的預設值為 `3`。 若想將 `perf_event_paranoid` 的值改為 `1`,可以輸入: `$ sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid' ` 如果要檢測 cache miss event ,需要先取消 kernel pointer 的禁用。 `$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"` 若想知道 `perf` 可以監測的效能種類: ``` $ perf list branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] ref-cycles [Hardware event] stalled-cycles-backend OR idle-cycles-backend [Hardware event] stalled-cycles-frontend OR idle-cycles-frontend [Hardware event] alignment-faults [Software event] bpf-output [Software event] context-switches OR cs [Software event] cpu-clock [Software event] cpu-migrations OR migrations [Software event] dummy [Software event] emulation-faults [Software event] major-faults [Software event] minor-faults [Software event] page-faults OR faults [Software event] task-clock [Software event] L1-dcache-load-misses [Hardware cache event] L1-dcache-loads [Hardware cache event] L1-dcache-prefetch-misses [Hardware cache event] L1-dcache-store-misses [Hardware cache event] L1-dcache-stores [Hardware cache event] L1-icache-load-misses [Hardware cache event] LLC-loads [Hardware cache event] LLC-prefetches [Hardware cache event] LLC-stores [Hardware cache event] branch-load-misses [Hardware cache event] branch-loads [Hardware cache event] dTLB-load-misses [Hardware cache event] dTLB-loads [Hardware cache event] dTLB-store-misses [Hardware cache event] dTLB-stores [Hardware cache event] iTLB-load-misses [Hardware cache event] iTLB-loads [Hardware cache event] ``` > 這邊只約略列出 ## gnuplot 配置 安裝 **gnuplot** ``` $ sudo apt-get install gnuplot ``` 為了要看 gnuplot 畫出來的圖,得安裝 **eog** ``` $ sudo apt-get install eog ``` ## CPU Cache 原理探討 由於避免將整個版面拖的太長,將內容在轉到另外一篇[筆記](https://hackmd.io/s/H1U6NgK3Z)上。 ## 案例分析 ### 原始程式碼 首先是 `phonebook_orig.h` ``` C #ifndef _PHONEBOOK_H #define _PHONEBOOK_H #define MAX_LAST_NAME_SIZE 16 /* original version */ typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5]; struct __PHONE_BOOK_ENTRY *pNext; } entry; entry *findName(char lastName[], entry *pHead); entry *append(char lastName[], entry *e); #endif ``` 可以發現 `entry` : * 大小為 `136 bytes` * `123 * 1 (char array) + 4 * 1 (padding) + 8 * 1 (pointer) = 136 bytes` * 關於 padding,可以參考 [Data Alignment](https://www.youtube.com/watch?v=OKjOZBaKlOc) 和 [The Lost Art of C Structure Packing](http://www.catb.org/esr/structure-packing/) * 結構為 `linked-list` 接著是 `phonebook_orig.c` ``` C #include <stdio.h> #include <stdlib.h> #include <string.h> #include <ctype.h> #include "phonebook_orig.h" /* original version */ entry *findName(char lastName[], entry *pHead) { while (pHead != NULL) { if (strcasecmp(lastName, pHead->lastName) == 0) return pHead; pHead = pHead->pNext; } return NULL; } entry *append(char lastName[], entry *e) { /* allocate memory for the new entry and put lastName */ e->pNext = (entry *) malloc(sizeof(entry)); e = e->pNext; strcpy(e->lastName, lastName); e->pNext = NULL; return e; } ``` 可以得知: * `findname()` 是 `linked-list` 的 `search` 的實作。 * `append()` 是 `linked-list` 的 `insert` 的實作。 接著透過 `perf` 來分析原始程式碼的效能。首先先到 `phonebook` 的路徑底下 ``` $ make cc -Wall -std=gnu99 -O0 \ -DIMPL="\"phonebook_orig.h\"" -o phonebook_orig \ main.c phonebook_orig.c cc -Wall -std=gnu99 -O0 \ -DIMPL="\"phonebook_opt.h\"" -o phonebook_opt \ main.c phonebook_opt.c ``` 在用 `perf` 測試之前,先將 cache 裡的資料清除 ``` $ echo 3 | sudo tee /proc/sys/vm/drop_caches ``` 接著,正式開始測試。 ``` $ perf stat -r 100 -e cache-misses,cache-references,instructions,cycles,L1-dcache-load-misses,L1-dcache-prefetch-misses,L1-dcache-store-misses ./phonebook_orig ``` ``` Performance counter stats for './phonebook_orig' (100 runs): 135,4197 cache-misses # 64.058 % of all cache refs ( +- 1.32% ) (54.08%) 211,4019 cache-references ( +- 0.67% ) (55.73%) 2,6831,3537 instructions # 1.08 insn per cycle ( +- 0.28% ) (71.16%) 2,4873,8327 cycles ( +- 0.88% ) (72.97%) 264,7532 L1-dcache-load-misses ( +- 0.33% ) (75.54%) 324,3224 L1-dcache-prefetch-misses ( +- 0.89% ) (74.57%) 100,5760 L1-dcache-store-misses ( +- 0.86% ) (70.65%) 0.084016902 seconds time elapsed ( +- 1.13% ) ``` 發現到 * `cache-misses` 的次數高達 `135,4197` 次 * `L1-dcache-load-misses` 的次數為 `264,7532` 次 * `L1-dcache-prefetch-misses` 的次數為 `324,3224` 次 * `L1-dcache-store-misses` 的次數為 `100,5760` 次 很明顯的,若想要提高這個程式的效能,就必須減少 `cache-miss` 的次數。 ### 優化程式碼 #### **實驗一** 已知條件: * `CPU` 的 `L1d cache` 的大小為 `32K`。 * `L1 cache` 的 `Associativity` 為 `8-way Set-associative`。 * `entry` 這個結構體的大小為 `136 bytes`。 思考: * 由於 `L1d Cache` 的大小有限,因此若無法將所有的資料都儲存在 `L1d cache` 中,自然就會發生 `L1-dcache-load-misses`。 * 在 `./dictionary/words.txt` 中的資料共有 349900 筆。 * `L1d Cache` 能夠儲存 `entry` 的筆數為 ${(32 * 1024) / 136 = 240.9411}$ 筆 * 因此,若減小 `entry` 的大小,則代表 `L1d cache` 裡可以儲存更多筆 `entry` 的資料,即能降低 `L1-dcache-load-misses` 的次數。 作法: * 首先每次在搜尋時,只需要用到 `lastName` 的資訊,因此,為了驗證將 `entry` 的大小減少,可以減少 `cache-misses` 的次數,我將 `entry` 裡除了 `lastName` 以外的資訊用另一個 `entryDetail` 的 struct 來表示,以縮小 `entry` 的大小。 ``` C $ cat phonebook_opt.h #ifndef _PHONEBOOK_H #define _PHONEBOOK_H #define MAX_LAST_NAME_SIZE 16 /* TODO: After modifying the original version, uncomment the following * line to set OPT properly */ #define OPT 1 typedef struct __PHONE_BOOK_ENTRY_DETAIL { char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5]; } entryDetail; typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; entryDetail* detail; struct __PHONE_BOOK_ENTRY *pNext; } entry; entry *findName(char lastName[], entry *pHead); entry *append(char lastName[], entry *e); #endif ``` 至於 `phonebook_opt.c` 則維持與 `phonebook_orig.c` 相同。接著,開始測試。 預期結果: * 由於 `entry` 的大小從 `136 bytes` 縮減為 `32 bytes`,因此 `L1d Cache` 能夠儲存 `entry` 的筆數為 ${(32∗1024)/32=1024}$ 筆,約為 $240.9411$ 的 $4.2$ 倍。 * 因此預期能夠大幅減少 `cache-misses` 的次數。 實驗結果: ``` $ perf stat -r 100 -e cache-misses,cache-references,instructions,cycles,L1-dcache-load-misses,L1-dcache-prefetch-misses,L1-dcache-store-misses ./phonebook_opt ``` ``` Performance counter stats for './phonebook_opt' (100 runs): 28,2969 cache-misses # 42.799 % of all cache refs ( +- 2.40% ) (52.79%) 66,1154 cache-references ( +- 2.04% ) (54.33%) 2,4136,2527 instructions # 1.50 insn per cycle ( +- 0.36% ) (70.09%) 1,6113,2612 cycles ( +- 1.29% ) (72.63%) 118,7456 L1-dcache-load-misses ( +- 2.22% ) (77.39%) 220,9291 L1-dcache-prefetch-misses ( +- 2.78% ) (77.45%) 36,1261 L1-dcache-store-misses ( +- 2.34% ) (71.55%) 0.054441195 seconds time elapsed ( +- 1.36% ) ``` ![gnuplot_test1](https://i.imgur.com/hatA50Z.png) 實驗結果觀察: * `cache-misses` 的次數從 `135,4197` 下降到 `28,2969` * `L1-dcache-load-misses` 的次數從 `264,7532` 下降到 `118,7456` * `L1-dcache-prefetch-misses` 的次數從 `324,3224` 下降到 `220,9291` * `L1-dcache-store-misss`的次數從 `100,5760` 下降到 `36,1261` * `append()` 的花費時間從 `0.056262` 秒 下降到 `0.042920` 秒。 * `findName()` 的花費時間從 `0.007348` 秒 下降到 `0.003211` 秒。 #### **實驗二**