2017q3 Homework1 (phonebook)
===
###### tags: `sysprog2017`
contributed by <`river85511`>
---
## [作業要求](https://hackmd.io/s/HJJUmdXsZ)
## 開發環境
CPU:
```
$ lscpu
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
```
```
$ dmidecode -t cache
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.6 present.
Handle 0x0002, DMI type 7, 19 bytes
Cache Information
Socket Designation: L1-Cache
Configuration: Enabled, Not Socketed, Level 1
Operational Mode: Write Through
Location: Internal
Installed Size: 64 kB
Maximum Size: 64 kB
Supported SRAM Types:
Synchronous
Installed SRAM Type: Synchronous
Speed: Unknown
Error Correction Type: Single-bit ECC
System Type: Data
Associativity: 8-way Set-associative
Handle 0x0003, DMI type 7, 19 bytes
Cache Information
Socket Designation: L2-Cache
Configuration: Enabled, Not Socketed, Level 2
Operational Mode: Write Through
Location: Internal
Installed Size: 256 kB
Maximum Size: 256 kB
Supported SRAM Types:
Synchronous
Installed SRAM Type: Synchronous
Speed: Unknown
Error Correction Type: Single-bit ECC
System Type: Data
Associativity: 8-way Set-associative
Handle 0x0004, DMI type 7, 19 bytes
Cache Information
Socket Designation: L3-Cache
Configuration: Enabled, Not Socketed, Level 3
Operational Mode: Write Back
Location: Internal
Installed Size: 3072 kB
Maximum Size: 3072 kB
Supported SRAM Types:
Synchronous
Installed SRAM Type: Synchronous
Speed: Unknown
Error Correction Type: Single-bit ECC
System Type: Unified
Associativity: 12-way Set-associative
```
* 可以得知
* `L1 cache` 的 `Associativity` 為 `8-way Set-associative`
* `L2 cache` 的 `Associativity` 為 `8-way Set-associative`
* `L3 cache` 的 `Associativity` 為 `12-way Set-associative`
系統版本:
```
$ uname -r
4.10.0-28-generic
```
## Perf 配置
安裝 **perf**
```
$ sudo apt-get install linux-tools-4.10.0-37-generic
$ sudo apt-get install linux-cloud-tools-4.10.0-37-generic
```
接著輸入:
```
$ perf top
```
出現以下畫面:

從此圖可以發現,我的系統的 `perf_event_paranoid` 的預設值為 `3`。
若想將 `perf_event_paranoid` 的值改為 `1`,可以輸入:
`$ sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'
`
如果要檢測 cache miss event ,需要先取消 kernel pointer 的禁用。
`$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"`
若想知道 `perf` 可以監測的效能種類:
```
$ perf list
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
ref-cycles [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
alignment-faults [Software event]
bpf-output [Software event]
context-switches OR cs [Software event]
cpu-clock [Software event]
cpu-migrations OR migrations [Software event]
dummy [Software event]
emulation-faults [Software event]
major-faults [Software event]
minor-faults [Software event]
page-faults OR faults [Software event]
task-clock [Software event]
L1-dcache-load-misses [Hardware cache event]
L1-dcache-loads [Hardware cache event]
L1-dcache-prefetch-misses [Hardware cache event]
L1-dcache-store-misses [Hardware cache event]
L1-dcache-stores [Hardware cache event]
L1-icache-load-misses [Hardware cache event]
LLC-loads [Hardware cache event]
LLC-prefetches [Hardware cache event]
LLC-stores [Hardware cache event]
branch-load-misses [Hardware cache event]
branch-loads [Hardware cache event]
dTLB-load-misses [Hardware cache event]
dTLB-loads [Hardware cache event]
dTLB-store-misses [Hardware cache event]
dTLB-stores [Hardware cache event]
iTLB-load-misses [Hardware cache event]
iTLB-loads [Hardware cache event]
```
> 這邊只約略列出
## gnuplot 配置
安裝 **gnuplot**
```
$ sudo apt-get install gnuplot
```
為了要看 gnuplot 畫出來的圖,得安裝 **eog**
```
$ sudo apt-get install eog
```
## CPU Cache 原理探討
由於避免將整個版面拖的太長,將內容在轉到另外一篇[筆記](https://hackmd.io/s/H1U6NgK3Z)上。
## 案例分析
### 原始程式碼
首先是 `phonebook_orig.h`
``` C
#ifndef _PHONEBOOK_H
#define _PHONEBOOK_H
#define MAX_LAST_NAME_SIZE 16
/* original version */
typedef struct __PHONE_BOOK_ENTRY {
char lastName[MAX_LAST_NAME_SIZE];
char firstName[16];
char email[16];
char phone[10];
char cell[10];
char addr1[16];
char addr2[16];
char city[16];
char state[2];
char zip[5];
struct __PHONE_BOOK_ENTRY *pNext;
} entry;
entry *findName(char lastName[], entry *pHead);
entry *append(char lastName[], entry *e);
#endif
```
可以發現 `entry` :
* 大小為 `136 bytes`
* `123 * 1 (char array) + 4 * 1 (padding) + 8 * 1 (pointer) = 136 bytes`
* 關於 padding,可以參考 [Data Alignment](https://www.youtube.com/watch?v=OKjOZBaKlOc) 和 [The Lost Art of C Structure Packing](http://www.catb.org/esr/structure-packing/)
* 結構為 `linked-list`
接著是 `phonebook_orig.c`
``` C
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include "phonebook_orig.h"
/* original version */
entry *findName(char lastName[], entry *pHead)
{
while (pHead != NULL) {
if (strcasecmp(lastName, pHead->lastName) == 0)
return pHead;
pHead = pHead->pNext;
}
return NULL;
}
entry *append(char lastName[], entry *e)
{
/* allocate memory for the new entry and put lastName */
e->pNext = (entry *) malloc(sizeof(entry));
e = e->pNext;
strcpy(e->lastName, lastName);
e->pNext = NULL;
return e;
}
```
可以得知:
* `findname()` 是 `linked-list` 的 `search` 的實作。
* `append()` 是 `linked-list` 的 `insert` 的實作。
接著透過 `perf` 來分析原始程式碼的效能。首先先到 `phonebook` 的路徑底下
```
$ make
cc -Wall -std=gnu99 -O0 \
-DIMPL="\"phonebook_orig.h\"" -o phonebook_orig \
main.c phonebook_orig.c
cc -Wall -std=gnu99 -O0 \
-DIMPL="\"phonebook_opt.h\"" -o phonebook_opt \
main.c phonebook_opt.c
```
在用 `perf` 測試之前,先將 cache 裡的資料清除
```
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
```
接著,正式開始測試。
```
$ perf stat -r 100 -e cache-misses,cache-references,instructions,cycles,L1-dcache-load-misses,L1-dcache-prefetch-misses,L1-dcache-store-misses ./phonebook_orig
```
```
Performance counter stats for './phonebook_orig' (100 runs):
135,4197 cache-misses # 64.058 % of all cache refs ( +- 1.32% ) (54.08%)
211,4019 cache-references ( +- 0.67% ) (55.73%)
2,6831,3537 instructions # 1.08 insn per cycle ( +- 0.28% ) (71.16%)
2,4873,8327 cycles ( +- 0.88% ) (72.97%)
264,7532 L1-dcache-load-misses ( +- 0.33% ) (75.54%)
324,3224 L1-dcache-prefetch-misses ( +- 0.89% ) (74.57%)
100,5760 L1-dcache-store-misses ( +- 0.86% ) (70.65%)
0.084016902 seconds time elapsed ( +- 1.13% )
```
發現到
* `cache-misses` 的次數高達 `135,4197` 次
* `L1-dcache-load-misses` 的次數為 `264,7532` 次
* `L1-dcache-prefetch-misses` 的次數為 `324,3224` 次
* `L1-dcache-store-misses` 的次數為 `100,5760` 次
很明顯的,若想要提高這個程式的效能,就必須減少 `cache-miss` 的次數。
### 優化程式碼
#### **實驗一**
已知條件:
* `CPU` 的 `L1d cache` 的大小為 `32K`。
* `L1 cache` 的 `Associativity` 為 `8-way Set-associative`。
* `entry` 這個結構體的大小為 `136 bytes`。
思考:
* 由於 `L1d Cache` 的大小有限,因此若無法將所有的資料都儲存在 `L1d cache` 中,自然就會發生 `L1-dcache-load-misses`。
* 在 `./dictionary/words.txt` 中的資料共有 349900 筆。
* `L1d Cache` 能夠儲存 `entry` 的筆數為 ${(32 * 1024) / 136 = 240.9411}$ 筆
* 因此,若減小 `entry` 的大小,則代表 `L1d cache` 裡可以儲存更多筆 `entry` 的資料,即能降低 `L1-dcache-load-misses` 的次數。
作法:
* 首先每次在搜尋時,只需要用到 `lastName` 的資訊,因此,為了驗證將 `entry` 的大小減少,可以減少 `cache-misses` 的次數,我將 `entry` 裡除了 `lastName` 以外的資訊用另一個 `entryDetail` 的 struct 來表示,以縮小 `entry` 的大小。
``` C
$ cat phonebook_opt.h
#ifndef _PHONEBOOK_H
#define _PHONEBOOK_H
#define MAX_LAST_NAME_SIZE 16
/* TODO: After modifying the original version, uncomment the following
* line to set OPT properly */
#define OPT 1
typedef struct __PHONE_BOOK_ENTRY_DETAIL {
char firstName[16];
char email[16];
char phone[10];
char cell[10];
char addr1[16];
char addr2[16];
char city[16];
char state[2];
char zip[5];
} entryDetail;
typedef struct __PHONE_BOOK_ENTRY {
char lastName[MAX_LAST_NAME_SIZE];
entryDetail* detail;
struct __PHONE_BOOK_ENTRY *pNext;
} entry;
entry *findName(char lastName[], entry *pHead);
entry *append(char lastName[], entry *e);
#endif
```
至於 `phonebook_opt.c` 則維持與 `phonebook_orig.c` 相同。接著,開始測試。
預期結果:
* 由於 `entry` 的大小從 `136 bytes` 縮減為 `32 bytes`,因此 `L1d Cache` 能夠儲存 `entry` 的筆數為 ${(32∗1024)/32=1024}$ 筆,約為 $240.9411$ 的 $4.2$ 倍。
* 因此預期能夠大幅減少 `cache-misses` 的次數。
實驗結果:
```
$ perf stat -r 100 -e cache-misses,cache-references,instructions,cycles,L1-dcache-load-misses,L1-dcache-prefetch-misses,L1-dcache-store-misses ./phonebook_opt
```
```
Performance counter stats for './phonebook_opt' (100 runs):
28,2969 cache-misses # 42.799 % of all cache refs ( +- 2.40% ) (52.79%)
66,1154 cache-references ( +- 2.04% ) (54.33%)
2,4136,2527 instructions # 1.50 insn per cycle ( +- 0.36% ) (70.09%)
1,6113,2612 cycles ( +- 1.29% ) (72.63%)
118,7456 L1-dcache-load-misses ( +- 2.22% ) (77.39%)
220,9291 L1-dcache-prefetch-misses ( +- 2.78% ) (77.45%)
36,1261 L1-dcache-store-misses ( +- 2.34% ) (71.55%)
0.054441195 seconds time elapsed ( +- 1.36% )
```

實驗結果觀察:
* `cache-misses` 的次數從 `135,4197` 下降到 `28,2969`
* `L1-dcache-load-misses` 的次數從 `264,7532` 下降到 `118,7456`
* `L1-dcache-prefetch-misses` 的次數從 `324,3224` 下降到 `220,9291`
* `L1-dcache-store-misss`的次數從 `100,5760` 下降到 `36,1261`
* `append()` 的花費時間從 `0.056262` 秒 下降到 `0.042920` 秒。
* `findName()` 的花費時間從 `0.007348` 秒 下降到 `0.003211` 秒。
#### **實驗二**