# 2017q1 Homework1 (phonebook)
###### tags: `sysprog2017` `phonebook`
contributed by <`stanleytazi`>
### Reviewed by `etc276`
- [這次]("https://github.com/stanleytazi/phonebook-1/commit/8facfecd969da56755e4a8e824ed311ca37ab0f0") commit message 的 title 是 "Replace single linked-list by hash table" ,但內文並沒有具體補充說明 hash 後會比 single linked-list 優化的原因(看完內容沒有比看完標題時更懂這次 commit 的理由,以及更改後的優缺點)
- 在 Hackmd 上放 code 時,盡量都使用 `clike =` ,以便使用者閱讀時知道行數(例如 memory pool 那邊),也較好提出建議,開發環境的內容也需要排版。
- 後面有張圖顯示優化後的 append() 時間並沒有下降,我猜測原因是 hash table size 的選擇,可以參考[這篇]("http://thuhak.blog.51cto.com/2891595/1352903")和[共筆]("https://hackmd.io/s/Hyr9PhFYx"),選擇特定數量級的質數以優化
## 開發環境
### CPU 基本資訊
指令 `$ lscpu`
```shell
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
每核心執行緒數:2
每通訊端核心數:2
Socket(s): 1
NUMA 節點: 1
供應商識別號: GenuineIntel
CPU 家族: 6
型號: 69
Model name: Intel(R) Core(TM) i5-4260U CPU @ 1.40GHz
製程: 1
CPU MHz: 1400.312
CPU max MHz: 2700.0000
CPU min MHz: 800.0000
BogoMIPS: 4000.20
虛擬: VT-x
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 3072K
NUMA node0 CPU(s): 0-3
```
### cache line size
>中英文字間請以空白隔開!
>[name=課程助教][color=red]
指令 `$ cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size`
得到 ==64==(bytes? 猜測,待文件或是實驗查證)
在補充資料中皆有提到目前主流是64B
+ 實驗查證
[cache補充資料](http://cenalulu.github.io/linux/all-about-cpu-cache/) 中有一個程式來觀察 cache line size 對不同 array size 做 read 操作時的影響。
結果如下:()
## 原始版本執行結果
+ 觀察cache使用狀況
```
$ make plot
```
可以看到 cache-misses 的比例很高,***首要目標降低 cache-misses rate***
```shell
Performance counter stats for './phonebook_orig' (100 runs):
1,024,254 cache-misses # 90.150 % of all cache refs
1,140,377 cache-references
263,685,295 instructions # 1.48 insns per cycle
178,278,149 cycles
0.067626164 seconds time elapsed ( +- 0.13% )
```
+ 執行時間 (optimized 尚未實作)

+ 現有append()&findName()只需要lastName就可以了
+ 原先的儲存人員資料的structure太多資訊在上述兩個function是不需要的
```clike
typedef struct __PHONE_BOOK_ENTRY {
char lastName[MAX_LAST_NAME_SIZE];
char firstName[16];
char email[16];
char phone[10];
char cell[10];
char addr1[16];
char addr2[16];
char city[16];
char state[2];
char zip[5];
struct __PHONE_BOOK_ENTRY *pNext;
} entry;
```
+ perf top
+ 可以觀察到消耗CPU cycle最多的為strcasecmp,可以查看function實作看是否有更改的空間或是利用其他方式提前判斷兩個字串是不一樣的而不用呼叫此function
+ 第二名為findName,這邊會需要更改 single-linked list的連結方式以及改善cache misses rate

## 可實驗方向
+ 更改 **struct __PHONE_BOOK_ENTRY** 儲存資料方式
+ 減少呼叫strcasecmp
+ append內所需要的空間可以在程式一開始產生出來
+ 利用 hash function
## 實驗一 減少cache misses rate
### 更改 **struct __PHONE_BOOK_ENTRY** 儲存資料方式
+ in `phonebook_opt.h`
```clike
typedef struct {
char firstName[16];
char email[16];
char phone[10];
char cell[10];
char addr1[16];
char addr2[16];
char city[16];
char state[2];
char zip[5];
} entry_detail;
typedef struct __PHONE_BOOK_ENTRY {
char lastName[MAX_LAST_NAME_SIZE];
struct __PHONE_BOOK_ENTRY *pNext;
entry_detail *info;
} entry;
```
+ 與 orig 版本 append()&findName() 執行時間比較

+ phonebook_opt 的 cache使用狀況
+
```shell
Performance counter stats for './phonebook_opt' (100 runs):
114,443 cache-misses # 29.189 % of all cache refs
397,487 cache-references
245,583,148 instructions # 2.00 insns per cycle
122,015,515 cycles
0.047291252 seconds time elapsed ( +- 0.26% )
```
## 實驗二 - 利用 hash function
### 改用 hash table
+ 這邊使用作業說明([各种字符串Hash函数比较](https://www.byvoid.com/zhs/blog/string-hash-compare))提及的**BKDRHash**
```clike=
```
```clike=
unsigned int BKDRHash(char *str)
{
unsigned int seed = 131;
unsigned int hash = 0;
while (*str) {
hash = hash * seed + (*str++);
}
return (hash & 0x7FFFFFFF);
}
```
+ 把Hash value也儲存到 entry裡面
```clike=
typedef struct __PHONE_BOOK_ENTRY {
char lastName[MAX_LAST_NAME_SIZE];
unsigned int hashVal;
entry_detail *info;
struct __PHONE_BOOK_ENTRY *pNext;
} entry;
```
+ findName() & append() 實作
```clike=
#define MAX_HASH_TABLE_SIZE 1024
entry *findName(char lastName[], entry *pHead)
{
unsigned int hVal = BKDRHash(lastName);
unsigned char hashTableIndex = hVal % MAX_HASH_TABLE_SIZE;
pHead = hashTable[hashTableIndex];
while (pHead != NULL) {
if (pHead->hashVal == hVal &&
strcasecmp(lastName, pHead->lastName) == 0)
return pHead;
pHead = pHead->pNext;
}
return NULL;
}
entry *append(char lastName[], entry *e)
{
unsigned char hashTableIndex ;
/* allocate memory for the new entry and put lastName */
entry *hashEntry = (entry *) malloc(sizeof(entry));
strcpy(hashEntry->lastName, lastName);
hashEntry->hashVal = BKDRHash(lastName);
hashTableIndex = hashEntry->hashVal % MAX_HASH_TABLE_SIZE;
if(hashTable[hashTableIndex]) {
hashEntry->pNext = hashTable[hashTableIndex];
}
hashTable[hashTableIndex] = hashEntry;
return hashEntry;
}
```
+ 與 phonebook_orig 版本實行時間比較圖
+ 可以看到 append()所花費的時間比實驗一的還多,與orig版本一樣,但是findName() 的執行時間縮短很多

+ 利用 perf 可以看到 cpu 主要花在 BKDRHash()
```shell
# Samples: 528 of event 'cycles'
# Event count (approx.): 131645130
#
# Overhead Command Shared Object Symbol
# ........ ............. ................. ......................................
#
24.16% phonebook_opt phonebook_opt [.] BKDRHash
18.10% phonebook_opt phonebook_opt [.] main
10.62% phonebook_opt libc-2.21.so [.] _int_malloc
9.17% phonebook_opt phonebook_opt [.] append
8.80% phonebook_opt libc-2.21.so [.] _IO_fgets
5.28% phonebook_opt libc-2.21.so [.] _IO_getline_info
4.85% phonebook_opt libc-2.21.so [.] __strcpy_sse2_unaligned
3.48% phonebook_opt libc-2.21.so [.] malloc
2.31% phonebook_opt libc-2.21.so [.] __memcpy_sse2
1.52% phonebook_opt [kernel.kallsyms] [k] page_fault
1.29% phonebook_opt [kernel.kallsyms] [k] clear_page_c_e
1.16% phonebook_opt libc-2.21.so [.] memchr
1.09% phonebook_opt [kernel.kallsyms] [k] _raw_spin_lock
```
+ cache misses rate
+ 可以看到cache misses rate 比實驗一還要糟糕,應該是在`entry`裡又多了一個`hashVal` element,使得每次每一條cache line所能儲存的entry個數減少
```shell
Performance counter stats for './phonebook_opt' (5 runs):
132,277,864 cycles
236,245,171 instructions # 1.80 insns per cycle
143,696 cache-references
68,468 cache-misses # 44.361 % of all cache refs
5,047,187 bus-cycles
0.057255147 seconds time elapsed ( +- 9.29% )
```
+ 因為 hash table 只有1024個 entry,所以對於 BKDRHash 產生的hash value 會有許多的 collision,想建立一個 memory pool 是對於 hash table中在同一個 index 下的所有 free entry 會是相鄰位置的,目的想在traverse 某一格 hash table 的 single-linked list 時,減少 cache misses rate,同時間也想利用memory pool 來減少每次 `malloc()`產生的負擔,實驗進行在下個項目
### 針對hash table 中每個index 建立鄰近位置的memory pool
+ 新增 `initFreeEntryPool()` 和 `allocEmptyEntry()`
```clike
// create a memory pool
void initFreeEntryPool(unsigned int freeBufferNum)
{
int i,j;
entry *freeEntry = NULL;
for(i=0; i < MAX_HASH_TABLE_SIZE; i++) {
freeEntry = (entry *) malloc(sizeof(entry));
freePool[i] = freeEntry;
for(j=1; j < freeBufferNum; j++) {
if(freeEntry) {
freeEntry->pNext = (entry *) malloc(sizeof(entry));
freeEntry = freeEntry->pNext;
freeEntry->pNext = NULL;
}
}
}
}
//allocate a free entry
entry *allocEmptyEntry(unsigned int tableIndex)
{
entry *output = NULL;
if(tableIndex < MAX_HASH_TABLE_SIZE) {
output = freePool[tableIndex];
if(output)
freePool[tableIndex] = output->pNext;
}
return output;
}
```
+ cache misses rate
```shell
Performance counter stats for './phonebook_opt' (100 runs):
305,065 cache-misses # 67.835 % of all cache refs
431,262 cache-references
292,667,383 instructions # 1.46 insns per cycle
199,296,907 cycles
0.078124204 seconds time elapsed ( +- 0.24% )
```
+ 執行時間比較

+ perf top
```
41.23% phonebook_opt [.] allocEmptyEntry
20.12% phonebook_opt [.] main
11.57% phonebook_opt [.] BKDRHash
7.13% phonebook_opt [.] append
4.33% libc-2.21.so [.] _IO_fgets
3.25% libc-2.21.so [.] _IO_getline_info
2.21% [kernel] [k] page_fault
2.13% libc-2.21.so [.] __memcpy_sse2
1.45% libc-2.21.so [.] malloc
```
+ `perf record -e cache-refereces,cache-misses -F 10000 ./phonebook_opt`
```
# Samples: 628 of event 'cache-references'
# Event count (approx.): 496297
#
# Overhead Command Shared Object Symbol
# ........ ............. ................. ..................................
#
44.82% phonebook_opt libc-2.21.so [.] _IO_fgets
9.20% phonebook_opt phonebook_opt [.] allocEmptyEntry
6.15% phonebook_opt [kernel.kallsyms] [k] copy_user_enhanced_fast_string
6.15% phonebook_opt phonebook_opt [.] append
4.81% phonebook_opt phonebook_opt [.] main
1.87% phonebook_opt [kernel.kallsyms] [k] clear_page_c_e
1.72% phonebook_opt [kernel.kallsyms] [k] free_pages_and_swap_cache
1.37% phonebook_opt [kernel.kallsyms] [k] _raw_spin_lock_irqsave
1.35% phonebook_opt [kernel.kallsyms] [k] __rmqueue
1.26% phonebook_opt phonebook_opt [.] BKDRHash
1.25% phonebook_opt libc-2.21.so [.] _IO_getline_info
1.03% phonebook_opt [kernel.kallsyms] [k] page_remove_rmap
#
# Total Lost Samples: 0
#
# Samples: 742 of event 'cache-misses'
# Event count (approx.): 331210
#
# Overhead Command Shared Object Symbol
# ........ ............. ................. ...................................
#
61.73% phonebook_opt libc-2.21.so [.] _IO_fgets
7.07% phonebook_opt phonebook_opt [.] append
5.49% phonebook_opt [kernel.kallsyms] [k] copy_user_enhanced_fast_string
2.40% phonebook_opt [kernel.kallsyms] [k] clear_page_c_e
2.36% phonebook_opt phonebook_opt [.] allocEmptyEntry
2.15% phonebook_opt [kernel.kallsyms] [k] __rmqueue
1.54% phonebook_opt [kernel.kallsyms] [k] unmap_page_range
1.52% phonebook_opt [kernel.kallsyms] [k] ext4_discard_preallocations
1.44% phonebook_opt [kernel.kallsyms] [k] free_pcppages_bulk
1.35% phonebook_opt [kernel.kallsyms] [k] __mod_zone_page_state
0.97% phonebook_opt phonebook_opt [.] BKDRHash
```
+ 結果討論
+ 可以從 perf 後的結果發現新增的 `allocEmptyEntry()`反而耗費cpu很多執行時間
+ 如何解釋這現象?
+ 多建立了一個特地想讓每個 hash bucket所串接的entry都是相鄰位置,反而會造成每次拿 free entry時都會有cache miss
+ 所以就單純建立一個linked list,每次存取都會是從head拿新的 free entry
+ ~~感覺~~對於cache的使用狀況還是沒有完全了解
+ 做了一個簡短的[筆記](https://hackmd.io/GYdgHAzAbBCsDGBaATAIwKYBZGdVAhoqnhIhPAIzITrKazBgCcQA)
:::danger
不懂就不懂,不要假文青用「感覺」,理工科系的學生說話要精確。
快去讀書,並且把你的認知貼出來 --jserv
:::
> 是! 會誠實面對自己!