2017q3 Homework1 (phonebook)

tags: `sysprog2017`

contributed by <river85511>

作業要求

開發環境

CPU：

$ lscpu

L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K

$ dmidecode -t cache

# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.6 present.

Handle 0x0002, DMI type 7, 19 bytes
Cache Information
	Socket Designation: L1-Cache
	Configuration: Enabled, Not Socketed, Level 1
	Operational Mode: Write Through
	Location: Internal
	Installed Size: 64 kB
	Maximum Size: 64 kB
	Supported SRAM Types:
		Synchronous
	Installed SRAM Type: Synchronous
	Speed: Unknown
	Error Correction Type: Single-bit ECC
	System Type: Data
	Associativity: 8-way Set-associative

Handle 0x0003, DMI type 7, 19 bytes
Cache Information
	Socket Designation: L2-Cache
	Configuration: Enabled, Not Socketed, Level 2
	Operational Mode: Write Through
	Location: Internal
	Installed Size: 256 kB
	Maximum Size: 256 kB
	Supported SRAM Types:
		Synchronous
	Installed SRAM Type: Synchronous
	Speed: Unknown
	Error Correction Type: Single-bit ECC
	System Type: Data
	Associativity: 8-way Set-associative

Handle 0x0004, DMI type 7, 19 bytes
Cache Information
	Socket Designation: L3-Cache
	Configuration: Enabled, Not Socketed, Level 3
	Operational Mode: Write Back
	Location: Internal
	Installed Size: 3072 kB
	Maximum Size: 3072 kB
	Supported SRAM Types:
		Synchronous
	Installed SRAM Type: Synchronous
	Speed: Unknown
	Error Correction Type: Single-bit ECC
	System Type: Unified
	Associativity: 12-way Set-associative

可以得知
- L1 cache 的 Associativity 為 8-way Set-associative
- L2 cache 的 Associativity 為 8-way Set-associative
- L3 cache 的 Associativity 為 12-way Set-associative

系統版本：

$ uname -r

4.10.0-28-generic

Perf 配置

安裝 perf

$ sudo apt-get install linux-tools-4.10.0-37-generic 
$ sudo apt-get install linux-cloud-tools-4.10.0-37-generic

接著輸入：

$ perf top

出現以下畫面：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

從此圖可以發現，我的系統的 perf_event_paranoid 的預設值為 3。
若想將 perf_event_paranoid 的值改為 1，可以輸入：

$ sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'

如果要檢測 cache miss event ，需要先取消 kernel pointer 的禁用。
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"

若想知道 perf 可以監測的效能種類：

$ perf list

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]

  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-loads                                         [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  dTLB-stores                                        [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
  iTLB-loads                                         [Hardware cache event]

這邊只約略列出

gnuplot 配置

安裝 gnuplot

$ sudo apt-get install gnuplot

為了要看 gnuplot 畫出來的圖，得安裝 eog

$ sudo apt-get install eog

CPU Cache 原理探討

由於避免將整個版面拖的太長，將內容在轉到另外一篇筆記上。

案例分析

原始程式碼

首先是 phonebook_orig.h

#ifndef _PHONEBOOK_H
#define _PHONEBOOK_H

#define MAX_LAST_NAME_SIZE 16

/* original version */
typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

entry *findName(char lastName[], entry *pHead);
entry *append(char lastName[], entry *e);

#endif

可以發現 entry ：

大小為 136 bytes
- 123 * 1 (char array) + 4 * 1 (padding) + 8 * 1 (pointer) = 136 bytes
- 關於 padding，可以參考 Data Alignment 和 The Lost Art of C Structure Packing
結構為 linked-list

接著是 phonebook_orig.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#include "phonebook_orig.h"

/* original version */
entry *findName(char lastName[], entry *pHead)
{
    while (pHead != NULL) {
        if (strcasecmp(lastName, pHead->lastName) == 0)
            return pHead;
        pHead = pHead->pNext;
    }
    return NULL;
}

entry *append(char lastName[], entry *e)
{
    /* allocate memory for the new entry and put lastName */
    e->pNext = (entry *) malloc(sizeof(entry));
    e = e->pNext;
    strcpy(e->lastName, lastName);
    e->pNext = NULL;

    return e;
}

可以得知：

findname() 是 linked-list 的 search 的實作。
append() 是 linked-list 的 insert 的實作。

接著透過 perf 來分析原始程式碼的效能。首先先到 phonebook 的路徑底下

$ make

cc -Wall -std=gnu99 -O0 \
	-DIMPL="\"phonebook_orig.h\"" -o phonebook_orig \
	main.c phonebook_orig.c
cc -Wall -std=gnu99 -O0 \
	-DIMPL="\"phonebook_opt.h\"" -o phonebook_opt \
	main.c phonebook_opt.c

在用 perf 測試之前，先將 cache 裡的資料清除

$ echo 3 | sudo tee /proc/sys/vm/drop_caches

接著，正式開始測試。

$ perf stat -r 100 -e cache-misses,cache-references,instructions,cycles,L1-dcache-load-misses,L1-dcache-prefetch-misses,L1-dcache-store-misses ./phonebook_orig

 Performance counter stats for './phonebook_orig' (100 runs):

          135,4197      cache-misses              #   64.058 % of all cache refs      ( +-  1.32% )  (54.08%)
          211,4019      cache-references                                              ( +-  0.67% )  (55.73%)
       2,6831,3537      instructions              #    1.08  insn per cycle           ( +-  0.28% )  (71.16%)
       2,4873,8327      cycles                                                        ( +-  0.88% )  (72.97%)
          264,7532      L1-dcache-load-misses                                         ( +-  0.33% )  (75.54%)
          324,3224      L1-dcache-prefetch-misses                                     ( +-  0.89% )  (74.57%)
          100,5760      L1-dcache-store-misses                                        ( +-  0.86% )  (70.65%)

       0.084016902 seconds time elapsed                                          ( +-  1.13% )

發現到

cache-misses 的次數高達 135,4197 次
L1-dcache-load-misses 的次數為 264,7532 次
L1-dcache-prefetch-misses 的次數為 324,3224 次
L1-dcache-store-misses 的次數為 100,5760 次

很明顯的，若想要提高這個程式的效能，就必須減少 cache-miss 的次數。

優化程式碼

實驗一

已知條件：

CPU 的 L1d cache 的大小為 32K。
L1 cache 的 Associativity 為 8-way Set-associative。
entry 這個結構體的大小為 136 bytes。

思考：

由於 L1d Cache 的大小有限，因此若無法將所有的資料都儲存在 L1d cache 中，自然就會發生 L1-dcache-load-misses。
- 在 ./dictionary/words.txt 中的資料共有 349900 筆。
- L1d Cache 能夠儲存 entry 的筆數為
  $(32 * 1024) / 136 = 240.9411$ 筆
因此，若減小 entry 的大小，則代表 L1d cache 裡可以儲存更多筆 entry 的資料，即能降低 L1-dcache-load-misses 的次數。

作法：

首先每次在搜尋時，只需要用到 lastName 的資訊，因此，為了驗證將 entry 的大小減少，可以減少 cache-misses 的次數，我將 entry 裡除了 lastName 以外的資訊用另一個 entryDetail 的 struct 來表示，以縮小 entry 的大小。

$ cat phonebook_opt.h

#ifndef _PHONEBOOK_H
#define _PHONEBOOK_H

#define MAX_LAST_NAME_SIZE 16

/* TODO: After modifying the original version, uncomment the following
 * line to set OPT properly */
#define OPT 1
typedef struct __PHONE_BOOK_ENTRY_DETAIL {
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
} entryDetail;

typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    entryDetail* detail;
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;
entry *findName(char lastName[], entry *pHead);
entry *append(char lastName[], entry *e);

#endif

至於 phonebook_opt.c 則維持與 phonebook_orig.c 相同。接著，開始測試。

預期結果：

由於 entry 的大小從 136 bytes 縮減為 32 bytes，因此 L1d Cache 能夠儲存 entry 的筆數為
$(32 * 1024) / 32 = 1024$ 筆，約為
$240.9411$ 的
$4.2$ 倍。
因此預期能夠大幅減少 cache-misses 的次數。

實驗結果：

$ perf stat -r 100 -e cache-misses,cache-references,instructions,cycles,L1-dcache-load-misses,L1-dcache-prefetch-misses,L1-dcache-store-misses ./phonebook_opt

 Performance counter stats for './phonebook_opt' (100 runs):

           28,2969      cache-misses              #   42.799 % of all cache refs      ( +-  2.40% )  (52.79%)
           66,1154      cache-references                                              ( +-  2.04% )  (54.33%)
       2,4136,2527      instructions              #    1.50  insn per cycle           ( +-  0.36% )  (70.09%)
       1,6113,2612      cycles                                                        ( +-  1.29% )  (72.63%)
          118,7456      L1-dcache-load-misses                                         ( +-  2.22% )  (77.39%)
          220,9291      L1-dcache-prefetch-misses                                     ( +-  2.78% )  (77.45%)
           36,1261      L1-dcache-store-misses                                        ( +-  2.34% )  (71.55%)

       0.054441195 seconds time elapsed                                          ( +-  1.36% )

gnuplot_test1

實驗結果觀察：

cache-misses 的次數從 135,4197 下降到 28,2969
L1-dcache-load-misses 的次數從 264,7532 下降到 118,7456
L1-dcache-prefetch-misses 的次數從 324,3224 下降到 220,9291
L1-dcache-store-misss的次數從 100,5760 下降到 36,1261
append() 的花費時間從 0.056262 秒下降到 0.042920 秒。
findName() 的花費時間從 0.007348 秒下降到 0.003211 秒。