Try   HackMD

2017q3 Homework1 (phonebook)

tags: sysprog2017

contributed by <river85511>


作業要求

開發環境

CPU:

$ lscpu

L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K
$ dmidecode -t cache

# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.6 present.

Handle 0x0002, DMI type 7, 19 bytes
Cache Information
	Socket Designation: L1-Cache
	Configuration: Enabled, Not Socketed, Level 1
	Operational Mode: Write Through
	Location: Internal
	Installed Size: 64 kB
	Maximum Size: 64 kB
	Supported SRAM Types:
		Synchronous
	Installed SRAM Type: Synchronous
	Speed: Unknown
	Error Correction Type: Single-bit ECC
	System Type: Data
	Associativity: 8-way Set-associative

Handle 0x0003, DMI type 7, 19 bytes
Cache Information
	Socket Designation: L2-Cache
	Configuration: Enabled, Not Socketed, Level 2
	Operational Mode: Write Through
	Location: Internal
	Installed Size: 256 kB
	Maximum Size: 256 kB
	Supported SRAM Types:
		Synchronous
	Installed SRAM Type: Synchronous
	Speed: Unknown
	Error Correction Type: Single-bit ECC
	System Type: Data
	Associativity: 8-way Set-associative

Handle 0x0004, DMI type 7, 19 bytes
Cache Information
	Socket Designation: L3-Cache
	Configuration: Enabled, Not Socketed, Level 3
	Operational Mode: Write Back
	Location: Internal
	Installed Size: 3072 kB
	Maximum Size: 3072 kB
	Supported SRAM Types:
		Synchronous
	Installed SRAM Type: Synchronous
	Speed: Unknown
	Error Correction Type: Single-bit ECC
	System Type: Unified
	Associativity: 12-way Set-associative

  • 可以得知
    • L1 cacheAssociativity8-way Set-associative
    • L2 cacheAssociativity8-way Set-associative
    • L3 cacheAssociativity12-way Set-associative

系統版本:

$ uname -r

4.10.0-28-generic

Perf 配置

安裝 perf

$ sudo apt-get install linux-tools-4.10.0-37-generic 
$ sudo apt-get install linux-cloud-tools-4.10.0-37-generic

接著輸入:

$ perf top

出現以下畫面:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

從此圖可以發現,我的系統的 perf_event_paranoid 的預設值為 3
若想將 perf_event_paranoid 的值改為 1,可以輸入:

$ sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'

如果要檢測 cache miss event ,需要先取消 kernel pointer 的禁用。
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"

若想知道 perf 可以監測的效能種類:

$ perf list

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]

  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-loads                                         [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  dTLB-stores                                        [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
  iTLB-loads                                         [Hardware cache event]

這邊只約略列出

gnuplot 配置

安裝 gnuplot

$ sudo apt-get install gnuplot

為了要看 gnuplot 畫出來的圖,得安裝 eog

$ sudo apt-get install eog

CPU Cache 原理探討

由於避免將整個版面拖的太長,將內容在轉到另外一篇筆記上。

案例分析

原始程式碼

首先是 phonebook_orig.h

#ifndef _PHONEBOOK_H
#define _PHONEBOOK_H

#define MAX_LAST_NAME_SIZE 16

/* original version */
typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

entry *findName(char lastName[], entry *pHead);
entry *append(char lastName[], entry *e);

#endif

可以發現 entry

接著是 phonebook_orig.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#include "phonebook_orig.h"

/* original version */
entry *findName(char lastName[], entry *pHead)
{
    while (pHead != NULL) {
        if (strcasecmp(lastName, pHead->lastName) == 0)
            return pHead;
        pHead = pHead->pNext;
    }
    return NULL;
}

entry *append(char lastName[], entry *e)
{
    /* allocate memory for the new entry and put lastName */
    e->pNext = (entry *) malloc(sizeof(entry));
    e = e->pNext;
    strcpy(e->lastName, lastName);
    e->pNext = NULL;

    return e;
}

可以得知:

  • findname()linked-listsearch 的實作。
  • append()linked-listinsert 的實作。

接著透過 perf 來分析原始程式碼的效能。首先先到 phonebook 的路徑底下

$ make

cc -Wall -std=gnu99 -O0 \
	-DIMPL="\"phonebook_orig.h\"" -o phonebook_orig \
	main.c phonebook_orig.c
cc -Wall -std=gnu99 -O0 \
	-DIMPL="\"phonebook_opt.h\"" -o phonebook_opt \
	main.c phonebook_opt.c

在用 perf 測試之前,先將 cache 裡的資料清除

$ echo 3 | sudo tee /proc/sys/vm/drop_caches

接著,正式開始測試。

$ perf stat -r 100 -e cache-misses,cache-references,instructions,cycles,L1-dcache-load-misses,L1-dcache-prefetch-misses,L1-dcache-store-misses ./phonebook_orig
 Performance counter stats for './phonebook_orig' (100 runs):

          135,4197      cache-misses              #   64.058 % of all cache refs      ( +-  1.32% )  (54.08%)
          211,4019      cache-references                                              ( +-  0.67% )  (55.73%)
       2,6831,3537      instructions              #    1.08  insn per cycle           ( +-  0.28% )  (71.16%)
       2,4873,8327      cycles                                                        ( +-  0.88% )  (72.97%)
          264,7532      L1-dcache-load-misses                                         ( +-  0.33% )  (75.54%)
          324,3224      L1-dcache-prefetch-misses                                     ( +-  0.89% )  (74.57%)
          100,5760      L1-dcache-store-misses                                        ( +-  0.86% )  (70.65%)

       0.084016902 seconds time elapsed                                          ( +-  1.13% )

發現到

  • cache-misses 的次數高達 135,4197
  • L1-dcache-load-misses 的次數為 264,7532
  • L1-dcache-prefetch-misses 的次數為 324,3224
  • L1-dcache-store-misses 的次數為 100,5760

很明顯的,若想要提高這個程式的效能,就必須減少 cache-miss 的次數。

優化程式碼

實驗一

已知條件:

  • CPUL1d cache 的大小為 32K
  • L1 cacheAssociativity8-way Set-associative
  • entry 這個結構體的大小為 136 bytes

思考:

  • 由於 L1d Cache 的大小有限,因此若無法將所有的資料都儲存在 L1d cache 中,自然就會發生 L1-dcache-load-misses
    • ./dictionary/words.txt 中的資料共有 349900 筆。
    • L1d Cache 能夠儲存 entry 的筆數為
      (321024)/136=240.9411
  • 因此,若減小 entry 的大小,則代表 L1d cache 裡可以儲存更多筆 entry 的資料,即能降低 L1-dcache-load-misses 的次數。

作法:

  • 首先每次在搜尋時,只需要用到 lastName 的資訊,因此,為了驗證將 entry 的大小減少,可以減少 cache-misses 的次數,我將 entry 裡除了 lastName 以外的資訊用另一個 entryDetail 的 struct 來表示,以縮小 entry 的大小。
$ cat phonebook_opt.h

#ifndef _PHONEBOOK_H
#define _PHONEBOOK_H

#define MAX_LAST_NAME_SIZE 16

/* TODO: After modifying the original version, uncomment the following
 * line to set OPT properly */
#define OPT 1
typedef struct __PHONE_BOOK_ENTRY_DETAIL {
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
} entryDetail;

typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    entryDetail* detail;
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;
entry *findName(char lastName[], entry *pHead);
entry *append(char lastName[], entry *e);

#endif

至於 phonebook_opt.c 則維持與 phonebook_orig.c 相同。接著,開始測試。

預期結果:

  • 由於 entry 的大小從 136 bytes 縮減為 32 bytes,因此 L1d Cache 能夠儲存 entry 的筆數為
    (321024)/32=1024
    筆,約為
    240.9411
    4.2
    倍。
  • 因此預期能夠大幅減少 cache-misses 的次數。

實驗結果:

$ perf stat -r 100 -e cache-misses,cache-references,instructions,cycles,L1-dcache-load-misses,L1-dcache-prefetch-misses,L1-dcache-store-misses ./phonebook_opt
 Performance counter stats for './phonebook_opt' (100 runs):

           28,2969      cache-misses              #   42.799 % of all cache refs      ( +-  2.40% )  (52.79%)
           66,1154      cache-references                                              ( +-  2.04% )  (54.33%)
       2,4136,2527      instructions              #    1.50  insn per cycle           ( +-  0.36% )  (70.09%)
       1,6113,2612      cycles                                                        ( +-  1.29% )  (72.63%)
          118,7456      L1-dcache-load-misses                                         ( +-  2.22% )  (77.39%)
          220,9291      L1-dcache-prefetch-misses                                     ( +-  2.78% )  (77.45%)
           36,1261      L1-dcache-store-misses                                        ( +-  2.34% )  (71.55%)

       0.054441195 seconds time elapsed                                          ( +-  1.36% )

gnuplot_test1

實驗結果觀察:

  • cache-misses 的次數從 135,4197 下降到 28,2969
  • L1-dcache-load-misses 的次數從 264,7532 下降到 118,7456
  • L1-dcache-prefetch-misses 的次數從 324,3224 下降到 220,9291
  • L1-dcache-store-misss的次數從 100,5760 下降到 36,1261
  • append() 的花費時間從 0.056262 秒 下降到 0.042920 秒。
  • findName() 的花費時間從 0.007348 秒 下降到 0.003211 秒。

實驗二