sysprog2017
contributed by <river85511
>
CPU:
$ lscpu
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
$ dmidecode -t cache
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.6 present.
Handle 0x0002, DMI type 7, 19 bytes
Cache Information
Socket Designation: L1-Cache
Configuration: Enabled, Not Socketed, Level 1
Operational Mode: Write Through
Location: Internal
Installed Size: 64 kB
Maximum Size: 64 kB
Supported SRAM Types:
Synchronous
Installed SRAM Type: Synchronous
Speed: Unknown
Error Correction Type: Single-bit ECC
System Type: Data
Associativity: 8-way Set-associative
Handle 0x0003, DMI type 7, 19 bytes
Cache Information
Socket Designation: L2-Cache
Configuration: Enabled, Not Socketed, Level 2
Operational Mode: Write Through
Location: Internal
Installed Size: 256 kB
Maximum Size: 256 kB
Supported SRAM Types:
Synchronous
Installed SRAM Type: Synchronous
Speed: Unknown
Error Correction Type: Single-bit ECC
System Type: Data
Associativity: 8-way Set-associative
Handle 0x0004, DMI type 7, 19 bytes
Cache Information
Socket Designation: L3-Cache
Configuration: Enabled, Not Socketed, Level 3
Operational Mode: Write Back
Location: Internal
Installed Size: 3072 kB
Maximum Size: 3072 kB
Supported SRAM Types:
Synchronous
Installed SRAM Type: Synchronous
Speed: Unknown
Error Correction Type: Single-bit ECC
System Type: Unified
Associativity: 12-way Set-associative
L1 cache
的 Associativity
為 8-way Set-associative
L2 cache
的 Associativity
為 8-way Set-associative
L3 cache
的 Associativity
為 12-way Set-associative
系統版本:
$ uname -r
4.10.0-28-generic
安裝 perf
$ sudo apt-get install linux-tools-4.10.0-37-generic
$ sudo apt-get install linux-cloud-tools-4.10.0-37-generic
接著輸入:
$ perf top
出現以下畫面:
從此圖可以發現,我的系統的 perf_event_paranoid
的預設值為 3
。
若想將 perf_event_paranoid
的值改為 1
,可以輸入:
$ sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'
如果要檢測 cache miss event ,需要先取消 kernel pointer 的禁用。
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
若想知道 perf
可以監測的效能種類:
$ perf list
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
ref-cycles [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
alignment-faults [Software event]
bpf-output [Software event]
context-switches OR cs [Software event]
cpu-clock [Software event]
cpu-migrations OR migrations [Software event]
dummy [Software event]
emulation-faults [Software event]
major-faults [Software event]
minor-faults [Software event]
page-faults OR faults [Software event]
task-clock [Software event]
L1-dcache-load-misses [Hardware cache event]
L1-dcache-loads [Hardware cache event]
L1-dcache-prefetch-misses [Hardware cache event]
L1-dcache-store-misses [Hardware cache event]
L1-dcache-stores [Hardware cache event]
L1-icache-load-misses [Hardware cache event]
LLC-loads [Hardware cache event]
LLC-prefetches [Hardware cache event]
LLC-stores [Hardware cache event]
branch-load-misses [Hardware cache event]
branch-loads [Hardware cache event]
dTLB-load-misses [Hardware cache event]
dTLB-loads [Hardware cache event]
dTLB-store-misses [Hardware cache event]
dTLB-stores [Hardware cache event]
iTLB-load-misses [Hardware cache event]
iTLB-loads [Hardware cache event]
這邊只約略列出
安裝 gnuplot
$ sudo apt-get install gnuplot
為了要看 gnuplot 畫出來的圖,得安裝 eog
$ sudo apt-get install eog
由於避免將整個版面拖的太長,將內容在轉到另外一篇筆記上。
首先是 phonebook_orig.h
#ifndef _PHONEBOOK_H
#define _PHONEBOOK_H
#define MAX_LAST_NAME_SIZE 16
/* original version */
typedef struct __PHONE_BOOK_ENTRY {
char lastName[MAX_LAST_NAME_SIZE];
char firstName[16];
char email[16];
char phone[10];
char cell[10];
char addr1[16];
char addr2[16];
char city[16];
char state[2];
char zip[5];
struct __PHONE_BOOK_ENTRY *pNext;
} entry;
entry *findName(char lastName[], entry *pHead);
entry *append(char lastName[], entry *e);
#endif
可以發現 entry
:
大小為 136 bytes
123 * 1 (char array) + 4 * 1 (padding) + 8 * 1 (pointer) = 136 bytes
結構為 linked-list
接著是 phonebook_orig.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include "phonebook_orig.h"
/* original version */
entry *findName(char lastName[], entry *pHead)
{
while (pHead != NULL) {
if (strcasecmp(lastName, pHead->lastName) == 0)
return pHead;
pHead = pHead->pNext;
}
return NULL;
}
entry *append(char lastName[], entry *e)
{
/* allocate memory for the new entry and put lastName */
e->pNext = (entry *) malloc(sizeof(entry));
e = e->pNext;
strcpy(e->lastName, lastName);
e->pNext = NULL;
return e;
}
可以得知:
findname()
是 linked-list
的 search
的實作。append()
是 linked-list
的 insert
的實作。接著透過 perf
來分析原始程式碼的效能。首先先到 phonebook
的路徑底下
$ make
cc -Wall -std=gnu99 -O0 \
-DIMPL="\"phonebook_orig.h\"" -o phonebook_orig \
main.c phonebook_orig.c
cc -Wall -std=gnu99 -O0 \
-DIMPL="\"phonebook_opt.h\"" -o phonebook_opt \
main.c phonebook_opt.c
在用 perf
測試之前,先將 cache 裡的資料清除
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
接著,正式開始測試。
$ perf stat -r 100 -e cache-misses,cache-references,instructions,cycles,L1-dcache-load-misses,L1-dcache-prefetch-misses,L1-dcache-store-misses ./phonebook_orig
Performance counter stats for './phonebook_orig' (100 runs):
135,4197 cache-misses # 64.058 % of all cache refs ( +- 1.32% ) (54.08%)
211,4019 cache-references ( +- 0.67% ) (55.73%)
2,6831,3537 instructions # 1.08 insn per cycle ( +- 0.28% ) (71.16%)
2,4873,8327 cycles ( +- 0.88% ) (72.97%)
264,7532 L1-dcache-load-misses ( +- 0.33% ) (75.54%)
324,3224 L1-dcache-prefetch-misses ( +- 0.89% ) (74.57%)
100,5760 L1-dcache-store-misses ( +- 0.86% ) (70.65%)
0.084016902 seconds time elapsed ( +- 1.13% )
發現到
cache-misses
的次數高達 135,4197
次L1-dcache-load-misses
的次數為 264,7532
次L1-dcache-prefetch-misses
的次數為 324,3224
次L1-dcache-store-misses
的次數為 100,5760
次很明顯的,若想要提高這個程式的效能,就必須減少 cache-miss
的次數。
已知條件:
CPU
的 L1d cache
的大小為 32K
。L1 cache
的 Associativity
為 8-way Set-associative
。entry
這個結構體的大小為 136 bytes
。思考:
L1d Cache
的大小有限,因此若無法將所有的資料都儲存在 L1d cache
中,自然就會發生 L1-dcache-load-misses
。
./dictionary/words.txt
中的資料共有 349900 筆。L1d Cache
能夠儲存 entry
的筆數為 entry
的大小,則代表 L1d cache
裡可以儲存更多筆 entry
的資料,即能降低 L1-dcache-load-misses
的次數。作法:
lastName
的資訊,因此,為了驗證將 entry
的大小減少,可以減少 cache-misses
的次數,我將 entry
裡除了 lastName
以外的資訊用另一個 entryDetail
的 struct 來表示,以縮小 entry
的大小。$ cat phonebook_opt.h
#ifndef _PHONEBOOK_H
#define _PHONEBOOK_H
#define MAX_LAST_NAME_SIZE 16
/* TODO: After modifying the original version, uncomment the following
* line to set OPT properly */
#define OPT 1
typedef struct __PHONE_BOOK_ENTRY_DETAIL {
char firstName[16];
char email[16];
char phone[10];
char cell[10];
char addr1[16];
char addr2[16];
char city[16];
char state[2];
char zip[5];
} entryDetail;
typedef struct __PHONE_BOOK_ENTRY {
char lastName[MAX_LAST_NAME_SIZE];
entryDetail* detail;
struct __PHONE_BOOK_ENTRY *pNext;
} entry;
entry *findName(char lastName[], entry *pHead);
entry *append(char lastName[], entry *e);
#endif
至於 phonebook_opt.c
則維持與 phonebook_orig.c
相同。接著,開始測試。
預期結果:
entry
的大小從 136 bytes
縮減為 32 bytes
,因此 L1d Cache
能夠儲存 entry
的筆數為 cache-misses
的次數。實驗結果:
$ perf stat -r 100 -e cache-misses,cache-references,instructions,cycles,L1-dcache-load-misses,L1-dcache-prefetch-misses,L1-dcache-store-misses ./phonebook_opt
Performance counter stats for './phonebook_opt' (100 runs):
28,2969 cache-misses # 42.799 % of all cache refs ( +- 2.40% ) (52.79%)
66,1154 cache-references ( +- 2.04% ) (54.33%)
2,4136,2527 instructions # 1.50 insn per cycle ( +- 0.36% ) (70.09%)
1,6113,2612 cycles ( +- 1.29% ) (72.63%)
118,7456 L1-dcache-load-misses ( +- 2.22% ) (77.39%)
220,9291 L1-dcache-prefetch-misses ( +- 2.78% ) (77.45%)
36,1261 L1-dcache-store-misses ( +- 2.34% ) (71.55%)
0.054441195 seconds time elapsed ( +- 1.36% )
實驗結果觀察:
cache-misses
的次數從 135,4197
下降到 28,2969
L1-dcache-load-misses
的次數從 264,7532
下降到 118,7456
L1-dcache-prefetch-misses
的次數從 324,3224
下降到 220,9291
L1-dcache-store-misss
的次數從 100,5760
下降到 36,1261
append()
的花費時間從 0.056262
秒 下降到 0.042920
秒。findName()
的花費時間從 0.007348
秒 下降到 0.003211
秒。