本文旨在記錄 Computer Systems: A Programmer's Perspective (CS:APP) 一書第六章閱讀心得,該書是 CMU 的計算機系統概論的教材 (難度相當於台灣的大學高年級),該書的簡體中文翻譯名稱為《深入理解計算機系統》。
CS:APP 亦是 Linux Kernel Internals 2024 Spring 課程指定教材,並一同收錄於
Linux Kernel Internals 2024 Spring Collections。
在本章之前,CS:APP 專注討論 CPU 計算架構,如指令的執行、暫存器的存取等,並將記憶體簡化為一簡單線性的矩陣,但顯然與現代系統運作的方式相去甚遠;如同讀者所知,現代記憶體系統是一層級化 (hierarchical) 的結構,如 L1 L2 L3 以及主記憶體等,而每層有其取用的成本以及限制。
作為程式開發人員,了解記憶體對系統性能的衝擊是很必要的,因取用 register 僅需 0 cycle、cache 從 4 到 75 cycles 不等,而存取主記憶體可能需要上百個 cycles。
本章會聚焦 CPU 與主記憶體的互動,因為其是最頻繁存取也最耗費 cycles 的裝置之一,並且討論如何分析、改進 C programs 的 cache locality 等。
CPU 不與 memory 直接互動,而是透過 Cache Line;Cache Line 在 64-bits 系統中以 8 bytes 為單位,舉例而言
Cache Line 1: 0xffff888002a3aa00 - 0xffff888002a3aa3f
Cache Line 2: 0xffff888002a3aa40 - 0xffff888002a3aa7f
注意此處是記憶體位址層面 (memory layout) ,每一單位可以儲存 1 byte,故每一 Cache Line 可以儲存 64 bytes。
也就是說,在 GDB 中,以 x/8x
檢視 8 words 能夠看到
(gdb) x/8x 0xffff888002a3aa00
0xffff888002a3aa00: 0x0000000000004000 0x0000000000000000
0xffff888002a3aa10: 0x0000000100000000 0x0000000000000001
0xffff888002a3aa20: 0xffffc900000b8000 0x0420804000000002
0xffff888002a3aa30: 0x0000000000000000 0x0000000000000000
就是一 Cache Line;故此處將引入 False Sharing 之議題:
假設兩個核心存取相鄰的不同物件,並位於同一 Cache Line 上,將導致False Sharing;雖然對 Cache Line 上的不同物件讀寫操作,但可能導致 OS 不斷對該區段做更新或標註,導致頻繁的 cache 更新。
解決方法就是 Cache Alignment 或 Cache Line Padding。
嘗試實驗看看是否能夠復現 False Sharing,我們認為以下程式碼應該能夠觸發 False Sharing:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <fcntl.h>
#define NUM_INCREMENTS 10000000
#define CACHE_LINE_SIZE 64
typedef struct {
int counter1;
uint64_t i; // Padding to separate cache lines
int counter2;
} SharedCounters;
int main() {
SharedCounters *counters = mmap(NULL, sizeof(SharedCounters), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (counters == MAP_FAILED) {
perror("mmap");
exit(1);
}
counters->counter1 = 0;
counters->counter2 = 0;
pid_t pid = fork();
if (pid == -1) {
perror("fork");
exit(1);
}
if (pid == 0) { // Child process
for (int i = 0; i < NUM_INCREMENTS; ++i) {
counters->counter1++;
}
exit(0);
} else { // Parent process
for (int i = 0; i < NUM_INCREMENTS; ++i) {
counters->counter2++;
}
wait(NULL);
}
printf("counter1: %d, counter2: %d\n", counters->counter1, counters->counter2);
return 0;
}
透過 gdb 檢查該結構成員之記憶體位置,應落在同一 Cache Line
(gdb) p &counters->counter1
$3 = (int *) 0x555555558018 <counters>
(gdb) p &counters->counter2
$4 = (int *) 0x55555555801c <counters+4>
使用 padding 方法:
typedef struct {
int counter1;
char padding[CACHE_LINE_SIZE - sizeof(int)];
int counter2;
} SharedCounters;
在 Linux Kernel 的文檔中有專以 False Sharing 探討的文章,其中最重視的部分為 lock 之 False Sharing 所造成的效應;並推薦使用 perf-c2c
的方式檢測 False Sharing。
$ perf c2c record -F 60000 -a ./no_pad
$ perf c2c report -NN -c pid,iaddr --full-symbols --stdio
=================================================
Trace Event Information
=================================================
Total records : 108
Locked Load/Store Operations : 2
Load Operations : 34
Loads - uncacheable : 0
Loads - IO : 0
Loads - Miss : 0
Loads - no mapping : 1
Load Fill Buffer Hit : 9
Load L1D hit : 6
Load L2D hit : 0
Load LLC hit : 14
Load Local HITM : 0
Load Remote HITM : 0
Load Remote HIT : 0
Load Local DRAM : 4
Load Remote DRAM : 0
Load MESI State Exclusive : 0
Load MESI State Shared : 4
Load LLC Misses : 4
Load access blocked by data : 0
Load access blocked by address : 0
LLC Misses to Local DRAM : 100.0%
LLC Misses to Remote DRAM : 0.0%
LLC Misses to Remote cache (HIT) : 0.0%
LLC Misses to Remote cache (HITM) : 0.0%
Store Operations : 74
Store - uncacheable : 0
Store - no mapping : 0
Store L1D Hit : 74
Store L1D Miss : 0
No Page Map Rejects : 35
Unable to parse data source : 0
=================================================
Global Shared Cache Line Event Information
=================================================
Total Shared Cache Lines : 0
Load HITs on shared lines : 0
Fill Buffer Hits on shared lines : 0
L1D hits on shared lines : 0
L2D hits on shared lines : 0
LLC hits on shared lines : 0
Locked Access on shared lines : 0
Blocked Access on shared lines : 0
Store HITs on shared lines : 0
Store L1D hits on shared lines : 0
Total Merged records : 0
=================================================
c2c details
=================================================
Events : cpu/mem-loads,ldlat=30/P
: cpu/mem-stores/P
: dummy:HG
Cachelines sort on : Total HITMs
Cacheline data grouping : offset,pid,iaddr
=================================================
Shared Data Cache Line Table
=================================================
#
# ----------- Cacheline ---------- Tot ------- Load Hitm ------- Total Total Total ---- Stores ---- ----- Core Load Hit ----- - LLC Load Hit -- - RMT Load Hit -- --- L>
# Index Address Node PA cnt Hitm Total LclHitm RmtHitm records Loads Stores L1Hit L1Miss FB L1 L2 LclHit LclHitm RmtHit RmtHitm >
# ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........ ....... ........ ....... .....>
#
=================================================
Shared Cache Line Distribution Pareto
=================================================
#
$ perf c2c record -F 60000 -a ./pad
$ perf c2c report -NN -c pid,iaddr --full-symbols --stdio
=================================================
Trace Event Information
=================================================
Total records : 109
Locked Load/Store Operations : 3
Load Operations : 34
Loads - uncacheable : 0
Loads - IO : 0
Loads - Miss : 0
Loads - no mapping : 2
Load Fill Buffer Hit : 10
Load L1D hit : 12
Load L2D hit : 0
Load LLC hit : 7
Load Local HITM : 0
Load Remote HITM : 0
Load Remote HIT : 0
Load Local DRAM : 3
Load Remote DRAM : 0
Load MESI State Exclusive : 0
Load MESI State Shared : 3
Load LLC Misses : 3
Load access blocked by data : 0
Load access blocked by address : 0
LLC Misses to Local DRAM : 100.0%
LLC Misses to Remote DRAM : 0.0%
LLC Misses to Remote cache (HIT) : 0.0%
LLC Misses to Remote cache (HITM) : 0.0%
Store Operations : 75
Store - uncacheable : 0
Store - no mapping : 0
Store L1D Hit : 75
Store L1D Miss : 0
No Page Map Rejects : 35
Unable to parse data source : 0
=================================================
Global Shared Cache Line Event Information
=================================================
Total Shared Cache Lines : 0
Load HITs on shared lines : 0
Fill Buffer Hits on shared lines : 0
L1D hits on shared lines : 0
L2D hits on shared lines : 0
LLC hits on shared lines : 0
Locked Access on shared lines : 0
Blocked Access on shared lines : 0
Store HITs on shared lines : 0
Store L1D hits on shared lines : 0
Total Merged records : 0
=================================================
c2c details
=================================================
Events : cpu/mem-loads,ldlat=30/P
: cpu/mem-stores/P
: dummy:HG
Cachelines sort on : Total HITMs
Cacheline data grouping : offset,pid,iaddr
=================================================
Shared Data Cache Line Table
=================================================
#
# ----------- Cacheline ---------- Tot ------- Load Hitm ------- Total Total Total ---- Stores ---- ----- Core Load Hit ----- - LLC Load Hit -- - RMT Load Hit -- --- L>
# Index Address Node PA cnt Hitm Total LclHitm RmtHitm records Loads Stores L1Hit L1Miss FB L1 L2 LclHit LclHitm RmtHit RmtHitm >
# ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........ ....... ........ ....... .....>
#
=================================================
Shared Cache Line Distribution Pareto
=================================================
#
Store Operations : 120
Store - uncacheable : 0
Store - no mapping : 0
Store L1D Hit : 119
Store L1D Miss : 1
No Page Map Rejects : 39
Unable to parse data source : 0
可以看到是否 pad 幾乎沒有差別;實驗待驗證。
本文
Apr 4, 2025This is a note to record my understanding of EEVDF, for the writing of 《Demystifying The Linux CPU Scheduler》
Aug 8, 2024"*Demystifying the Linux CPU Scheduler*" 是 [Linux 核心設計/實作](https://wiki.csie.ncku.edu.tw/linux/schedule) 之課程教材,由授課老師黃敬群 (Jserv) 以英文編撰,以 CPU 排程器為主題,約 260 頁,仍未公開出版,更多資訊請關注課程行事曆或其粉絲專頁 [Jserv與他愉快的小夥伴 ](https://www.facebook.com/JservFans)。 撰寫本篇文章是起於 Linux 核心設計/實作 課程之[期末專題](),做為前置準備。
Aug 4, 2024本文旨在記錄 Computer Systems: A Programmer's Perspective (CS:APP) 一書第七章閱讀心得,該書是 CMU 的計算機系統概論的教材 (難度相當於台灣的大學高年級),該書的簡體中文翻譯名稱為《深入理解計算機系統》。
Jul 13, 2024or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up