# Chapter 6: The Memory Hierarchy 閱讀筆記 本文旨在記錄 *Computer Systems: A Programmer's Perspective* (CS:APP) 一書第六章閱讀心得,該書是 CMU 的計算機系統概論的教材 (難度相當於台灣的大學高年級),該書的簡體中文翻譯名稱為《深入理解計算機系統》。 CS:APP 亦是 [Linux Kernel Internals 2024 Spring](https://wiki.csie.ncku.edu.tw/linux/schedule) 課程指定教材,並一同收錄於 [Linux Kernel Internals 2024 Spring Collections](https://hackmd.io/@Kuanch/linux2024-collection)。 --- 在本章之前,CS:APP 專注討論 CPU 計算架構,如指令的執行、暫存器的存取等,並將記憶體簡化為一簡單線性的矩陣,但顯然與現代系統運作的方式相去甚遠;如同讀者所知,現代記憶體系統是一層級化 (hierarchical) 的結構,如 L1 L2 L3 以及主記憶體等,而每層有其取用的成本以及限制。 作為程式開發人員,了解記憶體對系統性能的衝擊是很必要的,因取用 register 僅需 0 cycle、cache 從 4 到 75 cycles 不等,而存取主記憶體可能需要上百個 cycles。 本章會聚焦 CPU 與主記憶體的互動,因為其是最頻繁存取也最耗費 cycles 的裝置之一,並且討論如何分析、改進 C programs 的 cache locality 等。 ## 6.1 Storage Technologies ### 6.1.1 Random Accees Memory ### 6.1.2 Disk Storage ## 6.2 Locality ### 6.2.1 Locality of References to Program Data ### 6.2.2 Locality of Instruction Fetches ## 6.3 The Memory Hierarchy ### 6.3.1 Caching in Memory Hierarchy ### 6.3.2 Summary of Memory Hierarchy Concepts ## 6.4 Cache Memories ### 6.4.1 Genric Cache Memory Organization ### 6.4.2 Direct Mapped Caches ### 6.4.3 Set Associative Caches ## 6.5 Writing Cache-Friendly Code ## Appendix ### Cache Line and Cache Alignments CPU 不與 memory 直接互動,而是透過 Cache Line;Cache Line 在 64-bits 系統中以 8 bytes 為單位,舉例而言 ``` Cache Line 1: 0xffff888002a3aa00 - 0xffff888002a3aa3f Cache Line 2: 0xffff888002a3aa40 - 0xffff888002a3aa7f ``` 注意此處是記憶體位址層面 (memory layout) ,每一單位可以儲存 1 byte,故每一 Cache Line 可以儲存 64 bytes。 也就是說,在 GDB 中,以 `x/8x` 檢視 8 words 能夠看到 ``` (gdb) x/8x 0xffff888002a3aa00 0xffff888002a3aa00: 0x0000000000004000 0x0000000000000000 0xffff888002a3aa10: 0x0000000100000000 0x0000000000000001 0xffff888002a3aa20: 0xffffc900000b8000 0x0420804000000002 0xffff888002a3aa30: 0x0000000000000000 0x0000000000000000 ``` 就是一 Cache Line;故此處將引入 **False Sharing** 之議題: 假設兩個核心存取相鄰的不同物件,並位於同一 Cache Line 上,將導致False Sharing;雖然對 Cache Line 上的不同物件讀寫操作,但可能導致 OS 不斷對該區段做更新或標註,導致頻繁的 cache 更新。 解決方法就是 **Cache Alignment** 或 **Cache Line Padding**。 嘗試實驗看看是否能夠復現 False Sharing,我們認為以下程式碼應該能夠觸發 False Sharing: ```c #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/wait.h> #include <sys/types.h> #include <fcntl.h> #define NUM_INCREMENTS 10000000 #define CACHE_LINE_SIZE 64 typedef struct { int counter1; uint64_t i; // Padding to separate cache lines int counter2; } SharedCounters; int main() { SharedCounters *counters = mmap(NULL, sizeof(SharedCounters), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (counters == MAP_FAILED) { perror("mmap"); exit(1); } counters->counter1 = 0; counters->counter2 = 0; pid_t pid = fork(); if (pid == -1) { perror("fork"); exit(1); } if (pid == 0) { // Child process for (int i = 0; i < NUM_INCREMENTS; ++i) { counters->counter1++; } exit(0); } else { // Parent process for (int i = 0; i < NUM_INCREMENTS; ++i) { counters->counter2++; } wait(NULL); } printf("counter1: %d, counter2: %d\n", counters->counter1, counters->counter2); return 0; } ``` 透過 gdb 檢查該結構成員之記憶體位置,應落在同一 Cache Line ``` (gdb) p &counters->counter1 $3 = (int *) 0x555555558018 <counters> (gdb) p &counters->counter2 $4 = (int *) 0x55555555801c <counters+4> ``` 使用 padding 方法: ```c typedef struct { int counter1; char padding[CACHE_LINE_SIZE - sizeof(int)]; int counter2; } SharedCounters; ``` 在 Linux Kernel 的文檔中有專以 False Sharing 探討的[文章](https://github.com/torvalds/linux/blob/855684c7d938c2442f07eabc154e7532b4c1fbf9/Documentation/kernel-hacking/false-sharing.rst),其中最重視的部分為 lock 之 False Sharing 所造成的效應;並推薦使用 `perf-c2c` 的方式檢測 False Sharing。 ```bash $ perf c2c record -F 60000 -a ./no_pad $ perf c2c report -NN -c pid,iaddr --full-symbols --stdio ================================================= Trace Event Information ================================================= Total records : 108 Locked Load/Store Operations : 2 Load Operations : 34 Loads - uncacheable : 0 Loads - IO : 0 Loads - Miss : 0 Loads - no mapping : 1 Load Fill Buffer Hit : 9 Load L1D hit : 6 Load L2D hit : 0 Load LLC hit : 14 Load Local HITM : 0 Load Remote HITM : 0 Load Remote HIT : 0 Load Local DRAM : 4 Load Remote DRAM : 0 Load MESI State Exclusive : 0 Load MESI State Shared : 4 Load LLC Misses : 4 Load access blocked by data : 0 Load access blocked by address : 0 LLC Misses to Local DRAM : 100.0% LLC Misses to Remote DRAM : 0.0% LLC Misses to Remote cache (HIT) : 0.0% LLC Misses to Remote cache (HITM) : 0.0% Store Operations : 74 Store - uncacheable : 0 Store - no mapping : 0 Store L1D Hit : 74 Store L1D Miss : 0 No Page Map Rejects : 35 Unable to parse data source : 0 ================================================= Global Shared Cache Line Event Information ================================================= Total Shared Cache Lines : 0 Load HITs on shared lines : 0 Fill Buffer Hits on shared lines : 0 L1D hits on shared lines : 0 L2D hits on shared lines : 0 LLC hits on shared lines : 0 Locked Access on shared lines : 0 Blocked Access on shared lines : 0 Store HITs on shared lines : 0 Store L1D hits on shared lines : 0 Total Merged records : 0 ================================================= c2c details ================================================= Events : cpu/mem-loads,ldlat=30/P : cpu/mem-stores/P : dummy:HG Cachelines sort on : Total HITMs Cacheline data grouping : offset,pid,iaddr ================================================= Shared Data Cache Line Table ================================================= # # ----------- Cacheline ---------- Tot ------- Load Hitm ------- Total Total Total ---- Stores ---- ----- Core Load Hit ----- - LLC Load Hit -- - RMT Load Hit -- --- L> # Index Address Node PA cnt Hitm Total LclHitm RmtHitm records Loads Stores L1Hit L1Miss FB L1 L2 LclHit LclHitm RmtHit RmtHitm > # ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........ ....... ........ ....... .....> # ================================================= Shared Cache Line Distribution Pareto ================================================= # ``` --- ```bash $ perf c2c record -F 60000 -a ./pad $ perf c2c report -NN -c pid,iaddr --full-symbols --stdio ================================================= Trace Event Information ================================================= Total records : 109 Locked Load/Store Operations : 3 Load Operations : 34 Loads - uncacheable : 0 Loads - IO : 0 Loads - Miss : 0 Loads - no mapping : 2 Load Fill Buffer Hit : 10 Load L1D hit : 12 Load L2D hit : 0 Load LLC hit : 7 Load Local HITM : 0 Load Remote HITM : 0 Load Remote HIT : 0 Load Local DRAM : 3 Load Remote DRAM : 0 Load MESI State Exclusive : 0 Load MESI State Shared : 3 Load LLC Misses : 3 Load access blocked by data : 0 Load access blocked by address : 0 LLC Misses to Local DRAM : 100.0% LLC Misses to Remote DRAM : 0.0% LLC Misses to Remote cache (HIT) : 0.0% LLC Misses to Remote cache (HITM) : 0.0% Store Operations : 75 Store - uncacheable : 0 Store - no mapping : 0 Store L1D Hit : 75 Store L1D Miss : 0 No Page Map Rejects : 35 Unable to parse data source : 0 ================================================= Global Shared Cache Line Event Information ================================================= Total Shared Cache Lines : 0 Load HITs on shared lines : 0 Fill Buffer Hits on shared lines : 0 L1D hits on shared lines : 0 L2D hits on shared lines : 0 LLC hits on shared lines : 0 Locked Access on shared lines : 0 Blocked Access on shared lines : 0 Store HITs on shared lines : 0 Store L1D hits on shared lines : 0 Total Merged records : 0 ================================================= c2c details ================================================= Events : cpu/mem-loads,ldlat=30/P : cpu/mem-stores/P : dummy:HG Cachelines sort on : Total HITMs Cacheline data grouping : offset,pid,iaddr ================================================= Shared Data Cache Line Table ================================================= # # ----------- Cacheline ---------- Tot ------- Load Hitm ------- Total Total Total ---- Stores ---- ----- Core Load Hit ----- - LLC Load Hit -- - RMT Load Hit -- --- L> # Index Address Node PA cnt Hitm Total LclHitm RmtHitm records Loads Stores L1Hit L1Miss FB L1 L2 LclHit LclHitm RmtHit RmtHitm > # ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........ ....... ........ ....... .....> # ================================================= Shared Cache Line Distribution Pareto ================================================= # Store Operations : 120 Store - uncacheable : 0 Store - no mapping : 0 Store L1D Hit : 119 Store L1D Miss : 1 No Page Map Rejects : 39 Unable to parse data source : 0 ``` 可以看到是否 pad 幾乎沒有差別;實驗待驗證。