CS:APP 第 6 章重點提示

為何你該深入理解記憶體

影片: What are Heads, Tracks, Cylinders & Sectors?
練習題 6.4 (Page 411): 假設 1 MB 的檔案由 512 bytes 的邏輯區塊組成，假設程式循序讀取檔案的邏輯區塊，將讀寫頭定位到第一個區塊的時間為 T^avg_seek + T^avg_rotation
參考答案指出 (Page 459):

最好的情況
- 區塊被映射到連續的 sector，同一個 cylinder，即可一個區塊接著一個區塊讀取，不用移動讀寫頭
- 5 + 3 + 12 = 20 ms
隨機的情況
- 讀取 2000 個區塊的每一部分都要計算
- 16000 ms (大幅增加到 16 秒)

Reference: Cylinder-head-sector

主軸

導讀
- 現代處理器設計: Cache 原理和實際影響
- Cache 原理和實際影響
- software-pipelining
- CPU caches by Ulrich Drepper

The Memory Hierarchy: 錄影 / 簡報

對應 CS:APP3e 的第 6 章: 6.3 儲存裝置的階層架構 (Page 421)

不要跟 Transactional memory 觀念搞混

從中央處理器和記憶體存取速度的演變可看出，對電腦程式影響極大的因素是 locality。

int sum_array_rows(int a[M][N]) {
    int sum = 0;
    for (int i = 0; i < M; i++)
        for (int j = 0; j < N; j++)
            sum += a[i][j];
    return sum;
}

這樣的 locality 好嗎？參見 Row- and column-major order

         column
        ┌    ┐    ┌    ┐
    row │    │    │    │
        └    ┘    └    ┘

row, column 的中文到底怎麼翻譯呢？哪個是直的？哪個是橫的？
台灣的「列」(row) 是橫向；中國的「列」(column) 是直向，完全相反！為何有這樣的演化？因為新中國政府頒訂的稿紙格子是橫式，每一行是橫的，而台灣的稿紙是直式，每一行是直的。在英文用法，row 是「橫」的一條，column 則是像監獄欄杆「直」的一根。

台灣把 row 翻成列，column 翻成行 :o: 想成台灣人愛排隊，一行接著一「行」，攜伴並排就是「列」
中國把 row 翻成行，column 翻成列 :o: 想成六四天安門廣場上，一「行」人橫著對抗進犯的坦克車，每次損耗就豎直變「烈士」(列)

另外，在計數 row, column 時，因為 row 是橫的由上往下排，所以數橫的 row 要豎著數，又因 column 是直的由左往右排，所以 column 要橫著數

反之，下方的程式碼就不是 row-major:

int sum_array_cols(int a[M][N])
{
    int sum = 0;
    for (int j = 0; j < N; j++)
        for (int i = 0; i < M; i++)
            sum += a[i][j];
    return sum;
}

除非在 M 數值很小的狀況，才不會因此影響效能。

Cache Memories: 錄影 / 簡報

對應 CS:APP3e 的第 6 章: 6.4 快取記憶體 (Page 425)

cache miss 的時間代價是很高的

Memory Mountain

Page 444 到 Page 450

參照 CS:APP student site 取得 mountain.tar

$ tar xf mountain.tar
$ cd mountain
$ make clean all
$ ./mountain

對照 A Gallery of Memory Mountains
- 觀察 read throughput
mountain.c

/* test - Iterate over first "elems" elements of array "data" with
 *        stride of "stride", using 4x4 loop unrolling.
 */
int test(int elems, int stride) {
    long i, sx2 = stride * 2, sx3 = stride * 3, sx4 = stride * 4;
    long acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0;
    long length = elems;
    long limit = length - sx4;

    /* Combine 4 elements at a time */
    for (i = 0; i < limit; i += sx4) {
        acc0 = acc0 + data[i];     
        acc1 = acc1 + data[i + stride];
        acc2 = acc2 + data[i + sx2]; 
        acc3 = acc3 + data[i + sx3];
    }

    /* Finish any remaining elements */
    for (; i < length; i++) {
        acc0 = acc0 + data[i];
    }
    return ((acc0 + acc1) + (acc2 + acc3));
}

2009 年的 Nehalem (採用 45 奈米製程，後期改用 32 奈米製程)
2013 年的 Haswell (和 Ivy Bridge 微架構一樣採用 22 奈米製程)

Layout of C Arrays in Memory

C arrays 以 row-major 順序配置
- each row in contiguous memory locations
Stepping through columns in one row:
```
for (i = 0; i < N; i++)
    sum += a[0][i];
```
- accesses successive elements
- if block size (B) > sizeof(a_ij) bytes, exploit spatial locality
  - miss rate = sizeof(a_ij) / B
Stepping through rows in one column:
```
for (i = 0; i < n; i++)
    sum += a[i][0];
```
- accesses distant elements
- no spatial locality!
  - miss rate = 1 (i.e. 100%)

隨堂練習

6.37: 計算 N=64 和 N=60 的 cache miss rate (Page 455)
6.45: 撰寫更快的轉置矩陣實作 (Page 458)
6.46: 有向圖轉換為無向圖 (Page 458) / 參考: Graph, 相鄰矩陣 (Adjacency Matrix))

6.37

typedef int array_t[N][N];

int sumA(array_t a) {
  int i, j;
  int sum = 0;
  for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
      sum += a[i][j];

  return sum;
}

int sumB(array_t a) {
  int i, j;
  int sum = 0;
  for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
      sum += a[j][i];

  return sum;
}

int sumC(array_t a) {
  int i, j;
  int sum = 0;
  for (i = 0; i < N; i+=2)
    for (j = 0; j < N; j+=2)
      sum += (a[j][i] + a[j][i+1] + a[j+1][i] + a[j+1][i+1])
}

C = 4096, B = 16, E = 1, S = 256

N = 64

sizeof(array_t) == 64 * 64 == 4096 == 4*C

memory-cache graph

memory address start from 0 and end to 4096*4.

cell size is 16B. number(0-255) in cell means cache block number that the cell
will be cached.

    0    +---------+
         |    0    |
    16   +---------+
         |    1    |
    32   +---------+
         |    2    |
    48   +---------+
         |    .    |
         |    .    |
         |    .    |
         |    .    |
         |    .    |
4096-16  +---------+
         |   255   |
   4096  +---------+
         |    0    |
4096+16  +---------+
         |    1    |
4096+32  +---------+
         |    .    |
         |    .    |
         |    .    |
         |    .    |
         |    .    |
         |    .    |
4096*4-16+---------+
         |   255   |
4096*4   +---------+

A. sumA

sum += a[i][j];

read memory address order:

0, 4, 8, 12, …, 4096*4-4

read cache order:

0, 0, 0, 0, 1, 1, 1, 1,…255, 255, 255, 255, 0, 0, 0, 0,…255, 255, 255,
255

first cache read miss, next 3 time read hit.

miss rate: 25%

B. sumB

sum += a[j][i];

read memory address order:

0, 644, 6442, … 64463, 4, 644+4, 6442+4, … 4096*4-4

read cache order:

0, 16, 32, 48, … 240,(4 times) 1, 17, 33, … 241,(4 times)
15, 31, 47, … 255(4 times)

let's see first read loop:

read cache order loop 4 times

0, 16, 32, 48, … 240,(4 times)

first loop all miss, next 3 loop all hit

so miss rate is 25%.

C. sumC

for (i = 0; i < N; i+=2)
  for (j = 0; j < N; j+=2)
    sum += (a[j][i] + a[j][i+1] + a[j+1][i] + a[j+1][i+1]);

easy to see that read a[j][i+1] and a[j+1][i+1] always hit

same like

for (i = 0; i < N; i+=2)
  for (j = 0; j < N; j+=2)
    sum += (a[j][i] + a[j+1][i]);

same like

for (i = 0; i < N; i+=2)
  for (j = 0; j < N; j++)
    sum += a[j][i];

total read count = 64*64

because of i+=2,

read cache order only loop 2 times

0, 16, 32, 48, … 240,(2 times)

so miss rate is 50%

total read miss count = 64/2 * 64 * 50% = 64*64/4

so miss rate is still 25%.

N = 60

A. sumA

sum += a[i][j];

read memory by step 1

miss rate 25%

B. sumB

for (i = 0; i < N; i++)
  for (j = 0; j < N; j++)
    sum += a[j][i];

it's interesting.

let's see first inner loop a[0][0] -> a[59][0]

read memory address order:

0, 604, 6042, … 604*59

read cache order:

0, 15, 30, …, 225, (17 numbers)
255, 14, 29, …, 224, (17 numbers)
254, 13, 28, …, 223, (17 numbers)
253, 12, 27, 42, 57, 72, 87, 102, 117 (9 numbers)

all read miss and store into different blocks

next 3 loops: a[0][1] -> a[59][1], a[0][2] -> a[59][2], a[0][3] -> a[59][3]

all hit

althrough N is smaller and not power of 2, miss rate is still 25%

C. sumC

same as miss rate when N = 64

25%

CS:APP 第 6 章重點提示

為何你該深入理解記憶體

主軸

N = 64

N = 60

Read more

KVM: Linux 虛擬化基礎建設

Linux 核心設計: 不僅是個執行單元的 Process

2025 年 Linux 核心設計課程期末專題

並行程式設計: 概念