Try   HackMD

CS:APP 第 6 章重點提示

為何你該深入理解記憶體

影片: What are Heads, Tracks, Cylinders & Sectors?
練習題 6.4 (Page 411): 假設 1 MB 的檔案由 512 bytes 的邏輯區塊組成,假設程式循序讀取檔案的邏輯區塊,將讀寫頭定位到第一個區塊的時間為 Tavg_seek + Tavg_rotation
參考答案指出 (Page 459):

  • 最好的情況
    • 區塊被映射到連續的 sector,同一個 cylinder,即可一個區塊接著一個區塊讀取,不用移動讀寫頭
    • 5 + 3 + 12 = 20 ms
  • 隨機的情況
    • 讀取 2000 個區塊的每一部分都要計算
    • 16000 ms (大幅增加到 16 秒)

Reference: Cylinder-head-sector

主軸

  • 對應 CS:APP3e 的第 6 章: 6.3 儲存裝置的階層架構 (Page 421)


不要跟 Transactional memory 觀念搞混

從中央處理器和記憶體存取速度的演變可看出,對電腦程式影響極大的因素是 locality。


int sum_array_rows(int a[M][N]) {
    int sum = 0;
    for (int i = 0; i < M; i++)
        for (int j = 0; j < N; j++)
            sum += a[i][j];
    return sum;
}

這樣的 locality 好嗎?參見 Row- and column-major order

         column
        ┌    ┐    ┌    ┐
    row │    │    │    │
        └    ┘    └    ┘

row, column 的中文到底怎麼翻譯呢?哪個是直的?哪個是橫的?
台灣的「列」(row) 是橫向;中國的「列」(column) 是直向,完全相反!為何有這樣的演化?因為新中國政府頒訂的稿紙格子是橫式,每一行是橫的,而台灣的稿紙是直式,每一行是直的。在英文用法,row 是「橫」的一條,column 則是像監獄欄杆「直」的一根。

  • 台灣把 row 翻成,column 翻成 :o: 想成台灣人愛排隊,一行接著一「行」,攜伴並排就是「列」
  • 中國把 row 翻成,column 翻成 :o: 想成六四天安門廣場上,一「行」人橫著對抗進犯的坦克車,每次損耗就豎直變「烈士」(列)

另外,在計數 row, column 時,因為 row 是橫的由上往下排,所以數橫的 row 要豎著數,又因 column 是直的由左往右排,所以 column 要橫著數

反之,下方的程式碼就不是 row-major:

int sum_array_cols(int a[M][N])
{
    int sum = 0;
    for (int j = 0; j < N; j++)
        for (int i = 0; i < M; i++)
            sum += a[i][j];
    return sum;
}

除非在 M 數值很小的狀況,才不會因此影響效能。




  • 對應 CS:APP3e 的第 6 章: 6.4 快取記憶體 (Page 425)



    cache miss 的時間代價是很高的


Memory Mountain

/* test - Iterate over first "elems" elements of array "data" with
 *        stride of "stride", using 4x4 loop unrolling.
 */
int test(int elems, int stride) {
    long i, sx2 = stride * 2, sx3 = stride * 3, sx4 = stride * 4;
    long acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0;
    long length = elems;
    long limit = length - sx4;

    /* Combine 4 elements at a time */
    for (i = 0; i < limit; i += sx4) {
        acc0 = acc0 + data[i];     
        acc1 = acc1 + data[i + stride];
        acc2 = acc2 + data[i + sx2]; 
        acc3 = acc3 + data[i + sx3];
    }

    /* Finish any remaining elements */
    for (; i < length; i++) {
        acc0 = acc0 + data[i];
    }
    return ((acc0 + acc1) + (acc2 + acc3));
}
  • 2009 年的 Nehalem (採用 45 奈米製程,後期改用 32 奈米製程)
  • 2013 年的 Haswell (和 Ivy Bridge 微架構一樣採用 22 奈米製程)



Layout of C Arrays in Memory

  • C arrays 以 row-major 順序配置
    • each row in contiguous memory locations
  • Stepping through columns in one row:
    ​​​​for (i = 0; i < N; i++)
    ​​​​    sum += a[0][i];
    
    • accesses successive elements
    • if block size (B) > sizeof(aij) bytes, exploit spatial locality
      • miss rate = sizeof(aij) / B
  • Stepping through rows in one column:
    ​​​​for (i = 0; i < n; i++)
    ​​​​    sum += a[i][0];
    
    • accesses distant elements
    • no spatial locality!
      • miss rate = 1 (i.e. 100%)




隨堂練習

  • 6.37: 計算 N=64N=60 的 cache miss rate (Page 455)
  • 6.45: 撰寫更快的轉置矩陣實作 (Page 458)
  • 6.46: 有向圖轉換為無向圖 (Page 458) / 參考: Graph, 相鄰矩陣 (Adjacency Matrix))

6.37

typedef int array_t[N][N];

int sumA(array_t a) {
  int i, j;
  int sum = 0;
  for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
      sum += a[i][j];

  return sum;
}

int sumB(array_t a) {
  int i, j;
  int sum = 0;
  for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
      sum += a[j][i];

  return sum;
}

int sumC(array_t a) {
  int i, j;
  int sum = 0;
  for (i = 0; i < N; i+=2)
    for (j = 0; j < N; j+=2)
      sum += (a[j][i] + a[j][i+1] + a[j+1][i] + a[j+1][i+1])
}

C = 4096, B = 16, E = 1, S = 256

N = 64

sizeof(array_t) == 64 * 64 == 4096 == 4*C

​​​​memory-cache graph

​​​​memory address start from 0 and end to 4096*4.

​​​​cell size is 16B. number(0-255) in cell means cache block number that the cell
​​​​will be cached.

​​​​    0    +---------+
​​​​         |    0    |
​​​​    16   +---------+
​​​​         |    1    |
​​​​    32   +---------+
​​​​         |    2    |
​​​​    48   +---------+
​​​​         |    .    |
​​​​         |    .    |
​​​​         |    .    |
​​​​         |    .    |
​​​​         |    .    |
​​​​4096-16  +---------+
​​​​         |   255   |
​​​​   4096  +---------+
​​​​         |    0    |
​​​​4096+16  +---------+
​​​​         |    1    |
​​​​4096+32  +---------+
​​​​         |    .    |
​​​​         |    .    |
​​​​         |    .    |
​​​​         |    .    |
​​​​         |    .    |
​​​​         |    .    |
​​​​4096*4-16+---------+
​​​​         |   255   |
​​​​4096*4   +---------+

A. sumA

​​​​sum += a[i][j];

read memory address order:

0, 4, 8, 12, , 4096*4-4

read cache order:

0, 0, 0, 0, 1, 1, 1, 1,255, 255, 255, 255, 0, 0, 0, 0,255, 255, 255,
255

first cache read miss, next 3 time read hit.

miss rate: 25%

B. sumB

​​​​sum += a[j][i];

read memory address order:

0, 644, 6442, 64463, 4, 644+4, 6442+4, 4096*4-4

read cache order:

0, 16, 32, 48, 240,(4 times) 1, 17, 33, 241,(4 times)
15, 31, 47, 255(4 times)

let's see first read loop:

read cache order loop 4 times

0, 16, 32, 48, 240,(4 times)

first loop all miss, next 3 loop all hit

so miss rate is 25%.

C. sumC

​​​​for (i = 0; i < N; i+=2)
​​​​  for (j = 0; j < N; j+=2)
​​​​    sum += (a[j][i] + a[j][i+1] + a[j+1][i] + a[j+1][i+1]);

easy to see that read a[j][i+1] and a[j+1][i+1] always hit

same like

​​​​for (i = 0; i < N; i+=2)
​​​​  for (j = 0; j < N; j+=2)
​​​​    sum += (a[j][i] + a[j+1][i]);

same like

​​​​for (i = 0; i < N; i+=2)
​​​​  for (j = 0; j < N; j++)
​​​​    sum += a[j][i];

total read count = 64*64

because of i+=2,

read cache order only loop 2 times

0, 16, 32, 48, 240,(2 times)

so miss rate is 50%

total read miss count = 64/2 * 64 * 50% = 64*64/4

so miss rate is still 25%.

N = 60

A. sumA

​​​​sum += a[i][j];

read memory by step 1

miss rate 25%

B. sumB

​​​​for (i = 0; i < N; i++)
​​​​  for (j = 0; j < N; j++)
​​​​    sum += a[j][i];

it's interesting.

let's see first inner loop a[0][0] -> a[59][0]

read memory address order:

0, 604, 6042, 604*59

read cache order:

0, 15, 30, , 225, (17 numbers)
255, 14, 29, , 224, (17 numbers)
254, 13, 28, , 223, (17 numbers)
253, 12, 27, 42, 57, 72, 87, 102, 117 (9 numbers)

all read miss and store into different blocks

next 3 loops: a[0][1] -> a[59][1], a[0][2] -> a[59][2], a[0][3] -> a[59][3]

all hit

althrough N is smaller and not power of 2, miss rate is still 25%

C. sumC

same as miss rate when N = 64

25%