---
tags: cs:app, csapp
---
# [CS:APP](https://hackmd.io/@sysprog/CSAPP) 第 6 章重點提示
## 為何你該深入理解記憶體
影片: [What are Heads, Tracks, Cylinders & Sectors?](https://www.youtube.com/watch?v=gd5vKobmZ3Q)
練習題 `6.4` (==Page 411==): 假設 1 MB 的檔案由 512 bytes 的邏輯區塊組成,假設程式循序讀取檔案的邏輯區塊,將讀寫頭定位到第一個區塊的時間為 T^avg_seek^ + T^avg_rotation^
參考答案指出 (==Page 459==):
* 最好的情況
* 區塊被映射到連續的 sector,同一個 cylinder,即可一個區塊接著一個區塊讀取,不用移動讀寫頭
* 5 + 3 + 12 = 20 ms
* 隨機的情況
* 讀取 2000 個區塊的每一部分都要計算
* 16000 ms (大幅增加到 16 秒)
Reference: [Cylinder-head-sector](https://en.wikipedia.org/wiki/Cylinder-head-sector)
## 主軸
* 導讀
* [現代處理器設計: Cache 原理和實際影響](https://hackmd.io/s/HkW3Dr1Rb)
* [Cache 原理和實際影響](https://hackmd.io/s/HkyscQn2z)
* [software-pipelining](https://hackmd.io/s/HkbPkW86f)
* [CPU caches](https://lwn.net/Articles/252125/) by Ulrich Drepper
- [ ] The Memory Hierarchy: [錄影](https://www.youtube.com/watch?v=vusQa4pfTFU) / [簡報](https://www.cs.cmu.edu/~213/lectures/10-memory-hierarchy.pdf)
* 對應 CS:APP3e 的第 6 章: 6.3 儲存裝置的階層架構 (==Page 421==)
![](https://i.imgur.com/KnmFqHf.png)
![](https://i.imgur.com/MmErdyO.png)
![](https://i.imgur.com/2i77x7n.png)
不要跟 [Transactional memory](https://en.wikipedia.org/wiki/Transactional_memory) 觀念搞混
![](https://i.imgur.com/rolKNaG.png)
從中央處理器和記憶體存取速度的演變可看出,對電腦程式影響極大的因素是 locality。
![](https://i.imgur.com/2iQVPYf.png)
![](https://i.imgur.com/mDV9T6R.png)
```Clike
int sum_array_rows(int a[M][N]) {
int sum = 0;
for (int i = 0; i < M; i++)
for (int j = 0; j < N; j++)
sum += a[i][j];
return sum;
}
```
這樣的 locality 好嗎?參見 [Row- and column-major order](https://en.wikipedia.org/wiki/Row-_and_column-major_order)
:::info
```
column
┌ ┐ ┌ ┐
row │ │ │ │
└ ┘ └ ┘
```
row, column 的中文到底怎麼翻譯呢?哪個是直的?哪個是橫的?
台灣的「列」(row) 是橫向;中國的「列」(column) 是直向,==完全相反==!為何有這樣的演化?因為新中國政府頒訂的稿紙格子是橫式,每一行是橫的,而台灣的稿紙是直式,每一行是直的。在英文用法,row 是「橫」的一條,column 則是像監獄欄杆「直」的一根。
* 台灣把 row 翻成==列==,column 翻成 ==行== :o: 想成台灣人愛排隊,一行接著一「行」,攜伴並排就是「列」
* 中國把 row 翻成==行==,column 翻成 ==列== :o: 想成六四天安門廣場上,一「行」人橫著對抗進犯的坦克車,每次損耗就豎直變「烈士」(列)
另外,在計數 row, column 時,因為 row 是橫的由上往下排,所以數橫的 row 要==豎著數==,又因 column 是直的由左往右排,所以 column 要==橫著數==
:::
反之,下方的程式碼就不是 row-major:
```Clike
int sum_array_cols(int a[M][N])
{
int sum = 0;
for (int j = 0; j < N; j++)
for (int i = 0; i < M; i++)
sum += a[i][j];
return sum;
}
```
除非在 M 數值很小的狀況,才不會因此影響效能。
![](https://i.imgur.com/ob8n0S9.png)
![](https://i.imgur.com/fBl8zkf.png)
![](https://i.imgur.com/DrufsYA.png)
![](https://i.imgur.com/AQ8DkkY.png)
- [ ] Cache Memories: [錄影](https://youtu.be/Spg7iTYLkS4) / [簡報](https://www.cs.cmu.edu/~213/lectures/12-cache-memories.pdf)
* 對應 CS:APP3e 的第 6 章: 6.4 快取記憶體 (==Page 425==)
![](https://i.imgur.com/KvQg6sO.png)
![](https://i.imgur.com/C1egnka.png)
![](https://i.imgur.com/HKqR10T.png)
cache miss 的時間代價是很高的
![](https://i.imgur.com/jBHtGCa.png)
![](https://i.imgur.com/MID1W9D.png)
Memory Mountain
* ==Page 444== 到 ==Page 450==
* 參照 [CS:APP student site](http://csapp.cs.cmu.edu/3e/students.html) 取得 [mountain.tar](http://csapp.cs.cmu.edu/3e/mountain.tar)
```shell
$ tar xf mountain.tar
$ cd mountain
$ make clean all
$ ./mountain
```
* 對照 [A Gallery of Memory Mountains](https://csappbook.blogspot.tw/2017/05/a-gallery-of-memory-mountains.html)
* 觀察 read throughput
* `mountain.c`
```C
/* test - Iterate over first "elems" elements of array "data" with
* stride of "stride", using 4x4 loop unrolling.
*/
int test(int elems, int stride) {
long i, sx2 = stride * 2, sx3 = stride * 3, sx4 = stride * 4;
long acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0;
long length = elems;
long limit = length - sx4;
/* Combine 4 elements at a time */
for (i = 0; i < limit; i += sx4) {
acc0 = acc0 + data[i];
acc1 = acc1 + data[i + stride];
acc2 = acc2 + data[i + sx2];
acc3 = acc3 + data[i + sx3];
}
/* Finish any remaining elements */
for (; i < length; i++) {
acc0 = acc0 + data[i];
}
return ((acc0 + acc1) + (acc2 + acc3));
}
```
* 2009 年的 [Nehalem](https://en.wikipedia.org/wiki/Nehalem_(microarchitecture)) (採用 45 奈米製程,後期改用 32 奈米製程)
![](https://i.imgur.com/HKWcrKD.png)
* 2013 年的 [Haswell](https://en.wikichip.org/wiki/intel/microarchitectures/haswell_(client)) (和 Ivy Bridge 微架構一樣採用 22 奈米製程)
![](https://i.imgur.com/d1m3lvj.png)
![](https://i.imgur.com/5L0ZZid.png)
![](https://i.imgur.com/OZH8WSq.png)
Layout of C Arrays in Memory
* C arrays 以 row-major 順序配置
* each row in contiguous memory locations
* Stepping through columns in one row:
```clike
for (i = 0; i < N; i++)
sum += a[0][i];
```
* accesses successive elements
* if `block size (B)` > sizeof(a~ij~) bytes, exploit spatial locality
- miss rate = sizeof(a~ij~) / B
* Stepping through rows in one column:
```clike
for (i = 0; i < n; i++)
sum += a[i][0];
```
* accesses distant elements
* no spatial locality!
- miss rate = 1 (i.e. 100%)
![](https://i.imgur.com/UWUYFTY.png)
![](https://i.imgur.com/rhMXeWQ.png)
![](https://i.imgur.com/HEv4eaE.png)
![](https://i.imgur.com/gczHDZz.png)
![](https://i.imgur.com/56wnMPL.png)
:::success
隨堂練習
- [ ] 6.37: 計算 `N=64` 和 `N=60` 的 cache miss rate (==Page 455==)
- [ ] 6.45: 撰寫更快的轉置矩陣實作 (==Page 458==)
- [ ] 6.46: 有向圖轉換為無向圖 (==Page 458==) / 參考: [Graph](http://www.csie.ntnu.edu.tw/~u91029/Graph.html), [相鄰矩陣 (Adjacency Matrix))](https://hellolynn.hpd.io/2017/09/22/%E5%AF%A6%E4%BD%9Cgraph%E8%88%87dfs%E3%80%81bfs%E8%B5%B0%E8%A8%AA/)
:::
6.37
```C
typedef int array_t[N][N];
int sumA(array_t a) {
int i, j;
int sum = 0;
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
sum += a[i][j];
return sum;
}
int sumB(array_t a) {
int i, j;
int sum = 0;
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
sum += a[j][i];
return sum;
}
int sumC(array_t a) {
int i, j;
int sum = 0;
for (i = 0; i < N; i+=2)
for (j = 0; j < N; j+=2)
sum += (a[j][i] + a[j][i+1] + a[j+1][i] + a[j+1][i+1])
}
```
C = 4096, B = 16, E = 1, S = 256
### N = 64
sizeof(array_t) == 64 * 64 == 4096 == 4*C
memory-cache graph
memory address start from 0 and end to 4096*4.
cell size is 16B. number(0-255) in cell means cache block number that the cell
will be cached.
0 +---------+
| 0 |
16 +---------+
| 1 |
32 +---------+
| 2 |
48 +---------+
| . |
| . |
| . |
| . |
| . |
4096-16 +---------+
| 255 |
4096 +---------+
| 0 |
4096+16 +---------+
| 1 |
4096+32 +---------+
| . |
| . |
| . |
| . |
| . |
| . |
4096*4-16+---------+
| 255 |
4096*4 +---------+
A. sumA
sum += a[i][j];
read memory address order:
0, 4, 8, 12, ....., 4096*4-4
read cache order:
0, 0, 0, 0, 1, 1, 1, 1,.....255, 255, 255, 255, 0, 0, 0, 0,.....255, 255, 255,
255
first cache read miss, next 3 time read hit.
miss rate: 25%
B. sumB
sum += a[j][i];
read memory address order:
0, 64*4, 64*4*2, .... 64*4*63, 4, 64*4+4, 64*4*2+4, .... 4096*4-4
read cache order:
0, 16, 32, 48, ... 240,(4 times) 1, 17, 33, ... 241,(4 times)
15, 31, 47, ... 255(4 times)
let's see first read loop:
read cache order loop 4 times
0, 16, 32, 48, ... 240,(4 times)
first loop all miss, next 3 loop all hit
so miss rate is 25%.
C. sumC
for (i = 0; i < N; i+=2)
for (j = 0; j < N; j+=2)
sum += (a[j][i] + a[j][i+1] + a[j+1][i] + a[j+1][i+1]);
easy to see that read a[j][i+1] and a[j+1][i+1] always hit
same like
for (i = 0; i < N; i+=2)
for (j = 0; j < N; j+=2)
sum += (a[j][i] + a[j+1][i]);
same like
for (i = 0; i < N; i+=2)
for (j = 0; j < N; j++)
sum += a[j][i];
total read count = 64*64
because of i+=2,
read cache order only loop 2 times
0, 16, 32, 48, ... 240,(2 times)
so miss rate is 50%
total read miss count = 64/2 * 64 * 50% = 64*64/4
so miss rate is still 25%.
### N = 60
A. sumA
sum += a[i][j];
read memory by step 1
miss rate 25%
B. sumB
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
sum += a[j][i];
it's interesting.
let's see first inner loop a[0][0] -> a[59][0]
read memory address order:
0, 60*4, 60*4*2, .... 60*4*59
read cache order:
0, 15, 30, ...., 225, (17 numbers)
255, 14, 29, ....., 224, (17 numbers)
254, 13, 28, ....., 223, (17 numbers)
253, 12, 27, 42, 57, 72, 87, 102, 117 (9 numbers)
all read miss and store into different blocks
next 3 loops: a[0][1] -> a[59][1], a[0][2] -> a[59][2], a[0][3] -> a[59][3]
all hit
althrough N is smaller and not power of 2, miss rate is still 25%
C. sumC
same as miss rate when N = 64
25%