# [CS:APP3e](https://hackmd.io/c/S1vGugaDQ/) 第 6 章重點提示 ## 為何你該深入理解記憶體 影片: [What are Heads, Tracks, Cylinders & Sectors?](https://www.youtube.com/watch?v=gd5vKobmZ3Q) 練習題 `6.4` (==Page 411==): 假設 1 MB 的檔案由 512 bytes 的邏輯區塊組成,假設程式循序讀取檔案的邏輯區塊,將讀寫頭定位到第一個區塊的時間為 T^avg_seek^ + T^avg_rotation^ 參考答案指出 (==Page 459==): * 最好的情況 * 區塊被映射到連續的 sector,同一個 cylinder,即可一個區塊接著一個區塊讀取,不用移動讀寫頭 * 5 + 3 + 12 = 20 ms * 隨機的情況 * 讀取 2000 個區塊的每一部分都要計算 * 16000 ms (大幅增加到 16 秒) Reference: [Cylinder-head-sector](https://en.wikipedia.org/wiki/Cylinder-head-sector) ## 主軸 * 導讀 * [現代處理器設計: Cache 原理和實際影響](https://hackmd.io/s/HkW3Dr1Rb) * [Cache 原理和實際影響](https://hackmd.io/s/HkyscQn2z) * [software-pipelining](https://hackmd.io/s/HkbPkW86f) * [CPU caches](https://lwn.net/Articles/252125/) by Ulrich Drepper - [ ] The Memory Hierarchy: [錄影](https://www.youtube.com/watch?v=vusQa4pfTFU) / [簡報](https://www.cs.cmu.edu/~213/lectures/11-memory-hierarchy.pdf) * 對應 CS:APP3e 的第 6 章: 6.3 儲存裝置的階層架構 (==Page 421==) ![](https://i.imgur.com/KnmFqHf.png) ![](https://i.imgur.com/MmErdyO.png) ![](https://i.imgur.com/2i77x7n.png) 不要跟 [Transactional memory](https://en.wikipedia.org/wiki/Transactional_memory) 觀念搞混 ![](https://i.imgur.com/rolKNaG.png) 從中央處理器和記憶體存取速度的演變可看出,對電腦程式影響極大的因素是 locality。 ![](https://i.imgur.com/2iQVPYf.png) ![](https://i.imgur.com/mDV9T6R.png) ```Clike int sum_array_rows(int a[M][N]) { int sum = 0; for (int i = 0; i < M; i++) for (int j = 0; j < N; j++) sum += a[i][j]; return sum; } ``` 這樣的 locality 好嗎?參見 [Row- and column-major order](https://en.wikipedia.org/wiki/Row-_and_column-major_order) :::info ``` column ┌ ┐ ┌ ┐ row │ │ │ │ └ ┘ └ ┘ ``` row, column 的中文到底怎麼翻譯呢?哪個是直的?哪個是橫的? 台灣的「列」(col) 是橫向;中國的「列」(column) 是直向,==完全相反==!為何有這樣的演化?因為新中國政府頒訂的稿紙格子是橫式,每一行是橫的,而台灣的稿紙是直式,每一行是直的。在英文用法,row 是「橫」的一條,column 則是像監獄欄杆「直」的一根。 * 台灣把 row 翻成==列==,column 翻成 ==行== :o: 想成台灣人愛排隊,一行接著一「行」,攜伴並排就是「列」 * 中國把 row 翻成==行==,column 翻成 ==列== :o: 想成六四天安門廣場上,一「行」人橫著對抗進犯的坦克車,每次損耗就豎直變「烈士」(列) 另外,在計數 row, column 時,因為 row 是橫的由上往下排,所以數橫的 row 要==豎著數==,又因 column 是直的由左往右排,所以 column 要==橫著數== ::: 反之,下方的程式碼就不是 row-major: ```Clike int sum_array_cols(int a[M][N]) { int sum = 0; for (int j = 0; j < N; j++) for (int i = 0; i < M; i++) sum += a[i][j]; return sum; } ``` 除非在 M 數值很小的狀況,才不會因此影響效能。 ![](https://i.imgur.com/ob8n0S9.png) ![](https://i.imgur.com/fBl8zkf.png) ![](https://i.imgur.com/DrufsYA.png) ![](https://i.imgur.com/AQ8DkkY.png) - [ ] Cache Memories: [錄影](https://youtu.be/Spg7iTYLkS4) / [簡報](https://www.cs.cmu.edu/~213/lectures/12-cache-memories.pdf) * 對應 CS:APP3e 的第 6 章: 6.4 快取記憶體 (==Page 425==) ![](https://i.imgur.com/KvQg6sO.png) ![](https://i.imgur.com/C1egnka.png) ![](https://i.imgur.com/HKqR10T.png) cache miss 的時間代價是很高的 ![](https://i.imgur.com/jBHtGCa.png) ![](https://i.imgur.com/MID1W9D.png) Memory Mountain * ==Page 444== 到 ==Page 450== * 參照 [CS:APP student site](http://csapp.cs.cmu.edu/3e/students.html) 取得 [mountain.tar](http://csapp.cs.cmu.edu/3e/mountain.tar) ```shell $ tar xf mountain.tar $ cd mountain $ make clean all $ ./mountain ``` * 對照 [A Gallery of Memory Mountains](https://csappbook.blogspot.tw/2017/05/a-gallery-of-memory-mountains.html) * 觀察 read throughput * `mountain.c` ```C /* test - Iterate over first "elems" elements of array "data" with * stride of "stride", using 4x4 loop unrolling. */ int test(int elems, int stride) { long i, sx2 = stride * 2, sx3 = stride * 3, sx4 = stride * 4; long acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0; long length = elems; long limit = length - sx4; /* Combine 4 elements at a time */ for (i = 0; i < limit; i += sx4) { acc0 = acc0 + data[i]; acc1 = acc1 + data[i + stride]; acc2 = acc2 + data[i + sx2]; acc3 = acc3 + data[i + sx3]; } /* Finish any remaining elements */ for (; i < length; i++) { acc0 = acc0 + data[i]; } return ((acc0 + acc1) + (acc2 + acc3)); } ``` * 2009 年的 [Nehalem](https://en.wikipedia.org/wiki/Nehalem_(microarchitecture)) (採用 45 奈米製程,後期改用 32 奈米製程) ![](https://i.imgur.com/HKWcrKD.png) * 2013 年的 [Haswell](https://en.wikichip.org/wiki/intel/microarchitectures/haswell_(client)) (和 Ivy Bridge 微架構一樣採用 22 奈米製程) ![](https://i.imgur.com/d1m3lvj.png) ![](https://i.imgur.com/5L0ZZid.png) ![](https://i.imgur.com/OZH8WSq.png) Layout of C Arrays in Memory * C arrays 以 row-major 順序配置 * each row in contiguous memory locations * Stepping through columns in one row: ```clike for (i = 0; i < N; i++) sum += a[0][i]; ``` * accesses successive elements * if `block size (B)` > sizeof(a~ij~) bytes, exploit spatial locality - miss rate = sizeof(a~ij~) / B * Stepping through rows in one column: ```clike for (i = 0; i < n; i++) sum += a[i][0]; ``` * accesses distant elements * no spatial locality! - miss rate = 1 (i.e. 100%) ![](https://i.imgur.com/UWUYFTY.png) ![](https://i.imgur.com/rhMXeWQ.png) ![](https://i.imgur.com/HEv4eaE.png) ![](https://i.imgur.com/gczHDZz.png) ![](https://i.imgur.com/56wnMPL.png) :::success 隨堂練習 - [ ] 6.37: 計算 `N=64` 和 `N=60` 的 cache miss rate (==Page 455==) - [ ] 6.45: 撰寫更快的轉置矩陣實作 (==Page 458==) - [ ] 6.46: 有向圖轉換為無向圖 (==Page 458==) / 參考: [Graph](http://www.csie.ntnu.edu.tw/~u91029/Graph.html), [相鄰矩陣 (Adjacency Matrix))](https://hellolynn.hpd.io/2017/09/22/%E5%AF%A6%E4%BD%9Cgraph%E8%88%87dfs%E3%80%81bfs%E8%B5%B0%E8%A8%AA/) ::: 6.37 ```C typedef int array_t[N][N]; int sumA(array_t a) { int i, j; int sum = 0; for (i = 0; i < N; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } int sumB(array_t a) { int i, j; int sum = 0; for (i = 0; i < N; i++) for (j = 0; j < N; j++) sum += a[j][i]; return sum; } int sumC(array_t a) { int i, j; int sum = 0; for (i = 0; i < N; i+=2) for (j = 0; j < N; j+=2) sum += (a[j][i] + a[j][i+1] + a[j+1][i] + a[j+1][i+1]) } ``` C = 4096, B = 16, E = 1, S = 256 ### N = 64 sizeof(array_t) == 64 * 64 == 4096 == 4*C memory-cache graph memory address start from 0 and end to 4096*4. cell size is 16B. number(0-255) in cell means cache block number that the cell will be cached. 0 +---------+ | 0 | 16 +---------+ | 1 | 32 +---------+ | 2 | 48 +---------+ | . | | . | | . | | . | | . | 4096-16 +---------+ | 255 | 4096 +---------+ | 0 | 4096+16 +---------+ | 1 | 4096+32 +---------+ | . | | . | | . | | . | | . | | . | 4096*4-16+---------+ | 255 | 4096*4 +---------+ A. sumA sum += a[i][j]; read memory address order: 0, 4, 8, 12, ....., 4096*4-4 read cache order: 0, 0, 0, 0, 1, 1, 1, 1,.....255, 255, 255, 255, 0, 0, 0, 0,.....255, 255, 255, 255 first cache read miss, next 3 time read hit. miss rate: 25% B. sumB sum += a[j][i]; read memory address order: 0, 64*4, 64*4*2, .... 64*4*63, 4, 64*4+4, 64*4*2+4, .... 4096*4-4 read cache order: 0, 16, 32, 48, ... 240,(4 times) 1, 17, 33, ... 241,(4 times) 15, 31, 47, ... 255(4 times) let's see first read loop: read cache order loop 4 times 0, 16, 32, 48, ... 240,(4 times) first loop all miss, next 3 loop all hit so miss rate is 25%. C. sumC for (i = 0; i < N; i+=2) for (j = 0; j < N; j+=2) sum += (a[j][i] + a[j][i+1] + a[j+1][i] + a[j+1][i+1]); easy to see that read a[j][i+1] and a[j+1][i+1] always hit same like for (i = 0; i < N; i+=2) for (j = 0; j < N; j+=2) sum += (a[j][i] + a[j+1][i]); same like for (i = 0; i < N; i+=2) for (j = 0; j < N; j++) sum += a[j][i]; total read count = 64*64 because of i+=2, read cache order only loop 2 times 0, 16, 32, 48, ... 240,(2 times) so miss rate is 50% totol read miss count = 64/2 * 64 * 50% = 64*64/4 so miss rate is still 25%. ### N = 60 A. sumA sum += a[i][j]; read memory by step 1 miss rate 25% B. sumB for (i = 0; i < N; i++) for (j = 0; j < N; j++) sum += a[j][i]; it's interesting. let's see first inner loop a[0][0] -> a[59][0] read memory address order: 0, 60*4, 60*4*2, .... 60*4*59 read cache order: 0, 15, 30, ...., 225, (17 numbers) 255, 14, 29, ....., 224, (17 numbers) 254, 13, 28, ....., 223, (17 numbers) 253, 12, 27, 42, 57, 72, 87, 102, 117 (9 numbers) all read miss and store into different blocks next 3 loops: a[0][1] -> a[59][1], a[0][2] -> a[59][2], a[0][3] -> a[59][3] all hit althrough N is smaller and not power of 2, miss rate is still 25% C. sumC same as miss rate when N = 64 25%