計算機組織 Ch.5 Large and Fast: Exploiting Memory Hierarchy (A)

# 計算機組織 Ch.5 Large and Fast: Exploiting Memory Hierarchy (A) ###### tags: `計算機組織`, `109-2` [TOC] ## Introduction (5-1) ### Principle of Locality #### Temporal Locality - Items **accessed** recently are likely to be accessed again soon e.g. ```cpp= for(int i = 0; i < 15; i++) sum += arr[i]; ``` which `i`, `sum` would be accessed very often in the loop #### Spatial locality - Items **near those accessed** recently are likely to be accessed soon - in above example, array `arr` has spatial locality ## Memory Hierachy ### Taking Advantage of Locality - Copy recently accessed (and nearby) items from **disk** to *smaller* **DRAM** memory - Copy **more** recently accessed (and nearby) items from **DRAM** to *smaller* **SRAM** memory - Some terms - **Block** (line): unit of copying (may be multiple words) - **Hit**: accessed data is present in upper level - **Miss**: accessed data is absent - **Hit ratio**: $\frac{hit}{accesses}$ ### Memory Technology (5-2) - Static RAM (SRAM) - 0.5ns – 2.5ns, $2000 – $5000 per GB - Dynamic RAM (DRAM) - 50ns – 70ns, $20 – $75 per GB - Magnetic disk - 5ms – 20ms, $0.20 – $2 per GB #### Ideal memory - Access time of SRAM - more Capacity and cost/GB of disk ![](https://i.imgur.com/8OxPgPy.png) #### DRAM - Uses as main memory - Data stored as a charge in a capacitor - Bits in a DRAM are organized as a **rectangular array** - SDR/ DDR/ QDR ![](https://i.imgur.com/RM94U1z.png) ##### DRAM Performance Factors - Row Buffer - Allow serveral words to be read and refreshed in parallel - Fast access for data in Row Buffer - SDRAM (Synchronous DRAM) - Allows for consecutive accesses in bursts without needing to send each address - 利用同步計時器對記憶體的輸出入信號加以控制 (match CPU 時脈) - Improves bandwidth - DRAM banking - Allows simultaneous access to multiple DRAMs - 資料組多工 - Improves bandwidth ![](https://i.imgur.com/LOaPPvv.png) #### Flash Storage - Faster than disk, but slower than DRAM - Smaller and lower power consumption - Often used in mobile devices ##### Flash Types - NOR flash: bit cell like a NOR gate - Random read/write access - Used for instruction memory in embedded systems - NAND flash: bit cell like a NAND gate - Denser (bits/area), but block-at-a-time access - Cheaper per GB - Used for USB sticks, media storage, etc - Flash bits wears out after 1000’s of accesses - Not suitable for direct RAM or disk replacement - Wear leveling: remap data to less used blocks #### Disk Storage ##### Disk Sector - Each Sector records: - Sector ID - Data (512/ 4096 bytes) - Error Correcting code (ECC) - Time of accessing a sector involves: - Queue delay (if other accesses are pending) - Seek - Rotational latency - Data Tranfer - Controller Overhead :::info **Example: Disk Access** Given - `512`B sector - `15000`rpm - `4` ms average seek time (i.e. to position the head over the proper track) - `100`MB/s transfer rate - `0.2`ms controller overhead, idle disk Average Read Time will be: - Seek time = `4` ms - Rotational Latency = `0.5/(15000/60/1000 (ms))` = `2` ms - Transfer Time = `512 (B) / 100 (MB/s)` = `0.005` ms - Controller Overhead = `0.2 ms` - TOTAL = $6.2005 ms$ ::: ##### Disk Performance Issue - 經過上面運算，我們發現大部分時間都在 Seek 上 - 實用面改善：Locality, OS scheduling - 聰明的 Controller 會對 sector 進行分配 - e.g. SCSI, ATA, SATA - Caching ## Cache Memory (5-3) - 離 CPU 最近的 Memory Hierachy - 比 Main Memory 小很多，但快很多 - 沒有特別定址 (not byte-addressable) ### Direct-Mapped Cache - 直接把 Main Memory 的位址 % 到 Cache 裡面 - `Cache Block addr = Block addr % Cache Block Count` ![](https://i.imgur.com/IPkXHnK.png) ![](https://i.imgur.com/kBtoS8V.png) - 如何知道 Main Memory 某個 block 有沒有被存進 cache？ - Cache 裡面有個 `Tag`，存前幾位 Main memory 的 addr :::info **Example: Direct-Mapped Cache** 假設我們有 8-block 的 cache，每個 block 能存 1 word，使用 Direct-Mapped 表格是 (data 已省略) | (Index) | V | Tag | |:-------:|:-------:|:---:| | **000** | `false` | | | **001** | `false` | | | **010** | `false` | | | **011** | `false` | | | **100** | `false` | | | **101** | `false` | | | **110** | `false` | | | **111** | `false` | | *存取 Word Addrress `22`* 1. 轉成 Binary ：`10 110` 2. 前往 index 格：`010` 3. 判斷 tag 是否正確： `10 != -` 不正確 4. 拿資料，寫入 cache，並作 Miss 判定 | 110 | `true` | 10 | | --- | ------ | --- | *存取 Word Addrress `26`* 1. 轉成 Binary ：`11 010` 2. 前往 index 格：`010` 3. 判斷 tag 是否正確： `11 != -` 不正確 4. 拿資料，寫入 cache，並作 Miss 判定 | 010 | `true` | 11 | | --- | ------ | --- | *存取 Word Addrress `22`* 1. 轉成 Binary ：`10 110` 2. 前往 index 格：`110` 3. 判斷 tag 是否正確： `10 == 10` 正確 4. 直接回傳 data，作 Hit 判定 *存取 Word Addrress `18`* 1. 轉成 Binary ：`10 010` 2. 前往 index 格：`010` 3. 判斷 tag 是否正確： `10 != 11` 不正確 4. 拿資料，覆寫 cache，並作 Miss 判定 | 010 | `true` | 10 | | --- | ------ | --- | ::: #### Block Size Consideration - 增加 Block 大小 - 會減少 miss rate - 但會增加 miss penalty - 但如果 Cache 大小不變 - 會減少 Block 數量 => 重複覆寫(Competition) => 增加 miss rate - 存入不需要的資料 (pollution) ![](https://i.imgur.com/L7IVvP8.png) #### Cache Miss - Stall CPU pipline - 需要去下一層 memory hierachy 拿資料 - 需要重新進行 Instruction fetch ### Write Policy - 當 Cache Hit 時，存取 Cache 裡的資料，當資料發生更改時，會需要改變 Main memory的資料 - 當 Cache Miss 時。通常使用 write-through - write-around: 不讀了！直接寫 #### Write-Through - 寫入時同時更改 memory - 會造成寫入時間拉長 - e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles - Effective CPI = 1 + 0.1×100 = 11 - 解決方法： Write Buffer - 把資料一塊一塊的送進 memory，而非持續的 - CPU 可以持續運作，只有當 buffer 滿了才需要 stall #### Write-Back - 只有被替換時才寫入 memory - 要開一個欄位紀錄 block 是不是更改過(dirty) ### Cache Size - The total number of bits in a direct-mapped cache - $2^n \times (tag\ size+valid\ field\ size+block\ size)$ ![](https://i.imgur.com/381Lqf3.png) ![](https://i.imgur.com/H3Sjslf.png) :::info **Example: Cache size** Given - Direct-mapped Cache - `16` KB data - Block size: `4` word - Address: `32-bit` 這個 cache 的大小是？ - Block Count: 16 KB = `16384`B = `4096` words = `1024` blocks - Index = `lg1024`=`10` bits - Tag = `32-10-2-2` = `18` bits - TOTAL = $Block\ Entries \times (tag+valid\ field+data) = 2^{10}\times(18+1+128) = 147$ Kbits ::: ## Memory Performance - Bandwidth - Bus width, Bus cycle time - Number of channels - Latency - DRAM access time - Interconnect latency ### Increasing memory performance - Increase Memory Bandwidth ![](https://i.imgur.com/PdWlRQw.png) - Main Memory Supporting Cache ![](https://i.imgur.com/HZU7yOK.png)