# 計算機組織 Ch.5 Large and Fast: Exploiting Memory Hierarchy (A)
###### tags: `計算機組織`, `109-2`
[TOC]
## Introduction (5-1)
### Principle of Locality
#### Temporal Locality
- Items **accessed** recently are likely to be accessed again soon
e.g.
```cpp=
for(int i = 0; i < 15; i++)
sum += arr[i];
```
which `i`, `sum` would be accessed very often in the loop
#### Spatial locality
- Items **near those accessed** recently are likely to be accessed soon
- in above example, array `arr` has spatial locality
## Memory Hierachy
### Taking Advantage of Locality
- Copy recently accessed (and nearby) items from **disk** to *smaller* **DRAM** memory
- Copy **more** recently accessed (and nearby) items from **DRAM** to *smaller* **SRAM** memory
- Some terms
- **Block** (line): unit of copying (may be multiple words)
- **Hit**: accessed data is present in upper level
- **Miss**: accessed data is absent
- **Hit ratio**: $\frac{hit}{accesses}$
### Memory Technology (5-2)
- Static RAM (SRAM)
- 0.5ns – 2.5ns, $2000 – $5000 per GB
- Dynamic RAM (DRAM)
- 50ns – 70ns, $20 – $75 per GB
- Magnetic disk
- 5ms – 20ms, $0.20 – $2 per GB
#### Ideal memory
- Access time of SRAM
- more Capacity and cost/GB of disk

#### DRAM
- Uses as main memory
- Data stored as a charge in a capacitor
- Bits in a DRAM are organized as a **rectangular array**
- SDR/ DDR/ QDR

##### DRAM Performance Factors
- Row Buffer
- Allow serveral words to be read and refreshed in parallel
- Fast access for data in Row Buffer
- SDRAM (Synchronous DRAM)
- Allows for consecutive accesses in bursts without needing to send each address
- 利用同步計時器對記憶體的輸出入信號加以控制 (match CPU 時脈)
- Improves bandwidth
- DRAM banking
- Allows simultaneous access to multiple DRAMs
- 資料組多工
- Improves bandwidth

#### Flash Storage
- Faster than disk, but slower than DRAM
- Smaller and lower power consumption
- Often used in mobile devices
##### Flash Types
- NOR flash: bit cell like a NOR gate
- Random read/write access
- Used for instruction memory in embedded systems
- NAND flash: bit cell like a NAND gate
- Denser (bits/area), but block-at-a-time access
- Cheaper per GB
- Used for USB sticks, media storage, etc
- Flash bits wears out after 1000’s of accesses
- Not suitable for direct RAM or disk replacement
- Wear leveling: remap data to less used blocks
#### Disk Storage
##### Disk Sector
- Each Sector records:
- Sector ID
- Data (512/ 4096 bytes)
- Error Correcting code (ECC)
- Time of accessing a sector involves:
- Queue delay (if other accesses are pending)
- Seek
- Rotational latency
- Data Tranfer
- Controller Overhead
:::info
**Example: Disk Access**
Given
- `512`B sector
- `15000`rpm
- `4` ms average seek time (i.e. to position the head over the proper track)
- `100`MB/s transfer rate
- `0.2`ms controller overhead, idle disk
Average Read Time will be:
- Seek time = `4` ms
- Rotational Latency = `0.5/(15000/60/1000 (ms))` = `2` ms
- Transfer Time = `512 (B) / 100 (MB/s)` = `0.005` ms
- Controller Overhead = `0.2 ms`
- TOTAL = $6.2005 ms$
:::
##### Disk Performance Issue
- 經過上面運算,我們發現大部分時間都在 Seek 上
- 實用面改善:Locality, OS scheduling
- 聰明的 Controller 會對 sector 進行分配
- e.g. SCSI, ATA, SATA
- Caching
## Cache Memory (5-3)
- 離 CPU 最近的 Memory Hierachy
- 比 Main Memory 小很多,但快很多
- 沒有特別定址 (not byte-addressable)
### Direct-Mapped Cache
- 直接把 Main Memory 的位址 % 到 Cache 裡面
- `Cache Block addr = Block addr % Cache Block Count`


- 如何知道 Main Memory 某個 block 有沒有被存進 cache?
- Cache 裡面有個 `Tag`,存前幾位 Main memory 的 addr
:::info
**Example: Direct-Mapped Cache**
假設我們有 8-block 的 cache,每個 block 能存 1 word,使用 Direct-Mapped
表格是 (data 已省略)
| (Index) | V | Tag |
|:-------:|:-------:|:---:|
| **000** | `false` | |
| **001** | `false` | |
| **010** | `false` | |
| **011** | `false` | |
| **100** | `false` | |
| **101** | `false` | |
| **110** | `false` | |
| **111** | `false` | |
*存取 Word Addrress `22`*
1. 轉成 Binary :`10 110`
2. 前往 index 格:`010`
3. 判斷 tag 是否正確: `10 != -` 不正確
4. 拿資料,寫入 cache,並作 Miss 判定
| 110 | `true` | 10 |
| --- | ------ | --- |
*存取 Word Addrress `26`*
1. 轉成 Binary :`11 010`
2. 前往 index 格:`010`
3. 判斷 tag 是否正確: `11 != -` 不正確
4. 拿資料,寫入 cache,並作 Miss 判定
| 010 | `true` | 11 |
| --- | ------ | --- |
*存取 Word Addrress `22`*
1. 轉成 Binary :`10 110`
2. 前往 index 格:`110`
3. 判斷 tag 是否正確: `10 == 10` 正確
4. 直接回傳 data,作 Hit 判定
*存取 Word Addrress `18`*
1. 轉成 Binary :`10 010`
2. 前往 index 格:`010`
3. 判斷 tag 是否正確: `10 != 11` 不正確
4. 拿資料,覆寫 cache,並作 Miss 判定
| 010 | `true` | 10 |
| --- | ------ | --- |
:::
#### Block Size Consideration
- 增加 Block 大小
- 會減少 miss rate
- 但會增加 miss penalty
- 但如果 Cache 大小不變
- 會減少 Block 數量 => 重複覆寫(Competition) => 增加 miss rate
- 存入不需要的資料 (pollution)

#### Cache Miss
- Stall CPU pipline
- 需要去下一層 memory hierachy 拿資料
- 需要重新進行 Instruction fetch
### Write Policy
- 當 Cache Hit 時,存取 Cache 裡的資料,當資料發生更改時,會需要改變 Main memory的資料
- 當 Cache Miss 時。通常使用 write-through
- write-around: 不讀了!直接寫
#### Write-Through
- 寫入時同時更改 memory
- 會造成寫入時間拉長
- e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles
- Effective CPI = 1 + 0.1×100 = 11
- 解決方法: Write Buffer
- 把資料一塊一塊的送進 memory,而非持續的
- CPU 可以持續運作,只有當 buffer 滿了才需要 stall
#### Write-Back
- 只有被替換時才寫入 memory
- 要開一個欄位紀錄 block 是不是更改過(dirty)
### Cache Size
- The total number of bits in a direct-mapped cache
- $2^n \times (tag\ size+valid\ field\ size+block\ size)$


:::info
**Example: Cache size**
Given
- Direct-mapped Cache
- `16` KB data
- Block size: `4` word
- Address: `32-bit`
這個 cache 的大小是?
- Block Count: 16 KB = `16384`B = `4096` words = `1024` blocks
- Index = `lg1024`=`10` bits
- Tag = `32-10-2-2` = `18` bits
- TOTAL = $Block\ Entries \times (tag+valid\ field+data) = 2^{10}\times(18+1+128) = 147$ Kbits
:::
## Memory Performance
- Bandwidth
- Bus width, Bus cycle time
- Number of channels
- Latency
- DRAM access time
- Interconnect latency
### Increasing memory performance
- Increase Memory Bandwidth

- Main Memory Supporting Cache
