# 計算機組織 Ch.5 Large and Fast: Exploiting Memory Hierarchy (B)
###### tags: `計算機組織`, `109-2`
[TOC]
## Measuring Cache Performance (5-4)
\begin{align}
Memory stall Cycle \\
=\frac{MemoryAccess}{Program} \times Miss Rate \times MissPenalty \\
=\frac{Instructions}{Program}\times\frac{Miss}{Instrction}\times MissPenalty
\end{align}
:::info
**Example: Cache Performance**
Given
- I-cache Miss Rate = `2%`
- D-cache Miss Rate = `4%`
- Miss Penalty = `100` Cycles
- Base CPI (ideal cache) = 2
- Load/ Store = `36%` of Instructions
Than,
- Miss Cycle per Instruction
- I-cache: $1 \times 0.02 \times 100 = 2$ (Instruction Fetch)
- D-cache: $0.36 \times 0.04 \times 100 = 1.44$ (L/S operation)
- Actual CPI = $2+2+1.44 = 5.44$
- Ideal CPU (no penalty) is $5.44/2 = 2.72$ times faster
:::
### Average Memory Access Time (AMAT)
- metrics for cache/memory performance
\begin{align}
AMAT = HitTime + MissRate \times MissPenalty
\end{align}
:::info
**Example: Average Memory Access Time**
Given
- Clock Time = `1` ns
with cache
- Hit Time = `1` cycle
- Miss Penalty = `20` cycle
- Miss Rate = `5%`
Than,
- AMAT = $1+5\%\times20 = 2$ ns
:::
## Associative Caches
- Purpose: Reduce Miss Rate
#### Fully associative
- Allow a given block to go in any cache entry
- Requires all entries to be searched at once
- Comparator per entry (expensive)
#### n-way set associative
- Each set contains n entries
- Block number determines which set
- `Block Number % Sets in Cache`
- Search all entries in a given set at once
- n comparators (less expensive)


:::success
以停車格與停車證為例:
**Direct Mapping**
- 依照停車證的末幾碼決定能停的格子
**N-way Associative**
- 把所有的格子分組
- 依照停車證的末幾碼決定能停的格子組別
**Fully associative**
- 我不發停車證啦JOJO
:::
:::info
**Example: Associative Cache**
Given
- 4-block cache
Compare the following mapping methods:
- Direct mapped
- 2-wat set associative
- Fully associative
Block Address: `0, 8, 0, 6, 8`
--
**Direct Mapped (Index=2 -> last 2 bit)**
| Address | Index | Hit? | Data ([0, 1, 2, 3]) |
|:-------------:|:-----:|:----:|:-------------------:|
| `0` (`00000`) | `00` | miss | `[0,-,-,-]` |
| `8` (`01000`) | `00` | miss | `[8,-,-,-]` |
| `0` (`00000`) | `00` | miss | `[0,-,-,-]` |
| `6` (`00110`) | `10` | miss | `[0,-,6,-]` |
| `8` (`01000`) | `00` | miss | `[8,-,6,-]` |
--
**2-way set associative (LRU replacement)**
| Address | Index | Hit? | Data ([0, 1, 2, 3]) |
|:-------------:|:-----:|:----:|:-------------------:|
| `0` (`00000`) | `0` | miss | `[0,-,-,-]` |
| `8` (`01000`) | `0` | miss | `[0,8,-,-]` |
| `0` (`00000`) | `0` | hit | `[0,8,-,-]` |
| `6` (`00110`) | `0` | miss | `[0,6,-,-]` |
| `8` (`01000`) | `0` | miss | `[8,6,-,-]` |
--
**Fully associative**
| Address | Index | Hit? | Data ([0, 1, 2, 3]) |
|:-------------:|:-----:|:----:|:-------------------:|
| `0` (`00000`) | - | miss | `[0,-,-,-]` |
| `8` (`01000`) | - | miss | `[0,8,-,-]` |
| `0` (`00000`) | - | hit | `[0,8,-,-]` |
| `6` (`00110`) | - | miss | `[0,8,6,-]` |
| `8` (`01000`) | - | hit | `[0,8,6,-]` |
:::
:::info
**Example: Associative Cache Size**
Given a cache of
- 4096 blocks
- 4-word block size
- 32-bit address
Find the total number of tag bits which is
- Direct mapped
- 2-way set-associative cache
--
**2-wat set-associative**
Entries: $\lg(4096/2) = 11$
Byte Offset: 4word = 16byte => $\lg16 = 4$
Tag Bits: $32-11-4 = 17$
Total Number of Tag Bits: $17\times2\times2048 = 69632$ bits (70Kb)
**Directed mapped**
$65536$ bits (66Kb)
:::
### Associativity vs. Miss Rate
- 越多的 Associative 能越降低 Miss Rate
- 但是會下降率逐步減少
:::success
**Review: Basic Cache Optimizations**
- Larger block size
- Reduces compulsory misses
- Increases capacity and conflict misses, increases miss penalty
- Larger total cache capacity to reduce miss rate
- Increases hit time, increases power consumption
- Higher associativity
- Reduces conflict misses
- Increases hit time, increases power consumption
- Higher number of cache levels
- Reduces miss penalty and overall memory access time
- Giving priority to read misses over writes
- Reduces miss penalty
- Avoiding address translation in cache indexing
- Reduces hit time
:::
### Replacement Policy
- Direct mapped: No choice (對應只有一格)
- Set-Associative
- Prefer non-valid entry
- Otherwise, choose among entries
- Least-Recently Used (LRU)
- Random

## Multilevel Caches
Purpose: Reduce Miss Penalty (L1 Miss Penalty)
- Level-1 Caches (Primary Cache, L1)
- Attached to CPU
- Small, but Fast
- Level-2 Caches (L2)
- Larger & Slower
- Still faster than main memory
- Some High-end system include L3 cache
### Inclusion
- Inclusive cache: L2 包含 L1 ;L1 是 L2 的子集合
- Exclusive cache: 兩者獨立
- Without inclusion, coherence must lookup L1 on L2 miss
### Access Time
- Average access time = L1 Hit Time + L1 Miss Rate x L1 Miss Penalty
- L1 Miss Penalty = L2 Hit Time + L2 Miss Rate x L2 Miss Penalty
:::info
**Example: Multilevel Cache**
Given
- CPU base CPI = `1`
- CPU clock rate = `4`Ghz => Each Clock = 0.25ns
- Miss rate per instruction = `2%`
- Main memory access time = `100`ns
Whose L1 cache
- Miss Penalty = $100/0.25$ = 400 Cycles
- Effictive CPI = $1+2\%\times400$ = 9
Then we add L2 cache
- Access time = `5`ns
- Global Miss Rate (to Main memory) = `0.5%`
Primary miss with L2 hit (L1 miss, L2 hit)
- Penalty = $5/0.25%$ = 20 Cycles
Primary miss with L2 miss (L1, L2 all miss)
- Extra Penalty = 400 Cycles
We get
- CPI = $1+2\%\times 20 + 0.5\%\times400$ = 3.4
Thus, Performance Ratio = $9/3.4$ = 2.6
:::
:::info
**Example:**


:::
## Software Optimization
### Interations with Software
- 軟體存取記憶體的方式決定了 Miss 數量
- 演算法行為
- 編譯器優化
### Loop Blocking
- 目標:在每次 Cache 被替換前,盡可能存取最多次的資料
#### 範例:DGEMM (Double-Precision Matrix Multiply)
```python=
for i in range(0, n):
for j in range(0, n):
cij = c[i+j*n]
for k in range(0, n):
cij += A[i+k*n] * B[k+j*n]
C[i+j*n] = cij
```
- Access Pattern : C = A x B

#### 使用 Cache Block 後的 DGEMM

- Access Pattern

## Dependable Memory Hierarchy (5-5)
### Dependability Measures (6-?)
- **Reliability**: Mean Time To Failure (MTTF)
- Fault Avoidance
- Fault Tolerance
- Fault Forecasting
- **Service Interruption**: Mean Time To Repair (MTTR)
- Tools
- Diagnosis
- Repair
- Mean Time Between Failures: MTBF = MTTF + MTTR
- **Availability** = MTTF / MTBF = MTTF / (MTTF+MTTR)
:::info
**Example: Dependability**
Some disk today are quoted to have
- `1,000,000` hour MTTF (about 114 years)
Suppose we have a warehouse scale system which has
- `50,000` servers
- each sever has `2` disks
Thus, a disk's Annual Failure Rate (AFR) is
- $1/1000000\times24\times365 = 0.876\%$
So, with 100000 disks, we would expect
- $876$ disks to fail per year
- or $2$ disk failures per day
:::
### The Hamming SEC Code (ECC)
#### Terms
- Fault Detection / Correction for **RAM**
- **Pariity Bit:** 判斷單一 bit 是否有錯誤
- Even/ Odd Parity: 判斷 bits 總和是奇 or 偶 (也就是 XOR 運算 $\oplus$)
- Hamming Distance: 兩種 bits 規律之間相差了幾個 bits
- Minimum Distance = 2: Single bit Error Detection
- Minimum Distance = 3: Single Error Correction (SEC) / 2 bits Error Detection
#### Hamming EC Code (ECC)
- 第 1 個 bit (P1): (二進位位數)最右方 bit 為 1 (1 3 5 7 9 11...)
- 第 2 個 bit (P2): (二進位位數)最右方往左第 1 位為 1 (2 3 6 7 10 11...)
- 第 4 個 bit (P4): (二進位位數)最右方往左第 2 位為 1 (4 5 6 7 10 11...)
- 第 8 個 bit (P8): (二進位位數)最右方往左第 3 位為 1 (8-15 24-31...)

:::info
**Example: ECC**
Byte Data `1001 1010` in binary
The Hamming ECC code:
*方法一*
- P1: `10X1 1X1X` 總和為 4,為偶數 = 0
- P2: `1X01 X01X` 總和為 3,為奇數 = 1
- P4: `X001 XXX0` 總和為 1,為奇數 = 1
- P8: `XXXX 1010` 總和為 2,為偶數 = 0
插入 Encoded String: `--1- 001- 1010`。獲得結果 `0111 0010 1010`
*方法二*
- 先把 Encoded String 填入
- `--1- 001- 1010`
- 把上面資料裡含有 1 的位數,全部轉成 binary。注意是從左到右:
- `3, 7, 9, 11` => `0011, 0111, 1001, 1011`
- 把得到的結果放在一起做 XOR (判斷每位總和的基偶)
- $0011 \oplus 0111 \oplus 1001 \oplus 1011 = 0110$
- 把上面的數 **從右到左** 依序填入 Encoded String 獲得結果
- `0111 0010 1010`
\-
如果某個 bit 發生錯誤,例如資料變成 `0111 0010 1110`
我們檢查 ECC Code: 資料: `1001 1110` 應該要符合 P1/2/4/8: `0110`
(方法一)
- P1: `10X1 1X1X` 總和為 4,為偶數 = 0,正確
- P2: `1X01 X11X` 總和為 4,為偶數 = 0,**不正確!**
- P4: `X001 XXX0` 總和為 1,為奇數 = 1,正確
- P8: `XXXX 1110` 總和為 3,為奇數 = 1,**不正確!**
錯誤的位元: $2+8 = 10$。第 10 位元必定有錯!
:::
#### Bits Required
假設
- p = 檢查位元(Parity Bits)的數量
- d = 資料的 bit 數
則
- $2^p \ge p+d+1$
- 或 $p \ge log{(p+d+1)}$
## Virtual Machine (5-6)
### Virtual Machine
- Host computer emulates guest operating system and machine resources
- Improved isolation of multiple guests
- Avoids security and reliability problems
- Aids sharing of resources
- Virtualization has some performance impact
- Feasible with modern high-performance comptuers
- E.g.
- Xen – Public VMM
- IBM VM/370 *(1970s technology!)*
- VMWare
- Microsoft Virtual PC
### Virtual Machine Monitor
- 連接虛擬資源與實體資源
- Memory, I/O, CPU, ...
- 因為 Guest 的程式是跑在 Host 機器的 User mode
- 可以防止 VM 去跑特權指令 privileged instruction
- Guest OS 可以和 Host OS 不同