計算機組織 Ch.5 Large and Fast: Exploiting Memory Hierarchy (B)

# 計算機組織 Ch.5 Large and Fast: Exploiting Memory Hierarchy (B) ###### tags: `計算機組織`, `109-2` [TOC] ## Measuring Cache Performance (5-4) \begin{align} Memory stall Cycle \\ =\frac{MemoryAccess}{Program} \times Miss Rate \times MissPenalty \\ =\frac{Instructions}{Program}\times\frac{Miss}{Instrction}\times MissPenalty \end{align} :::info **Example: Cache Performance** Given - I-cache Miss Rate = `2%` - D-cache Miss Rate = `4%` - Miss Penalty = `100` Cycles - Base CPI (ideal cache) = 2 - Load/ Store = `36%` of Instructions Than, - Miss Cycle per Instruction - I-cache: $1 \times 0.02 \times 100 = 2$ (Instruction Fetch) - D-cache: $0.36 \times 0.04 \times 100 = 1.44$ (L/S operation) - Actual CPI = $2+2+1.44 = 5.44$ - Ideal CPU (no penalty) is $5.44/2 = 2.72$ times faster ::: ### Average Memory Access Time (AMAT) - metrics for cache/memory performance \begin{align} AMAT = HitTime + MissRate \times MissPenalty \end{align} :::info **Example: Average Memory Access Time** Given - Clock Time = `1` ns with cache - Hit Time = `1` cycle - Miss Penalty = `20` cycle - Miss Rate = `5%` Than, - AMAT = $1+5\%\times20 = 2$ ns ::: ## Associative Caches - Purpose: Reduce Miss Rate #### Fully associative - Allow a given block to go in any cache entry - Requires all entries to be searched at once - Comparator per entry (expensive) #### n-way set associative - Each set contains n entries - Block number determines which set - `Block Number % Sets in Cache` - Search all entries in a given set at once - n comparators (less expensive) ![](https://i.imgur.com/OSS0nRJ.png) ![](https://i.imgur.com/2eE0aLC.png) :::success 以停車格與停車證為例： **Direct Mapping** - 依照停車證的末幾碼決定能停的格子 **N-way Associative** - 把所有的格子分組 - 依照停車證的末幾碼決定能停的格子組別 **Fully associative** - 我不發停車證啦JOJO ::: :::info **Example: Associative Cache** Given - 4-block cache Compare the following mapping methods: - Direct mapped - 2-wat set associative - Fully associative Block Address: `0, 8, 0, 6, 8` -- **Direct Mapped (Index=2 -> last 2 bit)** | Address | Index | Hit? | Data ([0, 1, 2, 3]) | |:-------------:|:-----:|:----:|:-------------------:| | `0` (`00000`) | `00` | miss | `[0,-,-,-]` | | `8` (`01000`) | `00` | miss | `[8,-,-,-]` | | `0` (`00000`) | `00` | miss | `[0,-,-,-]` | | `6` (`00110`) | `10` | miss | `[0,-,6,-]` | | `8` (`01000`) | `00` | miss | `[8,-,6,-]` | -- **2-way set associative (LRU replacement)** | Address | Index | Hit? | Data ([0, 1, 2, 3]) | |:-------------:|:-----:|:----:|:-------------------:| | `0` (`00000`) | `0` | miss | `[0,-,-,-]` | | `8` (`01000`) | `0` | miss | `[0,8,-,-]` | | `0` (`00000`) | `0` | hit | `[0,8,-,-]` | | `6` (`00110`) | `0` | miss | `[0,6,-,-]` | | `8` (`01000`) | `0` | miss | `[8,6,-,-]` | -- **Fully associative** | Address | Index | Hit? | Data ([0, 1, 2, 3]) | |:-------------:|:-----:|:----:|:-------------------:| | `0` (`00000`) | - | miss | `[0,-,-,-]` | | `8` (`01000`) | - | miss | `[0,8,-,-]` | | `0` (`00000`) | - | hit | `[0,8,-,-]` | | `6` (`00110`) | - | miss | `[0,8,6,-]` | | `8` (`01000`) | - | hit | `[0,8,6,-]` | ::: :::info **Example: Associative Cache Size** Given a cache of - 4096 blocks - 4-word block size - 32-bit address Find the total number of tag bits which is - Direct mapped - 2-way set-associative cache -- **2-wat set-associative** Entries: $\lg(4096/2) = 11$ Byte Offset: 4word = 16byte => $\lg16 = 4$ Tag Bits: $32-11-4 = 17$ Total Number of Tag Bits: $17\times2\times2048 = 69632$ bits (70Kb) **Directed mapped** $65536$ bits (66Kb) ::: ### Associativity vs. Miss Rate - 越多的 Associative 能越降低 Miss Rate - 但是會下降率逐步減少 :::success **Review: Basic Cache Optimizations** - Larger block size - Reduces compulsory misses - Increases capacity and conflict misses, increases miss penalty - Larger total cache capacity to reduce miss rate - Increases hit time, increases power consumption - Higher associativity - Reduces conflict misses - Increases hit time, increases power consumption - Higher number of cache levels - Reduces miss penalty and overall memory access time - Giving priority to read misses over writes - Reduces miss penalty - Avoiding address translation in cache indexing - Reduces hit time ::: ### Replacement Policy - Direct mapped: No choice (對應只有一格) - Set-Associative - Prefer non-valid entry - Otherwise, choose among entries - Least-Recently Used (LRU) - Random ![](https://i.imgur.com/XCTqm79.png) ## Multilevel Caches Purpose: Reduce Miss Penalty (L1 Miss Penalty) - Level-1 Caches (Primary Cache, L1) - Attached to CPU - Small, but Fast - Level-2 Caches (L2) - Larger & Slower - Still faster than main memory - Some High-end system include L3 cache ### Inclusion - Inclusive cache: L2 包含 L1 ；L1 是 L2 的子集合 - Exclusive cache: 兩者獨立 - Without inclusion, coherence must lookup L1 on L2 miss ### Access Time - Average access time = L1 Hit Time + L1 Miss Rate x L1 Miss Penalty - L1 Miss Penalty = L2 Hit Time + L2 Miss Rate x L2 Miss Penalty :::info **Example: Multilevel Cache** Given - CPU base CPI = `1` - CPU clock rate = `4`Ghz => Each Clock = 0.25ns - Miss rate per instruction = `2%` - Main memory access time = `100`ns Whose L1 cache - Miss Penalty = $100/0.25$ = 400 Cycles - Effictive CPI = $1+2\%\times400$ = 9 Then we add L2 cache - Access time = `5`ns - Global Miss Rate (to Main memory) = `0.5%` Primary miss with L2 hit (L1 miss, L2 hit) - Penalty = $5/0.25%$ = 20 Cycles Primary miss with L2 miss (L1, L2 all miss) - Extra Penalty = 400 Cycles We get - CPI = $1+2\%\times 20 + 0.5\%\times400$ = 3.4 Thus, Performance Ratio = $9/3.4$ = 2.6 ::: :::info **Example:** ![](https://i.imgur.com/8sINz3M.png) ![](https://i.imgur.com/new0oee.png) ::: ## Software Optimization ### Interations with Software - 軟體存取記憶體的方式決定了 Miss 數量 - 演算法行為 - 編譯器優化 ### Loop Blocking - 目標：在每次 Cache 被替換前，盡可能存取最多次的資料 #### 範例：DGEMM (Double-Precision Matrix Multiply) ```python= for i in range(0, n): for j in range(0, n): cij = c[i+j*n] for k in range(0, n): cij += A[i+k*n] * B[k+j*n] C[i+j*n] = cij ``` - Access Pattern : C = A x B ![](https://i.imgur.com/8fbm9q4.png) #### 使用 Cache Block 後的 DGEMM ![](https://i.imgur.com/nwePb7H.png) - Access Pattern ![](https://i.imgur.com/PG7Rzn9.png) ## Dependable Memory Hierarchy (5-5) ### Dependability Measures (6-?) - **Reliability**: Mean Time To Failure (MTTF) - Fault Avoidance - Fault Tolerance - Fault Forecasting - **Service Interruption**: Mean Time To Repair (MTTR) - Tools - Diagnosis - Repair - Mean Time Between Failures: MTBF = MTTF + MTTR - **Availability** = MTTF / MTBF = MTTF / (MTTF+MTTR) :::info **Example: Dependability** Some disk today are quoted to have - `1,000,000` hour MTTF (about 114 years) Suppose we have a warehouse scale system which has - `50,000` servers - each sever has `2` disks Thus, a disk's Annual Failure Rate (AFR) is - $1/1000000\times24\times365 = 0.876\%$ So, with 100000 disks, we would expect - $876$ disks to fail per year - or $2$ disk failures per day ::: ### The Hamming SEC Code (ECC) #### Terms - Fault Detection / Correction for **RAM** - **Pariity Bit:** 判斷單一 bit 是否有錯誤 - Even/ Odd Parity: 判斷 bits 總和是奇 or 偶 (也就是 XOR 運算 $\oplus$) - Hamming Distance: 兩種 bits 規律之間相差了幾個 bits - Minimum Distance = 2: Single bit Error Detection - Minimum Distance = 3: Single Error Correction (SEC) / 2 bits Error Detection #### Hamming EC Code (ECC) - 第 1 個 bit (P1): (二進位位數)最右方 bit 為 1 (1 3 5 7 9 11...) - 第 2 個 bit (P2): (二進位位數)最右方往左第 1 位為 1 (2 3 6 7 10 11...) - 第 4 個 bit (P4): (二進位位數)最右方往左第 2 位為 1 (4 5 6 7 10 11...) - 第 8 個 bit (P8): (二進位位數)最右方往左第 3 位為 1 (8-15 24-31...) ![](https://i.imgur.com/J9HoA6b.png) :::info **Example: ECC** Byte Data `1001 1010` in binary The Hamming ECC code: *方法一* - P1: `10X1 1X1X` 總和為 4，為偶數 = 0 - P2: `1X01 X01X` 總和為 3，為奇數 = 1 - P4: `X001 XXX0` 總和為 1，為奇數 = 1 - P8: `XXXX 1010` 總和為 2，為偶數 = 0 插入 Encoded String: `--1- 001- 1010`。獲得結果 `0111 0010 1010` *方法二* - 先把 Encoded String 填入 - `--1- 001- 1010` - 把上面資料裡含有 1 的位數，全部轉成 binary。注意是從左到右： - `3, 7, 9, 11` => `0011, 0111, 1001, 1011` - 把得到的結果放在一起做 XOR (判斷每位總和的基偶) - $0011 \oplus 0111 \oplus 1001 \oplus 1011 = 0110$ - 把上面的數 **從右到左** 依序填入 Encoded String 獲得結果 - `0111 0010 1010` \- 如果某個 bit 發生錯誤，例如資料變成 `0111 0010 1110` 我們檢查 ECC Code: 資料: `1001 1110` 應該要符合 P1/2/4/8: `0110` (方法一) - P1: `10X1 1X1X` 總和為 4，為偶數 = 0，正確 - P2: `1X01 X11X` 總和為 4，為偶數 = 0，**不正確！** - P4: `X001 XXX0` 總和為 1，為奇數 = 1，正確 - P8: `XXXX 1110` 總和為 3，為奇數 = 1，**不正確！** 錯誤的位元： $2+8 = 10$。第 10 位元必定有錯！ ::: #### Bits Required 假設 - p = 檢查位元(Parity Bits)的數量 - d = 資料的 bit 數則 - $2^p \ge p+d+1$ - 或 $p \ge log{(p+d+1)}$ ## Virtual Machine (5-6) ### Virtual Machine - Host computer emulates guest operating system and machine resources - Improved isolation of multiple guests - Avoids security and reliability problems - Aids sharing of resources - Virtualization has some performance impact - Feasible with modern high-performance comptuers - E.g. - Xen – Public VMM - IBM VM/370 *(1970s technology!)* - VMWare - Microsoft Virtual PC ### Virtual Machine Monitor - 連接虛擬資源與實體資源 - Memory, I/O, CPU, ... - 因為 Guest 的程式是跑在 Host 機器的 User mode - 可以防止 VM 去跑特權指令 privileged instruction - Guest OS 可以和 Host OS 不同