# 計算機組織 Computer Organization (下) (CSIE-2) ###### tags: `courses meow` `CSIE-2` `Computer Organization` `2021 Spring` :::success :::spoiler click to open TOC [TOC] ::: :::success ### 教授 - Week_10 ~ Week_18 - Instructor - 陳增益(tychen@g.ncu.edu.tw) - Office hour: after class or by appointment - TAs - 狄尚弘(rrrr1012@gmail.com) - 彭懷德(storybear5306@gmail.com) ### Material - Textbook - "Computer Organization and Design : The Hardware/Software Interface" by David A. Patterson and John L Hennessy - They are turing award laureates - Slides - Download from new ee-class ::: :::info ### 參考資料 - [CPU Cache 原理探討](https://hackmd.io/@drwQtdGASN2n-vt_4poKnw/H1U6NgK3Z?type=view) ::: --- ## Chapter 5 ### Principle of Locality - **Temporal Locality** - 如果記憶體中的一個 Byte 最近被使用過,則該 Byte 在未來很有可能會再被使用。 - **Spatial Locality** - 如果記憶體中的一個 Byte 最近被使用過,則該 Byte 附近的 Bytes 在未來也很有可能會被使用。 - 以上兩種 Locality 皆適用於 資料(Data) 和 指令(Instruction) ### Memory Hierarchy Levels - 一些名詞解釋 - **Block**: Unit of copying - (Cache) **Hit**: Access satisfied by upper level - $\text{Hit ratio} = \frac{\text{hits}}{\text{accesses}}$ - (Cache) **Miss**: Block copied from lower level - $\text{Miss ratio} = 1 - \text{hit ratio}$ ![](https://i.imgur.com/kPrARha.png) ### Memory Technology | Type | Static RAM (SRAM) | Dynamic RAM (DRAM) | Magnetic Disk | | ----------- |:-----------------:|:------------------:|:-------------:| | Access Time | 0.5ns - 2.5ns | 50ns - 70ns | 5ms - 20ms | | Price | $2000 - $5000 | $20 - $75 | $0.2 - $2 | ### DRAM - Used as ***main memory*** - Data stored as a charge in a **capacitor** - Single **transistor** used to access the charge - Must be periodically refreshed > In a DRAM chip, each bit of memory data is stored as the presence or absence of an electric charge on a small capacitor on the chip. As time passes, the charges in the memory cells leak away, so **without being refreshed the stored data would eventually be lost**. [name=Wikipedia] - Bits in DRAM are organized as a rectangular array - DRAM accesses an **entire row** - **Burst mode** - 除了存取指定位址的記憶體外,還會多存取周圍的8~16列(視 bandwidth 大小)記憶體,並存在內部的 **row buffer** 中,減少定址(addressing)的時間 - **SDR / DDR / QDR DRAM** - Single / Double / Quad data rate (1 / 2 / 4 signals per cycle) - Performance factors - Row buffer - Typically SRAM-like - Fast access - Synchronous DRAM (SDRAM) - Allows for consecutive accesses in bursts without needing to send each address ### Flash Storage - 100x - 1000x faster than disk - Smaller, lower power, more robust - **NOR flash** - Random read/write access - **NAND flash** - Denser, block-at-a-time access - Cheaper - Used for USB keys, media storage, ... - Not suitable for direct RAM or disk replacement ### Cache Memory > How do we know if the data is present in the cache? Cache is not byte-addressable like main memory. [name=ppt] ### Direct Mapped Cache - Location determined by address mapping - Formula: $(\text{block address})\mod (\text{# of blocks in cache})$ #### Memory Addressing ![](https://i.imgur.com/N0JRDrU.png) #### Tags and Valid Bits - Store block address as well as the data - We only need to store the high-order bits, called **tag** - Matching tag to determine cache hit - Valid bit = 1 if the data is present, otherwise 0 #### Sheet ![](https://i.imgur.com/a8odgFW.png) ### Four Basic Questions - Consider access of levels in a memory hierarchy: - For memory: byte addressable - For caches: use ***block*** (or called ***line***) for the unit data transfer; satisfy ****Principle of Locality****. But, need to locate blocks (not byte addressable - Cache design - Block Placement - Block Identification - Block Replacement - Write Strategy ### Block Size Considerations > 大的 cache block 可以符合 spatial locality 的原則,因為我們常常在用陣列這類連續的資料結構,很有可能也需要取用到旁邊的 data > 但 cache 的大小是固定的,大的 cache block 會使 cache 裡面的 address 數量變少,造成競爭,會增加 miss rate[name=教授] - Pollution: 大資料的 access 造成 cache 需要進行大量資料搬移 > 不同層級的 cache 的設計理念也不一樣 ### Cache Miss - On cache hit, CPU proceeds normally - On cache miss - Stall the CPU pipeline - Fetch block from next level of hierarchy - I-cache & D-cache > 我們的程式在 SDRAM 中,如果不用 cache 的話,CPU 每執行一條指令,都要去訪問 SDRAM。速度很慢。加上 ICache 的話,當 CPU 去訪問 SDRAM 時,會把附近一小塊指令放到 I-cache 中去,下次執行時它會先看看 I-cache 中有沒有那些指令,如果有直接從 I-cache 中取出來執行,如果沒有就去 SDRAM 中取。 ### Write Policies #### Write-Through - When updating cache, also update memory - Make writes take longer - If base CPI = $1$, $10 \%$ of instructions are stores, write to memory takes $100$ cycles - Effective CPI = $1+0.1 \times 100 = 11$ - Solution: ***write buffer*** - Holds data waiting to be written to memory - CPU continues immediately - Only stalls on write if write buffer is already full #### Write-Back - Alternative: on data-write hit, only update the block in cache - Keep track of whether each block is dirty - When a dirty block is replaced, write it back to memory - Can use a ***write buffer*** to allow replacing block to be read first ![](https://i.imgur.com/snxnoG5.png) #### Write Allocate - Write miss 發生時,從 memory 把資料載入 cache,修改 cache 中的資料並標示為dirty, read miss 時將memory 中的資料更新 #### Write Around (Write No Allocate) - Write miss 發生時,把更新後資料寫進 memory 後不寫回 cache - 仰賴 read miss 來更新 cache 內部的資料 Useful Reference: https://www.jyt0532.com/2018/09/23/cache-mechanism/ ### Overhead Analysis - The total number of bits in a direct-mapped cache: $$ 2^n\times (\text{block size} + \text{tag size} + \text{valid field size}) $$ - The size of the tag field - $32 - (n+m+2)$ - $n$ bits are used for the index - $m$ bits are used for the word offset within the block - Formula - $\text{Total cache size} = 2^n\times [(2^m\times 32) + (32-n-m-2) + 1]$ - $\text{Overhead}= \frac{\text{Total cache size}}{\text{Actual cache data}}=\frac{2^n\times [(2^m\times 32) + (32-n-m-2) + 1]}{2^n\times (2^m\times 32)}$ ![](https://i.imgur.com/U9P7tNT.png) #### Example Problem > How many total bits are required for a direct-mapped cache with 16 KiB of data and 4-word blocks, assuming a 32-bit address? - 4-word blocks ($\implies$$m=2$ ) - Words to be cached: - 16 KiB = 4096 words - Total amount of blocks in the cache: - 4096 / 4 = 1024 blocks ($\implies$$n=10$ ) - 10 bits for entry index - Tag size = 32 - 10 - 2 - 2 = 18 bits - Total cache size = $2^{10}\times (4 \times 32 + 18 + 1)=147$ (Kib) - Overhead = $\frac{147}{128}$ ### Memory Performance - Bandwidth - Bus width, bus cycle time - Number of channels - Latency - DRAM access latency - Interconnect latency ### Associative Caches ![](https://i.imgur.com/i4P77lH.png) #### Fully associative - Allow a given block to go in any cache entry > Index 失效,需要儲存更多 tag bit > 而且我們需要做 sequential comparing (expensive) #### (n-way) Set associative - 把 block 切成好幾個 set - $\text{Set Index} = (\text{block address}) \mod (\text{number of sets in cache})$ - At most n comparisons (less expensive) - **LRU replacement policy** - 把最近有用到的資料保留,沒用到的踢掉 #### For a fixed cache size, increasing the associativity decreases the number of sets ![](https://i.imgur.com/GvEWWP4.png =500x) ### How Much Associativity - Increasing associativity **decreases miss rate**, but with **diminishing returns**(邊際遞減) :::spoiler **image of diminishing returns** ![diminishing returns](https://i.imgur.com/E2SPrLQ.png =300x) ::: - e.g. Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000 - 1-way: 10.3% - 2-way: 8.6% - 4-way: 8.3% - 8-way: 8.1% ### Cache Size: Set-associative > 4096 blocks, 4-word block size, 32-bit addressing > Find the total number of tag bits for caches that are **direct-mapped** and **two-way set associative**. - Direct mapped - The number of entries in the cache = $2^{12} = 4096$ - 4 word = 16 bytes - Tag size = 32 - 12 - 4 = 16 - Total number of tag bits = $16\times 2^{12}=64$ (Kib) - 2-way set associative - The number of entries in the cache = $2^{11} = 2048$ - 4 word = 16 bytes - Tag size = 32 - 11 - 4 = 17 - Total number of tag bits = $2\times 17\times 2^{11}=68$ (Kib) > 少了一個 index bit :::spoiler 濃縮版筆記 by wasabi-neko :::info ### 濃縮 > 參考:https://hackmd.io/@drwQtdGASN2n-vt_4poKnw/H1U6NgK3Z?type=view ```graphviz digraph { rankdir=LR; cpu [label="CPU"] m [label="main memory"] cpu -> m [label="addr"] m -> cpu [label="data"] rank=same; cpu; m; } ``` ```graphviz digraph { rankdir=LR; cpu [label="CPU"] c [label="cache"] m [label="main memory"] cpu -> c [label="addr"] c -> cpu [label="data"] c -> m [label="addr"] m -> c [label="data"] rank=same; cpu; m; } ``` 對 CPU 與 main memory 來說,他們的做的事完全沒有改變。 我們會將 main memory 分為很多個 **block**,一次 cache 一個 **block** ```graphviz digraph { node [shape="pain text"]; rankdir=LR; cache [label=< <table> <tr> <td colspan="3" border="0">Cache</td> </tr> <tr> <td>valid</td> <td>Tag</td> <td>Block data</td> </tr> <tr><td PORT="line0">1</td><td>01010011</td><td>1010101001</td></tr> <tr><td PORT="line1">0</td><td>01010011</td><td>1010101001</td></tr> <tr><td>0</td><td>01010110</td><td>1010101001</td></tr> <tr><td>:</td><td>:</td><td>:</td></tr> </table> >]; l0 [label="line: 0" shape=""] l0 -> cache:line0 l1 [label="line: 1" shape=""] l1 -> cache:line1 } ``` **Cache line**: 一個 line 包含 valid bit, tag, 與 block data **Block data**: 從 main memory copy 下來的 data **Tag**: 在後續用作查詢用的 key ### Cache mapping 在 CPU 要求某個 address 時,我們需要用某種算法查訊這個 address 的 data 在不在當前 cache 起來的 data 中。 #### **Fully-associated** main memory 的 address 可以看成,第 x 個 block 中的第 y byte。 接著把 x 當成 tag,在查詢的時候便咳以直接 for 每一個 line 比較 tag ```python= CPU_get_byte(addr): for line in cache: if line.tag == addr.tag and line.vaild == True: return line.block_data[addr.block_offset] return Fetch_data_from_memory(addr) ``` ##### example: ```graphviz digraph { rankdir=LR; label="Assume:\n32-bit addressing arch,\nblock_size=32B" m [shape="record" label="main memory|block 0|block 1|<b2>block 2|...|..."] addr [shape="pain text" label=< <table border="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10"> <tr><td colspan="2">address: 32-bit</td></tr> <tr> <td>000000000000000000000000020</td> <td PORT="tail">00011</td> </tr> <tr> <td>tag: 27 bit</td> <td>block_offset: 5 bit</td> </tr> </table> >] addr:tail->m:b2 cache [shape="pain text" label=< <table> <tr> <td colspan="3" border="0">Cache</td> </tr> <tr> <td>valid</td> <td>Tag</td> <td>Block data</td> </tr> <tr> <td PORT="b2">1</td> <td>000000000000000000000000020 </td> <td>block 2</td> </tr> <tr> <td>:</td><td>:</td><td>:</td> </tr> </table> >]; addr:tail->cache:b2 } ``` 缺點:查詢慢 #### **Directed mapped** 將 main memory 中的 block map 到特定 line。 ```python= CPU_get_byte(addr): target_line = addr.index if target_line.tag == addr.tag: return target_line.block_data[addr.offset] else: return Fetch_from_memory(addr) ``` ##### example: 假設總共有 512 個 line。則: block 0, 512, 1024.. 會 map 到 line 0 block 1, 513, 1025... 會 map 到 line 1 ```graphviz digraph { rankdir=LR; label="Assume:\ 32-bit addressing arch,\ block_size = 32B\ Number of line/block = 512\n" m [shape="record" label="main memory|block 0|...|<b514>block 514|...|..."] addr [shape="pain text" label=< <table border="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10"> <tr><td colspan="3">address: 32-bit</td></tr> <tr> <td>0000000000000000000001</td> <td>00010</td> <td PORT="tail">00011</td> </tr> <tr> <td>tag: 22 bit</td> <td>index: 5 bit</td> <td>block_offset: 5 bit</td> </tr> </table> >] cache [shape="pain text" label=< <table> <tr> <td colspan="3" border="0">Cache</td> </tr> <tr> <td>valid</td> <td>Tag</td> <td>Block data</td> </tr> <tr> <td>1</td><td>some tag</td><td PORT="b0_tail">some block data</td> </tr> <tr> <td>1</td><td>some tag</td><td PORT="b1_tail">some block data</td> </tr> <tr> <td PORT="b2">1</td> <td>0000000000000000000001 </td> <td PORT="b2_tail">block 514</td> </tr> <tr> <td>:</td><td>:</td><td>:</td> </tr> </table> >]; addr:tail->m:b514 addr:tail->cache:b2 cache:b0_tail -> index0 [dir="back"] cache:b1_tail -> index1 [dir="back"] cache:b2_tail -> index2 [dir="back"] } ``` 缺點:有幾個 block 將會共用一個 line,導致他們不能同時存在於 cache 中: #### **X byte N-way Set Associative** > 合併上面兩者的好處 將 main memory 中的 block map 到特定 set。 **set**:包含 N 個 line。 因此在查詢時,只需要對一個 set 迭代一次找 tag 即可。 > 有點類似 hash ```python= CPU_get_byte(addr): let target_set = addr.index for EACH line in target_set: if line.tag == addr.tag and line.vaild == True: return line.block_data[addr.block_offset] else: return Fetch_from_memory(addr) ``` ::: ### Six basic cache optimizations #### Larger block size - Reduces compulsory misses - Increase capacity and conflict misses, increase miss penalty #### Larger total cache capacity to reduce miss rate - Increases hit time, increases power consumption (more circuit) #### Higher associativity - Reduces conflict misses - Increase hit time, increases power consumption (more comparator) #### Higher number of cache levels - Reduces miss penalty and overall memory access time (L3 - Last-level cache) #### Giving priority to read misses over writes - Reduce miss penalty (Try to read data directly from cache if available, instead of write-back and read) #### Avoiding address translation in cache indexing - Reduces hit time ### Causes of misses (4 Cs) - **Compulsory** - First reference to a block - Data 第一次進到 cache 裡面所造成的 miss (因為一開始是空的) - **Capacity** - Blocks discarded and later retrieved - e.g. Fully-associative cache: 當空間不夠必須踢掉其中一個 block 裡面的 data,而後面馬上需要存取這筆 data 時 - **Conflict** - Program makes repeated reference to multiple addresses from different blocks that map to the same location in the cache - e.g. 多個位址對應到的 block index 是同一個,就會一直覆蓋掉之前已經 cache 好的 data 而必須重複寫入讀取 - **Coherence miss** - OS 裡面講到的 cache coherency 問題 - 假設 CPU A 和 CPU B 共用一筆資料 D,D 被 CPU A 修改,則必須將 CPU B 的 cache 內的 D mark 為 invalid,之後 CPU B 嘗試去 cache 中 access 資料 D 時產生的 miss 就是 coherence miss ### Replacement Policy - **Direct-mapped** - 直接替換,沒有其他選擇 - **Set associative** (and fully associative) - 如果有 non-valid entry,就可以直接替換 - 如果沒有,就必須尋找新的 policy 來選擇要替換哪格 (以下兩種): - **Least-recently used (LRU)** - Choose the one unused for the longest time - Simple for 2-way, manageable for 4-way (4-way: $4!=24$, need 5 bits) - Beyond 4-way is too hard - **Random** - 在 associativity 高的時候 performance 和 LRU 差不多 ### Implement LRU Replacement #### Pure LRU - 4-way $\to$ use **6 bits** (minimum 5 bits to cover $4!=24$) #### Partitioned LRU (Pseudo LRU) - Use a binary tree to maintain only (n-1) bits for n-way set associativity ![](https://i.imgur.com/JOQquR5.png) ### Multilevel Caches - **Primary Cache (L1)** - 0.5-1 ns - Attached to CPU - Small but fast - **Level-2 Cache (L2)** - 5-10 ns - Services misses from primary cache - Larger, slower, but still faster than main memory - **Main Memory** - 25-75 ns - Services L2 cache misses - Huge miss penalty - Some high-end systems include L3 Cache ### Multilevel Cache Terminology - **Local miss rate** - The miss rate of one hierarchy level by itself - $\text{Local miss rate} = \frac{\text{# of misses at that level}}{\text{# of accesses at that level}}$ - e.g. Miss rate(L1), Miss rate(L2) - **Global miss rate** - The miss rate of a whole group of hierarchy levels - e.g. $\text{Global Miss rate(L2)} = \text{Miss rate(L1)} \times \text{Local Miss rate(L2)}$ ### Multilevel Cache Example > CPU base CPI = 1, clock rate = 4GHz > Miss rate per instruction = 2% > Main memory access time = 100ns :::spoiler 如果你像我一樣忘記甚麼是CPU Time, CPI, clock rate ::: info #### CPU Time = ${ I * CPI *T}$ - ${I}$ = number of instructions in program - ${CPI}$ = average cycles per instruction - ${T}$ = clock cycle time - clock rate = $\dfrac{1}{T}$ [name=蛋蕉] ::: - **With just primary cache (L1)** - Miss penalty = $\frac{100 \times 10^{-9}}{0.25 \times 10^{-9}}=400$ (cycles) - Effective CPI = $1 + 0.02 \times 400 = 9$ - **Add L2 cache** > Access time = 5ns > Global miss rate to main memory = 0.5% - L1 miss, L2 hit - Penalty = $\frac{5}{0.25}=20$ (cycles) - L1 miss, L2 miss - Extra penalty = $400$ (cycles) - Effective CPI = $1 + 0.02 \times 20 + 0.005 \times 400 = 3.4$ ### Effect of 2-level Caching - L2 size usually much bigger than L1 - Provide reasonable hit rate - 降低 L1 cache 的 miss penalty - 但對整體的 miss penalty 而言是上升的 - Inclusion property - Inclusive cache - L1 是 L2 的子集合,在 L1 有的資料 L2 都有 - 簡化確保資料一致性的機制 - Exclusive cache - 在 L1 出現的不一定會在 L2 出現 - 如果 L2 miss 的話也要去看 L1 是否 hit ### dability Measures - **Reliability:** mean time to failure (MTTF) - 平均多久時間才會遇到一次錯誤 (e.g. 記憶體讀不到資料 / 讀到錯誤的資料) - **Service interruption**: mean time to repair (MTTR) - 發生錯誤後要花多久時間修復 - Mean time between failures (MTBF) - **MTBF = MTTF + MTTR** - $\text{Availability} = \frac{\text{uptime}}{\text{total time}} = \frac{\text{MTTF}}{\text{MTBF}}$ - Improving Availability - **Increase MTTF**: fault avoidance, fault tolerance (e.g. **RAID**), fault forecasting - **Reduce MTTR**: improved tools and processes for diagnosis and repair ### Examples > Some disks today are quoted to have a 1,000,000-hour MTTF (114 years) Suppose that warehouse scale system might have 50,000 servers. Each server has 2 disks. A disk’s annual failure rate (AFR) = 8760 / 1,000,000 = 0.876% With 100,000 disks, we would expect 876 disks to fail per year, or **on average more than 2 disk failures per day**! **長話短說**:單單一個硬碟看起來平均 114 年才會產生一次錯誤,但如果硬碟的數量非常多(在 data center 或是 data warehouse 內的情況),其實平均下來每天會有兩個硬碟會發生錯誤! ### Hamming SEC code - 是一種 **Error Correction Code** (ECC,校正碼) - Fault detection / correction for **RAM** - **Parity bit:** Single error detection - 奇偶校驗:把資料的所有 bit 做 `XOR`,產生出一個新的 bit,這個 bit 可檢查出資料內是否有錯誤 (但只能檢查出是否有**奇數個**錯誤) #### Hamming Distance - Number of bits that are different between two bit patterns (兩個 bit represetation data 之間不同的 bit 數) - 若 minimum distance 為 2,加上 parity check 即可提供 **Single Error Detection** - 若 minimum distance 為 3,可提供 **Single Error Correction (SEC), Double Error Detection (DED)** ![](https://i.imgur.com/KzbpefJ.png) #### Hamming Code ==醍醐灌頂的影片==:[Hamming codes and error correction ](https://www.youtube.com/watch?v=X8jsijhllIA) $P_1$ 覆蓋了所有數據位位置序號的二進制表示倒數第「$1$」位是 1 的數據:如 1, 11, 101, 111 等 $P_2$ 覆蓋了所有數據位位置序號的二進制表示倒數第「$2$」位是 1 的數據:如 10, 11, 110, 111 等 以此類推。 一種更快取得檢查碼的方式: 將原始資料中所有含 $1$ 的位元的索引取出,並將他們全部做 `XOR` 運算。 下方有範例。(*) 若將編碼後的資料中所有含 $1$ 的位元的索引取出,並做 `XOR` 運算,會得到全為 $0$ 的 bit pattern。 若不全為 $0$,則代表資料有誤,或是漢明碼算錯。 #### Example > Assume one byte data value is $(10011010)_2$. Generated encoded string: | Bit 1 ($P_1$) | Bit 2 ($P_2$) | Bit 3 | Bit 4 ($P_4$) | Bit 5 | Bit 6 | Bit 7 | Bit 8 ($P_8$) | Bit 9 | Bit 10 | Bit 11 | Bit 12 | |:-------------:|:-------------:|:-----:|:-------------:|:-----:|:-----:|:-----:|:-------------:|:-----:|:------:|:------:|:------:| | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | Hamming code for $(10011010)_2$ = $(0110)_2$ (*) 3 `XOR` 7 `XOR` 9 `XOR` 11 = $(0110)_2$ > Suppose the data is corrupted, we invert Bit 10. > The data became: $(011100101110)_2$ Hamming code for $(011100101110)_2$ = $(1010)_2$ $\implies$ $P_8$ and $P_2$ are incorrect. $\implies$ Bit 10 must be incorrect. #### How many bits are needed? 原資料以二進位表示的長度為 $d$ 個 bits,而我們總共需要 $p$ 個 bits 來當作 parity bits。 資料出錯的可能性有 $p + d$ 種: $\implies$ $2^p \geq p + d + 1$ (多的 1 個 bit 是拿來當整個 $p + d$ 長度的資料的 parity bit) $\implies$ $p \geq \log_2(p + d + 1)$ e.g. The data is 8 bits $\implies d=8$ By solving the inequality$\implies p=4$ ### Virtual Machines 在電腦上模擬一台電腦 (作業系統) ### Virtual Memory - Use main memory as a cache for secondary disk storage - 每個 process 只能透過自己的 page table 來把 virtual address 轉成真正的記憶體位址 - CPU and OS translate virtual addresses to physical addresses ### Page Table - Array of page table entries, indexed by virtual page number - 把記憶體區塊分成好幾個固定大小的 **page** (e.g. 4KB) - 每個 process 有自己的 page table - Virtual address 的 page offset 一定和 physical address 的 offset 大小相同 (如此一來才能讓 process 以為他們真的擁有了這些記憶體) - 以 4KB page 為例,page offset 就是 12 個 bit - **若 page 在 memory 裡面** (valid bit = 1),PTE (Page Table Entry) 儲存的就是 physical page number (和一些其他資料如:referenced, dirty bit, ...) - 若不在 (valid bit = 0),PTE 儲存的就是該筆資料在 disk 上的位置 - **page table register**: a register that point to the start to the page table ![](https://i.imgur.com/fxl7CDH.png =500x) ==超讚的影片,解釋得很清楚==:[Virtual Memory: Page Tables](https://www.youtube.com/watch?v=KNUJhZCQZ9c) ### Page Fault Penalty - On page fault, the page must be fetched from disk - Takes millions of clock cycles - Handled by OS - Try to minimize page fault rate - Fully associative replacement (因 miss penalty 過高) - Smart replacement algorithms ### Mapping Pages to Storage - 若 valid bit 為 1,則該 page 存在於 memory 中 - 若為 0,則 page 存在於 disk 中 (page fault) ![](https://i.imgur.com/rmnCqBK.png) ### Replacement and Writes - To reduce page fault rate, prefer least-recently used (LRU) replacement - **Reference bit** (a.k.a use bit) in PTE set to 1 on access to page - Periodically cleared to 0 by OS - A page with reference bit = 0 $\implies$ not been used recently - 非揮發性記憶體 (如硬碟) 寫入資料的代價過高 - **`Write-through`** is impractical, use **`Write-back`** instead (見先前 Writing Policy 部分) - **Dirty bit** in PTE will be set when page is written ### Fast Translation Using a TLB (Translation Lookaside Buffer) - 一般的轉址需要兩次的 access - 1. 透過 PTE 取得 Physical address - 2. 再透過 Physical address 取得真正需要的 data - 但存取 page table 這件事情其實應該有不錯的 locality 可以應用: - Temporal locality: 一個常被 access 到的 page 在近期應該還會再次被 access - Spatial locality: 常被 access 到的 page 的周圍的 page 也很有可能會被 access - 因此為了加速轉址時間,發明了 **TLB**,一種存地址的快取 - 通常為 full lookup table 的形式 (同 fully associative cache) - 常見大小:16 - 512 PTEs - Hit time:0.5 - 1 cycle - Miss penalty:10 - 100 cycles - Miss rate:0.01% - 1% ![](https://i.imgur.com/mWCx1uy.png) - Valid bit = 1, Reference bit = 1 $\implies$ **TLB Hit!** ### TLB Misses - 若 page 存在於 memory 中 - 去 page table 找 physical address 並寫回 TLB - 再度嘗試從 PTE 中取得 physical address - 若 page 不在 memory 中 (page fault) - 由 OS 處理,從 disk 中搬回 page 並更新 page table 中的 physical address ### Address Translation Overview - In VM, memory acts like a cache for disk - TLB acts like a cache for page table ![](https://i.imgur.com/ogC3MVT.png) ### TLB and Cache Interaction 看圖蠻複雜的,但其實就是 VA 先去 TLB 轉成 PA,再透過 PA 去 cache 撈資料而已 ![](https://i.imgur.com/zZZGqJw.png) ![](https://i.imgur.com/dF7nTwr.png) ### **Page Table valid bit** and **TLB valid bit** - Page Table valid bit - 0: data in disk - 1: data in memory - TLB valid bit - when a new value is loaded into the **page table register** -> TLB valid bit set to 0 - 簡單來說,就是當我今天換了一張 page table,那 TLB 裡的資料就不是正確的,所以valid bit = 0 ### Cache Design Trade-offs | Design change | Effect on miss rate | Negative performance effect | |:---------------------- |:---------------------------- |:---------------------------- | | Increase cache size | Decrease **capacity misses** | May increase **access time** | | Increase associativity | Decrease **conflict misses** | May increase **access time** | | Increase block size | Decrease **compulsory misses**, may also ==decrease miss rate== due to **spatial locality** | Increase **miss penalty**, for large block size, may ==increase miss rate== due to **pollution** | ### Cache Coherence Protocols - 為了解決 cache coherence 的問題 - Snooping protocols - Each cache monitors bus reads/writes - Directory-based protocols - Caches and memory record sharing status of blocks in a directory #### Snooping Solution - Send all requests for data to all processors - Requires **broadcast**, since caching information is in processors - Dominates for small scale machines (core 數不多) - Not feasible with **clustering** (processors not nearby, commnuicating by internet) - **broadcast** may have latency #### Directory-based Schemes - Keep track of what is being shared in 1 centralized place - Distributed memory (for scalability) - Send point-to-point requests to processors via **network** ### Basic Snoopy Protocols #### Write Invalidate Protocol - 只要資料一更新就把有用到這筆資料的其他 cache 裡面的資料 mark 成 invalid - Multiple readers, single writer - 發生 read miss 時: - Write-through: 資料一定是最新的 - Write-back: snoop 就必須去找 most recent copy (唯一一筆 valid 的資料) #### Write Broadcast (Update) Protocol - 只要資料一更新就會在 bus 上 broadcast,processor 就會更新自己 cache 內對應的 data copy - 發生 read miss 時資料一定是最新的 (因為每次資料被寫入時都會 broadcast)