# 計算機組織 Computer Organization (下) (CSIE-2)
###### tags: `courses meow` `CSIE-2` `Computer Organization` `2021 Spring`
:::success
:::spoiler click to open TOC
[TOC]
:::
:::success
### 教授
- Week_10 ~ Week_18
- Instructor
- 陳增益(tychen@g.ncu.edu.tw)
- Office hour: after class or by appointment
- TAs
- 狄尚弘(rrrr1012@gmail.com)
- 彭懷德(storybear5306@gmail.com)
### Material
- Textbook
- "Computer Organization and Design : The Hardware/Software Interface" by David A. Patterson and John L Hennessy
- They are turing award laureates
- Slides
- Download from new ee-class
:::
:::info
### 參考資料
- [CPU Cache 原理探討](https://hackmd.io/@drwQtdGASN2n-vt_4poKnw/H1U6NgK3Z?type=view)
:::
---
## Chapter 5
### Principle of Locality
- **Temporal Locality**
- 如果記憶體中的一個 Byte 最近被使用過,則該 Byte 在未來很有可能會再被使用。
- **Spatial Locality**
- 如果記憶體中的一個 Byte 最近被使用過,則該 Byte 附近的 Bytes 在未來也很有可能會被使用。
- 以上兩種 Locality 皆適用於 資料(Data) 和 指令(Instruction)
### Memory Hierarchy Levels
- 一些名詞解釋
- **Block**: Unit of copying
- (Cache) **Hit**: Access satisfied by upper level
- $\text{Hit ratio} = \frac{\text{hits}}{\text{accesses}}$
- (Cache) **Miss**: Block copied from lower level
- $\text{Miss ratio} = 1 - \text{hit ratio}$

### Memory Technology
| Type | Static RAM (SRAM) | Dynamic RAM (DRAM) | Magnetic Disk |
| ----------- |:-----------------:|:------------------:|:-------------:|
| Access Time | 0.5ns - 2.5ns | 50ns - 70ns | 5ms - 20ms |
| Price | $2000 - $5000 | $20 - $75 | $0.2 - $2 |
### DRAM
- Used as ***main memory***
- Data stored as a charge in a **capacitor**
- Single **transistor** used to access the charge
- Must be periodically refreshed
> In a DRAM chip, each bit of memory data is stored as the presence or absence of an electric charge on a small capacitor on the chip. As time passes, the charges in the memory cells leak away, so **without being refreshed the stored data would eventually be lost**. [name=Wikipedia]
- Bits in DRAM are organized as a rectangular array
- DRAM accesses an **entire row**
- **Burst mode**
- 除了存取指定位址的記憶體外,還會多存取周圍的8~16列(視 bandwidth 大小)記憶體,並存在內部的 **row buffer** 中,減少定址(addressing)的時間
- **SDR / DDR / QDR DRAM**
- Single / Double / Quad data rate (1 / 2 / 4 signals per cycle)
- Performance factors
- Row buffer
- Typically SRAM-like
- Fast access
- Synchronous DRAM (SDRAM)
- Allows for consecutive accesses in bursts without needing to send each address
### Flash Storage
- 100x - 1000x faster than disk
- Smaller, lower power, more robust
- **NOR flash**
- Random read/write access
- **NAND flash**
- Denser, block-at-a-time access
- Cheaper
- Used for USB keys, media storage, ...
- Not suitable for direct RAM or disk replacement
### Cache Memory
> How do we know if the data is present in the cache? Cache is not byte-addressable like main memory. [name=ppt]
### Direct Mapped Cache
- Location determined by address mapping
- Formula: $(\text{block address})\mod (\text{# of blocks in cache})$
#### Memory Addressing

#### Tags and Valid Bits
- Store block address as well as the data
- We only need to store the high-order bits, called **tag**
- Matching tag to determine cache hit
- Valid bit = 1 if the data is present, otherwise 0
#### Sheet

### Four Basic Questions
- Consider access of levels in a memory hierarchy:
- For memory: byte addressable
- For caches: use ***block*** (or called ***line***) for the unit data transfer; satisfy ****Principle of Locality****. But, need to locate blocks (not byte addressable
- Cache design
- Block Placement
- Block Identification
- Block Replacement
- Write Strategy
### Block Size Considerations
> 大的 cache block 可以符合 spatial locality 的原則,因為我們常常在用陣列這類連續的資料結構,很有可能也需要取用到旁邊的 data
> 但 cache 的大小是固定的,大的 cache block 會使 cache 裡面的 address 數量變少,造成競爭,會增加 miss rate[name=教授]
- Pollution: 大資料的 access 造成 cache 需要進行大量資料搬移
> 不同層級的 cache 的設計理念也不一樣
### Cache Miss
- On cache hit, CPU proceeds normally
- On cache miss
- Stall the CPU pipeline
- Fetch block from next level of hierarchy
- I-cache & D-cache
> 我們的程式在 SDRAM 中,如果不用 cache 的話,CPU 每執行一條指令,都要去訪問 SDRAM。速度很慢。加上 ICache 的話,當 CPU 去訪問 SDRAM 時,會把附近一小塊指令放到 I-cache 中去,下次執行時它會先看看 I-cache 中有沒有那些指令,如果有直接從 I-cache 中取出來執行,如果沒有就去 SDRAM 中取。
### Write Policies
#### Write-Through
- When updating cache, also update memory
- Make writes take longer
- If base CPI = $1$, $10 \%$ of instructions are stores, write to memory takes $100$ cycles
- Effective CPI = $1+0.1 \times 100 = 11$
- Solution: ***write buffer***
- Holds data waiting to be written to memory
- CPU continues immediately
- Only stalls on write if write buffer is already full
#### Write-Back
- Alternative: on data-write hit, only update the block in cache
- Keep track of whether each block is dirty
- When a dirty block is replaced, write it back to memory
- Can use a ***write buffer*** to allow replacing block to be read first

#### Write Allocate
- Write miss 發生時,從 memory 把資料載入 cache,修改 cache 中的資料並標示為dirty, read miss 時將memory 中的資料更新
#### Write Around (Write No Allocate)
- Write miss 發生時,把更新後資料寫進 memory 後不寫回 cache
- 仰賴 read miss 來更新 cache 內部的資料
Useful Reference: https://www.jyt0532.com/2018/09/23/cache-mechanism/
### Overhead Analysis
- The total number of bits in a direct-mapped cache:
$$
2^n\times (\text{block size} + \text{tag size} + \text{valid field size})
$$
- The size of the tag field
- $32 - (n+m+2)$
- $n$ bits are used for the index
- $m$ bits are used for the word offset within the block
- Formula
- $\text{Total cache size} = 2^n\times [(2^m\times 32) + (32-n-m-2) + 1]$
- $\text{Overhead}= \frac{\text{Total cache size}}{\text{Actual cache data}}=\frac{2^n\times [(2^m\times 32) + (32-n-m-2) + 1]}{2^n\times (2^m\times 32)}$

#### Example Problem
> How many total bits are required for a direct-mapped cache with 16 KiB of data and 4-word blocks, assuming a 32-bit address?
- 4-word blocks ($\implies$$m=2$ )
- Words to be cached:
- 16 KiB = 4096 words
- Total amount of blocks in the cache:
- 4096 / 4 = 1024 blocks ($\implies$$n=10$ )
- 10 bits for entry index
- Tag size = 32 - 10 - 2 - 2 = 18 bits
- Total cache size = $2^{10}\times (4 \times 32 + 18 + 1)=147$ (Kib)
- Overhead = $\frac{147}{128}$
### Memory Performance
- Bandwidth
- Bus width, bus cycle time
- Number of channels
- Latency
- DRAM access latency
- Interconnect latency
### Associative Caches

#### Fully associative
- Allow a given block to go in any cache entry
> Index 失效,需要儲存更多 tag bit
> 而且我們需要做 sequential comparing (expensive)
#### (n-way) Set associative
- 把 block 切成好幾個 set
- $\text{Set Index} = (\text{block address}) \mod (\text{number of sets in cache})$
- At most n comparisons (less expensive)
- **LRU replacement policy**
- 把最近有用到的資料保留,沒用到的踢掉
#### For a fixed cache size, increasing the associativity decreases the number of sets

### How Much Associativity
- Increasing associativity **decreases miss rate**, but with **diminishing returns**(邊際遞減)
:::spoiler **image of diminishing returns**

:::
- e.g. Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000
- 1-way: 10.3%
- 2-way: 8.6%
- 4-way: 8.3%
- 8-way: 8.1%
### Cache Size: Set-associative
> 4096 blocks, 4-word block size, 32-bit addressing
> Find the total number of tag bits for caches that are **direct-mapped** and **two-way set associative**.
- Direct mapped
- The number of entries in the cache = $2^{12} = 4096$
- 4 word = 16 bytes
- Tag size = 32 - 12 - 4 = 16
- Total number of tag bits = $16\times 2^{12}=64$ (Kib)
- 2-way set associative
- The number of entries in the cache = $2^{11} = 2048$
- 4 word = 16 bytes
- Tag size = 32 - 11 - 4 = 17
- Total number of tag bits = $2\times 17\times 2^{11}=68$ (Kib)
> 少了一個 index bit
:::spoiler 濃縮版筆記 by wasabi-neko
:::info
### 濃縮
> 參考:https://hackmd.io/@drwQtdGASN2n-vt_4poKnw/H1U6NgK3Z?type=view
```graphviz
digraph {
rankdir=LR;
cpu [label="CPU"]
m [label="main memory"]
cpu -> m [label="addr"]
m -> cpu [label="data"]
rank=same; cpu; m;
}
```
```graphviz
digraph {
rankdir=LR;
cpu [label="CPU"]
c [label="cache"]
m [label="main memory"]
cpu -> c [label="addr"]
c -> cpu [label="data"]
c -> m [label="addr"]
m -> c [label="data"]
rank=same; cpu; m;
}
```
對 CPU 與 main memory 來說,他們的做的事完全沒有改變。
我們會將 main memory 分為很多個 **block**,一次 cache 一個 **block**
```graphviz
digraph {
node [shape="pain text"];
rankdir=LR;
cache [label=<
<table>
<tr>
<td colspan="3" border="0">Cache</td>
</tr>
<tr>
<td>valid</td>
<td>Tag</td>
<td>Block data</td>
</tr>
<tr><td PORT="line0">1</td><td>01010011</td><td>1010101001</td></tr>
<tr><td PORT="line1">0</td><td>01010011</td><td>1010101001</td></tr>
<tr><td>0</td><td>01010110</td><td>1010101001</td></tr>
<tr><td>:</td><td>:</td><td>:</td></tr>
</table>
>];
l0 [label="line: 0" shape=""]
l0 -> cache:line0
l1 [label="line: 1" shape=""]
l1 -> cache:line1
}
```
**Cache line**: 一個 line 包含 valid bit, tag, 與 block data
**Block data**: 從 main memory copy 下來的 data
**Tag**: 在後續用作查詢用的 key
### Cache mapping
在 CPU 要求某個 address 時,我們需要用某種算法查訊這個 address 的 data 在不在當前 cache 起來的 data 中。
#### **Fully-associated**
main memory 的 address 可以看成,第 x 個 block 中的第 y byte。
接著把 x 當成 tag,在查詢的時候便咳以直接 for 每一個 line 比較 tag
```python=
CPU_get_byte(addr):
for line in cache:
if line.tag == addr.tag and line.vaild == True:
return line.block_data[addr.block_offset]
return Fetch_data_from_memory(addr)
```
##### example:
```graphviz
digraph {
rankdir=LR;
label="Assume:\n32-bit addressing arch,\nblock_size=32B"
m [shape="record" label="main memory|block 0|block 1|<b2>block 2|...|..."]
addr [shape="pain text" label=<
<table border="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10">
<tr><td colspan="2">address: 32-bit</td></tr>
<tr>
<td>000000000000000000000000020</td>
<td PORT="tail">00011</td>
</tr>
<tr>
<td>tag: 27 bit</td>
<td>block_offset: 5 bit</td>
</tr>
</table>
>]
addr:tail->m:b2
cache [shape="pain text" label=<
<table>
<tr>
<td colspan="3" border="0">Cache</td>
</tr>
<tr>
<td>valid</td>
<td>Tag</td>
<td>Block data</td>
</tr>
<tr>
<td PORT="b2">1</td>
<td>000000000000000000000000020 </td>
<td>block 2</td>
</tr>
<tr>
<td>:</td><td>:</td><td>:</td>
</tr>
</table>
>];
addr:tail->cache:b2
}
```
缺點:查詢慢
#### **Directed mapped**
將 main memory 中的 block map 到特定 line。
```python=
CPU_get_byte(addr):
target_line = addr.index
if target_line.tag == addr.tag:
return target_line.block_data[addr.offset]
else:
return Fetch_from_memory(addr)
```
##### example:
假設總共有 512 個 line。則:
block 0, 512, 1024.. 會 map 到 line 0
block 1, 513, 1025... 會 map 到 line 1
```graphviz
digraph {
rankdir=LR;
label="Assume:\
32-bit addressing arch,\
block_size = 32B\
Number of line/block = 512\n"
m [shape="record" label="main memory|block 0|...|<b514>block 514|...|..."]
addr [shape="pain text" label=<
<table border="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10">
<tr><td colspan="3">address: 32-bit</td></tr>
<tr>
<td>0000000000000000000001</td>
<td>00010</td>
<td PORT="tail">00011</td>
</tr>
<tr>
<td>tag: 22 bit</td>
<td>index: 5 bit</td>
<td>block_offset: 5 bit</td>
</tr>
</table>
>]
cache [shape="pain text" label=<
<table>
<tr>
<td colspan="3" border="0">Cache</td>
</tr>
<tr>
<td>valid</td>
<td>Tag</td>
<td>Block data</td>
</tr>
<tr>
<td>1</td><td>some tag</td><td PORT="b0_tail">some block data</td>
</tr>
<tr>
<td>1</td><td>some tag</td><td PORT="b1_tail">some block data</td>
</tr>
<tr>
<td PORT="b2">1</td>
<td>0000000000000000000001 </td>
<td PORT="b2_tail">block 514</td>
</tr>
<tr>
<td>:</td><td>:</td><td>:</td>
</tr>
</table>
>];
addr:tail->m:b514
addr:tail->cache:b2
cache:b0_tail -> index0 [dir="back"]
cache:b1_tail -> index1 [dir="back"]
cache:b2_tail -> index2 [dir="back"]
}
```
缺點:有幾個 block 將會共用一個 line,導致他們不能同時存在於 cache 中:
#### **X byte N-way Set Associative**
> 合併上面兩者的好處
將 main memory 中的 block map 到特定 set。
**set**:包含 N 個 line。
因此在查詢時,只需要對一個 set 迭代一次找 tag 即可。
> 有點類似 hash
```python=
CPU_get_byte(addr):
let target_set = addr.index
for EACH line in target_set:
if line.tag == addr.tag and line.vaild == True:
return line.block_data[addr.block_offset]
else:
return Fetch_from_memory(addr)
```
:::
### Six basic cache optimizations
#### Larger block size
- Reduces compulsory misses
- Increase capacity and conflict misses, increase miss penalty
#### Larger total cache capacity to reduce miss rate
- Increases hit time, increases power consumption (more circuit)
#### Higher associativity
- Reduces conflict misses
- Increase hit time, increases power consumption (more comparator)
#### Higher number of cache levels
- Reduces miss penalty and overall memory access time
(L3 - Last-level cache)
#### Giving priority to read misses over writes
- Reduce miss penalty
(Try to read data directly from cache if available, instead of write-back and read)
#### Avoiding address translation in cache indexing
- Reduces hit time
### Causes of misses (4 Cs)
- **Compulsory**
- First reference to a block
- Data 第一次進到 cache 裡面所造成的 miss (因為一開始是空的)
- **Capacity**
- Blocks discarded and later retrieved
- e.g. Fully-associative cache: 當空間不夠必須踢掉其中一個 block 裡面的 data,而後面馬上需要存取這筆 data 時
- **Conflict**
- Program makes repeated reference to multiple addresses from different blocks that map to the same location in the cache
- e.g. 多個位址對應到的 block index 是同一個,就會一直覆蓋掉之前已經 cache 好的 data 而必須重複寫入讀取
- **Coherence miss**
- OS 裡面講到的 cache coherency 問題
- 假設 CPU A 和 CPU B 共用一筆資料 D,D 被 CPU A 修改,則必須將 CPU B 的 cache 內的 D mark 為 invalid,之後 CPU B 嘗試去 cache 中 access 資料 D 時產生的 miss 就是 coherence miss
### Replacement Policy
- **Direct-mapped**
- 直接替換,沒有其他選擇
- **Set associative** (and fully associative)
- 如果有 non-valid entry,就可以直接替換
- 如果沒有,就必須尋找新的 policy 來選擇要替換哪格 (以下兩種):
- **Least-recently used (LRU)**
- Choose the one unused for the longest time
- Simple for 2-way, manageable for 4-way (4-way: $4!=24$, need 5 bits)
- Beyond 4-way is too hard
- **Random**
- 在 associativity 高的時候 performance 和 LRU 差不多
### Implement LRU Replacement
#### Pure LRU
- 4-way $\to$ use **6 bits** (minimum 5 bits to cover $4!=24$)
#### Partitioned LRU (Pseudo LRU)
- Use a binary tree to maintain only (n-1) bits for n-way set associativity

### Multilevel Caches
- **Primary Cache (L1)**
- 0.5-1 ns
- Attached to CPU
- Small but fast
- **Level-2 Cache (L2)**
- 5-10 ns
- Services misses from primary cache
- Larger, slower, but still faster than main memory
- **Main Memory**
- 25-75 ns
- Services L2 cache misses
- Huge miss penalty
- Some high-end systems include L3 Cache
### Multilevel Cache Terminology
- **Local miss rate**
- The miss rate of one hierarchy level by itself
- $\text{Local miss rate} = \frac{\text{# of misses at that level}}{\text{# of accesses at that level}}$
- e.g. Miss rate(L1), Miss rate(L2)
- **Global miss rate**
- The miss rate of a whole group of hierarchy levels
- e.g. $\text{Global Miss rate(L2)} = \text{Miss rate(L1)} \times \text{Local Miss rate(L2)}$
### Multilevel Cache Example
> CPU base CPI = 1, clock rate = 4GHz
> Miss rate per instruction = 2%
> Main memory access time = 100ns
:::spoiler 如果你像我一樣忘記甚麼是CPU Time, CPI, clock rate
::: info
#### CPU Time = ${ I * CPI *T}$
- ${I}$ = number of instructions in program
- ${CPI}$ = average cycles per instruction
- ${T}$ = clock cycle time
- clock rate = $\dfrac{1}{T}$
[name=蛋蕉]
:::
- **With just primary cache (L1)**
- Miss penalty = $\frac{100 \times 10^{-9}}{0.25 \times 10^{-9}}=400$ (cycles)
- Effective CPI = $1 + 0.02 \times 400 = 9$
- **Add L2 cache**
> Access time = 5ns
> Global miss rate to main memory = 0.5%
- L1 miss, L2 hit
- Penalty = $\frac{5}{0.25}=20$ (cycles)
- L1 miss, L2 miss
- Extra penalty = $400$ (cycles)
- Effective CPI = $1 + 0.02 \times 20 + 0.005 \times 400 = 3.4$
### Effect of 2-level Caching
- L2 size usually much bigger than L1
- Provide reasonable hit rate
- 降低 L1 cache 的 miss penalty
- 但對整體的 miss penalty 而言是上升的
- Inclusion property
- Inclusive cache
- L1 是 L2 的子集合,在 L1 有的資料 L2 都有
- 簡化確保資料一致性的機制
- Exclusive cache
- 在 L1 出現的不一定會在 L2 出現
- 如果 L2 miss 的話也要去看 L1 是否 hit
### dability Measures
- **Reliability:** mean time to failure (MTTF)
- 平均多久時間才會遇到一次錯誤 (e.g. 記憶體讀不到資料 / 讀到錯誤的資料)
- **Service interruption**: mean time to repair (MTTR)
- 發生錯誤後要花多久時間修復
- Mean time between failures (MTBF)
- **MTBF = MTTF + MTTR**
- $\text{Availability} = \frac{\text{uptime}}{\text{total time}} = \frac{\text{MTTF}}{\text{MTBF}}$
- Improving Availability
- **Increase MTTF**: fault avoidance, fault tolerance (e.g. **RAID**), fault forecasting
- **Reduce MTTR**: improved tools and processes for diagnosis and repair
### Examples
> Some disks today are quoted to have a 1,000,000-hour MTTF (114 years)
Suppose that warehouse scale system might have 50,000 servers.
Each server has 2 disks.
A disk’s annual failure rate (AFR) = 8760 / 1,000,000 = 0.876%
With 100,000 disks, we would expect 876 disks to fail per year, or **on average more than 2 disk failures per day**!
**長話短說**:單單一個硬碟看起來平均 114 年才會產生一次錯誤,但如果硬碟的數量非常多(在 data center 或是 data warehouse 內的情況),其實平均下來每天會有兩個硬碟會發生錯誤!
### Hamming SEC code
- 是一種 **Error Correction Code** (ECC,校正碼)
- Fault detection / correction for **RAM**
- **Parity bit:** Single error detection
- 奇偶校驗:把資料的所有 bit 做 `XOR`,產生出一個新的 bit,這個 bit 可檢查出資料內是否有錯誤 (但只能檢查出是否有**奇數個**錯誤)
#### Hamming Distance
- Number of bits that are different between two bit patterns
(兩個 bit represetation data 之間不同的 bit 數)
- 若 minimum distance 為 2,加上 parity check 即可提供 **Single Error Detection**
- 若 minimum distance 為 3,可提供 **Single Error Correction (SEC), Double Error Detection (DED)**

#### Hamming Code
==醍醐灌頂的影片==:[Hamming codes and error correction
](https://www.youtube.com/watch?v=X8jsijhllIA)
$P_1$ 覆蓋了所有數據位位置序號的二進制表示倒數第「$1$」位是 1 的數據:如 1, 11, 101, 111 等
$P_2$ 覆蓋了所有數據位位置序號的二進制表示倒數第「$2$」位是 1 的數據:如 10, 11, 110, 111 等
以此類推。
一種更快取得檢查碼的方式:
將原始資料中所有含 $1$ 的位元的索引取出,並將他們全部做 `XOR` 運算。
下方有範例。(*)
若將編碼後的資料中所有含 $1$ 的位元的索引取出,並做 `XOR` 運算,會得到全為 $0$ 的 bit pattern。
若不全為 $0$,則代表資料有誤,或是漢明碼算錯。
#### Example
> Assume one byte data value is $(10011010)_2$.
Generated encoded string:
| Bit 1 ($P_1$) | Bit 2 ($P_2$) | Bit 3 | Bit 4 ($P_4$) | Bit 5 | Bit 6 | Bit 7 | Bit 8 ($P_8$) | Bit 9 | Bit 10 | Bit 11 | Bit 12 |
|:-------------:|:-------------:|:-----:|:-------------:|:-----:|:-----:|:-----:|:-------------:|:-----:|:------:|:------:|:------:|
| 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
Hamming code for $(10011010)_2$ = $(0110)_2$
(*) 3 `XOR` 7 `XOR` 9 `XOR` 11 = $(0110)_2$
> Suppose the data is corrupted, we invert Bit 10.
> The data became: $(011100101110)_2$
Hamming code for $(011100101110)_2$ = $(1010)_2$
$\implies$ $P_8$ and $P_2$ are incorrect.
$\implies$ Bit 10 must be incorrect.
#### How many bits are needed?
原資料以二進位表示的長度為 $d$ 個 bits,而我們總共需要 $p$ 個 bits 來當作 parity bits。
資料出錯的可能性有 $p + d$ 種:
$\implies$ $2^p \geq p + d + 1$
(多的 1 個 bit 是拿來當整個 $p + d$ 長度的資料的 parity bit)
$\implies$ $p \geq \log_2(p + d + 1)$
e.g. The data is 8 bits $\implies d=8$
By solving the inequality$\implies p=4$
### Virtual Machines
在電腦上模擬一台電腦 (作業系統)
### Virtual Memory
- Use main memory as a cache for secondary disk storage
- 每個 process 只能透過自己的 page table 來把 virtual address 轉成真正的記憶體位址
- CPU and OS translate virtual addresses to physical addresses
### Page Table
- Array of page table entries, indexed by virtual page number
- 把記憶體區塊分成好幾個固定大小的 **page** (e.g. 4KB)
- 每個 process 有自己的 page table
- Virtual address 的 page offset 一定和 physical address 的 offset 大小相同 (如此一來才能讓 process 以為他們真的擁有了這些記憶體)
- 以 4KB page 為例,page offset 就是 12 個 bit
- **若 page 在 memory 裡面** (valid bit = 1),PTE (Page Table Entry) 儲存的就是 physical page number (和一些其他資料如:referenced, dirty bit, ...)
- 若不在 (valid bit = 0),PTE 儲存的就是該筆資料在 disk 上的位置
- **page table register**: a register that point to the start to the page table

==超讚的影片,解釋得很清楚==:[Virtual Memory: Page Tables](https://www.youtube.com/watch?v=KNUJhZCQZ9c)
### Page Fault Penalty
- On page fault, the page must be fetched from disk
- Takes millions of clock cycles
- Handled by OS
- Try to minimize page fault rate
- Fully associative replacement (因 miss penalty 過高)
- Smart replacement algorithms
### Mapping Pages to Storage
- 若 valid bit 為 1,則該 page 存在於 memory 中
- 若為 0,則 page 存在於 disk 中 (page fault)

### Replacement and Writes
- To reduce page fault rate, prefer least-recently used (LRU) replacement
- **Reference bit** (a.k.a use bit) in PTE set to 1 on access to page
- Periodically cleared to 0 by OS
- A page with reference bit = 0 $\implies$ not been used recently
- 非揮發性記憶體 (如硬碟) 寫入資料的代價過高
- **`Write-through`** is impractical, use **`Write-back`** instead (見先前 Writing Policy 部分)
- **Dirty bit** in PTE will be set when page is written
### Fast Translation Using a TLB (Translation Lookaside Buffer)
- 一般的轉址需要兩次的 access
- 1. 透過 PTE 取得 Physical address
- 2. 再透過 Physical address 取得真正需要的 data
- 但存取 page table 這件事情其實應該有不錯的 locality 可以應用:
- Temporal locality: 一個常被 access 到的 page 在近期應該還會再次被 access
- Spatial locality: 常被 access 到的 page 的周圍的 page 也很有可能會被 access
- 因此為了加速轉址時間,發明了 **TLB**,一種存地址的快取
- 通常為 full lookup table 的形式 (同 fully associative cache)
- 常見大小:16 - 512 PTEs
- Hit time:0.5 - 1 cycle
- Miss penalty:10 - 100 cycles
- Miss rate:0.01% - 1%

- Valid bit = 1, Reference bit = 1 $\implies$ **TLB Hit!**
### TLB Misses
- 若 page 存在於 memory 中
- 去 page table 找 physical address 並寫回 TLB
- 再度嘗試從 PTE 中取得 physical address
- 若 page 不在 memory 中 (page fault)
- 由 OS 處理,從 disk 中搬回 page 並更新 page table 中的 physical address
### Address Translation Overview
- In VM, memory acts like a cache for disk
- TLB acts like a cache for page table

### TLB and Cache Interaction
看圖蠻複雜的,但其實就是 VA 先去 TLB 轉成 PA,再透過 PA 去 cache 撈資料而已


### **Page Table valid bit** and **TLB valid bit**
- Page Table valid bit
- 0: data in disk
- 1: data in memory
- TLB valid bit
- when a new value is loaded into the **page table register** -> TLB valid bit set to 0
- 簡單來說,就是當我今天換了一張 page table,那 TLB 裡的資料就不是正確的,所以valid bit = 0
### Cache Design Trade-offs
| Design change | Effect on miss rate | Negative performance effect |
|:---------------------- |:---------------------------- |:---------------------------- |
| Increase cache size | Decrease **capacity misses** | May increase **access time** |
| Increase associativity | Decrease **conflict misses** | May increase **access time** |
| Increase block size | Decrease **compulsory misses**, may also ==decrease miss rate== due to **spatial locality** | Increase **miss penalty**, for large block size, may ==increase miss rate== due to **pollution** |
### Cache Coherence Protocols
- 為了解決 cache coherence 的問題
- Snooping protocols
- Each cache monitors bus reads/writes
- Directory-based protocols
- Caches and memory record sharing status of blocks in a directory
#### Snooping Solution
- Send all requests for data to all processors
- Requires **broadcast**, since caching information is in processors
- Dominates for small scale machines (core 數不多)
- Not feasible with **clustering** (processors not nearby, commnuicating by internet)
- **broadcast** may have latency
#### Directory-based Schemes
- Keep track of what is being shared in 1 centralized place
- Distributed memory (for scalability)
- Send point-to-point requests to processors via **network**
### Basic Snoopy Protocols
#### Write Invalidate Protocol
- 只要資料一更新就把有用到這筆資料的其他 cache 裡面的資料 mark 成 invalid
- Multiple readers, single writer
- 發生 read miss 時:
- Write-through: 資料一定是最新的
- Write-back: snoop 就必須去找 most recent copy (唯一一筆 valid 的資料)
#### Write Broadcast (Update) Protocol
- 只要資料一更新就會在 bus 上 broadcast,processor 就會更新自己 cache 內對應的 data copy
- 發生 read miss 時資料一定是最新的 (因為每次資料被寫入時都會 broadcast)