# 中央資工-計算機組織(下)
:::info
如果你是不幸修到這門課的學弟妹,先為你們默哀3秒鐘。
這是一群準備計算機組織準備到破防的人一起做的筆記,這麼難的考試不能只有我考到
(有些地方不是很完整,請見諒)
> 考完85分簽[name=BendTail]
:::
https://hackmd.io/@U86qbAM7Qp6EYJchUWA5SA/rk-_yUyXC/edit
我懶得整合


# Ch.5 Large and Fast: Exploiting Memory Hierarchy
## Locality
### Temporal Locality
一個item在最近被存取後,則這個item在未來很有可能會再被使用
:::info
Ex: 迴圈中的instructions、迴圈的induction variable
:::
### Spatial Locality
一個item在最近被存取後,則這個item跟他附近的item在未來很有可能會再被使用
:::info
Ex: Array data、連續的instructions
:::
::: success
Ex:
```
for(int i = 0; i < 8; i++){
for(int j = 0; j < 8000; j++){
A[i][j] = A[j][i] + B[i][0];
}
}
```
Temporal: i, j, B[i][0]
Spatial: A[i][j](因為以row major只有A[i][j]會迭代相鄰的東西)
:::
### 晚餐(?
* 優點:量大管飽 便宜 耐用
* 缺點:low performance 高耗能
## Memory Hierarchy
* 將最近存取的資料(及周圍的資料)從disk copy到DRAM
* 將更最近存取的資料(及周圍的資料)從DRAM copy到SRAM(Ex: CPU上的L1 cache)
### Block
Copy的基本單位(可能是好幾個word) **(Spatial locality)**
memory to cache, cache之間的搬移的單位
### Hit
要的資料有在上層,直接存取
:::info
$\text{Hit ratio} = \frac{\text{hits}}{\text{accesses}}$
:::
### Miss
要的資料不在上層
要從下層copy上來再存取
#### Miss Penalty
將資料從下層拿到上層所花的時間
:::info
$\text{Miss ratio} = 1 -\text{Hit ratio}$
:::
## Memory Technology
* Static RAM(SRAM)
* Dynamic RAM(DRAM)
* Magnetic disk

## DRAM
* Uses as main memory
* 用capacitor(電容)的充電狀態表示0或1(充電是0沒充電是1)
* Single transistor used
* 必須固定時間(64ms)更新一次(不然電容中的電會不見)
* Read contents and write back
* 以 row 為單位
### Advanced DRAM Organization
* Bits in a DRAM are organized as a rectangular array
* First active the row. DRAM accesses an entire row, then find the target column
* 以一個row 為單位去存取/讀寫,每一個row都有address,通常一個 row 為 32bits 或 64bits
* 在搬移row 資訊時,要重複大小/row size 次 -> Burst Mode 優化
* Burst Mode
* 除了存取指定位址的記憶體外,還會多存取周圍的8~16列(視 bandwidth 大小)記憶體,並存在內部的 row buffer 中,減少定址(addressing)的時間
* Reduce latency
#### Double Data Rate(DDR) DRAM
Transfer on rising and falling clock cycle
#### Quad Data Rate(QDR) DRAM
Separate DDR inputs and outputs

### DRAM Performance Factors
#### Increase Row Buffer size
* DRAM的cache,先從這邊找有沒有目標,沒有再去active row
* 可同時(parallel)讀取和更新好幾個words
* 存取資料更快
#### Synchronous DRAM(SDRAM)
* Allows for consecutive accesses in bursts without needing to send each address(Burst mode)
* Improves bandwidth(頻寬)
#### DRAM Banking
* 可同時存取(每個bank獨立運作)
* Improves bandwidth
### Burst Mode Support
:::info
Ex:

> Burst mode會先把所有column active,省去一個一個column active的時間[name=教授]
> Reduced Access Overhead: It involves transferring multiple data words in a single burst.
:::
## Incrasing Memory Bandwidth
bandwidth = data/latency


:::success
將bus變寬的成本很高,但用bank可在低成本的情況下增加頻寬
:::
## Flash Storage
* Nonvolatile semiconductor storage
* 比disk快100~1000倍
* smaller, low power, more robust(耐用)
* 但cost比disk和DRAM高
### NOR flash
* Random read/write access
* 優點: random access 讀取速度快
* 缺點: bit-alterable 寫入速度慢
:::info
Ex: 嵌入式系統的Instruction memory
:::
### NAND flash
* Block(page)-at-a-time access
* 成本較低
* 缺點: 每一次write 都要改寫
:::info
Ex: USB, media storage
:::
:::warning
Flash bit 1000次access後會wear out(要remap data到較少使用的blocks -> wear leveling)
> 假設你有一個1TB SSD,那在寫入1000TB之後可能就不能用了
:::
## Disk
* Nonvolatile, rotating magnetic storage
* 移動head到track
* 轉動磁盤到對應sector
### Sector
* disk的基本單位
* 用sector ID 定址(Sector ID 不會存到 sector space)
>可以把disk想像成一個一維陣列,要有index才能存取資料[name=教授]
* 目前大小為4096 bytes為主流
#### Error correcting code(ECC)
兩份資料做XOR比對,有1就是有錯誤(不一樣的數值XOR出來會是1)
把要儲存的 data 分成2部分並做XOR,是為 old parity(ECC),每一次 read 時重新計算 new parity,跟 old parity 比對(XOR),如果出現1(old 跟 new 不同)代表為 error bit,可以透過 ECC 做 recovery
#### Latency
1. Job在queue中等待的時間
2. average seek time: 移動head(向內or 向外)的latency
3. rotational latency: 轉動磁盤的latency(平均時間就是轉一圈所需時間乘上1/2)
4. data transfer
5. controller overhead(fixed for operation)

### Disk drives include caches
每次的存取latency太長,需要cache增加performance
## Cache Memory
先檢查cache中有沒有要的資料,沒有再觸發cache miss event然後去下層的memory抓
:::danger
要怎麼知道資料在不在cache中?
* 找memory address有沒有在cache中
要去哪裡找?
* 找對應的index,比對tag
要怎麼讀取?
> 我不知道,不要再考我了[name=想休學的那個人]
:::
### Direct Mapped Cache
* Memory address map到cache entry
* Map方法: $\text{(Block address)}\ \ mod\ \ \text{(# of Blocks in cache)}$
* 看對應的cache entry就能知道有沒有memory address
* $\text{(# of Blocks)}$ is the power of 2
:::danger
問題: Memory是byte-addressable但cache是block-addressable
:::
:::success
$\lfloor\frac{address}{單位大小}\rfloor$
Ex: 32 bits 轉成 word(單位大小$2^2$)
->$\lfloor\frac{address}{2^2}\rfloor$
:::
### Tags
用lower bits得知entry再看higher bits是否一致
所以entry只需儲存higher bits->**tag**
### Valid bits
資料的可用狀態
Valid bit = 1 if the data is present, otherwise 0
若valid bit = 0則不管tag比對結果是否一致都要載入資料
## Four Basic Questions
- Consider access of levels in a memory hierarchy:
- For memory: byte addressable
- For caches: use block (or called line) for the unit data transfer; satisfy Principle of Locality. But, need to locate blocks (not byte addressable
- Cache design
- Block Placement
- Block Identification
- Block Replacement
- Write Strategy
## Larger Block Size

那他是怎麼算的呢?First of all, it said it's 16 bytes/block, since 16=${2^4}$, we require 4 bits to be used as the offset. Next, we know that there are 64 blocks, and 64 =${2^6}$. So, we know that we need 6 bits to represent all possible block numbers (0 to 63). Lastly, we assume the total address size is 32-bits(~~We just know.~~). In order to get tag size, we have to reduct the index bits and the offset bits, which means we get 32-6-4=22.
## Block Size Considerations
- 因為spatial locality,larger block size應該會降低miss rate
- 但是larger block size會使block數量變少,造成碰撞更多,增加miss rate
- larger block size造成pollution->搬了不需要的資料上來(因為一次抓會抓很大一段)
- larger miss penalty(要access的資料量變大)
## Cache Misses
- hit時,CPU正常運作
- miss時
1. stall the CPU pipeline
2. 從下一個hierarchy fetch資料
- Block size will affect the miss rate.
- 64 byte之前因為spatial locality所以miss rate下降,但64 byte後因為Cache pollution造成miss rate增加
> 破防 [name=海熊]

* 降低: spatial locality
* 增加: cache pollution、fewer entries
### Instruction Miss
restart instruction fetch
### Data Cache Miss
complete data access
## Write Policy
<!-- 我要把我上課聽到的東西打上去uwuouo -->
<!--👍👍👍-->
### Write through
After data-write hit, not only we have to update the block in cache, also, we have to write through back to memory.
<!-- 排版怎麼排 我來就好uwuowo-->
### Write back
- 用dirty bit
- dirty bit = 1 代表資料被更新過
- 當dirty block被replace
- 寫回memory
- 可用write buffer讓replaceing blocks能先被read
> 我不乾淨了
> You little dirty boi
<!-- 晶晶體 對不起 我很sorry-->
#### Write Buffer
Since write through takes extra time, we can use write buffer to reduce the time cost.(把要存回Memory的東西丟到Buffer,CPU就可以去做其他事了)
### Write Allocation, Write Around
當發生write miss(想要儲存的資料不在 cache 中)時
* **Write Allocation:** When a write miss occurs in the cache , the cache line is allocated and the data is fetched into the cache. The write operation is then performed on the cache.
* **Write Around:** When a write miss occurs, the data is written directly to the main memory, bypassing the cache. The cache is not updated or allocated with the new data.
- 如果是write-through
- allocate on miss: fetch the block(更新到cache後寫)
- write around: don't fetch the block(直接在memory寫)
- 如果是write-back
- usually fetch the block(write allocate)
:::info
放個超難作業題(作業五 6.)(跟作業數字不太一樣但是概念一樣)


> 個人理解,有錯幫改
>
> (a)Read bandwidth = instruction bandwidth + data read bandwidth
> Instruction bandwidth很直覺,就是先算出每個cycle有幾個instructions,接著每個instruction都有0.3%的機率miss,miss的話就需要fetch一個block,就是block size * miss rate * IPC = 64 * 0.003 * 0.5
> Data read bandwidth是難點所在,因為他不只要算read,write也要算(想不到吧),因為當發生write miss時,也需要從memory read資料,
所以就是block size * data cache miss rate * IPC * (write佔全部instruction的比例 + read佔全部instruction的比例)
Write bandwidth因為是write through,所以不論有沒有write miss都會寫回memory,所以不用乘上miss rate,直接word size * write佔全部instruction的比例 * IPC即可
>
> (b)Read bandwidth和(a)同理,Write bandwidth因為是write back,所以cache miss時才寫回memory,就是block size * miss rate * IPC * (write佔全部instruction的比例 + read佔全部instruction的比例) * 是dirty block的機率[name=BendTail ]
> 
> 因為write through 寫任何東西進cache都得立刻寫回memory中 最小的單位是一個word 所以每寫入快取一個word 它也會立刻寫回memory[name=icy]
>> 誰他媽會知道,一點道理都沒有[name=BendTail ]
>>
不要再看了
:::
### Cache Size
Assume there's $2^n$ blocks in the cache, n bits are used for index.
Assume there's $2^m$ words in a block, m ared used for words, and additional 2 bits are used as offset.
>那個2哪來的: CPU以word為單位,而memory以byte為單位 需要額外用兩個bits來把block轉成byte(word offset)[name=教授]
>到底誰會這樣算,把block轉成byte(單位轉換)的時候本來就會算進去了= =
### Overhead Analysis
- Direct-mapped cache中bit的數量
$2^n \times (\text{block size} + \text{tag size} + \text{valid field size})$
- The size of the tag field (Tag size)
$32 - (n + m + 2)$
n bits被用在表示index
m bits被用在表示block中的word
:::success
$\text{Overhead}= \frac{\text{Total cache size}}{\text{Actual cache data}}$
:::

>我的筆記夥伴倒了,魏伯儒起床[name=BendTail]
>起床了,阿怎麼下課了[name=海熊]
## Memory Performance
### Bandwidth
Affected by
- Bus width
- Bus cycle time
- Number of channels(各記憶體插槽可獨立運作)
- bandwidth = total data size / bus cycle 數(miss penalty)
### Latency
- DRAM access latency
- Interconnect latency(bank)
## Main Memory Supporting Cache
目前以DRAM為主流
- connected by fixed-width clocked bus
- memory bus clock通常比CPU clock慢

## Measuring Cache Performance
:::info
**複習一下**
$\text{CPU Time}=\frac{\text{Instructions}}{\text{Program}}\times\frac{\text{Clock Cycles}}{\text{Instructions}}\times\frac{\text{Seconds}}{\text{Clock Cycle}}$
$=\text{IC}\times\text{CPI}\times\text{Cycle Time}$
:::
IC=Instruction Count
>偷偷更,排版加油 [name=海熊]
>愛了❤️[name=BendTail]
CPU time 組成
1. 程式的execution cycles ( include cache hit time)
2. memory stall cycles ( due to cache misses)
$\text{Memory Stall Cycles}=\frac{\text{Memory accesses}}{\text{Program}}\times\text{Miss rate}\times\text{Miss penalty}$
$=\frac{\text{Instructions}}{\text{Program}}\times\frac{\text{Miss rate}}{\text{Instruction}}\times\text{Miss penalty}$

> 為甚麼要再放一張一樣的圖,我LATEX打很久耶:( [name=破防海熊]
Example

## Average Memory Access Time (AMAT)
$\text{AMAT}=\text{Hit time} + \text{Miss rate} \times\text{Miss penalty}$
(平均每一個Instruction 會花幾個cycle)
拿取資料會花多少時間
>想請問一下所以CPU performance是用CPU time來衡量而不是AMAT?
應該說不用考慮AMAT只要CPU time就好? [name=急救群的人]
> AMAT只是用來判斷到底拿取資料平均需要花多少時間
CPU time 才是用來判斷對於一個process CPU performance [name=李昱成]
> 大佬....說得太好...(提供作業七.pdf-*) [name=Eason Hsu]
<!-- 我不知道加號怎麼打--><!-- 就直接打+ -->
exapmle

//我不知道後面在講什麼了
## 該如何降低Miss Rate? - Associative Caches
### Fully Associative cache
* 所有的entry都可能包含目標資料
* 每一個entry 都要做search 跟比較
:::warning
可能會增加hit time
:::
:::info
> 其實就是下面那個n-way set associative在只有一個set的情況[name=BendTail]
> one-way-set=direct mapped[name=eric]
:::
### n-way set associative
* 分割成幾個set,每個set有n個entry
* 有n個路徑去找cache device 中的資料
* $\text{(Block number)}\ \ mod\ \ \text{(#sets in cache)}$
* LRU(Least Recently Used) policy(需要移除資料時,將最久沒被存取的資料移除,也就是保留最近存取過的資料)
:::info
> 原本Direct mapped是一個entry只能放一個資料,n-way就是一格可以放n個資料,先找到對應的set後,在這個set中搜尋有沒有對應的tag(此方法可用來解決餘數一樣的多筆資料被頻繁存取的情況)[name=BendTail]
:::

// 只找一個位置v.s. 找N個位置 v.s. 找每一個位置

set個數降低 -> 所需要的 index bit 變少
:::info
Example:

**Direct Mapped:**
We have 4096 blokcs -> 4096 sets.
index= log_2(4096)=12
offset=4 blocks = 4
Tag: 32-index-offset =32 - 12 - 4 = 16 bits
Total tag bits = 4096 * 16 = 8192 bytes
**Two-way:**
We have 4096 blokcs -> 4096/2 = 2048 sets.
index= log_2(2048)=11
offset=4 blocks = 4
Tag: 32-index-offset =32 - 11 - 4 = 17 bits
Total tag bits = 2048* 2 * 17 = 8704 bytes
**Four-way:**
We have 4096 blokcs -> 4096/4 = 1024 sets.
index= log_2(1024)=10
offset=4 blocks = 4
Tag: 32-index-offset =32 - 10 - 4 = 18 bits
Total tag bits = 1024 * 4 * 18 = 9216 bytes
**Fully-associative:**
We have 4096 blokcs -> 1 set. 因為是fully associative,想不到吧(雖然你應該要想到)
index= 0 (因為只有一個set)
offset=4 blocks = 4
Tag: 32-index-offset =32 - 0 - 4 = 28 bits
Total tag bits = 4096 * 28 = 14336 bytes
:::
------
:::spoiler 聊天室(?
> 我其實都聽不懂[name=David-Teng]
> 樓上這我[name=v0]
> 然後羅宣祐你就繼續自己偷卷不來幫忙做hackmd沒關係[name=假mrwei(v0)]
> 首先,我睡著了[name=海熊]
> 我現在看懂了,補在上面[name=BendTail]
:::
------


## Memory Hierarchy basic
### Six basic cache optimzation
#### Large block
* 可以一次帶比較多資料到cache
* 減少強制性的miss (compulsory misses)
* 增加 capacity and conflict miss及miss penalty
#### Larger total cache capacity
* 減少miss rate
* 增加hit time及power consumption
#### Higher associativity
* 減少conflict miss
* 增加hit time及power consumption
#### Higher number of cache levels
* 減少miss penalty 及overall memory access time
#### Giving priority to read misses over write
* 減少 miss penalty
#### Avoiding address translation in cache indexing
* 減少 hit time

## Miss的種類
### Compulsory Miss(強迫性失誤)
第一次讀取資料時造成(透過larger block改善)
### Capacity Miss(容量性失誤)
之前讀取過的資料被discard後又再被request(cache的可用size不夠造成,透過 larger cache改善)
### Conflict Miss(衝突性失誤)
多個位址對應到的 block index 是同一個,就會一直覆蓋掉之前已經 cache 好的 data 而必須重複寫入讀取(n-way 的n不夠造成)

> 後面丟圖片是因為我覺得比較好理解,~~絕對不是因為我已經跟不上了~~,然後我想睡覺ZZZ[name=v0]
>為什麼鄧博元烙幹去吃午餐了操
>對不起[name=david]
>諸君 早安 我之後再回來補,中間加入根本聽不懂(好耶補完了)[name=BendTail]
## Replacement policy
### Direct mapped
* no choice
### Set associative
* 先找是否有non-valid entry
* 否則就用其他方法如:FIFO,random,LRU等
#### Least Recrntly Used(LRU)
* 選擇最久沒用的entry
#### Random
* 可以節省bit
* ~~Bogo replacement~~
## LRU

### Pure LRU
* 拿第一格當例子,如果前面是第1個entry後面是第2個entry就設為TRUE(1),否則設為False(0)
* 從bit的解讀可以推entry的順序
* 最小5 bits 是因為 2^5= 32 > 4!= 24
* 然後他取6 bits是因為他是臭婊子(大概吧)
### Partiton LRU(Pseudo LRU)
* 0往左邊找,1往右邊找
* 找過後bit會反轉

> 請證明Partition LRU 有效[name=孫敏德]
<!--儘量不要用DC 我怕圖片死掉 圖片直接複製後在這邊貼上就好 by Jasper-->
<!--他剛剛說平板不能直接貼 by V0-->
<!--不然你在DC複製再貼過來也可以 by Jasper-->

## Multi level Cache Terminology
multi-level cache的設計目是為了減少miss penalty,當L1發生cache miss時,L2幾乎包含大部分的資料,因此L1的miss penalty就由L2(SRAM)的讀取時間所決定,遠比讀取main memory(DRAM)的讀取時間還快數百倍。
multi-level cache在L1與L2設計著重的點不同
L1的hit time會決定CPU的clock cycle time,所以要盡量設計小。
1. 使用split memory,頻寬加倍,以降低hit time。
2. L1會採用write-back或是write-through都可,因為L1和L2都是SRAM,速度都很快。
L2注重miss rate,要夠大才能涵蓋所有大部分需要的資料。
1. 使用combined memory,locality of reference增加,以降低miss rate。
2. L2下面一層就是main memory,SRAM和DRAM速度差非常多,因此會採用write-back。
:::spoiler 對不起
>謝謝羅電神
>>你就繼續卷
>>>這你
>>>>說到他
>>>>>好多層喔
>>>>>>yee
>>>>>>>包於
>>>>>>>>讀不完 好想變巨人
:::
### local miss rate
* 單獨算每層的miss rate(Miss rate(L1), Miss Rate(L2))
* L1通常比L2高(沒聽錯的話)
>ChatGPT説L2比較高耶[name=海熊]
>可是你媽說L1比較高[name=你媽]
### Global miss rate
:::info
$\frac{\text{一個Group的存取lower level的次數}}{\text{一個Group的總存取次數}}$
:::
* 通常是每一層的Local Miss Rate相乘
* global L2 Miss rate=L1 miss rate x L2 Miss rate
* 越低代表要存取memory的次數越少
* If there's L3, Miss Penalty(L2)= Hit time(L3)+Miss rate(L3) * Miss Penalty(L3). Otherwise, the penalty should be memory access

> CPU base CPI = 1, clock rate = 4GHz
> Miss rate per instruction = 2%
> Main memory access time = 100ns
:::spoiler 如果你像我一樣忘記甚麼是CPU Time, CPI, clock rate
::: info
#### CPU Time = ${ I * CPI *T}$
- ${I}$ = number of instructions in program
- ${CPI}$ = average cycles per instruction
- ${T}$ = clock cycle time
- clock rate = $\dfrac{1}{T}$
[name=蛋蕉] 誰啦
:::
- **With just primary cache (L1)**
- T $= \frac{1}{4G \text{hz}} = 0.25 \times 10^{-9}$ Clock/Cycles
- Miss penalty = $\frac{100 \times 10^{-9}}{0.25 \times 10^{-9}}=400$ (cycles)
- Effective CPI = $1 + 0.02 \times 400 = 9$
- **Add L2 cache**
> Access time = 5ns
> Global miss rate to main memory = 0.5%
- L1 miss, L2 hit
- Penalty = $\frac{5}{0.25}=20$ (cycles)
- L1 miss, L2 miss
- Extra penalty = $400$ (cycles)
- Effective CPI = $1 + 0.02 \times 20 + 0.005 \times 400 = 3.4$
### Effect of 2-level Caching
- L2 size usually much bigger than L1
- Provide reasonable hit rate
- 降低 L1 cache 的 miss penalty
- 但對整體的 miss penalty 而言是上升的
- Inclusion property
- Inclusive cache
- L1 是 L2 的子集合,在 L1 有的資料 L2 都有
- 簡化確保資料一致性的機制
- L2 replace 時,要 INvalid L1 對應的 data
- Exclusive cache
- 在 L1 出現的不一定會在 L2 出現
- 如果 L2 miss 的話也要去看 L1 是否 hit
- 2023 Final:
:::info
What is the difference between inclusive and exclusive cache policies in a multi-level cache hierarchy? List one advantage and disadvantage for inclusive and exclusive of each approach. (7 points)
In an inclusive cache policy, the data present in a higher-level cache is also present in all lower-level caches.
On the other hand, in an exclusive cache policy, the data present in a higher-level cache is not present in any lower-level caches.
**Advantage:**
* Inclusive cache: Improve cache coherence and reduces the complexity of maintaining consistency between different cache levels.
* Exclusive cache: Maximize cache capacity.
**Disadvantage:**
* Inclusive cache: Result in higher cache capacity requirements because of duplicate data.
* Exclusive cache: Increase the complexity of maintaining cache coherence.
:::
### Multiple-level Cache **inclusion property**
1. Inclusive cache: L1有的L2裡面也有,effective cache size = L2.
* L2 is inclusive of L1:
在 L1 的就一定要在 L2 出現
從 L1 移除時,L2 不會跟著移除
從 L2 移除時,若 L1 也有則 L1 也要移除
* Enforce inclusive property: L2的資料被replace時要把L1中的這筆資料invalid掉

2. Exclusive cache: L1有的L2裡面不會有,當 L1 cache line 中的資料被替換的時候,L1 的資料會往 L2 放,effective cache size = L1 + L2.
* L2 is exclusive of L1:
L1、L2 中只會出現在其中一邊

## Interactions with Advanced CPUs
//todo

亂序執行CPU可以在快取未命中期間執行指令
* 未完成的存儲操作停留在加載/存儲單元中
* 依賴指令在預留站中等待
* 獨立指令繼續執行
* 未命中的影響取決於程序數據流
* 更難分析
* 使用系統模擬
## Caching Exapmle (我不知道為什麼插在這裡)


> 175 in 2-way is supposed to be CF [name=嘟嘟]
## DGEMM(Double-Precision-Matrix-Multiply) Access Pattern
把頻繁存取的一些資料放在同一個cache block,這樣才不用兩個block之間頻繁切換

//我看不懂這圖 好心人幫補 羅宣又做事
//羅宣佑不要E了 做事

當資料量變大,Performence就會拉開差距
## Dependability


* Reliability: Mean time to failure(MTTF)
* 平均多久時間才會遇到一次錯誤 (e.g. 記憶體讀不到資料 / 讀到錯誤的資料),MTTF越大越好
* Service interruption:mean time to repair(MTTR)
* 修復所需的平均時間,MTTR越小越好
* Mean time between failure(MTBF)= MTTF + MTTR
* Availability: $=\frac{\text{MTTF}}{\text{MTBF}}$

- **Increase MTTF**: fault avoidance, fault tolerance (e.g. **RAID**), fault forecasting
- **Reduce MTTR**: improved tools and processes for diagnosis and repair
#### ''''''我是分隔號''''''
- **Reliability:** mean time to failure (MTTF)
- 平均多久時間才會遇到一次錯誤 (e.g. 記憶體讀不到資料 / 讀到錯誤的資料)
- **Service interruption**: mean time to repair (MTTR)
- 發生錯誤後要花多久時間修復
- Mean time between failures (MTBF)
- **MTBF = MTTF + MTTR**
- $\text{Availability} = \frac{\text{uptime}}{\text{total time}} = \frac{\text{MTTF}}{\text{MTBF}}$
- Improving Availability
- **Increase MTTF**: fault avoidance, fault tolerance (e.g. **RAID**), fault forecasting
- **Reduce MTTR**: improved tools and processes for diagnosis and repair
## The Hamming SEC Code(ECC)
假設讀一連串的bit,我們要怎麼知道其中有沒有位元發生翻轉
Fault detection / correction for **RAM**
* Parity Bit: Detect single bit error
* 看所有bit的sum是奇數還是偶數
### Hamming distance(Single Error Correction, Double Error Detection)
兩個bit patterns有幾個bit不同
### Error Correction Code(ECC)
:::info

:::
> 就是把二進制表示法中所有第n bit為1的數抓出來(像bit 1是1, 3, 5, 7, 9..., bit 2是2, 3, 6, 7, 10...)[name=BendTail ]
#### 首先要先加入檢查碼(Parity bit)
檢查碼要加在$2^n$的位置
:::info

:::
> 先把$2^n$的位置留空,剩下的照原本的順序填入,照上面的規則把對應的bit抓出來XOR,要讓XOR的值為0,反推出對應的檢查碼應該是多少。[name=BendTail ]
#### Error Detection
假設第10位被翻轉
:::info

:::
> 照上面的規則把對應的bit抓出來XOR,把所有結果是1的加起來就是錯誤的bit,先用single bit error的方法去看他,嘗試糾正後還是有error代表有兩個以上的error,就沒辦法糾正,只能偵測[name=BendTail ]
<!--我先大概記一下 之後再補 (補完!)-->
### Bits needed
p = total number of parity bits
d = number of data bits in p + d bits
and one more to indicate that no error exists

> 為什麼不用$\LaTeX$[name=BendTail]
> 我不會[name=嬌生慣養的me]
:::danger
> ==注意,以下內容(到Cache coherence problem之前)完成度巨低,但我相信你們OS應該已經讀過了,我也懶得做,嗚呼!==[name=BendTail ]
>> 我OS超爛= =
:::
:::info
~~一起看教授的鬼畫符~~

:::
## Virtual Memory(important!)
* 用main memory當成secondary storage(disk)的cache
* 每個program要有自己的空間(private virtual address),不能被其他program存取(不然會打架)
* 由OS和CPU hardware共同管理,將virtaul address(VA)轉換成physical address(PA)
> OS 有教,要用MMU [name=被荼毒之人]
* 以page為單位搬移資料(4kb),miss叫做page fault
* page 就是 VM 的 block
## Address Translation
透過page table找到對應的physical address
page offset(4kb = $2^{12}$)

## Page Fault
### Page Fault Penalty
* 發生的話,要去disk找資料
* latency超大(1000 * 1000),~~跟我一樣 這我 遲緩兒:(~~
* 對CPU負擔大,由OS Code處理
* 所以要盡量降低page fault rate
* Fully associative placement
* 可以用酷酷的演算法(pure LRU, psuedo LRU)
### Page Fault Handler
1. 用VA在PTE中找到Page在disk中的位置
2. 選擇要被替換掉的Page(若Page is dirty寫回disk)
3. 將Page放進memory後更新Page table
4. Restart from faulting instruction
## Page Tables
一個一維陣列,VA丟進去,PA跑出來,~~高雄發大財~~
An array of page table entries, indexed by virtual page number

* Page Table Register
* CPU中用來指向physical memory中page table位置的暫存器
* If page is present in memory
* page table entry(PTE) stores the physical page number
* Plus other status bits(referenced, dirty...)

* If page is not present in memory
* page table entry can refer to location in swap space on disk

### Replacement and Writes
- To reduce page fault rate, prefer least-recently used (LRU) replacement
- **Reference bit** (a.k.a use bit) in PTE set to 1 on access to page
- Periodically cleared to 0 by OS
- A page with reference bit = 0 $\implies$ not been used recently
- 非揮發性記憶體 (如硬碟) 寫入資料的代價過高
- **`Write-through`** is impractical(速度太慢), use **`Write-back`** instead (見先前 Writing Policy 部分)
- **Dirty bit** in PTE will be set when page is written
## Translation Lookaside Buffer(TLB)
原本只用page table的話會需要access memory兩次
1. 一次access PTE(Page table entry)
2. 一次access轉換出來的memory位置
TLB的概念大概是Page table的cache,若TLB中有目標則可以只access memory一次


### TLB Misses
可能有兩種情況
1. Page在page table中但是不在TLB裡面
2. Page不在page table中
發生TLB miss時,handler(可以是硬體或軟體)從PTE中copy到TLB,接著restart instruction,若不在page table中則會發生page fault,載入之後重啟instruction。
TLB miss 的時候,會去Page Table找, Page Table會告訴你Data在physical page or DISK.
### TLB and Cache Interaction

## Memory Protection
不同的task可能共用同一個VA
需要用OS去保護,不讓非法存取成功
有些硬體輔助OS進行保護
* Kernel mode
* Privileged instructions
* Page table只能在Kernel mode存取
* System call
## Cache Design Trade-offs
講優劣,感覺很欠考

## Cache Controller FSM
大概就是flow chart,可以看一下
This cache is direct-mapped, write-back, write-allocate

## Linear Table
>我不知道他在寫什麼 [name=海熊]
>我救救看 [name=BendTail ]
通常一個process就有一個page table
根據page size以及address的位數來推算出有幾個entries
page table size就是entry數量乘上entry size
:::info
假設page size = 4kb = $2^{12}$ bytes
address = 32 bits
entry數量 = $2^{32-12}$ = $2^{20}$
:::

Base+PTES* # of PTE

offset看page size
Page/Page Table Entry(PTE)= entries #
Base+PTE size * VPN
## Multi-level Page Tables: Page Directory
Page Directory
* The page directory contains one entry per page of the page table.
* It consists of a number of page directory entries(PDE).
* PDE has a valid bit and page frame number(PFN).
* PFN->next level page table
* 
我根本看不懂
----------
## Cache Coherence Problem

如果每個Processor都有自己的Cache
跟著上圖的步驟做,會發現到第3步時,u會被修改,但根據write back,只有在被replace時才會寫回memory,這樣在寫回之前,其他processor的cache會得到更新前的值
### 解決方法
* Migration(遷移) of data to local caches
* Reduces bandwidth for shared memory
* Replication(複製) of read-shared data
* Reduces contention(紛爭) for access
### Snooping Protocols
當一筆 shared memory 同時紀錄於不同快取的 cacheline 時,利用 bus 顯示資料的狀態,讓所有 cache 能窺探並更新自己狀態與資料,確保所有 cache 對這筆資料讀寫時只有一種基準值。
#### Write Invalidate
* 多個readers,一個writer
* 當有一個processor更改共享資料時,將其他快取中的這筆資料標示為invalidate(再次使用會得到coherence misses)
* miss的話如果是write through就可以直接到memory找這筆資料,write back則需要snoop每個cache找最新的資料
#### Write Broadcast
* 通常是write through
* 當有一個processor更改這筆共享資料時,將更改的數值broadcast給其他快取做更新。
#### Write Serialization
(我不知道)
#### Write through

#### Write back
* use read miss to triger write back

### Directory Based Schemes
Directory-based 不透過 system bus 維持節點間資訊同步,而以 directory 追蹤其他 cache block 的情形,包括 cache block 目前的 state,與目前有哪些節點正共享這份 cache block。如此一來即可降低透過 system bus 將 signal 廣播給所有節點,而只傳給對資訊有興趣的節點,以降低 memory bandwith 的成本。
# Ch.6 Parallel Processors From Client to Cloud
## Hardware and Software
### Hardware
* Serial(不能平行)
* Parallel(可平行)
### Software
* Sequential(不能平行)
* Concurrent(可平行)
**Sequentail** software can run on **serial** hardware.
**Concurrent** software can run on **parallel** hardware.
## Parallel Programming
問題在Parallel software
### Difficulties
* Partitioning
* Coordination
* Communications overheads
## Amdahl's Law(又見面了)
不能平行處理的部分會限制speedup
:::info

:::
----------------
# 題型
是非
簡答(cache block變大的影響之類的)
計算
cache管理(direct map cache, associative cache)(snooping)
(是非簡答snooping跟directory都有可能出)