# 中央資工-計算機組織(下) :::info 如果你是不幸修到這門課的學弟妹,先為你們默哀3秒鐘。 這是一群準備計算機組織準備到破防的人一起做的筆記,這麼難的考試不能只有我考到 (有些地方不是很完整,請見諒) > 考完85分簽[name=BendTail] ::: https://hackmd.io/@U86qbAM7Qp6EYJchUWA5SA/rk-_yUyXC/edit 我懶得整合 ![](https://memeprod.sgp1.digitaloceanspaces.com/user-wtf/1592837439480.jpg =50%x) ![](https://megapx-assets.dcard.tw/images/c5fea2ed-747b-440a-a370-b76550d3450a/1280.webp =50%x) # Ch.5 Large and Fast: Exploiting Memory Hierarchy ## Locality ### Temporal Locality 一個item在最近被存取後,則這個item在未來很有可能會再被使用 :::info Ex: 迴圈中的instructions、迴圈的induction variable ::: ### Spatial Locality 一個item在最近被存取後,則這個item跟他附近的item在未來很有可能會再被使用 :::info Ex: Array data、連續的instructions ::: ::: success Ex: ``` for(int i = 0; i < 8; i++){ for(int j = 0; j < 8000; j++){ A[i][j] = A[j][i] + B[i][0]; } } ``` Temporal: i, j, B[i][0] Spatial: A[i][j](因為以row major只有A[i][j]會迭代相鄰的東西) ::: ### 晚餐(? * 優點:量大管飽 便宜 耐用 * 缺點:low performance 高耗能 ## Memory Hierarchy * 將最近存取的資料(及周圍的資料)從disk copy到DRAM * 將更最近存取的資料(及周圍的資料)從DRAM copy到SRAM(Ex: CPU上的L1 cache) ### Block Copy的基本單位(可能是好幾個word) **(Spatial locality)** memory to cache, cache之間的搬移的單位 ### Hit 要的資料有在上層,直接存取 :::info $\text{Hit ratio} = \frac{\text{hits}}{\text{accesses}}$ ::: ### Miss 要的資料不在上層 要從下層copy上來再存取 #### Miss Penalty 將資料從下層拿到上層所花的時間 :::info $\text{Miss ratio} = 1 -\text{Hit ratio}$ ::: ## Memory Technology * Static RAM(SRAM) * Dynamic RAM(DRAM) * Magnetic disk ![image](https://hackmd.io/_uploads/BJGjaK-SA.png) ## DRAM * Uses as main memory * 用capacitor(電容)的充電狀態表示0或1(充電是0沒充電是1) * Single transistor used * 必須固定時間(64ms)更新一次(不然電容中的電會不見) * Read contents and write back * 以 row 為單位 ### Advanced DRAM Organization * Bits in a DRAM are organized as a rectangular array * First active the row. DRAM accesses an entire row, then find the target column * 以一個row 為單位去存取/讀寫,每一個row都有address,通常一個 row 為 32bits 或 64bits * 在搬移row 資訊時,要重複大小/row size 次 -> Burst Mode 優化 * Burst Mode * 除了存取指定位址的記憶體外,還會多存取周圍的8~16列(視 bandwidth 大小)記憶體,並存在內部的 row buffer 中,減少定址(addressing)的時間 * Reduce latency #### Double Data Rate(DDR) DRAM Transfer on rising and falling clock cycle #### Quad Data Rate(QDR) DRAM Separate DDR inputs and outputs ![](https://hackmd.io/_uploads/rkUaoFDbA.png) ### DRAM Performance Factors #### Increase Row Buffer size * DRAM的cache,先從這邊找有沒有目標,沒有再去active row * 可同時(parallel)讀取和更新好幾個words * 存取資料更快 #### Synchronous DRAM(SDRAM) * Allows for consecutive accesses in bursts without needing to send each address(Burst mode) * Improves bandwidth(頻寬) #### DRAM Banking * 可同時存取(每個bank獨立運作) * Improves bandwidth ### Burst Mode Support :::info Ex: ![image](https://hackmd.io/_uploads/ByPEjpfB0.png) > Burst mode會先把所有column active,省去一個一個column active的時間[name=教授] > Reduced Access Overhead: It involves transferring multiple data words in a single burst. ::: ## Incrasing Memory Bandwidth bandwidth = data/latency ![](https://hackmd.io/_uploads/ByssAFP-A.png) ![](https://hackmd.io/_uploads/HyCoRtP-C.png) :::success 將bus變寬的成本很高,但用bank可在低成本的情況下增加頻寬 ::: ## Flash Storage * Nonvolatile semiconductor storage * 比disk快100~1000倍 * smaller, low power, more robust(耐用) * 但cost比disk和DRAM高 ### NOR flash * Random read/write access * 優點: random access 讀取速度快 * 缺點: bit-alterable 寫入速度慢 :::info Ex: 嵌入式系統的Instruction memory ::: ### NAND flash * Block(page)-at-a-time access * 成本較低 * 缺點: 每一次write 都要改寫 :::info Ex: USB, media storage ::: :::warning Flash bit 1000次access後會wear out(要remap data到較少使用的blocks -> wear leveling) > 假設你有一個1TB SSD,那在寫入1000TB之後可能就不能用了 ::: ## Disk * Nonvolatile, rotating magnetic storage * 移動head到track * 轉動磁盤到對應sector ### Sector * disk的基本單位 * 用sector ID 定址(Sector ID 不會存到 sector space) >可以把disk想像成一個一維陣列,要有index才能存取資料[name=教授] * 目前大小為4096 bytes為主流 #### Error correcting code(ECC) 兩份資料做XOR比對,有1就是有錯誤(不一樣的數值XOR出來會是1) 把要儲存的 data 分成2部分並做XOR,是為 old parity(ECC),每一次 read 時重新計算 new parity,跟 old parity 比對(XOR),如果出現1(old 跟 new 不同)代表為 error bit,可以透過 ECC 做 recovery #### Latency 1. Job在queue中等待的時間 2. average seek time: 移動head(向內or 向外)的latency 3. rotational latency: 轉動磁盤的latency(平均時間就是轉一圈所需時間乘上1/2) 4. data transfer 5. controller overhead(fixed for operation) ![image](https://hackmd.io/_uploads/H14QB9brC.png) ### Disk drives include caches 每次的存取latency太長,需要cache增加performance ## Cache Memory 先檢查cache中有沒有要的資料,沒有再觸發cache miss event然後去下層的memory抓 :::danger 要怎麼知道資料在不在cache中? * 找memory address有沒有在cache中 要去哪裡找? * 找對應的index,比對tag 要怎麼讀取? > 我不知道,不要再考我了[name=想休學的那個人] ::: ### Direct Mapped Cache * Memory address map到cache entry * Map方法: $\text{(Block address)}\ \ mod\ \ \text{(# of Blocks in cache)}$ * 看對應的cache entry就能知道有沒有memory address * $\text{(# of Blocks)}$ is the power of 2 :::danger 問題: Memory是byte-addressable但cache是block-addressable ::: :::success $\lfloor\frac{address}{單位大小}\rfloor$ Ex: 32 bits 轉成 word(單位大小$2^2$) ->$\lfloor\frac{address}{2^2}\rfloor$ ::: ### Tags 用lower bits得知entry再看higher bits是否一致 所以entry只需儲存higher bits->**tag** ### Valid bits 資料的可用狀態 Valid bit = 1 if the data is present, otherwise 0 若valid bit = 0則不管tag比對結果是否一致都要載入資料 ## Four Basic Questions - Consider access of levels in a memory hierarchy: - For memory: byte addressable - For caches: use block (or called line) for the unit data transfer; satisfy Principle of Locality. But, need to locate blocks (not byte addressable - Cache design - Block Placement - Block Identification - Block Replacement - Write Strategy ## Larger Block Size ![](https://hackmd.io/_uploads/rJZwQOgzC.png) 那他是怎麼算的呢?First of all, it said it's 16 bytes/block, since 16=${2^4}$, we require 4 bits to be used as the offset. Next, we know that there are 64 blocks, and 64 =${2^6}$. So, we know that we need 6 bits to represent all possible block numbers (0 to 63). Lastly, we assume the total address size is 32-bits(~~We just know.~~). In order to get tag size, we have to reduct the index bits and the offset bits, which means we get 32-6-4=22. ## Block Size Considerations - 因為spatial locality,larger block size應該會降低miss rate - 但是larger block size會使block數量變少,造成碰撞更多,增加miss rate - larger block size造成pollution->搬了不需要的資料上來(因為一次抓會抓很大一段) - larger miss penalty(要access的資料量變大) ## Cache Misses - hit時,CPU正常運作 - miss時 1. stall the CPU pipeline 2. 從下一個hierarchy fetch資料 - Block size will affect the miss rate. - 64 byte之前因為spatial locality所以miss rate下降,但64 byte後因為Cache pollution造成miss rate增加 > 破防 [name=海熊] ![](https://hackmd.io/_uploads/HyY6DuefA.png) * 降低: spatial locality * 增加: cache pollution、fewer entries ### Instruction Miss restart instruction fetch ### Data Cache Miss complete data access ## Write Policy <!-- 我要把我上課聽到的東西打上去uwuouo --> <!--👍👍👍--> ### Write through After data-write hit, not only we have to update the block in cache, also, we have to write through back to memory. <!-- 排版怎麼排 我來就好uwuowo--> ### Write back - 用dirty bit - dirty bit = 1 代表資料被更新過 - 當dirty block被replace - 寫回memory - 可用write buffer讓replaceing blocks能先被read > 我不乾淨了 > You little dirty boi <!-- 晶晶體 對不起 我很sorry--> #### Write Buffer Since write through takes extra time, we can use write buffer to reduce the time cost.(把要存回Memory的東西丟到Buffer,CPU就可以去做其他事了) ### Write Allocation, Write Around 當發生write miss(想要儲存的資料不在 cache 中)時 * **Write Allocation:** When a write miss occurs in the cache , the cache line is allocated and the data is fetched into the cache. The write operation is then performed on the cache. * **Write Around:** When a write miss occurs, the data is written directly to the main memory, bypassing the cache. The cache is not updated or allocated with the new data. - 如果是write-through - allocate on miss: fetch the block(更新到cache後寫) - write around: don't fetch the block(直接在memory寫) - 如果是write-back - usually fetch the block(write allocate) :::info 放個超難作業題(作業五 6.)(跟作業數字不太一樣但是概念一樣) ![image](https://hackmd.io/_uploads/SJ_X6L8m0.png) ![image](https://hackmd.io/_uploads/SkdA3UIQA.png) > 個人理解,有錯幫改 > > (a)Read bandwidth = instruction bandwidth + data read bandwidth > Instruction bandwidth很直覺,就是先算出每個cycle有幾個instructions,接著每個instruction都有0.3%的機率miss,miss的話就需要fetch一個block,就是block size * miss rate * IPC = 64 * 0.003 * 0.5 > Data read bandwidth是難點所在,因為他不只要算read,write也要算(想不到吧),因為當發生write miss時,也需要從memory read資料, 所以就是block size * data cache miss rate * IPC * (write佔全部instruction的比例 + read佔全部instruction的比例) Write bandwidth因為是write through,所以不論有沒有write miss都會寫回memory,所以不用乘上miss rate,直接word size * write佔全部instruction的比例 * IPC即可 > > (b)Read bandwidth和(a)同理,Write bandwidth因為是write back,所以cache miss時才寫回memory,就是block size * miss rate * IPC * (write佔全部instruction的比例 + read佔全部instruction的比例) * 是dirty block的機率[name=BendTail ] > ![image](https://hackmd.io/_uploads/B1nb8sPm0.png) > 因為write through 寫任何東西進cache都得立刻寫回memory中 最小的單位是一個word 所以每寫入快取一個word 它也會立刻寫回memory[name=icy] >> 誰他媽會知道,一點道理都沒有[name=BendTail ] >> 不要再看了 ::: ### Cache Size Assume there's $2^n$ blocks in the cache, n bits are used for index. Assume there's $2^m$ words in a block, m ared used for words, and additional 2 bits are used as offset. >那個2哪來的: CPU以word為單位,而memory以byte為單位 需要額外用兩個bits來把block轉成byte(word offset)[name=教授] >到底誰會這樣算,把block轉成byte(單位轉換)的時候本來就會算進去了= = ### Overhead Analysis - Direct-mapped cache中bit的數量 $2^n \times (\text{block size} + \text{tag size} + \text{valid field size})$ - The size of the tag field (Tag size) $32 - (n + m + 2)$ n bits被用在表示index m bits被用在表示block中的word :::success $\text{Overhead}= \frac{\text{Total cache size}}{\text{Actual cache data}}$ ::: ![](https://hackmd.io/_uploads/r1BWltgM0.png) >我的筆記夥伴倒了,魏伯儒起床[name=BendTail] >起床了,阿怎麼下課了[name=海熊] ## Memory Performance ### Bandwidth Affected by - Bus width - Bus cycle time - Number of channels(各記憶體插槽可獨立運作) - bandwidth = total data size / bus cycle 數(miss penalty) ### Latency - DRAM access latency - Interconnect latency(bank) ## Main Memory Supporting Cache 目前以DRAM為主流 - connected by fixed-width clocked bus - memory bus clock通常比CPU clock慢 ![image](https://hackmd.io/_uploads/rJLd5tlfA.png) ## Measuring Cache Performance :::info **複習一下** $\text{CPU Time}=\frac{\text{Instructions}}{\text{Program}}\times\frac{\text{Clock Cycles}}{\text{Instructions}}\times\frac{\text{Seconds}}{\text{Clock Cycle}}$ $=\text{IC}\times\text{CPI}\times\text{Cycle Time}$ ::: IC=Instruction Count >偷偷更,排版加油 [name=海熊] >愛了❤️[name=BendTail] CPU time 組成 1. 程式的execution cycles ( include cache hit time) 2. memory stall cycles ( due to cache misses) $\text{Memory Stall Cycles}=\frac{\text{Memory accesses}}{\text{Program}}\times\text{Miss rate}\times\text{Miss penalty}$ $=\frac{\text{Instructions}}{\text{Program}}\times\frac{\text{Miss rate}}{\text{Instruction}}\times\text{Miss penalty}$ ![image](https://hackmd.io/_uploads/HyUnpiWSR.png) > 為甚麼要再放一張一樣的圖,我LATEX打很久耶:( [name=破防海熊] Example ![image](https://hackmd.io/_uploads/rywGC7PGA.png) ## Average Memory Access Time (AMAT) $\text{AMAT}=\text{Hit time} + \text{Miss rate} \times\text{Miss penalty}$ (平均每一個Instruction 會花幾個cycle) 拿取資料會花多少時間 >想請問一下所以CPU performance是用CPU time來衡量而不是AMAT? 應該說不用考慮AMAT只要CPU time就好? [name=急救群的人] > AMAT只是用來判斷到底拿取資料平均需要花多少時間 CPU time 才是用來判斷對於一個process CPU performance [name=李昱成] > 大佬....說得太好...(提供作業七.pdf-*) [name=Eason Hsu] <!-- 我不知道加號怎麼打--><!-- 就直接打+ --> exapmle ![image](https://hackmd.io/_uploads/S12-yVDMA.png) //我不知道後面在講什麼了 ## 該如何降低Miss Rate? - Associative Caches ### Fully Associative cache * 所有的entry都可能包含目標資料 * 每一個entry 都要做search 跟比較 :::warning 可能會增加hit time ::: :::info > 其實就是下面那個n-way set associative在只有一個set的情況[name=BendTail] > one-way-set=direct mapped[name=eric] ::: ### n-way set associative * 分割成幾個set,每個set有n個entry * 有n個路徑去找cache device 中的資料 * $\text{(Block number)}\ \ mod\ \ \text{(#sets in cache)}$ * LRU(Least Recently Used) policy(需要移除資料時,將最久沒被存取的資料移除,也就是保留最近存取過的資料) :::info > 原本Direct mapped是一個entry只能放一個資料,n-way就是一格可以放n個資料,先找到對應的set後,在這個set中搜尋有沒有對應的tag(此方法可用來解決餘數一樣的多筆資料被頻繁存取的情況)[name=BendTail] ::: ![image](https://hackmd.io/_uploads/Sy-IostMC.jpg) // 只找一個位置v.s. 找N個位置 v.s. 找每一個位置 ![image](https://hackmd.io/_uploads/rk4ynsFMA.jpg) set個數降低 -> 所需要的 index bit 變少 :::info Example: ![image](https://hackmd.io/_uploads/Hygh1xCBHA.png) **Direct Mapped:** We have 4096 blokcs -> 4096 sets. index= log_2(4096)=12 offset=4 blocks = 4 Tag: 32-index-offset =32 - 12 - 4 = 16 bits Total tag bits = 4096 * 16 = 8192 bytes **Two-way:** We have 4096 blokcs -> 4096/2 = 2048 sets. index= log_2(2048)=11 offset=4 blocks = 4 Tag: 32-index-offset =32 - 11 - 4 = 17 bits Total tag bits = 2048* 2 * 17 = 8704 bytes **Four-way:** We have 4096 blokcs -> 4096/4 = 1024 sets. index= log_2(1024)=10 offset=4 blocks = 4 Tag: 32-index-offset =32 - 10 - 4 = 18 bits Total tag bits = 1024 * 4 * 18 = 9216 bytes **Fully-associative:** We have 4096 blokcs -> 1 set. 因為是fully associative,想不到吧(雖然你應該要想到) index= 0 (因為只有一個set) offset=4 blocks = 4 Tag: 32-index-offset =32 - 0 - 4 = 28 bits Total tag bits = 4096 * 28 = 14336 bytes ::: ------ :::spoiler 聊天室(? > 我其實都聽不懂[name=David-Teng] > 樓上這我[name=v0] > 然後羅宣祐你就繼續自己偷卷不來幫忙做hackmd沒關係[name=假mrwei(v0)] > 首先,我睡著了[name=海熊] > 我現在看懂了,補在上面[name=BendTail] ::: ------ ![IMG_0499](https://hackmd.io/_uploads/ByGp_b-m0.jpg) ![IMG_0500](https://hackmd.io/_uploads/ryy0_bbQA.jpg) ## Memory Hierarchy basic ### Six basic cache optimzation #### Large block * 可以一次帶比較多資料到cache * 減少強制性的miss (compulsory misses) * 增加 capacity and conflict miss及miss penalty #### Larger total cache capacity * 減少miss rate * 增加hit time及power consumption #### Higher associativity * 減少conflict miss * 增加hit time及power consumption #### Higher number of cache levels * 減少miss penalty 及overall memory access time #### Giving priority to read misses over write * 減少 miss penalty #### Avoiding address translation in cache indexing * 減少 hit time ![image](https://hackmd.io/_uploads/SytjKhYfA.jpg) ## Miss的種類 ### Compulsory Miss(強迫性失誤) 第一次讀取資料時造成(透過larger block改善) ### Capacity Miss(容量性失誤) 之前讀取過的資料被discard後又再被request(cache的可用size不夠造成,透過 larger cache改善) ### Conflict Miss(衝突性失誤) 多個位址對應到的 block index 是同一個,就會一直覆蓋掉之前已經 cache 好的 data 而必須重複寫入讀取(n-way 的n不夠造成) ![15](https://hackmd.io/_uploads/HkYyJTKf0.jpg) > 後面丟圖片是因為我覺得比較好理解,~~絕對不是因為我已經跟不上了~~,然後我想睡覺ZZZ[name=v0] >為什麼鄧博元烙幹去吃午餐了操 >對不起[name=david] >諸君 早安 我之後再回來補,中間加入根本聽不懂(好耶補完了)[name=BendTail] ## Replacement policy ### Direct mapped * no choice ### Set associative * 先找是否有non-valid entry * 否則就用其他方法如:FIFO,random,LRU等 #### Least Recrntly Used(LRU) * 選擇最久沒用的entry #### Random * 可以節省bit * ~~Bogo replacement~~ ## LRU ![21](https://hackmd.io/_uploads/HyQZWpKf0.jpg) ### Pure LRU * 拿第一格當例子,如果前面是第1個entry後面是第2個entry就設為TRUE(1),否則設為False(0) * 從bit的解讀可以推entry的順序 * 最小5 bits 是因為 2^5= 32 > 4!= 24 * 然後他取6 bits是因為他是臭婊子(大概吧) ### Partiton LRU(Pseudo LRU) * 0往左邊找,1往右邊找 * 找過後bit會反轉 ![image](https://hackmd.io/_uploads/Sk17tbZQA.png) > 請證明Partition LRU 有效[name=孫敏德] <!--儘量不要用DC 我怕圖片死掉 圖片直接複製後在這邊貼上就好 by Jasper--> <!--他剛剛說平板不能直接貼 by V0--> <!--不然你在DC複製再貼過來也可以 by Jasper--> ![24](https://hackmd.io/_uploads/B15uNpFGR.jpg) ## Multi level Cache Terminology multi-level cache的設計目是為了減少miss penalty,當L1發生cache miss時,L2幾乎包含大部分的資料,因此L1的miss penalty就由L2(SRAM)的讀取時間所決定,遠比讀取main memory(DRAM)的讀取時間還快數百倍。 multi-level cache在L1與L2設計著重的點不同 L1的hit time會決定CPU的clock cycle time,所以要盡量設計小。 1. 使用split memory,頻寬加倍,以降低hit time。 2. L1會採用write-back或是write-through都可,因為L1和L2都是SRAM,速度都很快。 L2注重miss rate,要夠大才能涵蓋所有大部分需要的資料。 1. 使用combined memory,locality of reference增加,以降低miss rate。 2. L2下面一層就是main memory,SRAM和DRAM速度差非常多,因此會採用write-back。 :::spoiler 對不起 >謝謝羅電神 >>你就繼續卷 >>>這你 >>>>說到他 >>>>>好多層喔 >>>>>>yee >>>>>>>包於 >>>>>>>>讀不完 好想變巨人 ::: ### local miss rate * 單獨算每層的miss rate(Miss rate(L1), Miss Rate(L2)) * L1通常比L2高(沒聽錯的話) >ChatGPT説L2比較高耶[name=海熊] >可是你媽說L1比較高[name=你媽] ### Global miss rate :::info $\frac{\text{一個Group的存取lower level的次數}}{\text{一個Group的總存取次數}}$ ::: * 通常是每一層的Local Miss Rate相乘 * global L2 Miss rate=L1 miss rate x L2 Miss rate * 越低代表要存取memory的次數越少 * If there's L3, Miss Penalty(L2)= Hit time(L3)+Miss rate(L3) * Miss Penalty(L3). Otherwise, the penalty should be memory access ![29](https://hackmd.io/_uploads/rJ3EuaKMR.jpg) > CPU base CPI = 1, clock rate = 4GHz > Miss rate per instruction = 2% > Main memory access time = 100ns :::spoiler 如果你像我一樣忘記甚麼是CPU Time, CPI, clock rate ::: info #### CPU Time = ${ I * CPI *T}$ - ${I}$ = number of instructions in program - ${CPI}$ = average cycles per instruction - ${T}$ = clock cycle time - clock rate = $\dfrac{1}{T}$ [name=蛋蕉] 誰啦 ::: - **With just primary cache (L1)** - T $= \frac{1}{4G \text{hz}} = 0.25 \times 10^{-9}$ Clock/Cycles - Miss penalty = $\frac{100 \times 10^{-9}}{0.25 \times 10^{-9}}=400$ (cycles) - Effective CPI = $1 + 0.02 \times 400 = 9$ - **Add L2 cache** > Access time = 5ns > Global miss rate to main memory = 0.5% - L1 miss, L2 hit - Penalty = $\frac{5}{0.25}=20$ (cycles) - L1 miss, L2 miss - Extra penalty = $400$ (cycles) - Effective CPI = $1 + 0.02 \times 20 + 0.005 \times 400 = 3.4$ ### Effect of 2-level Caching - L2 size usually much bigger than L1 - Provide reasonable hit rate - 降低 L1 cache 的 miss penalty - 但對整體的 miss penalty 而言是上升的 - Inclusion property - Inclusive cache - L1 是 L2 的子集合,在 L1 有的資料 L2 都有 - 簡化確保資料一致性的機制 - L2 replace 時,要 INvalid L1 對應的 data - Exclusive cache - 在 L1 出現的不一定會在 L2 出現 - 如果 L2 miss 的話也要去看 L1 是否 hit - 2023 Final: :::info What is the difference between inclusive and exclusive cache policies in a multi-level cache hierarchy? List one advantage and disadvantage for inclusive and exclusive of each approach. (7 points) In an inclusive cache policy, the data present in a higher-level cache is also present in all lower-level caches. On the other hand, in an exclusive cache policy, the data present in a higher-level cache is not present in any lower-level caches. **Advantage:** * Inclusive cache: Improve cache coherence and reduces the complexity of maintaining consistency between different cache levels. * Exclusive cache: Maximize cache capacity. **Disadvantage:** * Inclusive cache: Result in higher cache capacity requirements because of duplicate data. * Exclusive cache: Increase the complexity of maintaining cache coherence. ::: ### Multiple-level Cache **inclusion property** 1. Inclusive cache: L1有的L2裡面也有,effective cache size = L2. * L2 is inclusive of L1: 在 L1 的就一定要在 L2 出現 從 L1 移除時,L2 不會跟著移除 從 L2 移除時,若 L1 也有則 L1 也要移除 * Enforce inclusive property: L2的資料被replace時要把L1中的這筆資料invalid掉 ![](https://hackmd.io/_uploads/S19DHBi4C.png) 2. Exclusive cache: L1有的L2裡面不會有,當 L1 cache line 中的資料被替換的時候,L1 的資料會往 L2 放,effective cache size = L1 + L2. * L2 is exclusive of L1: L1、L2 中只會出現在其中一邊 ![](https://hackmd.io/_uploads/BkoorBo40.png =75%x) ## Interactions with Advanced CPUs //todo ![image](https://hackmd.io/_uploads/HJobDJUrC.png =75%x) 亂序執行CPU可以在快取未命中期間執行指令 * 未完成的存儲操作停留在加載/存儲單元中 * 依賴指令在預留站中等待 * 獨立指令繼續執行 * 未命中的影響取決於程序數據流 * 更難分析 * 使用系統模擬 ## Caching Exapmle (我不知道為什麼插在這裡) ![image](https://hackmd.io/_uploads/rJ3Ecy8BR.png =75%x) ![image](https://hackmd.io/_uploads/SJ_YokUHA.png =75%x) > 175 in 2-way is supposed to be CF [name=嘟嘟] ## DGEMM(Double-Precision-Matrix-Multiply) Access Pattern 把頻繁存取的一些資料放在同一個cache block,這樣才不用兩個block之間頻繁切換 ![image](https://hackmd.io/_uploads/SJcX21mX0.png) //我看不懂這圖 好心人幫補 羅宣又做事 //羅宣佑不要E了 做事 ![image](https://hackmd.io/_uploads/SyUVnJQmA.png =75%x) 當資料量變大,Performence就會拉開差距 ## Dependability ![image](https://hackmd.io/_uploads/ryxt3yQ7R.png =40%x) ![image](https://hackmd.io/_uploads/BySt3eUBC.png =40%x) * Reliability: Mean time to failure(MTTF) * 平均多久時間才會遇到一次錯誤 (e.g. 記憶體讀不到資料 / 讀到錯誤的資料),MTTF越大越好 * Service interruption:mean time to repair(MTTR) * 修復所需的平均時間,MTTR越小越好 * Mean time between failure(MTBF)= MTTF + MTTR * Availability: $=\frac{\text{MTTF}}{\text{MTBF}}$ ![image](https://hackmd.io/_uploads/SkKY0yX7A.png =75%x) - **Increase MTTF**: fault avoidance, fault tolerance (e.g. **RAID**), fault forecasting - **Reduce MTTR**: improved tools and processes for diagnosis and repair #### ''''''我是分隔號'''''' - **Reliability:** mean time to failure (MTTF) - 平均多久時間才會遇到一次錯誤 (e.g. 記憶體讀不到資料 / 讀到錯誤的資料) - **Service interruption**: mean time to repair (MTTR) - 發生錯誤後要花多久時間修復 - Mean time between failures (MTBF) - **MTBF = MTTF + MTTR** - $\text{Availability} = \frac{\text{uptime}}{\text{total time}} = \frac{\text{MTTF}}{\text{MTBF}}$ - Improving Availability - **Increase MTTF**: fault avoidance, fault tolerance (e.g. **RAID**), fault forecasting - **Reduce MTTR**: improved tools and processes for diagnosis and repair ## The Hamming SEC Code(ECC) 假設讀一連串的bit,我們要怎麼知道其中有沒有位元發生翻轉 Fault detection / correction for **RAM** * Parity Bit: Detect single bit error * 看所有bit的sum是奇數還是偶數 ### Hamming distance(Single Error Correction, Double Error Detection) 兩個bit patterns有幾個bit不同 ### Error Correction Code(ECC) :::info ![image](https://hackmd.io/_uploads/S1xh6vim0.png) ::: > 就是把二進制表示法中所有第n bit為1的數抓出來(像bit 1是1, 3, 5, 7, 9..., bit 2是2, 3, 6, 7, 10...)[name=BendTail ] #### 首先要先加入檢查碼(Parity bit) 檢查碼要加在$2^n$的位置 :::info ![image](https://hackmd.io/_uploads/S1XIldsmR.png) ::: > 先把$2^n$的位置留空,剩下的照原本的順序填入,照上面的規則把對應的bit抓出來XOR,要讓XOR的值為0,反推出對應的檢查碼應該是多少。[name=BendTail ] #### Error Detection 假設第10位被翻轉 :::info ![image](https://hackmd.io/_uploads/SyYZXdjXR.png) ::: > 照上面的規則把對應的bit抓出來XOR,把所有結果是1的加起來就是錯誤的bit,先用single bit error的方法去看他,嘗試糾正後還是有error代表有兩個以上的error,就沒辦法糾正,只能偵測[name=BendTail ] <!--我先大概記一下 之後再補 (補完!)--> ### Bits needed p = total number of parity bits d = number of data bits in p + d bits and one more to indicate that no error exists ![image](https://hackmd.io/_uploads/B1IWql7mC.png) > 為什麼不用$\LaTeX$[name=BendTail] > 我不會[name=嬌生慣養的me] :::danger > ==注意,以下內容(到Cache coherence problem之前)完成度巨低,但我相信你們OS應該已經讀過了,我也懶得做,嗚呼!==[name=BendTail ] >> 我OS超爛= = ::: :::info ~~一起看教授的鬼畫符~~ ![IMG_3606](https://hackmd.io/_uploads/SkY_jxmQ0.jpg =75%x) ::: ## Virtual Memory(important!) * 用main memory當成secondary storage(disk)的cache * 每個program要有自己的空間(private virtual address),不能被其他program存取(不然會打架) * 由OS和CPU hardware共同管理,將virtaul address(VA)轉換成physical address(PA) > OS 有教,要用MMU [name=被荼毒之人] * 以page為單位搬移資料(4kb),miss叫做page fault * page 就是 VM 的 block ## Address Translation 透過page table找到對應的physical address page offset(4kb = $2^{12}$) ![image](https://hackmd.io/_uploads/Byywy-77A.png =70%x) ## Page Fault ### Page Fault Penalty * 發生的話,要去disk找資料 * latency超大(1000 * 1000),~~跟我一樣 這我 遲緩兒:(~~ * 對CPU負擔大,由OS Code處理 * 所以要盡量降低page fault rate * Fully associative placement * 可以用酷酷的演算法(pure LRU, psuedo LRU) ### Page Fault Handler 1. 用VA在PTE中找到Page在disk中的位置 2. 選擇要被替換掉的Page(若Page is dirty寫回disk) 3. 將Page放進memory後更新Page table 4. Restart from faulting instruction ## Page Tables 一個一維陣列,VA丟進去,PA跑出來,~~高雄發大財~~ An array of page table entries, indexed by virtual page number ![IMG_3607](https://hackmd.io/_uploads/HyeVfZXQA.jpg =50%x) * Page Table Register * CPU中用來指向physical memory中page table位置的暫存器 * If page is present in memory * page table entry(PTE) stores the physical page number * Plus other status bits(referenced, dirty...) ![image](https://hackmd.io/_uploads/H1PPsrsE0.png =90%x) * If page is not present in memory * page table entry can refer to location in swap space on disk ![image](https://hackmd.io/_uploads/SkXCoriN0.png =90%x) ### Replacement and Writes - To reduce page fault rate, prefer least-recently used (LRU) replacement - **Reference bit** (a.k.a use bit) in PTE set to 1 on access to page - Periodically cleared to 0 by OS - A page with reference bit = 0 $\implies$ not been used recently - 非揮發性記憶體 (如硬碟) 寫入資料的代價過高 - **`Write-through`** is impractical(速度太慢), use **`Write-back`** instead (見先前 Writing Policy 部分) - **Dirty bit** in PTE will be set when page is written ## Translation Lookaside Buffer(TLB) 原本只用page table的話會需要access memory兩次 1. 一次access PTE(Page table entry) 2. 一次access轉換出來的memory位置 TLB的概念大概是Page table的cache,若TLB中有目標則可以只access memory一次 ![image](https://hackmd.io/_uploads/BkJzDonVA.png) ![image](https://hackmd.io/_uploads/BJYGKi3VC.png) ### TLB Misses 可能有兩種情況 1. Page在page table中但是不在TLB裡面 2. Page不在page table中 發生TLB miss時,handler(可以是硬體或軟體)從PTE中copy到TLB,接著restart instruction,若不在page table中則會發生page fault,載入之後重啟instruction。 TLB miss 的時候,會去Page Table找, Page Table會告訴你Data在physical page or DISK. ### TLB and Cache Interaction ![image](https://hackmd.io/_uploads/HyImpj240.png) ## Memory Protection 不同的task可能共用同一個VA 需要用OS去保護,不讓非法存取成功 有些硬體輔助OS進行保護 * Kernel mode * Privileged instructions * Page table只能在Kernel mode存取 * System call ## Cache Design Trade-offs 講優劣,感覺很欠考 ![image](https://hackmd.io/_uploads/SkGznMUH0.png =75%x) ## Cache Controller FSM 大概就是flow chart,可以看一下 This cache is direct-mapped, write-back, write-allocate ![image](https://hackmd.io/_uploads/rk5NaGUrA.png =75%x) ## Linear Table >我不知道他在寫什麼 [name=海熊] >我救救看 [name=BendTail ] 通常一個process就有一個page table 根據page size以及address的位數來推算出有幾個entries page table size就是entry數量乘上entry size :::info 假設page size = 4kb = $2^{12}$ bytes address = 32 bits entry數量 = $2^{32-12}$ = $2^{20}$ ::: ![image](https://hackmd.io/_uploads/SyRnQUBVC.png) Base+PTES* # of PTE ![image](https://hackmd.io/_uploads/H1PQdUBEA.png) offset看page size Page/Page Table Entry(PTE)= entries # Base+PTE size * VPN ## Multi-level Page Tables: Page Directory Page Directory * The page directory contains one entry per page of the page table. * It consists of a number of page directory entries(PDE). * PDE has a valid bit and page frame number(PFN). * PFN->next level page table * ![image](https://hackmd.io/_uploads/SyqZtUHNA.png) 我根本看不懂 ---------- ## Cache Coherence Problem ![image](https://hackmd.io/_uploads/HkOcMlmHC.png =75%x) 如果每個Processor都有自己的Cache 跟著上圖的步驟做,會發現到第3步時,u會被修改,但根據write back,只有在被replace時才會寫回memory,這樣在寫回之前,其他processor的cache會得到更新前的值 ### 解決方法 * Migration(遷移) of data to local caches * Reduces bandwidth for shared memory * Replication(複製) of read-shared data * Reduces contention(紛爭) for access ### Snooping Protocols 當一筆 shared memory 同時紀錄於不同快取的 cacheline 時,利用 bus 顯示資料的狀態,讓所有 cache 能窺探並更新自己狀態與資料,確保所有 cache 對這筆資料讀寫時只有一種基準值。 #### Write Invalidate * 多個readers,一個writer * 當有一個processor更改共享資料時,將其他快取中的這筆資料標示為invalidate(再次使用會得到coherence misses) * miss的話如果是write through就可以直接到memory找這筆資料,write back則需要snoop每個cache找最新的資料 #### Write Broadcast * 通常是write through * 當有一個processor更改這筆共享資料時,將更改的數值broadcast給其他快取做更新。 #### Write Serialization (我不知道) #### Write through ![image](https://hackmd.io/_uploads/SkgRjgmHC.png) #### Write back * use read miss to triger write back ![image](https://hackmd.io/_uploads/S1nMneXr0.png) ### Directory Based Schemes Directory-based 不透過 system bus 維持節點間資訊同步,而以 directory 追蹤其他 cache block 的情形,包括 cache block 目前的 state,與目前有哪些節點正共享這份 cache block。如此一來即可降低透過 system bus 將 signal 廣播給所有節點,而只傳給對資訊有興趣的節點,以降低 memory bandwith 的成本。 # Ch.6 Parallel Processors From Client to Cloud ## Hardware and Software ### Hardware * Serial(不能平行) * Parallel(可平行) ### Software * Sequential(不能平行) * Concurrent(可平行) **Sequentail** software can run on **serial** hardware. **Concurrent** software can run on **parallel** hardware. ## Parallel Programming 問題在Parallel software ### Difficulties * Partitioning * Coordination * Communications overheads ## Amdahl's Law(又見面了) 不能平行處理的部分會限制speedup :::info ![image](https://hackmd.io/_uploads/S1nIkWXrA.png) ::: ---------------- # 題型 是非 簡答(cache block變大的影響之類的) 計算 cache管理(direct map cache, associative cache)(snooping) (是非簡答snooping跟directory都有可能出)