計算機組織 - HackMD

###### tags: `資工系課程` `計算機組織` `Computer Architecture` `CSnote` # 計算機組織 [可參考網站](https://chi_gitbook.gitbooks.io/personal-note/content/zu_he_yu_yan_yu_ji_suan_ji_zu_zhi.html) ## CH1 ### CPI ![](https://i.imgur.com/qjqKVuh.png) CPI：每個指令所需要平均的Clock cycle數（意思就是每條指令花費的時鐘周期數）總共有n類指令，總和下方的I應改為i CPI乘上cycle time就是執行時間 **CPU Time = CPU Clock Cycles * Clock Cycle Time = CPU Clock Cycles / Clock Rate** 說明: CPU time 可以視為是一個程式在CPU裡面執行了多少個 clock cycle 再乘上每一個clock cycle它實際花了多少的時間而在這邊clock cycle其實就是一個週期又因為我們知道週期跟頻率其實互為倒數的關係所以CPU time也可以定義為: **CPU的clock cycles 除以 clock rate (CPU的頻率)** 這裡的clock rate 其實就是CPU的頻率 clock frequency ![](https://i.imgur.com/Yh7aWeX.png) ![](https://i.imgur.com/usJzeuy.png) （上排的數據名稱跑掉了）計算思路：先計算原先CPI總和=2.2 第一題：0.5 + 2　* 0.2 + 0.3 + 0.4 = 1.6 ， 2.2/1.6=1.375，快了37.5% 第二題：0.5 + 1 + 0.3 + 0.2 = 2 ， 2.2/2=1.1，快了10% 第三題：0.25 + 1 + 0.3 + 0.4 = 1.95 ， 2.2/1.95=1.128，快了12.8% ![](https://i.imgur.com/DB2Flqy.png) 計算思路：第一個code sequence的total cycle數量為2 * 1+1 * 2+2 * 3=10 CPI為10/5(指令數)=2 第二個code sequence的total cycle數量為4 * 1+1 * 2+1 * 3=9 CPI為9/6(指令數)=1.5 而cpi代表每個指令平均的cycle數，因此數字越大越慢，可知第一個比第二個慢2/1.5=1.3333倍 ### MIPS ![](https://i.imgur.com/jNgHbwz.png) MIPS：每秒執行百萬條指令數（意思就是每秒cpu能算幾個百萬的指令） MIPS=指令數量/(執行時間*10^6)=指令數量/(clock cycle數 * 週期(也就是每次所需時間)*10^6)=指令數量/[(clock cycle數/f) * 10^6]=f/ (CPI * 10^6) 補充:CPI：每個指令所需要平均的Clock cycle數 [MIPS參考資料](https://www.bilibili.com/read/cv10959808) 計算思路：可先算CPI再算MIPS 第一個total cycle數量為5 * 1+1 * 2+1 * 3=10，CPI=10/7=1.428 MIPS=4 * 10^9 /[ (10/7) * 10^6) ]=2.8 * 10^3（慢）第二個total cycle數量為10 * 1+1 * 2+1 * 3=15，CPI=15/12=1.25 MIPS=4 * 10^9 /[ (15/12) * 10^6 ]= 16/5 * 10^3 = 3.2 * 10^3（快）執行時間：算法一：執行時間＝CPU Clock Cycles / Clock Rate（頻率）第一個：(5+2+3) * 10^6/ 4 * 10^9=10/4000=1/400 第二個：(10+2+3) * 10^6/ 4 * 10^9=15/4000=1.5/400 算法二：因MIPS表示指令數量/(執行時間*10^6)，因此(MIPS倒數/10^6)*指令數量就是執行時間第一個：(1/2800) * 7 * 10^6 /10^6=1/400(快) 第二個：(1/3200) * 12 * 10^6 / 10^6=3/800=1.5/400(慢) 兩種答案不同，表示MIPS雖然比較容易算，但可能與執行時間不同。也就是說MIPS不一定能代表執行時間的快或慢 ### Amdahl's Law ![](https://i.imgur.com/fXe7EBP.png) 題目解釋：假設跑一程式需要100秒，在乘法的部分占了80秒。要怎麼改善乘法的部分讓程式快四倍？計算思路：設speedu為x倍，100/4=80/x+20 => 5=80/x => x=16 其餘參考例題： [Amdahl's Law](https://chi_gitbook.gitbooks.io/personal-note/content/amdahls_law.html) ### Speedup Fraction（enhanced）升級比例：整個系統改善的部分，那個部分占了整體的多少？ - if 20 seconds is enhanced in a program that takes 60 seconds in total, the fraction is 20/60 Speedup（enhanced）升級加速比: 改善的部分能夠改善幾倍？ - If the enhanced mode takes 2 seconds for some portion of the program that can completely use the mode, while the original mode took 5 seconds for the same portion, the improvement is 5/2 ![](https://i.imgur.com/YTWOjcF.png) Speedup(overall): ![](https://i.imgur.com/qLyvZAT.png) ![](https://i.imgur.com/4CYhTYo.png) ![](https://i.imgur.com/93Lfh4r.png) - Example2-1： amount為5，Affected為(10/2=5)秒，unaffected為10-5=5秒，Execution Time After Improvement為5+5/5=6秒，speedup(overall)為10/6。 - Example2-2： speedup(overall)為3，原執行時間：100秒，求浮點指令必須佔多少執行時間？　新執行時間為100/3秒，設佔了x秒，100/3 = x/5 +(100-x) => 4/5x = 200/3 => x= 250/3 ![](https://i.imgur.com/YmqBT2e.png) MIPS：Ａ＝4 * 10^9 /(1 * 10^6)=4000(快) B =4 * 10^9 /(1.1 * 10^6)=4000/1.1(慢) 執行時間：Ａ＝10 * 10^9 /4 * 10^9 = 2.5(慢) Ｂ＝8 * 10^9/4 * 10^9=2(快) 電腦B比較快 ### SPEC [SPEC基準](https://www.easyatm.com.tw/wiki/SPEC%E5%9F%BA%E6%BA%96) ## CH2 ### Fetch（取指）從memory抓取出來 * 控制器將指令的地址送往存儲器 * 存儲器案給定的地址讀出指令內容送回控制器 ### Execute（執行） * 控制器從通用暫存器或是存儲器取出操作數據 * 控制器命令運算器對操作數進行指令規定的運算 ### ISAs指令集架構 #### Stack ![](https://i.imgur.com/R4lMdUh.png) 到堆疊的頂端去取出運算元，至運算結束時，再儲存到堆疊的頂端兩個input進入ALU進行運算後疊到最上方（first in last out），所以與memory無關 #### Accumulator ![](https://i.imgur.com/lbX4h6J.png) 一個input來自register，一個來自memeory，output會放回暫存器。有累積運算的感覺 #### Register-memory ![](https://i.imgur.com/ptdNtsA.png) input一個來自register file（暫存器堆），另一個來自memeory，output回去register file，但沒有特別指定哪一個，任一個都可以。會影響效率，但比較有彈性及沒有限制。 #### Register-register／load-store ![](https://i.imgur.com/C1MdL5s.png) 兩個input都來自register file，但沒有限制是哪一個，output存回register file，通常三個input及output不會是同一個，但還有是有這個機會。**而memory從頭到尾都沒用到。** 變數在運算前要先放在register，這動作稱為load，而計算完後放回register稱為store **performance最差是Ａ，再來是Ｂ＜Ｃ＜Ｄ** #### 四種不同運算指令 ![](https://i.imgur.com/Azlc3MY.png) 註：stack中的tos為top of stack，gernel的EA為memory address，load的$代表register的記號 ![](https://i.imgur.com/AKv4GEu.png) Load $r1,A，就是從memeory中把A放入r1暫存器 ### 硬體運算 ![](https://i.imgur.com/3xIDaUv.png) 長度都是32bits，語法都是一個運算子三個運算元 ![](https://i.imgur.com/hNTDOSB.png) ![](https://i.imgur.com/9kn3wuX.png) 註：陣列的初始地址+$t1暫存器就是Ａ[i] 的地址，記得＊４ lw為load the word at address into register(從memory讀到暫存器)把A[8]的值放到t0這個暫存器裡面 sw為store the word in register to address(從暫存器寫入到memeory)把t0這個暫存器裡面的值移動到記憶體A[12] ### Representing Instructions instruction是由32個bit所組成而每一個instruction word 我們將會把它分成好幾個不同的"fields"不同的欄位這些欄位會分別的告訴processor我們需要處理的事情在MIPS的指令集裡面我們依據切割不同欄位的方法將指令分為三個不同的format "R-format, I-format以及J-format" #### R-format ![](https://i.imgur.com/wmDDvUW.png) **opcode** 這個opcode它對應的這個指令叫"做的事情是什麼" > 注意 : 在MIPS的指令集裡面"所有的R-format指令它的opcode都為0" **function** > 在MIPS的指令集裡面由於R-format的opcode都為0 > 因此我們需要搭配了0的opcode再加上function > 才能夠決定這個R-format的指令要做的事情是什麼 **rs (source register)** 這個指令所要處理的第一個operand **rt (target register)** 這個指令所要處理的第二個operand **rd (destination register)** 這個指令當它處理完成之後結果所需要放入的register **shamt (shift amount)** 提供給shift的指令來使用的 > 我們都知道在MIPS裡面一個register總共也就只有32個bit 當我們要shift的數量超過31個bit，其實是沒有任何意義的，因為所有的資料都shift光了。這也就是為什麼我們總共只需要用5個bit，就可以明確的去描述我們所需要shift的個數 **opcode與function之所以不合併為12bits，是為了配合I-format以及J-fotamt可以對齊，如果三種格式opcode都是一樣，抓取比較方便。** #### I format ![](https://i.imgur.com/7rHiF75.png) **rs (source register)** 這個指令所要處理的operand **rt (target register)** 這個指令當它處理完成之後結果所需要放入的register **immediate** > 有負號記的做 2's complement > immediate這個field是由16個bit所組合而成的 > 也因此它可以表示2的16次方種不同的數值 > 這些常數在實際處理的時候將會進行sign-extention成為32個bit > 於是就可以跟register一起做運算例如 : addi (add immediate) > addi $29,$29,4 > 把29號暫存器裡面的值跟這個4 也就是常數相加 > 加完的結果再把它放回到29號暫存器裡面 ![](https://i.imgur.com/kDFGmwJ.png) ![](https://i.imgur.com/VUkarVh.png) **若是immediate過大超過16bits怎麼辦？** ![](https://i.imgur.com/ac6vuOc.png) 註：$at是用來處理這種狀況的register lui 載入到Address的上半部, 從31bit 到 16 bit。剩下後半是0 ori 就是做or，又因0xCDCD是16bits，跟$at做or就是跟他後半部全部是0的地方做or，所以維持原狀。如此一來，暫存器at 才擁有完整的 32-bit address **問號？？** ![](https://i.imgur.com/em0b0lj.png) 0x表示為16進位因此0xFFFF換成2進位為1111 1111 1111 1111 是16bits 而0x12345678則為32bits 因此0xFFFF無法碰觸到0x12345678中的前16bits。若是將0x12345678轉成2進位，會有些是0，而0xFFFF都是1，所以or之後一定是全部都是1，也就跟原先一樣。所以算出來就是0x1234FFFF ### shift instructions ![](https://i.imgur.com/U1u8KWk.png) sll：往左移，所以最高位元會被移出去，最右邊補0 #### sra例子 ![](https://i.imgur.com/8J7PLxW.png) 不一定會補0，要看最高位元是0或是1 ### Decision Instructions ![](https://i.imgur.com/ICHMbcO.png) ### Goto Instruction ![](https://i.imgur.com/3KLhYMx.png) ### Inequalities in MIPS ![](https://i.imgur.com/VUCQxrF.png) ### Immediate in Inequalities ![](https://i.imgur.com/9aA3Vrt.png) **問號** ![](https://i.imgur.com/nv7lGIB.png) ![](https://i.imgur.com/ZyZdzuH.png) ### J-Format Instructions ![](https://i.imgur.com/DrpiUyq.png) - 存儲的地址 last two bits are always 00 (in binary) 地址 = (addrFromLabelTable/4) 的低 26 位 - specify 28 bits of the 32-bit bit address **舉例：** jump j 10000 （go to 10000 26-bit+4-bit of PC） ### MIPS Jump, Branch, Compare ![](https://i.imgur.com/hg0Qsr8.png) 問號的是PC+4+100? ### Data Transfer: Memory to Register **例1：lw $ t0,12($ s0)** Take the pointer in $s0, add 12 bytes to it, and then load the value from the memory pointed to by this calculated sum into register $t0 **12為偏移量，$s0為base register** **例2：lw $ t0,32($s3) # $t0 gets A[8]** ### memory alignment [參考資料：關於記憶體對齊(Alignment)](http://opass.logdown.com/posts/743054-about-memory-alignment) [備用資料](https://codingnote.cc/zh-tw/p/306517/) ### endian [參考資料：位元組順序](https://zh.wikipedia.org/wiki/%E5%AD%97%E8%8A%82%E5%BA%8F) ### An Example of Passing Arguments ![](https://i.imgur.com/pcA5v75.png) a0對應i a1對應j，所以一開始會先做i+j n是第六個參數 sp+20，已經被load到t0 v0會把結果傳回去註:圈圈本來要圈到20($sp) ### Registers Conventions for MIPS 表格 ![](https://i.imgur.com/pbLj0iK.png) ### MIPS指令參考 ![](https://i.imgur.com/PYehpAP.png) ### object file 目的檔（英語：object file）即存放目的碼的電腦檔案，它常被稱作二進位檔案（binaries）。 Unix中主要有這六種檔案 - Object file header - Text segment - Static data segment - Relocation information Relocation information - Symbol table - Debugg g in information ### Linker 鏈結器的工作就是解析未定義的符號參照，將目的檔中的預留位置替換為符號的位址。鏈結器還要完成程式中各目的檔的位址空間的組織，這可能涉及重定位工作。 - 執行檔要用到的code跟data擺在一起並且linked成絕對位置的執行檔 - 決定所有data跟label的位置 - 內部的參數或是references也在裡面 ### Loader - 讀取執行檔的header以決定要分配多少的memory給text跟data - 製造text跟data位址的空間 - Copy the parameter to main program onto the stack - Initialize the machine registers and set the stack pointer to the first free location - Jump to the start Jump to the start-up routine ### DLL 動態連結函式庫（Dynamic-Link Library）不需要把執行檔所要的library或是.o檔都link在一起，並且放到memory中。而是要用到的時候再load進來使用。 ### C to MIPS ![](https://i.imgur.com/UyGPZ7D.png) ## CH3 ![](https://i.imgur.com/3Fbs3II.png) １０１０＝-２^3+2=-6 ### Functional Specification of ALU ![](https://i.imgur.com/xGxGgve.png) ### Multiply Algorithm（version 1） ![](https://i.imgur.com/6En2Ut0.png) ### Multiply Algorithm（version 2） ![](https://i.imgur.com/Zcolh7w.png) ### Booth’s Algorithm [Booth’s Multiplier Booth演算法](http://alex9ufoexploer.blogspot.com/2013/12/8-bit-booths-multiplier-booth.html) ![](https://i.imgur.com/5kxXmWH.png) ![](https://i.imgur.com/E0Z2cx7.png) ![](https://i.imgur.com/bRc8mWw.png) ### Divide Algorithm(version 1) [Division](https://chi_gitbook.gitbooks.io/personal-note/content/division.html) 嘗試著去把除數拿去減掉被除數當被除數拿來減的部分是大於除數的在這樣的情況下我們就會知道我們的商會大於 0 然後接下來我們就把商放上去把值減完了之後剩下來的這個部分的餘數會再跟被除數裡面比較低的位元重新結合在一起然後繼續的去跟除數做比較如果我們發現沒有辦法減下去也就是說當我們完成這個減法的時候發現減法得到的結果是負的值那這時候我們必須要把它的值還原回去然後把它的商填上0 依此類推直到最後一個位元 ![](https://i.imgur.com/XgGiQtE.png) ### Divide Algorithm(version 2) ![](https://i.imgur.com/fV9IFYD.png) ### 浮點數表示法 ![](https://i.imgur.com/JljQ6Wk.png) ### Floating point to Decimal ![](https://i.imgur.com/L73EQdX.png) ### Decimal to FP ![](https://i.imgur.com/9oNLlm9.png) ![](https://i.imgur.com/NgO0KZ9.png) ### Special Numbers ![](https://i.imgur.com/3EQfXIg.png) ## CH4 ### Data path for Branch Operation ![](https://i.imgur.com/frTePLm.png) 之所以會shift left 2 是因為下面的sign extend是word element，單位要一樣所以要*4 有點問號 ### Control Unit [參考](https://hackmd.io/@8bFA57f7SRG-K7AAT8s62g/ryv1NT3S#Control-Unit) 問號 ### Designing the Main Control ![](https://i.imgur.com/ZfI6vPP.png) ### Summary of Control bits ![](https://i.imgur.com/RBmFHE0.png) ### Implementation of ALU Control Block 真值表跟後面的ｏｒ跟ａｎｄ怎麼畫的問號 ![](https://i.imgur.com/aC0SPqy.png) ### Example: Fixed-Period Clock vs. Variable-Period Clock in a Single in a Single-Cycle Implementation ![](https://i.imgur.com/UspMgZV.png) 上方為CPU執行function的時間，下方是指令的百分比 (a)是計算single clock使用fixed-period clock的時間(b)是計算指令使用variable-period clock的時間 **解答：** ![](https://i.imgur.com/UaTCYqd.png) 數字單位為ns，是參考題目中上方給的時間，有使用才列出 ### Multicycle Approach: Design Principals ![](https://i.imgur.com/go5kB6s.png) **將指令分解為步驟** - 每個步驟佔一個clock cycle - **balance**：每個步驟/週期中要完成的工作量是大致相等 - **simple**：限制每個週期每個主要functional unit最多使用一次，以便不必複製這些單元 - **minimize the cost**：functional unit可以在一條指令內的不同周期之間共享 **cycle／steps之間** - 在一個週期結束時儲存數據（會使用暫存器）以供以後的周期使用同樣的指令 - 後面指令使用的數據儲存在工程師可見的狀態元素中：register file、PC、memory ### Partition Single-Cycle Datapath ![](https://i.imgur.com/q3kkSDH.png) 由紅色線分隔，分成五塊，由左到右分別為 1. Instruction fetch and PC increment (IF) 2. Instruction decode and register fetch (ID) register file的部分 3. Execution memory address computation or branch completion(EX) ALU的執行、memory address計算、branch判斷 4. Memory access or R-type instruction completion (MEM) 5. Memory read completion (WB) 寫回register file **not all instructions require all the steps** **each step takes one clock cycle** ![](https://i.imgur.com/3Ax8utA.png) ![](https://i.imgur.com/yhVWKNa.png) 會多這些暫存器（紅圈部分）當成暫時儲存data用，以便接力給下個cycle ### Example: CPI in a Multicycle CPU ![](https://i.imgur.com/fCTmwgi.png) 下方5或4或3的數字是看上面那五個步驟會需要幾個，而一個步驟又分別佔一個cycle 雖然single cycle的CPI只有1，但clock time長，所以multicycle還是有改善 ### Summary of Instruction Execution ![](https://i.imgur.com/dubbiHn.png) ### Single-, Multi-Cycle, vs. Pipeline ![](https://i.imgur.com/nX5ZlGK.png) ### Problems with Pipelining MIPS [Pipeline Hazard](https://king0980692.medium.com/computer-architecture-cheat-sheet-pipeline-hazard-ee27d0d66e89) * Structural hazards：像是碰到都需要加法器的指令，同時都進到CPU，就會有一個一定要等待硬體資源不夠多而導致同一時間內要執行的多個指令無法執行。 * Control hazards：會出現在branch指令，如果branch結果要跳到一個很遠的指令（不是branch下一行的指令），會變成這個branch下一行指令已經進入CPU了，可是要執行的不是他，就需要清除，再把branch結果要執行的指令fetch進CPU，這些浪費的cycle就是Control hazards * Data hazards：前面指令執行的結果要給後面的指令用，可是指令執行的結果要等到第五個stage才會寫回到register。還沒寫進去前，後面的指令就不能用，就會空轉。當一個instruction必須參考先前instruction的執行結果，但是先前的instruction卻還在pipeline中沒執行完，就會發生data hazard。 ### Control hazards解決方法 **解法一：Stall the pipeline 就讓它空轉（bubble）** ![](https://i.imgur.com/JgPuE86.png) 如果知道是一個branch instruction，則暫停直到正確的condition知道後才開始下一個instruction。 **解法二： Predict branch outcome 預測結果** ![](https://i.imgur.com/Gbqleir.png) 事先預測branch instruction不會發生，如果預測結果錯誤則放棄已經進入pipeline的指令而fetch branch target的instruction進入pipeline，如果預測成功就可以快速順利的運作下去。下面是預測不成功，就要清空 **解法三：Delayed branch** ![](https://i.imgur.com/Lnu0V8A.png) 在branch instruction之後先執行下個合適的指令，但是必須確保這個指令是安全指令，不會影響program的執行結果才行。 ### Data hazards解決方法 **解法一：Forward data if possible** ![](https://i.imgur.com/TSJa8qE.png) 不用等到前一個instruction執行到最後一個stage，在ALU算出result後就將它送入下個instruction中當作input。 excution之後資料已經產生了，所以照著紅線箭頭內部傳回去就來得及讓下一道指令用。不過有的時候forwarding來不及，因為中間隔的時間比較久，像是要等到執行完memory之後才行，所以就只能進行等待。 ![](https://i.imgur.com/gYgAiua.png) **解法二：Reordering Code to Avoid Pipeline Stall（Software Solution）** ![](https://i.imgur.com/XfkQAxi.png) ### Pipelined Datapath ![](https://i.imgur.com/yb0L4DM.png) 這是有誤的，(待補) ### Corrected Datapath ![](https://i.imgur.com/Yf7b1jk.png) #### 會考！ ![](https://i.imgur.com/3Cli57u.png) 這有Structural hazard嗎？教授說是紅圈的部分如果連續做四個ｌｗ就會有　wr跟reg/dec會有衝突 ![](https://i.imgur.com/C8vgd9I.png) ？這就可以 ![](https://i.imgur.com/PVFYlpB.png) 結論：教授唬爛 ### R-type如果只用4個stage 如果pipeline會與lw衝突，所以一定要多加memory這個stage ![](https://i.imgur.com/wusZAiT.png) ![](https://i.imgur.com/AEf5ZxA.png) ### Graphically Representing Pipelines 考題： ![](https://i.imgur.com/zJcuUMw.png) ### Pipeline Control Implementation Control signal會跟著移動 ![](https://i.imgur.com/bADTuQQ.png) ![](https://i.imgur.com/zszg5wd.png) ### Example ![](https://i.imgur.com/KFnsX7K.png) ![](https://i.imgur.com/JIWptKK.png) ![](https://i.imgur.com/7BAJ0oj.png) ![](https://i.imgur.com/nrXHLyZ.png) ![](https://i.imgur.com/ZY8qpVn.png) ![](https://i.imgur.com/6uZqYyx.png) ![](https://i.imgur.com/Ziz8Lf6.png) ![](https://i.imgur.com/4Mnq5B0.png) ![](https://i.imgur.com/uCgX3XI.png) ![](https://i.imgur.com/46JTWP5.png) ### Types of Data Hazards (inst. i1 followed by inst. i2) i1先，i2後 - RAW (read after write) 在i1寫入前i2就進行read - WAR (write after read) i1 read之前i2就進行寫入 - WAW (write after write) i1寫入之前i2就進行寫入只有RAW會真正造成Data Hazard，所有RAW的data hazard又稱為True data dependency，其餘稱為False data deapendency ![](https://i.imgur.com/MmaNTXe.png) #### Software Solution **插入nop** nop = “no operation” = 00…0 (32bits) = sll $0, $0, 0 右邊是插入足夠的nop指令使得有Data Hazard的指令間隔差距為2以上，但nop會使得performance變差。左邊是插入兩個不會影響到program的指令（且最終也會執行），使得程式不會浪費cycle，對performance有幫助。 ![](https://i.imgur.com/V5o19Dk.png) #### Hardware Solution: Forwarding 簡單而有效的方式，又稱為bypassing，是利用資料提前的方式，讓需要計算的資料提早送達EX或MA之前步驟： 1. Detect data hazard data hazard 2. Forward intermediate data to resolve hazard ##### Hazard Detection ![](https://i.imgur.com/f9DyqIz.png) 對照圖：上方的＝可以看成link，是說明如果用forwarding可以怎麼連接 ![](https://i.imgur.com/nG0ninj.png) forwarding後： ![](https://i.imgur.com/PdxV35q.png) 加上forwarding hardware： ![](https://i.imgur.com/1N6E3CZ.png) ![](https://i.imgur.com/brTlmxg.png) forwarding rules： ![](https://i.imgur.com/5pzwfhK.png) 紅字是說前面就可以forwarding不用等到後面才做 ![](https://i.imgur.com/IiroUlh.png) ### pipeline branch 這是在講control hazard中若預測錯誤的情況會浪費兩個cycle ![](https://i.imgur.com/EbfuiOX.png) ### branch prediction方式 #### Dynamic prediction: 1-bit Predictor 會利用Branch prediction buffer or branch history table（BHT）查表方式看過去碰到這個branch是否會成功，如果預測錯就invert ![](https://i.imgur.com/v4hfvFy.png) #### Dynamic prediction: 2-bit Predictor 連續預測錯兩次才會改，11和10表示這條分支會跳轉；01和00表示分支不會跳轉 ![](https://i.imgur.com/1NfgIjW.png) ### Correlating Predictors [Dynamic Branch Prediction](http://ece-research.unm.edu/jimp/611/slides/chap4_5.html) ![](https://i.imgur.com/90Lgbmy.png) 第三個branch如果與前面兩個有關，就可以透過參考前面兩個branch預測出第三個branch結果 ### Handling Exceptions ![](https://i.imgur.com/NrWGh4F.png) ![](https://i.imgur.com/xfvrOMK.png) 會有cause紀錄原因，except PC紀錄發生位置 ## CH5 [參考資料](https://blog.csdn.net/weixin_44437187/article/details/86922281) ### Levels of Memory Hierarchy ![](https://i.imgur.com/l654RkN.png) ### Hits and Misses **Write hits** - Write-through：CPU向cache寫入數據時，同時向memory也寫一份，使cache和memory的數據保持一致。優點是簡單，缺點是每次都要訪問memory，速度比較慢。 - Write-back：cpu更新cache時，只是用dirty bit把更新的cache區標記一下，並不同步更新memory。直到該cache區要被新進入的數據取代時，才更新memory。這樣做的原因是考慮到很多時候cache存入的是中間結果，沒有必要同步更新memory。優點是CPU執行的效率提高，缺點是實現起來技術比較復雜，且資料沒辦法每次保持最新。 **Write misses** - Write-allocated：將寫入位置讀入cache，然後採用write-back操作。也就是把要寫的地址所在的塊先從main memory調入cache中，然後寫cache。缺點是low miss rate, complex control - Write-non-allocate：並不將寫入位置讀入cache，而是直接將數據寫入memory。high miss rate, easy control, match，with write-through ### Avoid Waiting for Memory in Write Through ![](https://i.imgur.com/RAQbaQI.png) **Use a write buffer (WB)** - Processor: writes data into cache and WB - Memory controller: write WB data to memory **Write buffer is just a FIFO** - Typical number of entries: 4 ### Exploiting Spatial Locality [Spatial Locality](https://www.sciencedirect.com/topics/computer-science/spatial-locality) ### block size與miss rate的關係在同樣的cache size下，如果提昇block size，會降低miss rate，因為你提昇了spatial locality。但是如果你無限制的提高block size，反而會導致cache內的總block數太少（high competition）。另一個提高block size會造成的問題是miss penalty變大，因為一旦miss，你須要轉移更多的記憶體內容。 ![](https://i.imgur.com/SGLoZVy.png) <font color=red size=5px>**必考：Average access time= hit time x (1 - miss rate)+miss penalty x miss rate**</font> ### Example ![](https://i.imgur.com/xhBbHce.png) ![](https://i.imgur.com/3OHCps2.png) ![](https://i.imgur.com/2WnGj15.png) ### Improving Cache Performance by Increasing Bandwidth ![](https://i.imgur.com/JirCjg7.png) ### Performance CPU time = (execution cycles + memory stall cycles) × cycle time memory stall cycles = memory accesses × miss rate × miss penalty #### 兩種方式讓cahcce performance變好： - 減少miss rate - 減少miss penalty ### Example ![](https://i.imgur.com/eAL3tLj.png) ![](https://i.imgur.com/VddyEIC.png) 第二題自己算 ![](https://i.imgur.com/8esIS2S.png) ### Recap: Four Questions for Hierarchy Design <font color=red size=5px>必考</font> ![](https://i.imgur.com/1GaKJ62.png) ### different types of misses #### Capacity miss 因為容量（cache size）不足產生的miss Cache cannot contain all blocks by program - Solution：increase cache size #### Compulsory miss (cold start, process migration) 第一次access因為裡面是空的（code start）或是裡面雖然有資料，但是是上一個contest的？（process migration） - Solution: increase cache block size #### Conflict miss 在direct map中因為tag一樣，但是能放的data有限，所以會產生這問題 more than 1 memory blocks mapped to same location - Solution 1: increase cache size - Solution 2: increase associativity（一排有好幾個位置） ### Decreasing Miss Rates with Associative Block Placement [重要參考資料](https://hackmd.io/@sysprog/HkW3Dr1Rb?type=view#Cache-%E7%9A%84%E8%A8%AD%E8%A8%88%E6%A8%A1%E5%BC%8F) #### Implementing Fully Associate Cache 來不及補筆記（參閱ｐｐｔ) ### Replacement of cache block **Random** **LRU (Least Recently Used)** Hardware keeps track of the access history and replace the block that has not been used for the longest time（記錄誰最久沒被用到） An example of a pseudo LRU（記上次用到哪）： 1. use a pointer pointing at each block in turn 2. whenever an access to the block the pointer is pointing at, move the pointer to the next block 3. when need to replace, replace the block currently pointed at ### Example ![](https://i.imgur.com/f1jvKnu.png) ![](https://i.imgur.com/RDdZB4R.png) ![](https://i.imgur.com/gCi9tUP.png) ![](https://i.imgur.com/UNxFhBs.png) ### Reduce Miss Penalty with Multilevel Caches **Add a second level cache:** ![](https://i.imgur.com/ew4hm48.png) Average access time： = L1 hit time + L1 miss rate * L1 miss penalty L1 miss penalty： = L2 hit time + L2 miss rate * L2 miss penalty #### Example ![](https://i.imgur.com/P5XNm8C.png) ### Cache Design Trade-offs ![](https://i.imgur.com/PgN9M8s.png) ### Page Fault: What Happens When You Miss? when hardware detect page fault situation, hardware must trap to the operating system so that it can remedy the situation - Pick a page to discard (may write it to disk) - Load the requested page in from disk - Update the page table - Resume to program so HW will retry and succeed! ### Example ![](https://i.imgur.com/O2shwJM.png) ![](https://i.imgur.com/zRNDKPn.png) ## CH6 ![](https://i.imgur.com/JSqodUF.png) ![](https://i.imgur.com/9bWZ1kf.png) ![](https://i.imgur.com/yEs1GcE.png) ![](https://i.imgur.com/hCIySlm.png) ![](https://i.imgur.com/CSNqF9Y.png) 到p51