``` 112考資工研究所時寫的筆記,有錯歡迎留言指正 內文有引用他人筆記/教材/文章的地方 若有侵權十分抱歉,告知後將立刻撤除 ``` # CH1 Instruction :::info ### 縮寫區 - MIPS: Microprocessor without Interlocked Pipe Stages - RISC: reduced instruction set compute - CISC: complex instruction set compute - ARM: Acorn RISC Machine - ISA: instruction set architecture - PC: program counter - GPR: general purpose register - 電腦五大單元 1. 輸入單元(Input Unit)簡稱IU 2. 輸出單元(Output Unit)簡稱OU 3. 控制單元(Control Unit)簡稱CU 4. 記憶單元(Memory Unit)簡稱MU 5. 算術邏輯單元(Arithmetic Logic Unit)簡稱ALU - AM: arithmetic mean - GM: geometric mean - PLA: Programmable Logic Array - BHT: branch history table - BTB: branch target buffer - PTE: page table entry - TLB: translation lookaside buffer - MMU: Memory management unit - CLA: carry lookahead adder - MTTR: Mean time to repair - MTTF: Mean time to failure - MTBF: Mean time between failure (MTTF + MTTR) - SSD: Solid state drive - DRAM: Dynamic random access memory - SRAM: Static random access memory - DMA: Direct memory access - RAID: Redundant Array of Independent Disks - UMA: Uniform memory access multiprocesors - NUMA: nonuniform memory access multiprocesors - VLIW: very long instruction word - GPGPU: general purpose graphic process unit - SMP: Shared-memory Multiprocessor - MPP: Message passing Multiprocessor - SISD: single instruction single data - SIMD: single instruction multiple data - MISD: multiple instruction single data (not exist) - MIMD: multiple instruction multiple data ::: - word字組: CPU一次可以處裡的資料量(可從reg, ALU的設計看出) - RISC是32bit word - Endian - big endian: (address遞增) 由most significand開始擺 - MIPS - little endian: (address遞增) 由least significand開始擺 - RISC-V - JVM (Java virtual machine) - 執行時才把method連(links)至Java library - JIT (Just in time compiler) - 執行時才把bytecode翻成native machine code - register in MIPs - $fp: frame pointer - $sp: stack pointer - $gp: global pointer - 很重要的兩張圖 - ![](https://hackmd.io/_uploads/rk0taha43.png) - 上圖為process在virtual memory中的樣子 - 最初sp和fp都重疊在stack底部(以圖來看為最上方) - 隨著process開始運作,sp會被往low address的方向(以圖來看是往下)推進 - ![](https://hackmd.io/_uploads/H1L962TN2.png) - 上圖勘誤: SP應為stack pointer,非step pointer! - 上圖為stack區段被放大來看的樣子,procedure call仍在同個process內 - 高階語言轉譯步驟 - compile 編譯器: 說明略 - assemble 組譯器: 說明略 - linker 連結器: 連結模組和library,解決address reference問題 - loader 載入器: 把machine code載入適當記憶體位置 ## 指令 instruction - 右移 - shift right arithmetic: 左邊做signed-extension - shift right logical: 左邊補0 - 左移都一樣,右邊補0 - MIPS沒有not指令,與0 NOR即可達成 (或與自己NOR即可達成) - NOR $t0 $t0 0 (= NOR $t0 $t0 $t0 ) - arithmetic - add r1 r2 r3 - r1 $\leftarrow$ r2 + r3 - sub r1 r2 r3 - r1 $\leftarrow$ r2 - r3 - data transfer: - lw reg offset(base reg) - reg $\leftarrow$ Memory [ offset+base reg ] - sw reg offset(base reg) - reg $\rightarrow$ Memory [ offset+base reg ] - 其他補充 - signd會做sign extension, unsigned則不會做 - lh (load half word signed) - lhu (load half word unsigned) - lb (load byte signed) - lbu (load byte unsigned) - flow control - beq r1, r2, Label - if(r1==r2) PC $\leftarrow$ Label - bne r1 r2, Label - if(r1!=r2) PC $\leftarrow$ Label - slt r1 r2 r3 - r1 $\leftarrow$ 1 if(r2<r3) else 0 - jr r1 - PC $\leftarrow$ r1 - MIPS沒有blt, bgt, ble, bge - 用beq, bne, slt即可做出虛擬指令(pseudoinstruction) - logical - sll (shift left logically), srl (shift right logically) - 只寫範例 - sll r1 r2 5 - srl r1 r2 5 - and r1 r2 r3 - or r1 r2 r3 - nor r1 r2 r3 - 沒有not !!! nor即可取代 - ==A nor 0== = not(A or 0) = ==not(A)== - 常數(應該歸類在arithmetic?) - lui (load upper immediate): - 載入16bit的常數載入32bit reg的左半部,右半部填0 - 其他只寫範例 - addi r1 r2 4 - slti r1 r2 4 - andi r1 r2 4 - 補充: basic block (very important) - The code in a basic block has: - One entry point, meaning that no code within it is the destination of a jump instruction anywhere in the program. - One exit point, meaning that only the last instruction can cause the program to begin executing code in a different basic block. ## 指令格式 MIPS instruction format 以下可至[MIPS Green Card](https://inst.eecs.berkeley.edu/~cs61c/resources/MIPS_Green_Sheet.pdf)對照 - R type - add, sub, and, ==srl ,sll, jr, jalr (109交大計系)==... - 以add rd rs rt為例 | op | rs | rt | rd | shmat | funct | | --- | --- | --- | --- | ----- | ----- | | 6 | 5 | 5 | 5 | 5 | 6 | - I type - lw, sw, beq, bne, addi - 以lw rt 32(rs)為例 | op | rs | rt | addr/imm | | --- | --- | --- | -------- | | 6 | 5 | 5 | 16 | - addr以==branch下一個指令的位置當基準(0)==,往上為負,往下為正 - J type - jal, j - | op | addr | | --- | ---- | | 6 | 26 | - ==addr欄位26bits,以扣除最右邊2 bit(都是00),和最左邊4bits (同process一樣)== ## 程序呼叫 Procedure call - 步驟 1. 存參數 2. 跳去 3. 要空間,存前人data 4. 執行 5. 存回傳值 6. 還前人data,還空間 7. 跳回 - 12在caller做 - 34567在callee做 - 3,6順序成鏡向關係 - 這類題目爆炸難,自己去練課本p61, 62, 63 # CH2 Arithmetic Logic unit (ALU) - 考古題一定要寫13, 18, 49, 67 - 好難筆記,少寫很多東東,從課本p.150開始看就好 - overflow - MIPS只有在add, addi, sub等指令overflow時產生exception - addu, addiu, subu等unsign運算溢位也不管他 (programmer要負責) - 檢查是否overflow: 看最後兩個carryout,不同就是overflow - xor起來是1就是overflow ## 1 bit full adder - 以下可從truth table推出,細節見課本p150 - Cin: carry in - Cout: carry out - CarryOut = $ab + aCin + bCin$ 有2 gate delay - Sum = $a\overline{b} \overline{Cin}+\overline{a}b \overline{Cin}+\overline{a}\overline{b} Cin+abCin$ 有3 gate delay (多了not吧我猜) ## 64 bits ALU - [王廷基的影片講的很好](https://drive.google.com/drive/folders/11ofRhppqqWbPml2EGiomIeYkmq09Y2BM) - overflow detection: 把最後兩個1 bit ALU的carry拉出來xor - ![](https://hackmd.io/_uploads/BkMja2a4h.png) - set less than 的地方比較複雜,其他都可以自己看 - 1-62bit一樣,無腦輸出0就好(bit 63也output 0但有東西需計算) - 方法是讓A-B,此時sign bit (bit 63)經過些判斷就能當作slt餵回bit 0使用 - ![](https://hackmd.io/_uploads/Bkus6hpVn.png) - ![](https://hackmd.io/_uploads/Bk13anpN3.png) - ![](https://hackmd.io/_uploads/BkS3Th6V3.png) ## Ripple carry header - N個1 bit full adder(p.150)串在一起 - critical path delay想當然爾,超級長 - gate delay - N bit的adder中 - critical path delay (最後一個Cout算出來): 2N - sum delay (最後一個sum算出來): 2N+1 ## carry lookahead adder (CLA) - C~out~ = ab + (a+b)C~in~ - 上面改寫成C~i+1~ = a~i~b~i~+(a~i~+b~i~)C~i~ - 以上是推導,不用背 - 定義generate, propagate - generate: g~i~ = a~i~*b~i~ - propagate: p~i~ = a~i~+b~i~ - C~i+1~ = g~i~ + p~i~c~i~ ---> 背! ### 4 bits CLA - 課本p156-158 - 課本p158圖中的add block,細節如下 - ![](https://hackmd.io/_uploads/S1gs2pnpE2.png) - 但我猜真正的硬體是1 bit full adder(p151右下)跟這個的合體 - carry - c1=g0+p0c0 - c2=g1+(p1g0)+(p1p0c0) - c3=g2+(p2g1)+(p2p1g0)+(p2p1p0c0) - c3=g3+(p3g2)+(p3p2g1)+(p3p2p1g0)+(p3p2p1p0c0) - c2, c3, c4由Ci+1 = gi + piCi展開可得,不用背(但最好練習一下展開) ### 16 bits CLA - 課本p159 - Multiple Level Carry Lookahead (上課講義說的) - Propagate(此處Pi為大寫) - P0=p3p2p1p0 - P1=p7p6p5p4 - P2=p11p10p9p8 - P3=p15p14p13p12 - Generate (此處Gi為大寫) - G0=g3+(p3g2)+(p3p2g1)+(p3p2p1g0) - G1=g7+(p7g6)+(p7p6g5)+(p7p6p5g4) - G2=g11+(p11g10)+(p11p10g9)+(p11p10p9g8) - G3=g15+(p15g14)+(p15p14g13)+(p15p14p13g12) - Carry (此處Ci為大寫) - C1=G0+P0C0 - C2=G1+(P1G0)+(P1P0C0) - C3=G2+(P2G1)+(P2P1G0)+(P2P1P0C0) - C3=G3+(P3G2)+(P3P2G1)+(P3P2P1G0)+(P3P2P1P0C0) - p150-159的圖,看熟,CLA甚至要背!!! - gate delay的計算看下表 (ex: 2-level的16 bits adder,carry delay=5, sum delay=10) - ![](https://hackmd.io/_uploads/rky1R2T4h.png) - 所有筆記的東西我現在理解了才背,希望之後回來還記得... ### gate delay總結 - [幫助理解](https://www.ece.lsu.edu/ee3755/2013f/cla.pdf) - 小背一下 (參考清大105計系) - 16 bits ripple carry adder c delay: 2*16=32 - 16 bits 1 level cascade CLA c delay: 3+2+2+2=9 - 16 bits 2 level CLA c delay: 2+3=5 ## Multiplication && Division (共5個,超重要) - 乘法器除法器(不管有號無號)都是sequential circuit ### traditional multiplication (Unsigned) - 課本p165,務必練題目(畫運算表格) - ![](https://hackmd.io/_uploads/H1dJAnaVn.png) - ![](https://hackmd.io/_uploads/SybeC36V2.png) - ![](https://hackmd.io/_uploads/B1PlC2a4n.png) - 在32*32的乘法運算中 - multiplier: 右移,32bits - multiplicand: 左移,64bits - product: 固定,64bits ### hardware friendly (Robertson) multiplication (Unsigned) - 課本p167,務必練題目(畫運算表格) - ![](https://hackmd.io/_uploads/Sk6g0nTNh.png) - ![](https://hackmd.io/_uploads/S17-Rn6N3.png) - 在32*32的乘法運算中 - multiplier: 放入product右半邊,32bits - multiplicand: 固定,32bits - product: 右移,64bits ### Booth's algorithm (signed) - 直接從表格理解,說明就免看了 - 計算推導過程可以背下(p.172, 173) - 要注意product右移時要signed extend !!! ### Traditional Division (unsigned) - ![](https://hackmd.io/_uploads/SyabR3p43.png) - ![](https://hackmd.io/_uploads/BJNf0hTEh.png) - 稍微複雜點,要作過才會記得 - 餘數要sign extension ### Hardware-friendly division (unsigned) - ![](https://hackmd.io/_uploads/r1dGA3pE3.png) - ![](https://hackmd.io/_uploads/B1y70haV2.png) - 餘數要sign extension - remainder只有左半邊和divisor相減 !!!(p.182) - 乘法器與除法器共用硬體 ![](https://hackmd.io/_uploads/HkSQ02pNn.png) ## IEEE 754浮點數 ### single precision - range: $1.1...1 \times 2^{127}\ to\ 1.0 \times 2^{-126}$和負數同範圍 - 非正規化 單精度可達1.0*2^-149^ | sign | exponent | fraction(mantissa) | | ---- | -------- | ------------------ | | 1 | 8 | 23 | ### double precision | sign | exponent | fraction(mantissa) | | ---- | -------- | ------------------ | | 1 | 11 | 52 | ### 編碼規則 | exp | frac | representation | | ----- | -------- | --------------------- | | 1-154 | anything | +-floating number | | 0 | 0 | +-0 | | 0 | nonzero | +-Denormalized number | | 255 | 0 | +-inf | | 255 | nonzero | NaN | # CH3 Computer performance ### Exe Time = IC $\times$ CPI $\times$ CCT - IC=instruction count - CPI=cycle per instruction - CCT=clock cycle time ### MIPS=$\frac{IC}{Exe Time \times 10^6}=\frac{rate}{CPI \times 10^6}$ ### Amdahl's law - speedup=$\displaystyle \frac{1}{\frac{F}{S} + (1-F)}$ - F: 被改善的比例,S: 改善倍率 # CH4 CPU datapath & control ![](https://hackmd.io/_uploads/rknX026Vn.png) - 控制單元truth table中 - 只有三個信號可能出現don't care (上圖有接control訊號的Mux) - RegDst - ALUSrc - MemtoReg - j type會用到三個cycle,雖然target address讀到指令就知道了 (考古1) - IF - ID - compose target addr - 對應其他指令cycle,這邊是EXE - 把26bits的target addr接上左4bits、右2bits才是完整的32bits addr - longest path計算ALU前面的Mux(ALUSrc)的時機(p388練習) - Rtype: 需要算,因為要等reg file跑完才能過 - $I$type: 不用算,因為sign ext後可以先進mux,此時reg file還在跑 # CH5 Pipeline - ### Hazard - Structural Hazard - 硬體資源不足,以致於無法pipeline (ex. IF和ME都用同個memory) - Data Hazard - Data dependency (最常見的那種,ex. load use) - Control Hazard - MEM階段才得到branch結果的話,可能已經多做三個指令了 - ### Delay branch - From before: delay slot中插入不管有沒有branch都會執行的指令 - From target: 類似predict taken,插入branch後的指令 - From fall through: 類似predict not taken,插入不成立會用的指令 - 以上都需用A==B的條件說明 - 此技術已經很少被使用了Dynamic prediction比較好 - ### NOPs vs stall (p448練習,成大考古) - NOPs: 軟體hazard解法,由compiler插入,是assembly code的一部份 - stall: 硬體hazard解法,由硬體偵測並stall by processor,所有hazard都能用他解決 - ### Bubble(stall)的畫法 - 寫法應該像下面那樣,才能正確標示出bubble所在的cycle,要養成習慣 - ![](https://hackmd.io/_uploads/S17hy66N3.jpg) - 上面那種畫法是NOPs,塞入編譯好的空指令才能這樣表示 - ### Dynamic branch prediction - branch prediction buffer = branch history table (BHT) - 就是那些預測用的state - branch target buffer (BTB) - 如果branch成立,去這邊(cache)撈地址,省一個計算address的cycle # CH6 Memory ## 三種記憶體 (p.7) - SRAM: cache,比較快,儲存在反向閘之間,體積較大 - DRAM: main memory,比較慢,以電荷存在電容中,體積較小 - Magnetic disk: 容量大,最便宜,最慢 ## 快取寫入 (p.25) - Inconsistent - 資料在快取和主記憶體不一致 - ### write back - hit - 只寫到cache上,用dirty bit管理 - miss - write allocate: 去cache上要空間,寫上去 - ### write through - hit - 寫到cache和memory上,很浪費時間 - solution: write buffer - 處理器先寫到buffer,之後buffer再寫到memory - miss - Write around: 只更新資料在Memory中的值 - write allocate: 去cache上要空間,寫上去 - [ref](https://www.jyt0532.com/2018/09/23/cache-mechanism/) ## Cache - byte offset - 用來存取一個 word 裡面的某個 byte,所以所需的 bit 數會是可以代表一個 word 裡面可以存的 byte 數,通常 1 word = 4 byte 所以這個部分通常為 2 bit - block offset - 用來存取 cache line 中的某個 word,所以由 cache line size(cache block size) 和 byte per word 來決定,舉例 64 B 的 cache block size,4 byte per word,那 block offset 就是 4 bit (64/4 = 2^4^) - [referencet](https://ithelp.ithome.com.tw/articles/10270524) - ==offset = block offset + byte offset== - 如果題目提供Word address,代表該數字已經扣除byte offset (p44練習, p127考古34) - ex. address 134~10~=10000110~2~ - In 3-way set associative cache with 2-word blocks, total size=24 words - 他的tag=10000, idx=11, block offset=0 ## Virtual memory - ### 把主記憶體當作磁碟的Cache ::: info - ### 兩大目的 - 供多個程式安全的共享記憶體 - 消除主記憶體空間不足的限制 ::: - 是技術,不是memory - 用virtual memory cache可降低hit time,因為不需要轉換至phy-addr的多餘過程 - 每個process都有自己的page table register - 紀錄page table起點 - ### 如何限制過大的page table size (p67) - Limit register - page table大小可調整 - 兩張分頁表 - 分別位於stack(上往下)和heap(下往上)中,大小可變,MIPS採用之 - 多層page table - OS提過,缺點是cache miss penalty變大 - page table也被分頁 - 可能會無限page fault - 3C miss: compulsory miss, capacity miss, conflict miss - capacity miss是原本在cache中的block被擠出去後,再被存取卻找不到人的miss。如果該組tag/idx原本就不再cache中,算是complusory miss (p155,74題) - context switch需清除TLB,不然會抓到錯的page,因此context switch後page fault高 - 務必寫過: p 20, 21, 68, 69, 72, 73 - 有打星星的考古都很重要一定要寫到背起來 ### Cache vs Virtual memory | | 術語 | | mapping | write | replace | | -------- | --------- | ---------- | ---------------- | ------------------ | ----------- | | Cache | block/line | miss | fully,set,direct | write through/back | random/ LRU | | virtual mem | page | page fault | fully | write through | LRU | ### virtual address cache(p78) - 之前Cache都是用 phy idx / phy tag - 需要先用TLB轉換至phy addr才能去cache找資料,效率差 - 如果是 vir idx / vir tag 呢 - 效率明顯變好,critical path上移除TLB - 問題: 可能有aliasing(別名)問題,兩個vir addr可能對應到同個phy addr - 造成data inconsistent (同步問題) - 解法: vir idx / phy tag,平行處理TLB和Cache - 拿TLB得到的phy tag進cache找東西 ### TLB - [TLB miss的處理](https://unix.stackexchange.com/questions/386591/in-linux-is-the-kernel-who-handle-all-the-tlb-miss) - x86: Hardware only - MIPS: OS kernel mode # CH7 儲存裝置和其他周邊設備 - Disk Access Time - Seek time + Control time + Rotation time + Wait time + Data Transfer time - 口訣: SCRWT - Rotation time = $\displaystyle \frac{0.5}{RPS}$ - RPS: rotation per second - 小心單位轉換(s/ms, RPM/RPS) - ### MTTF, MTTR - 衡量availability的一種標準 - MTTR: Mean time to repair - MTTF: Mean time to failure - MTBF: Mean time between failure (MTTF + MTTR) - availability = $\displaystyle \frac{MTTF}{MTTF+MTTR}$ - [Mega大: CA ch7 Disk筆記](https://drive.google.com/drive/folders/1bfU-MfVsvKu_1QVsWoR_-XxJ4nRL6X5J) - RAID - RAID 0+1的本質是RAID 0 - RAID 1+0的本質是RAID 1 - RAID 3 不利於small data write - ### IO data transfer & processor - Programmed IO = polling - Interrupt - DMA - Direct memory access (DMA) is the mechanism that provides device controller with the ability to transfer data directly from or to memory without the involving the processor. - 會作[cycle stealing ( Interleaving )](https://www.dcard.tw/f/softwareengineer/p/239481598) - CPU utilization: - DMA > Interrupt > polling - Latency: - polling < DMA < Interrupt - Polling / Interrupt / DMA 比較 | | Polling | Interrupt | DMA | | ---- | ----------------- | ---------------------------- | ------------------------- | | 優點 | simple | processor不用一直看IO device | 用cycle stealing 更省時間 | | 缺點 | 浪費processor時間 | 比polling複雜 | 需要硬體支援 | - Blocking IO & non-blocking IO - 參考: 2022/11/19 20:09 TG群 - ![](https://hackmd.io/_uploads/ryMYkaaVh.jpg) - ![](https://hackmd.io/_uploads/B1v51apNn.png) - 簡單這樣記 - Blocking IO = Synchronous IO 會進waiting state - Non-blocking IO = polling 留在running state,process困在polling - Asynchronous IO 留在running state,可做IO外的事,等inturrupt - interrupt可能屬於blocking 或 non blocking,參考下文ref - [ref](https://medium.com/@clu1022/%E6%B7%BA%E8%AB%87i-o-model-32da09c619e6) # CH8 多重處理器 - 兩種多重處理器 - 記憶體共享(重點3): shared-memory multiprocessor (SMP) - 傳遞訊息(重點4): message passing multiprocessor (MPP) - 兩種多重處裡器,作加法時的演算法不同 (看課本p.311, 316) - code不好背,看圖記規則 - ### 重點5: 多處裡器的快取一致性 - snooping (p318) - cache coherence protocol,分以下兩種 - write-update - 又稱write-broadcast - 用匯流排把所有對應該data的cache都更新 - write-invalid - 先把別人快取copy變invalid,在把自己變invalid - ### 重點6: 單一晶片multithreading - find-grained multithreading - 隱藏short/long stalls - coarse-grained multithreading - 只能藏住long stalls - simultaneous multithreading (SMT) - with "Instruction level parallelism (ILP)"和"thread level parallelism (TLP)" - 用danamically scheduled processor達成 - ### 重點7: 平行電腦分類 - Flynn's taxonomy - SISD: single instruction single data (uniprocessor, scalar processor) - SIMD: single instruction multiple data - MISD: multiple instruction single data (not exist) - MIMD: multiple instruction multiple data - Processors - [vector processor](https://zh.wikipedia.org/wiki/%E5%90%91%E9%87%8F%E5%A4%84%E7%90%86%E5%99%A8) (array processor) - 能高效且有效地處理稱為向量的大型1d array (ex. GPU) - [scalar processor](https://en.wikipedia.org/wiki/Scalar_processor) - 一次處裡一個data,和vector processor相反 - [superscaler processor](https://en.wikipedia.org/wiki/Superscalar_processor) - 在一個processor中實現指令平行化 - 比較與差異 - ==SIMD== v.s. ==vector processor== - SIMD處裡的vector長度不可變 - vector processor處裡的vector長度可變 - ==superscaler processor== v.s. ==pipeline== - superscaler processor: 平行執行多個指令用多個執行unit - pipeline: 平行執行多個指令用一個執行unit - superscaler + pipeline: ![](https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Superscalarpipeline.svg/330px-Superscalarpipeline.svg.png) - parallelism - ILP - Instruction-level parallelism is a measure of how many of the instructions in a computer program can be executed simultaneously. - DLP - Data-level parallelism is the parallelism achieved by operating on independent data. - MLP - Memory-level parallelism refers to the ability to have pending multiple memory operations, in particular cache misses or translation lookaside buffer (TLB) misses, at the same time. - TLP - Thread-level parallelism is the parallelism inherent in an application that runs multiple threads at once. - 以上四種平行只是概念,實作可用硬體或軟體達成 (108台大計系) - ![](https://hackmd.io/_uploads/H1KayaTV2.png) ## 圖要背起來 ### 上冊 - 16 bit 2 level carry lookahead adder (p.159) - ![](https://hackmd.io/_uploads/rkJmla6Vh.jpg) ### 下冊 - MIPS datapath - ![](https://hackmd.io/_uploads/S1cmx6aNn.png) - Direct mapped with block offset (P.17) - Utilize spatial locality - ![](https://hackmd.io/_uploads/BkGVgTp4h.png) - 4-way set associative (P.40) - ![](https://hackmd.io/_uploads/HkYVgT6Vn.png) - Virtual address getting data (P.76) - 先去TLB找phy addr, 假設找到,再去cache挖資料 - ![](https://hackmd.io/_uploads/SkbLgTTE2.jpg)