# Computer Architecture Notes ###### tags: `sysprog` [黃婷婷教授的計算機組織課程錄影](https://www.youtube.com/playlist?list=PL67lj4m-vhchqUiwXXoz0PH5j18gvupW_) ## Outline - Computer Architecture = Instruction Set Architecture + Machine Organization ( 比喻 : 建築設計師 and 營造商 ) , 有個 trade-off - Instruction Set Architecture - 可看成是 Software 和 hardware 的 interface - MIPS - instruction 只有三種 format - 都是32 bits - 評估效能 - response time - CPU time low level => number of clock cycle * clock period(clock frequency 倒數) instruction level => CPI (clock per instruction) 一個 instruction 需要多少 cycle CPU time = instruction/Program * CPI * Seconds/Clock cycle(clock period) - throughput : 單位時間內可以做多少事情 - Instruction Set Architecture(A) - CPU time - CPU time = instruction/Program * CPI * Seconds/Clock cycle(clock period) - multi-core 如果一行程式都沒變的話就是增加 throughput, 沒增加 response time . - 如果想增加 response time, 則 multi programming - Benchmarks - speed of processor depends on code used to test it. Need industry standards so that different processors can be fairly compared -> benchmark programs - 不同的 work load 所需要的 benchmark不同 - power consumption - static power consumption : 漏電 (目前漏電比例仍大) - dynamic power consumption ## Instruction Set Architecture(B) - Computer Architecture = Instruction Set Architecture + Machine Organization - Operands - Register operands - Memory operands - Immediate operands - arithmetic/logic instructions : - add $0, $1, $2 -> operation destination , source1 , source2 - assembly don't use variables => registers - 使用 register 的好處 : - faster than memory - easier for a compiler to use - Registers can hold variables to reduce memory traffic and improve code density - ( register named with fewer bits than memory location, 用多少bit來描述哪個register or memory ) - 32 registers , each is 32 bits wide ## Pipeline ![圖片](https://hackpad-attachments.imgix.net/hackpad.com_W3x4n3Ht2oA_p.219987_1426006191102_undefined?fit=max&w=882) - pipeline 的五個stage : - instruction fetch -> register access (instruction decode) -> ALU -> memory access -> write back ( IF -> ID -> EX -> MEM -> WB ) - throughput 增加 , response time 不會 - 和 single cycle 比較 , 同樣是五個stage , 有加了 pipeline register - datapath : - 切的時候盡量 balance. clock cycle 會設為消耗時間最長的那個 stage, 才能讓全部的 state 往下一個前進 - control : - "control signal" decode 之後就繼續往下傳 , 存在 pipeline register - Data hazards - 假設 z instruction 在 stage 2 的時候要 R5 這個 register 資料, 但 y instruction 還在 stage 3, 要寫資料進 R5 需要在 stage 5, 所以這時候就產生 data hazards. - Control hazards - z instruction 在 stage 3 or 4 要 branch , stage 1 2 已經有 instruction 進來了 , 這時候就要把之前的 flush 掉. - Forwarding (R-type and R-type) - Stalls (Load and R-type) ### Pipeline Hazards / Forwarding - **Structure hazards** : 不同的 instruction 要競爭同一個 resources - 解決 : resource 都在同一個 stage 使用. 不論是哪種 instruction type , 都要分五種 stage. - **Data hazards** : attempt to use item before ready. (data dependency) - 如果是 internal forwarding 就可解決 - Type : - RAW (read after write) : I2 tries to read operand before I1 write it. - WAR (write after read) : I2 tries to write operand before I1 read it. (pipeline 不會發生) - WAW (write after write) : I2 tries to write operand before I1 write it. (pipeline 不會發生) - **Control hazards** : attempt to make decision before condition is evaluated. - **waiting** 永遠可以解決 hazards - 每一個 stage 要用的 resources 要不一樣 !!! - 解決 Data hazard : - internal forwarding : clock trigger 的時候就寫入 register(first half of clock) , read in second half ![](https://hackpad-attachments.imgix.net/hackpad.com_W3x4n3Ht2oA_p.219987_1426226769130_undefined?fit=max&w=882) - Compiler inserts NOP (software solution) - Forwarding (hardware solution) - Stall (hardware solution) - another way to fix Forwarding (Datapath with forwarding) : ![](https://hackpad-attachments.imgix.net/hackpad.com_W3x4n3Ht2oA_p.219987_1426226558749_undefined?fit=max&w=882) - DataPath : MUX 是一個 three to one, 一個是從正常的 register file, 另外兩個是 forwarding 回來的 data. - "Forwarding unit" 要發出 signal 來控制 MUX 的 input. - normal input -> Forwarding = 00 - EX hazard -> Forwarding = 01 - MEM hazard -> Forwarding = 10 - Don't forward if instruction does not write register => check if RegWrite is asserted. - Don't forward if destination register is $0 ### Stalling - Can't Always Forward !!!! - Load(`lw`) and R-type (Stalling) : 還沒從 memory 拿到資料 , 要如何 forward? - **Stall** pipeline by keeping instructions in same stage and inserting an NOP instead. (停一個 clock cycle) - hazard detection controls , PC , IF/ID , control signals ### 小結 : Data Hazard 分兩種 : 1. 第一個 instruction 和 第二個 instruction 的關係是 R-type and R-type : -> 永遠可以 forwarding 2. 第一個 instruction 和 第二個 instruction 的關係是 load and R-type : -> stall a clock cycle , can not forward. ### Handling Branch Hazard (Control hazard) - **Predict branch** always not taken - need to add hardware for flushing instruction , if wrong. - **flush** 掉前面的 data 就是把 control signal 都設為0 - Dynamic Branch Prediction - e.g. loop - 如果切越深的 pipeline, branch 越早決定越好, flush 掉的 instruction 越少 - Branch prediction buffer (i.e., branch history table) 紀錄上次跳躍的狀況, associate memory to implements. - 在第二個 stage 做 - Indexed by recent branch instruction addresses. - Store outcome. -> taken(branch) or not taken. - 1-Bit Predictor : Shortcoming - 2-Bit Predictor (Finite State Machine) - Branch target buffer - Cache of target addresses - **Delayed Branch** - instruction 再 decode 的時候就要決定/猜是要跳 (taken) 或不跳 (not taken -> pc=pc+4 or target address), 但其實只要差一個 cycle 就可以知道是哪一個了. 找到一個 instruction 是兩個都執行, 等到下一個 cycle 就知道是哪個了 - the following instruction is always executed. ### Exception & Interrupt - exception : Arises within the CPU - e.g., undefined opcode, overflow, syscall... - interrupt : from an external I/O controller - Handling Exceptions - Save PC of offending(or interrupted) instruction - In MIPS : 存在 Exception Program Counter (EPC) - Single address - Vectored Interrupts - 看發生的原因, 然後查表後 直接跳去相對應的 handler 處理 - Another form of control hazard (flush instruction) - Simple approach : 從最早的 instruction 開始處理 - 發生 exception 的指令 之前的指令還是要處理完, 後面的則 flush 掉 - In complex pipelines : - Out of order completion (data has no dependency) - imprecise Exceptions - 不知道誰先誰後, 則丟給 OS 來處理 ### Superscalar and dynamic pipelining - pipeline 就是一種 instruction level parallelism (ILP) - 基本是同時執行五個 instruction - To increase ILP - Deeper pipeline( 深度 ) => shorter clock cycle. 因為每個 stage 要做的事情變少了 - Multiple issue( 廣度 ) => Duplicate pipeline stages -> multiple pipelines - CPI < 1 , so use instruction per cycle (IPC) - Static multiple issue : - Compiler groups instruction to be issued together. - Packages them to "issue slot" - Compiler detects and avoids hazard - Speculation (Guess) : 因為要平行化, 所以要猜. (like branch) - speculate on branch outcome : - Roll back if path taken different - Speculate on load : - 正常的情況下, store 接 load 的 instruction, 通常 load 的 address 會和 store 的 address 不同. 所以 load 可以先作. load 的資料要準備作計算, 較急需. store 只要存資料 write back, 所以可以慢慢 store. - Roll back if instruction updated. - 小結 : - Static multiple issue : compiler approach - software 在 run time 之前就把可以平行處理的 instruction 給 pack在一起, 且當成是一個single instruction. => VLIW (Very long instruction word) - 兩個 instruction packets (一次抓 64 bits) : - one ALU / branch instruction - one load / store instruction - Scheduling - Loop Unrolling : - Replicate loop body to expose more parallelism - Use different registers per replication - Called "register renaming" -> anti-dependency - Dynamic multiple issue : hardware approach - Scheduling (superscalar): - Allow the CPU to execute instructions out of order to avoid stalls. - Speculation - Predict branch - Load speculation - Why Do Dynamic Scheduling ? 為何不要 software approach 就好? - Not all stalls are predictable - e.g., cache misses - Can not always schedule around branches - Different implementations of an ISA have different latencies and hazards. ## Memory - Random access : - Access time same for all locations, 和 magnetic disk不一樣 - SRAM : - use for caches - 如果電量一直維持的話, 內容會持續 - DRAM : - use for main memory - 需要 refreshed regularly - memory as a 2D matrix ( read row to activate and read column to select ) - access SRAM 比 DRAM 快幾乎100倍 - 想要快, 又想要容量大 -> solution : Memory Hierarchy ### Memory Hierarchy - Block : basic unit of information transfer. (再 virtual memory 叫 page) - Why memory hierarchy works ? - Principle of Locality : - **90/10 rule** : 10% of code executed 90% of time. ( like loop ) - Temporary Locality : item is referenced, 它還有可能會在被 referenced again (loop) - Spatial Locality : array , program(pc=pc+4) , sequential 的 - (upper level) registers -> caches -> memory -> disk (lower level) - Memory Hierarchy : Terminology (術語 專有名詞) - Hit : data appears in upper level. - Hit rate, Hit time - Miss : data need to be retrieved from a block in the lower level. - Miss rate, Miss Penalty - Questions for Hierarchy Design : - block placement, block finding, block replacement, write strategy ### Cache Memory - basics of cache : direct-mapped cache - Address mapping : modulo (取餘數) number of blocks - 以 binary 來看. ex: 10100 如果 cache 只有八個 blocks 的話, 那就看後面三個 bits 就知道了! (10100 -> 100) ![](https://hackpad-attachments.imgix.net/hackpad.com_W3x4n3Ht2oA_p.219987_1427248522536_undefined?fit=max&w=882) ### Tags and Valid Bits : Block Finding - 下面四個 block 都 map 到上面的 upper level, 那我們從 cache 中要怎麼知道是哪個 block map 過來的? - 除了存 data 外, 還要存部分地址 (high-order bits -> called tag) - What if there is no data in a location ? - valid bit 1 = present. - cache 架構會不一樣, block size 也會不一樣 , 所以 tag 是多少個 bits 也不一樣 -> 要計算 - Ex : 64 blocks, 16 bytes/block (offset) - What block number does address 1200 map? - block address => 1200/16 = 75 - block number => 75 modulo 64 = 11 - 1200 = 10010110000(2 進位) / 10000 => 1001011 - 1001011 取後面 6 bits 的 index => 001011 (block number, 表示在 cache 中的位置) | Tag | Index | Offset | | ------- | ------ | ------ | |31 ~ 10|9 ~ 4|3 ~ 0| | 22 bits | 6 bits | 4 bits | ### Block Size Considerations - Larger blocks should reduce miss rate - due to spatial locality - 但是因為 cache size 是固定的, 所以 block number 就會減少, 這樣表示更多人要來搶這個 block. -> tradeoff ### Cache Read/Write Miss/Hit : - Read cache hit : CPU 正常執行 - Read cache miss : - Stall the CPU pipeline -> penalty !! - 從下一個 level 去抓資料 (block) - Write cache hit : - 因為 data 在 cache 和 memory 都有, 有一致性的問題 - Write through : also update memory - slow -> solution : write buffer. CPU continuous immediately. - 優點 : data 在 cache 和在 memory 永遠維持一致性 - 缺點 : 浪費 memory 的 bandwidth (traffic) - Write back : 當 block 要被 replace 掉的時候, 再write back 回 memory - use dirty bit. 有被改寫過的 dirty bit 就設 1 - Write cache miss : - 是否要 allocate to cache 呢? - 第一個方法 : write back - fetch the block. 把cache 內的 replace掉 - 第二個方法 : write through - 直接寫到 memory 內 , 不用搬進 cache 內 ### Memory Design to support cache : - how to increase memory bandwidth to reduce miss penalty? - 3 structure ( 不同的 memory organization )9 - 1. 一次一個 word - Access memory 的時間可分為 : access time and cycle time. - Access of DRAM : - address 拆成兩個部分: 一部份為 row index, 一部份為 column index ### Measuring and improving cache performance - 要計算 program 的 execution cycles ( 不管 miss or hits ) - 如果是 Miss 的話, 還要計算 stall cycles - 算法是 (memory access/Program) * Miss rate * Miss penalty - Average memory access time (AMAT) - Hit time + Miss rate * Miss penalty - Improve cache performance : - improve hits time - decrease miss ratio - decrease miss penalty - Direct mapped - 缺點 : 一個號碼只有一張椅子 - 改進 : Associative Caches - Direct mapped 和 fully associate 是兩個極端值, fully associate 是表示 lower level 的 address 可以任意 map 到 upper level內的 cache. - 所以折衷辦法是 set associate. ( 1 set associate == fully associate ) - Fully associative : - Comparator per entry (貴) : entry 有幾個 block 就要比較幾次 - 就不需要 index 了, 那表示 tag 長度變大 - n-way set associative : - Each set contains entries(blocks). ( 一個號碼內有多個椅子可以坐 ) - ![Example](http://cs.nyu.edu/~gottlieb/courses/2000s/2007-08-fall/arch/lectures/diagrams/cache-set-assoc.png) - 增加 associativity 會減少 miss rates - **Data Placement Policy** : - Direct mapped cache : - Each memory block mapped to one location - N-way set associative cache : - Each memory block has choice of N locations - fully associative cache : - Each memory block can be placed in ANY cache - Cache Block Replacement 有一些機制 : - Direct mapped : 沒選擇 - Set associative or fully associative : - Random - Least Recently Used (LRU) 最近最少用到的 : - use a pointer pointing at each block in turn - 我們拿到的是 byte address, 要取得他的 block address 就要先把後面的 byte offset 先不看才可以得到 block address. 假設現在一個 block 有兩個 word, 也就等於 8 bytes , 表示byte address 後面三個 bits 是不用看的. 有 block address 後再去用 index and tag 去找他在哪個 block. - Multiple Caches. ### Virtual Memory - Memory <--> Disk (之前談的是 Cache <--> Memory), 概念其實和之前的一樣 - 很多 program 之間 只要搬部份的 program 進 memory 內, 由 OS 控制 - Program share main memory : - 每一個人有自己的 private virtual address holding its frequently used code and data. - Upper level (physical memory) <--> Lower level (virtual memory) 1. address 的 high order 表示 frame number , low order 表示 offset - 要 access data 會先拿到 virtual address, 透過 page table 做了 address translation 後, 即可找到 physical address, 即可 access 到 data. - "block" -> "page" , "miss" -> "page fault" - Basic Issues : - Size of data blocks ? - 如何放置 (placement) policy to the page ? - 如何替換 (replacement) policy to the page ? ### Block Size & Memory Placement - miss penalty 太大了 : a page fault 會造成 millions of cycles to process - 所以 page size 一定要大 - 用 fully associative placement !!! - 每個 page 可能會對應到任何的 frame, 所以需要 page table(in memory) 來做對應 - ( 在memory 叫 frame, 在 disk 叫page ) - Address Translation : - 也是要先找 page number : address / page size(e.g., 4K => 2的12次方的 bytes), => 12 bits 不看, 剩下的 bits 才是他的 page number. - Page Table : - 存放 placement 的資訊. 放在 main memory 內 - Problem : - page table 太大 - How many memory references for each address translation? - 2 次. 一次查 page table , 一次 access memory - Page fault : - 表示 page 不在 main memory內. - Hardware must detect situation but it cannot remedy the situation. - software 解決. - The faulting process can be context switch ### Page Replacement and Writes: - fully associative -> 如果都滿了的話, 誰要出去?誰要被替換? -> 需要policy - To reduce page fault rate -> prefer least-recently used (LRU) replacement - Reference bit (use bit), 過了一段時間會刷新, 把全部的 bit 都設為零 - 一定是 Write back. - 需要 dirty bits ( 有被寫過的設為1 , 之後要 write back 回 disk ) ### Handling Huge page table : page table is too big - 有多大的 page 就要有多大的 page table 來做對應 Ex: Page table occupies storage : 32-bit VA, 4KB page, 4bytes/entry => 2的20次方 PTE(page table entry), 4MB table Solutions : - 用 bound registers 來限制 table size; add more if exceed. - 讓 pages grow in both directions - 2 tables -> 2 registers, one for hash , one for stack - hashing (modular) - multiple levels of page tables ### Translation Lookaside Buffer : - 某個 entry 一直用到 (locality), 就要一直做 address translation(查 page table), 那不如不要放在 main memory, 就放在 CPU 旁邊 -> TLB - Fast Translation Using a TLB(Translation Look-aside Buffer) - Hardware 做的事 - Fast cache of PTEs within the CPU - TLB : 其實就是常常用的 page table entry , 然後放在 CPU 旁邊. 就不用去 main memory 去 translate 它 - TLB hits - TLB miss - translation miss (在 page table 內的 entry 沒擺上來) - page fault (這個 page 根本不在我的 main memory 內) ### TLB and Cache - Translation 在 TLB, 得到一個 physical address 後 那個東西已經在 cache內了 , 就不用去 main memory 拿. -> 不用任何的 memory access - 另一個方法: 如果 cache 內是 virtual address 的話, 那 TLB 就不需要去做 translation, 直接去 cache 內找有沒有 hit. 省去做 translation 的時間. (Cache is virtually indexed and virtually tagged) - Problem : - 相同的 virtual address 會對應到不同的 physical addresses(consistency 問題) -> tag + process id - Synonym/alias problem : two different virtual addresses map to same physical address (需要 hardware 支援) - 所以基本上 cache 還是用 physical address -> Virtually Indexed but Physically Tagged (Overlapped Access). ![](https://hackpad-attachments.imgix.net/hackpad.com_W3x4n3Ht2oA_p.219987_1427997551753_undefined?fit=max&w=882) ### Memory Protection - 不同的 task 會 share 他們部份的 virtual address. 但不是任何的 user 都可以去 access 別人的東西. -> Supervisor mode 才可以去 access, 需要 system call (CPU from user to kernel) . ### A common framework for memory hierarchy - 每一個 level 都是用 physical locality. (常常可以用的就放在附近) 會有四個問題 : Block placement, Finding a block , Replacement on a miss, Write policy - Block Placement : 1. Direct mapped, n-way set associative, fully associative 2. 越高的 associativity 會減少miss rate , 但會增加複雜度, cost , access time. - Finding a Block : 1. Tag 的比對 : 1, n, Search all entries(# of entries), fully lookup table(0 -> for virtual memory) - Replacement : 1. Least recently used (LRU) , Random - Write Policy : 1. Write through , write back Sources of Misses : 1. Compulsory misses (一定會 miss 的, 像是 cold start misses ) 2. Capacity misses : size 有限, 容量不夠 (Due to finite cache size) 3. Conflict misses : 在 non-fully associative cache 會發生 ### Using a Finite State Machine to Control a Simple Cache - Coherence Defined Reads return most recently written value - 可以 centralize 的管也可以分開管 - snooping protocols : Each cache monitors bus reads/writes (一直去窺探 一直去竊聽) - Broadcasts an invalidate message on the bus. ### Pitfall - Byte address vs. word address - 看一個 word 是幾個 byte , 還要看 block size/page size, 才能決定 block number/page number. - In multiprocessor with shared L2 or L3 cache - More cores -> need to increase associativity