Computer Architecture Notes
黃婷婷教授的計算機組織課程錄影
Outline
-
Computer Architecture = Instruction Set Architecture + Machine Organization
( 比喻 : 建築設計師 and 營造商 ) , 有個 trade-off
-
Instruction Set Architecture
- 可看成是 Software 和 hardware 的 interface
- MIPS
- instruction 只有三種 format
- 都是32 bits
-
評估效能
- response time
- CPU time
low level => number of clock cycle * clock period(clock frequency 倒數)
instruction level => CPI (clock per instruction) 一個 instruction 需要多少 cycle
CPU time = instruction/Program * CPI * Seconds/Clock cycle(clock period)
- throughput : 單位時間內可以做多少事情
-
Instruction Set Architecture(A)
-
CPU time
- CPU time = instruction/Program * CPI * Seconds/Clock cycle(clock period)
-
multi-core 如果一行程式都沒變的話就是增加 throughput, 沒增加 response time .
- 如果想增加 response time, 則 multi programming
-
Benchmarks
- speed of processor depends on code used to test it. Need industry standards so that different processors can be fairly compared -> benchmark programs
- 不同的 work load 所需要的 benchmark不同
-
power consumption
- static power consumption : 漏電 (目前漏電比例仍大)
- dynamic power consumption
Instruction Set Architecture(B)
-
Computer Architecture = Instruction Set Architecture + Machine Organization
-
Operands
- Register operands
- Memory operands
- Immediate operands
-
arithmetic/logic instructions :
- add $0, $1, $2 -> operation destination , source1 , source2
-
assembly don't use variables => registers
- 使用 register 的好處 :
- faster than memory
- easier for a compiler to use
- Registers can hold variables to reduce memory traffic and improve code density
- ( register named with fewer bits than memory location, 用多少bit來描述哪個register or memory )
-
32 registers , each is 32 bits wide
Pipeline
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
-
pipeline 的五個stage :
- instruction fetch -> register access (instruction decode) -> ALU -> memory access -> write back
( IF -> ID -> EX -> MEM -> WB )
-
throughput 增加 , response time 不會
-
和 single cycle 比較 , 同樣是五個stage , 有加了 pipeline register
-
datapath :
- 切的時候盡量 balance. clock cycle 會設為消耗時間最長的那個 stage, 才能讓全部的 state 往下一個前進
-
control :
- "control signal" decode 之後就繼續往下傳 , 存在 pipeline register
-
Data hazards
- 假設 z instruction 在 stage 2 的時候要 R5 這個 register 資料, 但 y instruction 還在 stage 3, 要寫資料進 R5 需要在 stage 5, 所以這時候就產生 data hazards.
-
Control hazards
- z instruction 在 stage 3 or 4 要 branch , stage 1 2 已經有 instruction 進來了 , 這時候就要把之前的 flush 掉.
-
Forwarding (R-type and R-type)
-
Stalls (Load and R-type)
Pipeline Hazards / Forwarding
-
Structure hazards : 不同的 instruction 要競爭同一個 resources
- 解決 : resource 都在同一個 stage 使用.
不論是哪種 instruction type , 都要分五種 stage.
-
Data hazards : attempt to use item before ready. (data dependency)
- 如果是 internal forwarding 就可解決
- Type :
- RAW (read after write) : I2 tries to read operand before I1 write it.
- WAR (write after read) : I2 tries to write operand before I1 read it. (pipeline 不會發生)
- WAW (write after write) : I2 tries to write operand before I1 write it. (pipeline 不會發生)
-
Control hazards : attempt to make decision before condition is evaluated.
-
waiting 永遠可以解決 hazards
-
每一個 stage 要用的 resources 要不一樣 !!!
-
解決 Data hazard :
-
internal forwarding : clock trigger 的時候就寫入 register(first half of clock) , read in second half
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
-
Compiler inserts NOP (software solution)
-
Forwarding (hardware solution)
-
Stall (hardware solution)
-
another way to fix Forwarding (Datapath with forwarding) :
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
-
DataPath : MUX 是一個 three to one, 一個是從正常的 register file, 另外兩個是 forwarding 回來的 data.
-
"Forwarding unit" 要發出 signal 來控制 MUX 的 input.
- normal input -> Forwarding = 00
- EX hazard -> Forwarding = 01
- MEM hazard -> Forwarding = 10
-
Don't forward if instruction does not write register => check if RegWrite is asserted.
-
Don't forward if destination register is $0
Stalling
- Can't Always Forward !!!
- Load(
lw
) and R-type (Stalling) : 還沒從 memory 拿到資料 , 要如何 forward?
- Stall pipeline by keeping instructions in same stage and inserting an NOP instead. (停一個 clock cycle)
- hazard detection controls , PC , IF/ID , control signals
小結 :
Data Hazard 分兩種 :
- 第一個 instruction 和 第二個 instruction 的關係是 R-type and R-type :
-> 永遠可以 forwarding
- 第一個 instruction 和 第二個 instruction 的關係是 load and R-type :
-> stall a clock cycle , can not forward.
Handling Branch Hazard (Control hazard)
Exception & Interrupt
- exception : Arises within the CPU
- e.g., undefined opcode, overflow, syscall…
- interrupt : from an external I/O controller
- Handling Exceptions
- Save PC of offending(or interrupted) instruction
- In MIPS : 存在 Exception Program Counter (EPC)
- Vectored Interrupts
- 看發生的原因, 然後查表後 直接跳去相對應的 handler 處理
- Another form of control hazard (flush instruction)
- Simple approach : 從最早的 instruction 開始處理
- 發生 exception 的指令 之前的指令還是要處理完, 後面的則 flush 掉
- In complex pipelines :
- Out of order completion (data has no dependency)
Superscalar and dynamic pipelining
- pipeline 就是一種 instruction level parallelism (ILP)
- To increase ILP
- Deeper pipeline( 深度 ) => shorter clock cycle. 因為每個 stage 要做的事情變少了
- Multiple issue( 廣度 ) => Duplicate pipeline stages -> multiple pipelines
-
CPI < 1 , so use instruction per cycle (IPC)
-
Static multiple issue :
- Compiler groups instruction to be issued together.
- Packages them to "issue slot"
- Compiler detects and avoids hazard
-
Speculation (Guess) : 因為要平行化, 所以要猜. (like branch)
- speculate on branch outcome :
- Roll back if path taken different
- Speculate on load :
- 正常的情況下, store 接 load 的 instruction, 通常 load 的 address 會和 store 的 address 不同. 所以 load 可以先作. load 的資料要準備作計算, 較急需. store 只要存資料 write back, 所以可以慢慢 store.
- Roll back if instruction updated.
-
小結 :
- Static multiple issue : compiler approach
- software 在 run time 之前就把可以平行處理的 instruction 給 pack在一起, 且當成是一個single instruction. => VLIW (Very long instruction word)
- 兩個 instruction packets (一次抓 64 bits) :
- one ALU / branch instruction
- one load / store instruction
- Scheduling
- Loop Unrolling :
- Replicate loop body to expose more parallelism
- Use different registers per replication
- Called "register renaming" -> anti-dependency
- Dynamic multiple issue : hardware approach
- Scheduling (superscalar):
- Allow the CPU to execute instructions out of order to avoid stalls.
- Speculation
- Predict branch
- Load speculation
- Why Do Dynamic Scheduling ? 為何不要 software approach 就好?
- Not all stalls are predictable
- Can not always schedule around branches
- Different implementations of an ISA have different latencies and hazards.
Memory
- Random access :
- Access time same for all locations, 和 magnetic disk不一樣
- SRAM :
- use for caches
- 如果電量一直維持的話, 內容會持續
- DRAM :
- use for main memory
- 需要 refreshed regularly
- memory as a 2D matrix ( read row to activate and read column to select )
- access SRAM 比 DRAM 快幾乎100倍
- 想要快, 又想要容量大 -> solution : Memory Hierarchy
Memory Hierarchy
- Block : basic unit of information transfer. (再 virtual memory 叫 page)
- Why memory hierarchy works ?
- Principle of Locality :
- 90/10 rule : 10% of code executed 90% of time. ( like loop )
- Temporary Locality : item is referenced, 它還有可能會在被 referenced again (loop)
- Spatial Locality : array , program(pc=pc+4) , sequential 的
- (upper level) registers -> caches -> memory -> disk (lower level)
- Memory Hierarchy : Terminology (術語 專有名詞)
- Hit : data appears in upper level.
- Miss : data need to be retrieved from a block in the lower level.
- Questions for Hierarchy Design :
- block placement, block finding, block replacement, write strategy
Cache Memory
- basics of cache : direct-mapped cache
- Address mapping : modulo (取餘數) number of blocks
- 以 binary 來看. ex: 10100 如果 cache 只有八個 blocks 的話, 那就看後面三個 bits 就知道了! (10100 -> 100)

- 下面四個 block 都 map 到上面的 upper level, 那我們從 cache 中要怎麼知道是哪個 block map 過來的?
- 除了存 data 外, 還要存部分地址 (high-order bits -> called tag)
- What if there is no data in a location ?
- cache 架構會不一樣, block size 也會不一樣 , 所以 tag 是多少個 bits 也不一樣 -> 要計算
- Ex : 64 blocks, 16 bytes/block (offset)
- What block number does address 1200 map?
- block address => 1200/16 = 75
- block number => 75 modulo 64 = 11
- 1200 = 10010110000(2 進位) / 10000 => 1001011
- 1001011 取後面 6 bits 的 index => 001011 (block number, 表示在 cache 中的位置)
Tag |
Index |
Offset |
31 ~ 10 |
9 ~ 4 |
3 ~ 0 |
22 bits |
6 bits |
4 bits |
Block Size Considerations
- Larger blocks should reduce miss rate
- 但是因為 cache size 是固定的, 所以 block number 就會減少, 這樣表示更多人要來搶這個 block. -> tradeoff
Cache Read/Write Miss/Hit :
- Read cache hit : CPU 正常執行
- Read cache miss :
- Stall the CPU pipeline -> penalty !!
- 從下一個 level 去抓資料 (block)
- Write cache hit :
- 因為 data 在 cache 和 memory 都有, 有一致性的問題
- Write through : also update memory
- slow -> solution : write buffer. CPU continuous immediately.
- 優點 : data 在 cache 和在 memory 永遠維持一致性
- 缺點 : 浪費 memory 的 bandwidth (traffic)
- Write back : 當 block 要被 replace 掉的時候, 再write back 回 memory
- use dirty bit. 有被改寫過的 dirty bit 就設 1
- Write cache miss :
- 是否要 allocate to cache 呢?
- 第一個方法 : write back - fetch the block. 把cache 內的 replace掉
- 第二個方法 : write through - 直接寫到 memory 內 , 不用搬進 cache 內
Memory Design to support cache :
- how to increase memory bandwidth to reduce miss penalty?
- 3 structure ( 不同的 memory organization )9
-
- 一次一個 word
- Access memory 的時間可分為 : access time and cycle time.
- Access of DRAM :
- address 拆成兩個部分: 一部份為 row index, 一部份為 column index
- 要計算 program 的 execution cycles ( 不管 miss or hits )
- 如果是 Miss 的話, 還要計算 stall cycles
- 算法是 (memory access/Program) * Miss rate * Miss penalty
- Average memory access time (AMAT)
- Hit time + Miss rate * Miss penalty
- Improve cache performance :
- improve hits time
- decrease miss ratio
- decrease miss penalty
- Direct mapped
- 缺點 : 一個號碼只有一張椅子
- 改進 : Associative Caches
- Direct mapped 和 fully associate 是兩個極端值, fully associate 是表示 lower level 的 address 可以任意 map 到 upper level內的 cache.
- 所以折衷辦法是 set associate. ( 1 set associate == fully associate )
- Fully associative :
- Comparator per entry (貴) : entry 有幾個 block 就要比較幾次
- 就不需要 index 了, 那表示 tag 長度變大
- n-way set associative :
- Each set contains entries(blocks). ( 一個號碼內有多個椅子可以坐 )

- 增加 associativity 會減少 miss rates
- Data Placement Policy :
- Direct mapped cache :
- Each memory block mapped to one location
- N-way set associative cache :
- Each memory block has choice of N locations
- fully associative cache :
- Each memory block can be placed in ANY cache
- Cache Block Replacement 有一些機制 :
- Direct mapped : 沒選擇
- Set associative or fully associative :
- Random
- Least Recently Used (LRU) 最近最少用到的 :
- use a pointer pointing at each block in turn
- 我們拿到的是 byte address, 要取得他的 block address 就要先把後面的 byte offset 先不看才可以得到 block address. 假設現在一個 block 有兩個 word, 也就等於 8 bytes , 表示byte address 後面三個 bits 是不用看的. 有 block address 後再去用 index and tag 去找他在哪個 block.
- Multiple Caches.
Virtual Memory
- Memory <–> Disk (之前談的是 Cache <–> Memory), 概念其實和之前的一樣
- 很多 program 之間 只要搬部份的 program 進 memory 內, 由 OS 控制
- Program share main memory :
- 每一個人有自己的 private virtual address holding its frequently used code and data.
- Upper level (physical memory) <–> Lower level (virtual memory)
- address 的 high order 表示 frame number , low order 表示 offset
- 要 access data 會先拿到 virtual address, 透過 page table 做了 address translation 後, 即可找到 physical address, 即可 access 到 data.
- "block" -> "page" , "miss" -> "page fault"
- Basic Issues :
- Size of data blocks ?
- 如何放置 (placement) policy to the page ?
- 如何替換 (replacement) policy to the page ?
Block Size & Memory Placement
- miss penalty 太大了 : a page fault 會造成 millions of cycles to process
- 所以 page size 一定要大
- 用 fully associative placement !!!
- 每個 page 可能會對應到任何的 frame, 所以需要 page table(in memory) 來做對應
- ( 在memory 叫 frame, 在 disk 叫page )
- Address Translation :
- 也是要先找 page number : address / page size(e.g., 4K => 2的12次方的 bytes), => 12 bits 不看, 剩下的 bits 才是他的 page number.
- Page Table :
- 存放 placement 的資訊. 放在 main memory 內
- Problem :
- page table 太大
- How many memory references for each address translation?
- 2 次. 一次查 page table , 一次 access memory
- Page fault :
- 表示 page 不在 main memory內.
- Hardware must detect situation but it cannot remedy the situation.
- software 解決.
- The faulting process can be context switch
Page Replacement and Writes:
- fully associative -> 如果都滿了的話, 誰要出去?誰要被替換? -> 需要policy
- To reduce page fault rate -> prefer least-recently used (LRU) replacement
- Reference bit (use bit), 過了一段時間會刷新, 把全部的 bit 都設為零
- 一定是 Write back.
- 需要 dirty bits ( 有被寫過的設為1 , 之後要 write back 回 disk )
Handling Huge page table : page table is too big
- 有多大的 page 就要有多大的 page table 來做對應
Ex: Page table occupies storage :
32-bit VA, 4KB page, 4bytes/entry
=> 2的20次方 PTE(page table entry), 4MB table
Solutions :
- 用 bound registers 來限制 table size; add more if exceed.
- 讓 pages grow in both directions
- 2 tables -> 2 registers, one for hash , one for stack
- hashing (modular)
- multiple levels of page tables
Translation Lookaside Buffer :
- 某個 entry 一直用到 (locality), 就要一直做 address translation(查 page table), 那不如不要放在 main memory, 就放在 CPU 旁邊 -> TLB
- Fast Translation Using a TLB(Translation Look-aside Buffer)
- Hardware 做的事
- Fast cache of PTEs within the CPU
- TLB : 其實就是常常用的 page table entry , 然後放在 CPU 旁邊. 就不用去 main memory 去 translate 它
- TLB hits
- TLB miss
- translation miss (在 page table 內的 entry 沒擺上來)
- page fault (這個 page 根本不在我的 main memory 內)
TLB and Cache
- Translation 在 TLB, 得到一個 physical address 後 那個東西已經在 cache內了 , 就不用去 main memory 拿. -> 不用任何的 memory access
- 另一個方法: 如果 cache 內是 virtual address 的話, 那 TLB 就不需要去做 translation, 直接去 cache 內找有沒有 hit. 省去做 translation 的時間. (Cache is virtually indexed and virtually tagged)
- Problem :
- 相同的 virtual address 會對應到不同的 physical addresses(consistency 問題) -> tag + process id
- Synonym/alias problem : two different virtual addresses map to same physical address (需要 hardware 支援)
- 所以基本上 cache 還是用 physical address -> Virtually Indexed but Physically Tagged (Overlapped Access).

Memory Protection
- 不同的 task 會 share 他們部份的 virtual address. 但不是任何的 user 都可以去 access 別人的東西. -> Supervisor mode 才可以去 access, 需要 system call (CPU from user to kernel) .
A common framework for memory hierarchy
- 每一個 level 都是用 physical locality. (常常可以用的就放在附近)
會有四個問題 :
Block placement, Finding a block , Replacement on a miss, Write policy
- Block Placement :
- Direct mapped, n-way set associative, fully associative
- 越高的 associativity 會減少miss rate , 但會增加複雜度, cost , access time.
- Finding a Block :
- Tag 的比對 : 1, n, Search all entries(# of entries), fully lookup table(0 -> for virtual memory)
- Replacement :
- Least recently used (LRU) , Random
- Write Policy :
- Write through , write back
Sources of Misses :
- Compulsory misses (一定會 miss 的, 像是 cold start misses )
- Capacity misses : size 有限, 容量不夠 (Due to finite cache size)
- Conflict misses : 在 non-fully associative cache 會發生
Using a Finite State Machine to Control a Simple Cache
- Coherence Defined
Reads return most recently written value
- 可以 centralize 的管也可以分開管
- snooping protocols : Each cache monitors bus reads/writes (一直去窺探 一直去竊聽)
- Broadcasts an invalidate message on the bus.
Pitfall
- Byte address vs. word address
- 看一個 word 是幾個 byte , 還要看 block size/page size, 才能決定 block number/page number.
- In multiprocessor with shared L2 or L3 cache
- More cores -> need to increase associativity