CH1 Instruction
縮寫區
- MIPS: Microprocessor without Interlocked Pipe Stages
- RISC: reduced instruction set compute
- CISC: complex instruction set compute
- ARM: Acorn RISC Machine
- ISA: instruction set architecture
- PC: program counter
- GPR: general purpose register
- 電腦五大單元
- 輸入單元(Input Unit)簡稱IU
- 輸出單元(Output Unit)簡稱OU
- 控制單元(Control Unit)簡稱CU
- 記憶單元(Memory Unit)簡稱MU
- 算術邏輯單元(Arithmetic Logic Unit)簡稱ALU
- AM: arithmetic mean
- GM: geometric mean
- PLA: Programmable Logic Array
- BHT: branch history table
- BTB: branch target buffer
- PTE: page table entry
- TLB: translation lookaside buffer
- MMU: Memory management unit
- CLA: carry lookahead adder
- MTTR: Mean time to repair
- MTTF: Mean time to failure
- MTBF: Mean time between failure (MTTF + MTTR)
- SSD: Solid state drive
- DRAM: Dynamic random access memory
- SRAM: Static random access memory
- DMA: Direct memory access
- RAID: Redundant Array of Independent Disks
- UMA: Uniform memory access multiprocesors
- NUMA: nonuniform memory access multiprocesors
- VLIW: very long instruction word
- GPGPU: general purpose graphic process unit
- SMP: Shared-memory Multiprocessor
- MPP: Message passing Multiprocessor
- SISD: single instruction single data
- SIMD: single instruction multiple data
- MISD: multiple instruction single data (not exist)
- MIMD: multiple instruction multiple data
- word字組: CPU一次可以處裡的資料量(可從reg, ALU的設計看出)
- Endian
- big endian: (address遞增) 由most significand開始擺
- little endian: (address遞增) 由least significand開始擺
- JVM (Java virtual machine)
- 執行時才把method連(links)至Java library
- JIT (Just in time compiler)
- 執行時才把bytecode翻成native machine code
- register in MIPs
- $fp: frame pointer
- $sp: stack pointer
- $gp: global pointer
- 很重要的兩張圖
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- 上圖為process在virtual memory中的樣子
- 最初sp和fp都重疊在stack底部(以圖來看為最上方)
- 隨著process開始運作,sp會被往low address的方向(以圖來看是往下)推進
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- 上圖勘誤: SP應為stack pointer,非step pointer!
- 上圖為stack區段被放大來看的樣子,procedure call仍在同個process內
- 高階語言轉譯步驟
- compile 編譯器: 說明略
- assemble 組譯器: 說明略
- linker 連結器: 連結模組和library,解決address reference問題
- loader 載入器: 把machine code載入適當記憶體位置
指令 instruction
- 右移
- shift right arithmetic: 左邊做signed-extension
- shift right logical: 左邊補0
- 左移都一樣,右邊補0
- MIPS沒有not指令,與0 NOR即可達成 (或與自己NOR即可達成)
- NOR $t0 $t0 0 (= NOR $t0 $t0 $t0 )
- arithmetic
- add r1 r2 r3
- sub r1 r2 r3
- data transfer:
- lw reg offset(base reg)
- reg Memory [ offset+base reg ]
- sw reg offset(base reg)
- reg Memory [ offset+base reg ]
- 其他補充
- signd會做sign extension, unsigned則不會做
- lh (load half word signed)
- lhu (load half word unsigned)
- lb (load byte signed)
- lbu (load byte unsigned)
- flow control
- beq r1, r2, Label
- bne r1 r2, Label
- slt r1 r2 r3
- jr r1
- MIPS沒有blt, bgt, ble, bge
- 用beq, bne, slt即可做出虛擬指令(pseudoinstruction)
- logical
- sll (shift left logically), srl (shift right logically)
- 只寫範例
- sll r1 r2 5
- srl r1 r2 5
- and r1 r2 r3
- or r1 r2 r3
- nor r1 r2 r3
- 沒有not !!! nor即可取代
- A nor 0 = not(A or 0) = not(A)
- 常數(應該歸類在arithmetic?)
- lui (load upper immediate):
- 載入16bit的常數載入32bit reg的左半部,右半部填0
- 其他只寫範例
- addi r1 r2 4
- slti r1 r2 4
- andi r1 r2 4
- 補充: basic block (very important)
- The code in a basic block has:
- One entry point, meaning that no code within it is the destination of a jump instruction anywhere in the program.
- One exit point, meaning that only the last instruction can cause the program to begin executing code in a different basic block.
以下可至MIPS Green Card對照
- R type
- add, sub, and, srl ,sll, jr, jalr (109交大計系)…
- 以add rd rs rt為例
op |
rs |
rt |
rd |
shmat |
funct |
6 |
5 |
5 |
5 |
5 |
6 |
- I type
- J type
- jal, j
-
- addr欄位26bits,以扣除最右邊2 bit(都是00),和最左邊4bits (同process一樣)
程序呼叫 Procedure call
- 步驟
- 存參數
- 跳去
- 要空間,存前人data
- 執行
- 存回傳值
- 還前人data,還空間
- 跳回
- 12在caller做
- 34567在callee做
- 3,6順序成鏡向關係
- 這類題目爆炸難,自己去練課本p61, 62, 63
CH2 Arithmetic Logic unit (ALU)
- 考古題一定要寫13, 18, 49, 67
- 好難筆記,少寫很多東東,從課本p.150開始看就好
- overflow
- MIPS只有在add, addi, sub等指令overflow時產生exception
- addu, addiu, subu等unsign運算溢位也不管他 (programmer要負責)
- 檢查是否overflow: 看最後兩個carryout,不同就是overflow
1 bit full adder
- 以下可從truth table推出,細節見課本p150
- Cin: carry in
- Cout: carry out
- CarryOut = 有2 gate delay
- Sum = 有3 gate delay (多了not吧我猜)
64 bits ALU
- 王廷基的影片講的很好
- overflow detection: 把最後兩個1 bit ALU的carry拉出來xor
- set less than 的地方比較複雜,其他都可以自己看
- 1-62bit一樣,無腦輸出0就好(bit 63也output 0但有東西需計算)
- 方法是讓A-B,此時sign bit (bit 63)經過些判斷就能當作slt餵回bit 0使用



- N個1 bit full adder(p.150)串在一起
- critical path delay想當然爾,超級長
- gate delay
- N bit的adder中
- critical path delay (最後一個Cout算出來): 2N
- sum delay (最後一個sum算出來): 2N+1
carry lookahead adder (CLA)
- Cout = ab + (a+b)Cin
- 上面改寫成Ci+1 = aibi+(ai+bi)Ci
- 以上是推導,不用背
- 定義generate, propagate
- generate: gi = ai*bi
- propagate: pi = ai+bi
- Ci+1 = gi + pici –-> 背!
4 bits CLA
- 課本p156-158
- 課本p158圖中的add block,細節如下

- 但我猜真正的硬體是1 bit full adder(p151右下)跟這個的合體
- carry
- c1=g0+p0c0
- c2=g1+(p1g0)+(p1p0c0)
- c3=g2+(p2g1)+(p2p1g0)+(p2p1p0c0)
- c3=g3+(p3g2)+(p3p2g1)+(p3p2p1g0)+(p3p2p1p0c0)
- c2, c3, c4由Ci+1 = gi + piCi展開可得,不用背(但最好練習一下展開)
16 bits CLA
- 課本p159
- Multiple Level Carry Lookahead (上課講義說的)
- Propagate(此處Pi為大寫)
- P0=p3p2p1p0
- P1=p7p6p5p4
- P2=p11p10p9p8
- P3=p15p14p13p12
- Generate (此處Gi為大寫)
- G0=g3+(p3g2)+(p3p2g1)+(p3p2p1g0)
- G1=g7+(p7g6)+(p7p6g5)+(p7p6p5g4)
- G2=g11+(p11g10)+(p11p10g9)+(p11p10p9g8)
- G3=g15+(p15g14)+(p15p14g13)+(p15p14p13g12)
- Carry (此處Ci為大寫)
- C1=G0+P0C0
- C2=G1+(P1G0)+(P1P0C0)
- C3=G2+(P2G1)+(P2P1G0)+(P2P1P0C0)
- C3=G3+(P3G2)+(P3P2G1)+(P3P2P1G0)+(P3P2P1P0C0)
- p150-159的圖,看熟,CLA甚至要背!!!
- gate delay的計算看下表 (ex: 2-level的16 bits adder,carry delay=5, sum delay=10)

- 所有筆記的東西我現在理解了才背,希望之後回來還記得…
gate delay總結
- 幫助理解
- 小背一下 (參考清大105計系)
- 16 bits ripple carry adder c delay: 2*16=32
- 16 bits 1 level cascade CLA c delay: 3+2+2+2=9
- 16 bits 2 level CLA c delay: 2+3=5
Multiplication && Division (共5個,超重要)
- 乘法器除法器(不管有號無號)都是sequential circuit
traditional multiplication (Unsigned)
- 課本p165,務必練題目(畫運算表格)



- 在32*32的乘法運算中
- multiplier: 右移,32bits
- multiplicand: 左移,64bits
- product: 固定,64bits
hardware friendly (Robertson) multiplication (Unsigned)
- 課本p167,務必練題目(畫運算表格)


- 在32*32的乘法運算中
- multiplier: 放入product右半邊,32bits
- multiplicand: 固定,32bits
- product: 右移,64bits
Booth's algorithm (signed)
- 直接從表格理解,說明就免看了
- 計算推導過程可以背下(p.172, 173)
- 要注意product右移時要signed extend !!!
Traditional Division (unsigned)


- 稍微複雜點,要作過才會記得
- 餘數要sign extension
Hardware-friendly division (unsigned)
IEEE 754浮點數
single precision
sign |
exponent |
fraction(mantissa) |
1 |
8 |
23 |
double precision
sign |
exponent |
fraction(mantissa) |
1 |
11 |
52 |
編碼規則
exp |
frac |
representation |
1-154 |
anything |
±floating number |
0 |
0 |
±0 |
0 |
nonzero |
±Denormalized number |
255 |
0 |
±inf |
255 |
nonzero |
NaN |
Exe Time = IC CPI CCT
- IC=instruction count
- CPI=cycle per instruction
- CCT=clock cycle time
MIPS=
Amdahl's law
- speedup=
- F: 被改善的比例,S: 改善倍率
CH4 CPU datapath & control

- 控制單元truth table中
- 只有三個信號可能出現don't care (上圖有接control訊號的Mux)
- j type會用到三個cycle,雖然target address讀到指令就知道了 (考古1)
- IF
- ID
- compose target addr
- 對應其他指令cycle,這邊是EXE
- 把26bits的target addr接上左4bits、右2bits才是完整的32bits addr
- longest path計算ALU前面的Mux(ALUSrc)的時機(p388練習)
- Rtype: 需要算,因為要等reg file跑完才能過
- type: 不用算,因為sign ext後可以先進mux,此時reg file還在跑
CH5 Pipeline
-
Hazard
- Structural Hazard
- 硬體資源不足,以致於無法pipeline (ex. IF和ME都用同個memory)
- Data Hazard
- Data dependency (最常見的那種,ex. load use)
- Control Hazard
- MEM階段才得到branch結果的話,可能已經多做三個指令了
-
Delay branch
- From before: delay slot中插入不管有沒有branch都會執行的指令
- From target: 類似predict taken,插入branch後的指令
- From fall through: 類似predict not taken,插入不成立會用的指令
- 以上都需用A==B的條件說明
- 此技術已經很少被使用了Dynamic prediction比較好
-
NOPs vs stall (p448練習,成大考古)
- NOPs: 軟體hazard解法,由compiler插入,是assembly code的一部份
- stall: 硬體hazard解法,由硬體偵測並stall by processor,所有hazard都能用他解決
-
Bubble(stall)的畫法
- 寫法應該像下面那樣,才能正確標示出bubble所在的cycle,要養成習慣

- 上面那種畫法是NOPs,塞入編譯好的空指令才能這樣表示
-
Dynamic branch prediction
- branch prediction buffer = branch history table (BHT)
- branch target buffer (BTB)
- 如果branch成立,去這邊(cache)撈地址,省一個計算address的cycle
CH6 Memory
三種記憶體 (p.7)
- SRAM: cache,比較快,儲存在反向閘之間,體積較大
- DRAM: main memory,比較慢,以電荷存在電容中,體積較小
- Magnetic disk: 容量大,最便宜,最慢
快取寫入 (p.25)
- Inconsistent
-
write back
- hit
- miss
- write allocate: 去cache上要空間,寫上去
-
write through
- hit
- 寫到cache和memory上,很浪費時間
- solution: write buffer
- 處理器先寫到buffer,之後buffer再寫到memory
- miss
- Write around: 只更新資料在Memory中的值
- write allocate: 去cache上要空間,寫上去
- ref
Cache
- byte offset
- 用來存取一個 word 裡面的某個 byte,所以所需的 bit 數會是可以代表一個 word 裡面可以存的 byte 數,通常 1 word = 4 byte 所以這個部分通常為 2 bit
- block offset
- 用來存取 cache line 中的某個 word,所以由 cache line size(cache block size) 和 byte per word 來決定,舉例 64 B 的 cache block size,4 byte per word,那 block offset 就是 4 bit (64/4 = 24)
- referencet
- offset = block offset + byte offset
- 如果題目提供Word address,代表該數字已經扣除byte offset (p44練習, p127考古34)
- ex. address 13410=100001102
- In 3-way set associative cache with 2-word blocks, total size=24 words
- 他的tag=10000, idx=11, block offset=0
Virtual memory
-
兩大目的
- 供多個程式安全的共享記憶體
- 消除主記憶體空間不足的限制
- 是技術,不是memory
- 用virtual memory cache可降低hit time,因為不需要轉換至phy-addr的多餘過程
- 每個process都有自己的page table register
-
如何限制過大的page table size (p67)
- Limit register
- 兩張分頁表
- 分別位於stack(上往下)和heap(下往上)中,大小可變,MIPS採用之
- 多層page table
- OS提過,缺點是cache miss penalty變大
- page table也被分頁
- 3C miss: compulsory miss, capacity miss, conflict miss
- capacity miss是原本在cache中的block被擠出去後,再被存取卻找不到人的miss。如果該組tag/idx原本就不再cache中,算是complusory miss (p155,74題)
- context switch需清除TLB,不然會抓到錯的page,因此context switch後page fault高
- 務必寫過: p 20, 21, 68, 69, 72, 73
- 有打星星的考古都很重要一定要寫到背起來
Cache vs Virtual memory
|
術語 |
|
mapping |
write |
replace |
Cache |
block/line |
miss |
fully,set,direct |
write through/back |
random/ LRU |
virtual mem |
page |
page fault |
fully |
write through |
LRU |
virtual address cache(p78)
- 之前Cache都是用 phy idx / phy tag
- 需要先用TLB轉換至phy addr才能去cache找資料,效率差
- 如果是 vir idx / vir tag 呢
- 效率明顯變好,critical path上移除TLB
- 問題: 可能有aliasing(別名)問題,兩個vir addr可能對應到同個phy addr
- 造成data inconsistent (同步問題)
- 解法: vir idx / phy tag,平行處理TLB和Cache
TLB
CH7 儲存裝置和其他周邊設備
CH8 多重處理器
- 兩種多重處理器
- 記憶體共享(重點3): shared-memory multiprocessor (SMP)
- 傳遞訊息(重點4): message passing multiprocessor (MPP)
- 兩種多重處裡器,作加法時的演算法不同 (看課本p.311, 316)
- code不好背,看圖記規則
-
重點5: 多處裡器的快取一致性
- snooping (p318)
- cache coherence protocol,分以下兩種
- write-update
- 又稱write-broadcast
- 用匯流排把所有對應該data的cache都更新
- write-invalid
- 先把別人快取copy變invalid,在把自己變invalid
-
重點6: 單一晶片multithreading
- find-grained multithreading
- coarse-grained multithreading
- simultaneous multithreading (SMT)
- with "Instruction level parallelism (ILP)"和"thread level parallelism (TLP)"
- 用danamically scheduled processor達成
-
重點7: 平行電腦分類
- Flynn's taxonomy
- SISD: single instruction single data (uniprocessor, scalar processor)
- SIMD: single instruction multiple data
- MISD: multiple instruction single data (not exist)
- MIMD: multiple instruction multiple data
- Processors
- 比較與差異
- SIMD v.s. vector processor
- SIMD處裡的vector長度不可變
- vector processor處裡的vector長度可變
- superscaler processor v.s. pipeline
- superscaler processor: 平行執行多個指令用多個執行unit
- pipeline: 平行執行多個指令用一個執行unit
- superscaler + pipeline:

- parallelism
- ILP
- Instruction-level parallelism is a measure of how many of the instructions in a computer program can be executed simultaneously.
- DLP
- Data-level parallelism is the parallelism achieved by operating on independent data.
- MLP
- Memory-level parallelism refers to the ability to have pending multiple memory operations, in particular cache misses or translation lookaside buffer (TLB) misses, at the same time.
- TLP
- Thread-level parallelism is the parallelism inherent in an application that runs multiple threads at once.
- 以上四種平行只是概念,實作可用硬體或軟體達成 (108台大計系)

圖要背起來
上冊
- 16 bit 2 level carry lookahead adder (p.159)

下冊
- MIPS datapath
- Direct mapped with block offset (P.17)
- Utilize spatial locality

- 4-way set associative (P.40)
- Virtual address getting data (P.76)
- 先去TLB找phy addr, 假設找到,再去cache挖資料
