112考資工研究所時寫的筆記，有錯歡迎留言指正
內文有引用他人筆記/教材/文章的地方
若有侵權十分抱歉，告知後將立刻撤除

CH1 Instruction

縮寫區

MIPS: Microprocessor without Interlocked Pipe Stages
RISC: reduced instruction set compute
CISC: complex instruction set compute
ARM: Acorn RISC Machine
ISA: instruction set architecture
PC: program counter
GPR: general purpose register
電腦五大單元
1. 輸入單元(Input Unit)簡稱IU
2. 輸出單元(Output Unit)簡稱OU
3. 控制單元(Control Unit)簡稱CU
4. 記憶單元(Memory Unit)簡稱MU
5. 算術邏輯單元(Arithmetic Logic Unit)簡稱ALU
AM: arithmetic mean
GM: geometric mean
PLA: Programmable Logic Array
BHT: branch history table
BTB: branch target buffer
PTE: page table entry
TLB: translation lookaside buffer
MMU: Memory management unit
CLA: carry lookahead adder
MTTR: Mean time to repair
MTTF: Mean time to failure
MTBF: Mean time between failure (MTTF + MTTR)
SSD: Solid state drive
DRAM: Dynamic random access memory
SRAM: Static random access memory
DMA: Direct memory access
RAID: Redundant Array of Independent Disks
UMA: Uniform memory access multiprocesors
NUMA: nonuniform memory access multiprocesors
VLIW: very long instruction word
GPGPU: general purpose graphic process unit
SMP: Shared-memory Multiprocessor
MPP: Message passing Multiprocessor
SISD: single instruction single data
SIMD: single instruction multiple data
MISD: multiple instruction single data (not exist)
MIMD: multiple instruction multiple data

word字組: CPU一次可以處裡的資料量(可從reg, ALU的設計看出)
- RISC是32bit word
Endian
- big endian: (address遞增) 由most significand開始擺
  - MIPS
- little endian: (address遞增) 由least significand開始擺
  - RISC-V
JVM (Java virtual machine)
- 執行時才把method連(links)至Java library
JIT (Just in time compiler)
- 執行時才把bytecode翻成native machine code
register in MIPs
- $fp: frame pointer
- $sp: stack pointer
- $gp: global pointer
很重要的兩張圖
- Image Not Showing Possible Reasons
  The image was uploaded to a note which you don't have access to
  The note which the image was originally uploaded to has been deleted
  Learn More →
- 上圖為process在virtual memory中的樣子
- 最初sp和fp都重疊在stack底部(以圖來看為最上方)
- 隨著process開始運作，sp會被往low address的方向(以圖來看是往下)推進
- Image Not Showing Possible Reasons
  The image was uploaded to a note which you don't have access to
  The note which the image was originally uploaded to has been deleted
  Learn More →
- 上圖勘誤: SP應為stack pointer，非step pointer!
- 上圖為stack區段被放大來看的樣子，procedure call仍在同個process內
高階語言轉譯步驟
- compile 編譯器: 說明略
- assemble 組譯器: 說明略
- linker 連結器: 連結模組和library，解決address reference問題
- loader 載入器: 把machine code載入適當記憶體位置

指令 instruction

右移
- shift right arithmetic: 左邊做signed-extension
- shift right logical: 左邊補0
- 左移都一樣，右邊補0
MIPS沒有not指令，與0 NOR即可達成 (或與自己NOR即可達成)
- NOR $t0 $t0 0 (= NOR $t0 $t0 $t0 )
arithmetic
- add r1 r2 r3
  - r1
    $\leftarrow$ r2 + r3
- sub r1 r2 r3
  - r1
    $\leftarrow$ r2 - r3
data transfer:
- lw reg offset(base reg)
  - reg
    $\leftarrow$ Memory [ offset+base reg ]
- sw reg offset(base reg)
  - reg
    $\to$ Memory [ offset+base reg ]
- 其他補充
  - signd會做sign extension, unsigned則不會做
  - lh (load half word signed)
  - lhu (load half word unsigned)
  - lb (load byte signed)
  - lbu (load byte unsigned)
flow control
- beq r1, r2, Label
  - if(r1==r2) PC
    $\leftarrow$ Label
- bne r1 r2, Label
  - if(r1!=r2) PC
    $\leftarrow$ Label
- slt r1 r2 r3
  - r1
    $\leftarrow$ 1 if(r2<r3) else 0
- jr r1
  - PC
    $\leftarrow$ r1
- MIPS沒有blt, bgt, ble, bge
- 用beq, bne, slt即可做出虛擬指令(pseudoinstruction)
logical
- sll (shift left logically), srl (shift right logically)
- 只寫範例
- sll r1 r2 5
- srl r1 r2 5
- and r1 r2 r3
- or r1 r2 r3
- nor r1 r2 r3
- 沒有not !!! nor即可取代
  - A nor 0 = not(A or 0) = not(A)
常數(應該歸類在arithmetic?)
- lui (load upper immediate):
  - 載入16bit的常數載入32bit reg的左半部，右半部填0
- 其他只寫範例
- addi r1 r2 4
- slti r1 r2 4
- andi r1 r2 4
補充: basic block (very important)
- The code in a basic block has:
  - One entry point, meaning that no code within it is the destination of a jump instruction anywhere in the program.
  - One exit point, meaning that only the last instruction can cause the program to begin executing code in a different basic block.

指令格式 MIPS instruction format

以下可至MIPS Green Card對照

R type
- add, sub, and, srl ,sll, jr, jalr (109交大計系)…
- 以add rd rs rt為例
  
  op rs rt rd shmat funct
  
  6 5 5 5 5 6
I type
- lw, sw, beq, bne, addi
- 以lw rt 32(rs)為例
  
  op rs rt addr/imm
  
  6 5 5 16
- addr以branch下一個指令的位置當基準(0)，往上為負，往下為正
J type
- jal, j
- op addr
  
  6 26
- addr欄位26bits，以扣除最右邊2 bit(都是00)，和最左邊4bits (同process一樣)

op	rs	rt	rd	shmat	funct
6	5	5	5	5	6

op	rs	rt	addr/imm
6	5	5	16

op	addr
6	26

程序呼叫 Procedure call

步驟
1. 存參數
2. 跳去
3. 要空間，存前人data
4. 執行
5. 存回傳值
6. 還前人data，還空間
7. 跳回
- 12在caller做
- 34567在callee做
- 3,6順序成鏡向關係
這類題目爆炸難，自己去練課本p61, 62, 63

CH2 Arithmetic Logic unit (ALU)

考古題一定要寫13, 18, 49, 67
好難筆記，少寫很多東東，從課本p.150開始看就好
overflow
- MIPS只有在add, addi, sub等指令overflow時產生exception
- addu, addiu, subu等unsign運算溢位也不管他 (programmer要負責)
- 檢查是否overflow: 看最後兩個carryout，不同就是overflow
  - xor起來是1就是overflow

1 bit full adder

以下可從truth table推出，細節見課本p150
Cin: carry in
Cout: carry out
CarryOut =
$a b + a C i n + b C i n$ 有2 gate delay
Sum =
$a \overset{―}{b} \overset{―}{C i n} + \overset{―}{a} b \overset{―}{C i n} + \overset{―}{a} \overset{―}{b} C i n + a b C i n$ 有3 gate delay (多了not吧我猜)

64 bits ALU

王廷基的影片講的很好
overflow detection: 把最後兩個1 bit ALU的carry拉出來xor
set less than 的地方比較複雜，其他都可以自己看
- 1-62bit一樣，無腦輸出0就好(bit 63也output 0但有東西需計算)
- 方法是讓A-B，此時sign bit (bit 63)經過些判斷就能當作slt餵回bit 0使用

Ripple carry header

N個1 bit full adder(p.150)串在一起
critical path delay想當然爾，超級長
gate delay
- N bit的adder中
- critical path delay (最後一個Cout算出來): 2N
- sum delay (最後一個sum算出來): 2N+1

carry lookahead adder (CLA)

C_out = ab + (a+b)C_in
上面改寫成C_i+1 = a_ib_i+(a_i+b_i)C_i
以上是推導，不用背
定義generate, propagate
- generate: g_i = a_i*b_i
- propagate: p_i = a_i+b_i
C_i+1 = g_i + p_ic_i –-> 背!

4 bits CLA

課本p156-158
課本p158圖中的add block，細節如下
- 但我猜真正的硬體是1 bit full adder(p151右下)跟這個的合體
carry
- c1=g0+p0c0
- c2=g1+(p1g0)+(p1p0c0)
- c3=g2+(p2g1)+(p2p1g0)+(p2p1p0c0)
- c3=g3+(p3g2)+(p3p2g1)+(p3p2p1g0)+(p3p2p1p0c0)
- c2, c3, c4由Ci+1 = gi + piCi展開可得，不用背(但最好練習一下展開)

16 bits CLA

課本p159
Multiple Level Carry Lookahead (上課講義說的)
Propagate(此處Pi為大寫)
- P0=p3p2p1p0
- P1=p7p6p5p4
- P2=p11p10p9p8
- P3=p15p14p13p12
Generate (此處Gi為大寫)
- G0=g3+(p3g2)+(p3p2g1)+(p3p2p1g0)
- G1=g7+(p7g6)+(p7p6g5)+(p7p6p5g4)
- G2=g11+(p11g10)+(p11p10g9)+(p11p10p9g8)
- G3=g15+(p15g14)+(p15p14g13)+(p15p14p13g12)
Carry (此處Ci為大寫)
- C1=G0+P0C0
- C2=G1+(P1G0)+(P1P0C0)
- C3=G2+(P2G1)+(P2P1G0)+(P2P1P0C0)
- C3=G3+(P3G2)+(P3P2G1)+(P3P2P1G0)+(P3P2P1P0C0)
p150-159的圖，看熟，CLA甚至要背!!!
gate delay的計算看下表 (ex: 2-level的16 bits adder，carry delay=5, sum delay=10)
- 所有筆記的東西我現在理解了才背，希望之後回來還記得…

gate delay總結

幫助理解
小背一下 (參考清大105計系)
- 16 bits ripple carry adder c delay: 2*16=32
- 16 bits 1 level cascade CLA c delay: 3+2+2+2=9
- 16 bits 2 level CLA c delay: 2+3=5

Multiplication && Division (共5個，超重要)

乘法器除法器(不管有號無號)都是sequential circuit

traditional multiplication (Unsigned)

課本p165，務必練題目(畫運算表格)
在32*32的乘法運算中
- multiplier: 右移，32bits
- multiplicand: 左移，64bits
- product: 固定，64bits

hardware friendly (Robertson) multiplication (Unsigned)

課本p167，務必練題目(畫運算表格)
在32*32的乘法運算中
- multiplier: 放入product右半邊，32bits
- multiplicand: 固定，32bits
- product: 右移，64bits

Booth's algorithm (signed)

直接從表格理解，說明就免看了
計算推導過程可以背下(p.172, 173)
要注意product右移時要signed extend !!!

Traditional Division (unsigned)

稍微複雜點，要作過才會記得
餘數要sign extension

Hardware-friendly division (unsigned)

餘數要sign extension
remainder只有左半邊和divisor相減 !!!(p.182)
乘法器與除法器共用硬體

IEEE 754浮點數

single precision

range:
$1.1 . . .1 \times 2^{127} t o 1.0 \times 2^{- 126}$ 和負數同範圍
- 非正規化單精度可達1.0*2^-149

sign	exponent	fraction(mantissa)
1	8	23

double precision

sign	exponent	fraction(mantissa)
1	11	52

編碼規則

exp	frac	representation
1-154	anything	±floating number
0	0	±0
0	nonzero	±Denormalized number
255	0	±inf
255	nonzero	NaN

CH3 Computer performance

Exe Time = IC
$\times$ CPI
$\times$ CCT

IC=instruction count
CPI=cycle per instruction
CCT=clock cycle time

MIPS=
$\frac{I C}{E x e T i m e \times 10^{6}} = \frac{r a t e}{C P I \times 10^{6}}$

Amdahl's law

speedup=
$\frac{1}{\frac{F}{S} + (1 - F)}$
F: 被改善的比例，S: 改善倍率

CH4 CPU datapath & control

控制單元truth table中
- 只有三個信號可能出現don't care (上圖有接control訊號的Mux)
  - RegDst
  - ALUSrc
  - MemtoReg
j type會用到三個cycle，雖然target address讀到指令就知道了 (考古1)
- IF
- ID
- compose target addr
  - 對應其他指令cycle，這邊是EXE
  - 把26bits的target addr接上左4bits、右2bits才是完整的32bits addr
longest path計算ALU前面的Mux(ALUSrc)的時機(p388練習)
- Rtype: 需要算，因為要等reg file跑完才能過
- $I$ type: 不用算，因為sign ext後可以先進mux，此時reg file還在跑

CH5 Pipeline

Hazard
- Structural Hazard
  - 硬體資源不足，以致於無法pipeline (ex. IF和ME都用同個memory)
- Data Hazard
  - Data dependency (最常見的那種，ex. load use)
- Control Hazard
  - MEM階段才得到branch結果的話，可能已經多做三個指令了
Delay branch
- From before: delay slot中插入不管有沒有branch都會執行的指令
- From target: 類似predict taken，插入branch後的指令
- From fall through: 類似predict not taken，插入不成立會用的指令
- 以上都需用A==B的條件說明
- 此技術已經很少被使用了Dynamic prediction比較好
NOPs vs stall (p448練習，成大考古)
- NOPs: 軟體hazard解法，由compiler插入，是assembly code的一部份
- stall: 硬體hazard解法，由硬體偵測並stall by processor，所有hazard都能用他解決
Bubble(stall)的畫法
- 寫法應該像下面那樣，才能正確標示出bubble所在的cycle，要養成習慣
- 上面那種畫法是NOPs，塞入編譯好的空指令才能這樣表示
Dynamic branch prediction
- branch prediction buffer = branch history table (BHT)
  - 就是那些預測用的state
- branch target buffer (BTB)
  - 如果branch成立，去這邊(cache)撈地址，省一個計算address的cycle

CH6 Memory

三種記憶體 (p.7)

SRAM: cache，比較快，儲存在反向閘之間，體積較大
DRAM: main memory，比較慢，以電荷存在電容中，體積較小
Magnetic disk: 容量大，最便宜，最慢

快取寫入 (p.25)

Inconsistent
- 資料在快取和主記憶體不一致
write back
- hit
  - 只寫到cache上，用dirty bit管理
- miss
  - write allocate: 去cache上要空間，寫上去
write through
- hit
  - 寫到cache和memory上，很浪費時間
  - solution: write buffer
    - 處理器先寫到buffer，之後buffer再寫到memory
- miss
  - Write around: 只更新資料在Memory中的值
  - write allocate: 去cache上要空間，寫上去
ref

Cache

byte offset
- 用來存取一個 word 裡面的某個 byte，所以所需的 bit 數會是可以代表一個 word 裡面可以存的 byte 數，通常 1 word = 4 byte 所以這個部分通常為 2 bit
block offset
- 用來存取 cache line 中的某個 word，所以由 cache line size(cache block size) 和 byte per word 來決定，舉例 64 B 的 cache block size，4 byte per word，那 block offset 就是 4 bit (64/4 = 2⁴)
- referencet
offset = block offset + byte offset
如果題目提供Word address，代表該數字已經扣除byte offset (p44練習, p127考古34)
- ex. address 134₁₀=10000110₂
- In 3-way set associative cache with 2-word blocks, total size=24 words
- 他的tag=10000, idx=11, block offset=0

Virtual memory

把主記憶體當作磁碟的Cache

兩大目的
- 供多個程式安全的共享記憶體
- 消除主記憶體空間不足的限制

是技術，不是memory
用virtual memory cache可降低hit time，因為不需要轉換至phy-addr的多餘過程
每個process都有自己的page table register
- 紀錄page table起點
如何限制過大的page table size (p67)
- Limit register
  - page table大小可調整
- 兩張分頁表
  - 分別位於stack(上往下)和heap(下往上)中，大小可變，MIPS採用之
- 多層page table
  - OS提過，缺點是cache miss penalty變大
- page table也被分頁
  - 可能會無限page fault
3C miss: compulsory miss, capacity miss, conflict miss
- capacity miss是原本在cache中的block被擠出去後，再被存取卻找不到人的miss。如果該組tag/idx原本就不再cache中，算是complusory miss (p155,74題)
context switch需清除TLB，不然會抓到錯的page，因此context switch後page fault高
務必寫過: p 20, 21, 68, 69, 72, 73
有打星星的考古都很重要一定要寫到背起來

Cache vs Virtual memory

	術語		mapping	write	replace
Cache	block/line	miss	fully,set,direct	write through/back	random/ LRU
virtual mem	page	page fault	fully	write through	LRU

virtual address cache(p78)

之前Cache都是用 phy idx / phy tag
- 需要先用TLB轉換至phy addr才能去cache找資料，效率差
如果是 vir idx / vir tag 呢
- 效率明顯變好，critical path上移除TLB
- 問題: 可能有aliasing(別名)問題，兩個vir addr可能對應到同個phy addr
  - 造成data inconsistent (同步問題)
- 解法: vir idx / phy tag，平行處理TLB和Cache
  - 拿TLB得到的phy tag進cache找東西

TLB

TLB miss的處理
- x86: Hardware only
- MIPS: OS kernel mode

CH7 儲存裝置和其他周邊設備

Disk Access Time
- Seek time + Control time + Rotation time + Wait time + Data Transfer time
- 口訣: SCRWT
- Rotation time =
  $\frac{0.5}{R P S}$
  - RPS: rotation per second
  - 小心單位轉換(s/ms, RPM/RPS)
MTTF, MTTR
- 衡量availability的一種標準
- MTTR: Mean time to repair
- MTTF: Mean time to failure
- MTBF: Mean time between failure (MTTF + MTTR)
- availability =
  $\frac{M T T F}{M T T F + M T T R}$
Mega大: CA ch7 Disk筆記
RAID
- RAID 0+1的本質是RAID 0
- RAID 1+0的本質是RAID 1
- RAID 3 不利於small data write
IO data transfer & processor
- Programmed IO = polling
- Interrupt
- DMA
  - Direct memory access (DMA) is the mechanism that provides device controller with the ability to transfer data directly from or to memory without the involving the processor.
  - 會作cycle stealing ( Interleaving )
- CPU utilization:
  - DMA > Interrupt > polling
- Latency:
  - polling < DMA < Interrupt
Polling / Interrupt / DMA 比較

Polling Interrupt DMA

優點 simple processor不用一直看IO device 用cycle stealing 更省時間

缺點浪費processor時間比polling複雜需要硬體支援
Blocking IO & non-blocking IO
- 參考: 2022/11/19 20:09 TG群
- 簡單這樣記
  - Blocking IO = Synchronous IO 會進waiting state
  - Non-blocking IO = polling 留在running state，process困在polling
  - Asynchronous IO 留在running state，可做IO外的事，等inturrupt
  - interrupt可能屬於blocking 或 non blocking，參考下文ref
  - ref

	Polling	Interrupt	DMA
優點	simple	processor不用一直看IO device	用cycle stealing 更省時間
缺點	浪費processor時間	比polling複雜	需要硬體支援

CH8 多重處理器

兩種多重處理器
- 記憶體共享(重點3): shared-memory multiprocessor (SMP)
- 傳遞訊息(重點4): message passing multiprocessor (MPP)
- 兩種多重處裡器，作加法時的演算法不同 (看課本p.311, 316)
- code不好背，看圖記規則
重點5: 多處裡器的快取一致性
- snooping (p318)
- cache coherence protocol，分以下兩種
- write-update
  - 又稱write-broadcast
  - 用匯流排把所有對應該data的cache都更新
- write-invalid
  - 先把別人快取copy變invalid，在把自己變invalid
重點6: 單一晶片multithreading
- find-grained multithreading
  - 隱藏short/long stalls
- coarse-grained multithreading
  - 只能藏住long stalls
- simultaneous multithreading (SMT)
  - with "Instruction level parallelism (ILP)"和"thread level parallelism (TLP)"
  - 用danamically scheduled processor達成
重點7: 平行電腦分類
Flynn's taxonomy
- SISD: single instruction single data (uniprocessor, scalar processor)
- SIMD: single instruction multiple data
- MISD: multiple instruction single data (not exist)
- MIMD: multiple instruction multiple data
Processors
- vector processor (array processor)
  - 能高效且有效地處理稱為向量的大型1d array (ex. GPU)
- scalar processor
  - 一次處裡一個data，和vector processor相反
- superscaler processor
  - 在一個processor中實現指令平行化
比較與差異
- SIMD v.s. vector processor
  - SIMD處裡的vector長度不可變
  - vector processor處裡的vector長度可變
- superscaler processor v.s. pipeline
  - superscaler processor: 平行執行多個指令用多個執行unit
  - pipeline: 平行執行多個指令用一個執行unit
  - superscaler + pipeline:
parallelism
- ILP
  - Instruction-level parallelism is a measure of how many of the instructions in a computer program can be executed simultaneously.
- DLP
  - Data-level parallelism is the parallelism achieved by operating on independent data.
- MLP
  - Memory-level parallelism refers to the ability to have pending multiple memory operations, in particular cache misses or translation lookaside buffer (TLB) misses, at the same time.
- TLP
  - Thread-level parallelism is the parallelism inherent in an application that runs multiple threads at once.
- 以上四種平行只是概念，實作可用硬體或軟體達成 (108台大計系)

圖要背起來

上冊

16 bit 2 level carry lookahead adder (p.159)

下冊

MIPS datapath
Direct mapped with block offset (P.17)
- Utilize spatial locality
4-way set associative (P.40)
Virtual address getting data (P.76)
- 先去TLB找phy addr, 假設找到，再去cache挖資料

CH1 Instruction

縮寫區

指令 instruction

指令格式 MIPS instruction format

程序呼叫 Procedure call

CH2 Arithmetic Logic unit (ALU)

1 bit full adder

64 bits ALU

Ripple carry header

carry lookahead adder (CLA)

4 bits CLA

16 bits CLA

gate delay總結

Multiplication && Division (共5個，超重要)

traditional multiplication (Unsigned)

hardware friendly (Robertson) multiplication (Unsigned)

Booth's algorithm (signed)

Traditional Division (unsigned)

Hardware-friendly division (unsigned)

IEEE 754浮點數

single precision

double precision

編碼規則

CH3 Computer performance

Exe Time = IC × CPI × CCT

MIPS=ICExeTime×106=rateCPI×106

Amdahl's law

CH4 CPU datapath & control

CH5 Pipeline

Hazard

Delay branch

NOPs vs stall (p448練習，成大考古)

Bubble(stall)的畫法

Dynamic branch prediction

CH6 Memory

三種記憶體 (p.7)

快取寫入 (p.25)

write back

write through

Cache

Virtual memory

把主記憶體當作磁碟的Cache

兩大目的

如何限制過大的page table size (p67)

Cache vs Virtual memory

virtual address cache(p78)

TLB

CH7 儲存裝置和其他周邊設備

MTTF, MTTR

IO data transfer & processor

CH8 多重處理器

重點5: 多處裡器的快取一致性

重點6: 單一晶片multithreading

重點7: 平行電腦分類

圖要背起來

上冊

下冊

Read more

線代筆記

OS筆記

Algo筆記

資結筆記

Exe Time = IC
$\times$ CPI
$\times$ CCT

MIPS=
$\frac{I C}{E x e T i m e \times 10^{6}} = \frac{r a t e}{C P I \times 10^{6}}$