```
112考資工研究所時寫的筆記,有錯歡迎留言指正
內文有引用他人筆記/教材/文章的地方
若有侵權十分抱歉,告知後將立刻撤除
```
# CH1 Instruction
:::info
### 縮寫區
- MIPS: Microprocessor without Interlocked Pipe Stages
- RISC: reduced instruction set compute
- CISC: complex instruction set compute
- ARM: Acorn RISC Machine
- ISA: instruction set architecture
- PC: program counter
- GPR: general purpose register
- 電腦五大單元
1. 輸入單元(Input Unit)簡稱IU
2. 輸出單元(Output Unit)簡稱OU
3. 控制單元(Control Unit)簡稱CU
4. 記憶單元(Memory Unit)簡稱MU
5. 算術邏輯單元(Arithmetic Logic Unit)簡稱ALU
- AM: arithmetic mean
- GM: geometric mean
- PLA: Programmable Logic Array
- BHT: branch history table
- BTB: branch target buffer
- PTE: page table entry
- TLB: translation lookaside buffer
- MMU: Memory management unit
- CLA: carry lookahead adder
- MTTR: Mean time to repair
- MTTF: Mean time to failure
- MTBF: Mean time between failure (MTTF + MTTR)
- SSD: Solid state drive
- DRAM: Dynamic random access memory
- SRAM: Static random access memory
- DMA: Direct memory access
- RAID: Redundant Array of Independent Disks
- UMA: Uniform memory access multiprocesors
- NUMA: nonuniform memory access multiprocesors
- VLIW: very long instruction word
- GPGPU: general purpose graphic process unit
- SMP: Shared-memory Multiprocessor
- MPP: Message passing Multiprocessor
- SISD: single instruction single data
- SIMD: single instruction multiple data
- MISD: multiple instruction single data (not exist)
- MIMD: multiple instruction multiple data
:::
- word字組: CPU一次可以處裡的資料量(可從reg, ALU的設計看出)
- RISC是32bit word
- Endian
- big endian: (address遞增) 由most significand開始擺
- MIPS
- little endian: (address遞增) 由least significand開始擺
- RISC-V
- JVM (Java virtual machine)
- 執行時才把method連(links)至Java library
- JIT (Just in time compiler)
- 執行時才把bytecode翻成native machine code
- register in MIPs
- $fp: frame pointer
- $sp: stack pointer
- $gp: global pointer
- 很重要的兩張圖
- 
- 上圖為process在virtual memory中的樣子
- 最初sp和fp都重疊在stack底部(以圖來看為最上方)
- 隨著process開始運作,sp會被往low address的方向(以圖來看是往下)推進
- 
- 上圖勘誤: SP應為stack pointer,非step pointer!
- 上圖為stack區段被放大來看的樣子,procedure call仍在同個process內
- 高階語言轉譯步驟
- compile 編譯器: 說明略
- assemble 組譯器: 說明略
- linker 連結器: 連結模組和library,解決address reference問題
- loader 載入器: 把machine code載入適當記憶體位置
## 指令 instruction
- 右移
- shift right arithmetic: 左邊做signed-extension
- shift right logical: 左邊補0
- 左移都一樣,右邊補0
- MIPS沒有not指令,與0 NOR即可達成 (或與自己NOR即可達成)
- NOR $t0 $t0 0 (= NOR $t0 $t0 $t0 )
- arithmetic
- add r1 r2 r3
- r1 $\leftarrow$ r2 + r3
- sub r1 r2 r3
- r1 $\leftarrow$ r2 - r3
- data transfer:
- lw reg offset(base reg)
- reg $\leftarrow$ Memory [ offset+base reg ]
- sw reg offset(base reg)
- reg $\rightarrow$ Memory [ offset+base reg ]
- 其他補充
- signd會做sign extension, unsigned則不會做
- lh (load half word signed)
- lhu (load half word unsigned)
- lb (load byte signed)
- lbu (load byte unsigned)
- flow control
- beq r1, r2, Label
- if(r1==r2) PC $\leftarrow$ Label
- bne r1 r2, Label
- if(r1!=r2) PC $\leftarrow$ Label
- slt r1 r2 r3
- r1 $\leftarrow$ 1 if(r2<r3) else 0
- jr r1
- PC $\leftarrow$ r1
- MIPS沒有blt, bgt, ble, bge
- 用beq, bne, slt即可做出虛擬指令(pseudoinstruction)
- logical
- sll (shift left logically), srl (shift right logically)
- 只寫範例
- sll r1 r2 5
- srl r1 r2 5
- and r1 r2 r3
- or r1 r2 r3
- nor r1 r2 r3
- 沒有not !!! nor即可取代
- ==A nor 0== = not(A or 0) = ==not(A)==
- 常數(應該歸類在arithmetic?)
- lui (load upper immediate):
- 載入16bit的常數載入32bit reg的左半部,右半部填0
- 其他只寫範例
- addi r1 r2 4
- slti r1 r2 4
- andi r1 r2 4
- 補充: basic block (very important)
- The code in a basic block has:
- One entry point, meaning that no code within it is the destination of a jump instruction anywhere in the program.
- One exit point, meaning that only the last instruction can cause the program to begin executing code in a different basic block.
## 指令格式 MIPS instruction format
以下可至[MIPS Green Card](https://inst.eecs.berkeley.edu/~cs61c/resources/MIPS_Green_Sheet.pdf)對照
- R type
- add, sub, and, ==srl ,sll, jr, jalr (109交大計系)==...
- 以add rd rs rt為例
| op | rs | rt | rd | shmat | funct |
| --- | --- | --- | --- | ----- | ----- |
| 6 | 5 | 5 | 5 | 5 | 6 |
- I type
- lw, sw, beq, bne, addi
- 以lw rt 32(rs)為例
| op | rs | rt | addr/imm |
| --- | --- | --- | -------- |
| 6 | 5 | 5 | 16 |
- addr以==branch下一個指令的位置當基準(0)==,往上為負,往下為正
- J type
- jal, j
-
| op | addr |
| --- | ---- |
| 6 | 26 |
- ==addr欄位26bits,以扣除最右邊2 bit(都是00),和最左邊4bits (同process一樣)==
## 程序呼叫 Procedure call
- 步驟
1. 存參數
2. 跳去
3. 要空間,存前人data
4. 執行
5. 存回傳值
6. 還前人data,還空間
7. 跳回
- 12在caller做
- 34567在callee做
- 3,6順序成鏡向關係
- 這類題目爆炸難,自己去練課本p61, 62, 63
# CH2 Arithmetic Logic unit (ALU)
- 考古題一定要寫13, 18, 49, 67
- 好難筆記,少寫很多東東,從課本p.150開始看就好
- overflow
- MIPS只有在add, addi, sub等指令overflow時產生exception
- addu, addiu, subu等unsign運算溢位也不管他 (programmer要負責)
- 檢查是否overflow: 看最後兩個carryout,不同就是overflow
- xor起來是1就是overflow
## 1 bit full adder
- 以下可從truth table推出,細節見課本p150
- Cin: carry in
- Cout: carry out
- CarryOut = $ab + aCin + bCin$ 有2 gate delay
- Sum = $a\overline{b} \overline{Cin}+\overline{a}b \overline{Cin}+\overline{a}\overline{b} Cin+abCin$ 有3 gate delay (多了not吧我猜)
## 64 bits ALU
- [王廷基的影片講的很好](https://drive.google.com/drive/folders/11ofRhppqqWbPml2EGiomIeYkmq09Y2BM)
- overflow detection: 把最後兩個1 bit ALU的carry拉出來xor
- 
- set less than 的地方比較複雜,其他都可以自己看
- 1-62bit一樣,無腦輸出0就好(bit 63也output 0但有東西需計算)
- 方法是讓A-B,此時sign bit (bit 63)經過些判斷就能當作slt餵回bit 0使用
- 
- 
- 
## Ripple carry header
- N個1 bit full adder(p.150)串在一起
- critical path delay想當然爾,超級長
- gate delay
- N bit的adder中
- critical path delay (最後一個Cout算出來): 2N
- sum delay (最後一個sum算出來): 2N+1
## carry lookahead adder (CLA)
- C~out~ = ab + (a+b)C~in~
- 上面改寫成C~i+1~ = a~i~b~i~+(a~i~+b~i~)C~i~
- 以上是推導,不用背
- 定義generate, propagate
- generate: g~i~ = a~i~*b~i~
- propagate: p~i~ = a~i~+b~i~
- C~i+1~ = g~i~ + p~i~c~i~ ---> 背!
### 4 bits CLA
- 課本p156-158
- 課本p158圖中的add block,細節如下
- 
- 但我猜真正的硬體是1 bit full adder(p151右下)跟這個的合體
- carry
- c1=g0+p0c0
- c2=g1+(p1g0)+(p1p0c0)
- c3=g2+(p2g1)+(p2p1g0)+(p2p1p0c0)
- c3=g3+(p3g2)+(p3p2g1)+(p3p2p1g0)+(p3p2p1p0c0)
- c2, c3, c4由Ci+1 = gi + piCi展開可得,不用背(但最好練習一下展開)
### 16 bits CLA
- 課本p159
- Multiple Level Carry Lookahead (上課講義說的)
- Propagate(此處Pi為大寫)
- P0=p3p2p1p0
- P1=p7p6p5p4
- P2=p11p10p9p8
- P3=p15p14p13p12
- Generate (此處Gi為大寫)
- G0=g3+(p3g2)+(p3p2g1)+(p3p2p1g0)
- G1=g7+(p7g6)+(p7p6g5)+(p7p6p5g4)
- G2=g11+(p11g10)+(p11p10g9)+(p11p10p9g8)
- G3=g15+(p15g14)+(p15p14g13)+(p15p14p13g12)
- Carry (此處Ci為大寫)
- C1=G0+P0C0
- C2=G1+(P1G0)+(P1P0C0)
- C3=G2+(P2G1)+(P2P1G0)+(P2P1P0C0)
- C3=G3+(P3G2)+(P3P2G1)+(P3P2P1G0)+(P3P2P1P0C0)
- p150-159的圖,看熟,CLA甚至要背!!!
- gate delay的計算看下表 (ex: 2-level的16 bits adder,carry delay=5, sum delay=10)
- 
- 所有筆記的東西我現在理解了才背,希望之後回來還記得...
### gate delay總結
- [幫助理解](https://www.ece.lsu.edu/ee3755/2013f/cla.pdf)
- 小背一下 (參考清大105計系)
- 16 bits ripple carry adder c delay: 2*16=32
- 16 bits 1 level cascade CLA c delay: 3+2+2+2=9
- 16 bits 2 level CLA c delay: 2+3=5
## Multiplication && Division (共5個,超重要)
- 乘法器除法器(不管有號無號)都是sequential circuit
### traditional multiplication (Unsigned)
- 課本p165,務必練題目(畫運算表格)
- 
- 
- 
- 在32*32的乘法運算中
- multiplier: 右移,32bits
- multiplicand: 左移,64bits
- product: 固定,64bits
### hardware friendly (Robertson) multiplication (Unsigned)
- 課本p167,務必練題目(畫運算表格)
- 
- 
- 在32*32的乘法運算中
- multiplier: 放入product右半邊,32bits
- multiplicand: 固定,32bits
- product: 右移,64bits
### Booth's algorithm (signed)
- 直接從表格理解,說明就免看了
- 計算推導過程可以背下(p.172, 173)
- 要注意product右移時要signed extend !!!
### Traditional Division (unsigned)
- 
- 
- 稍微複雜點,要作過才會記得
- 餘數要sign extension
### Hardware-friendly division (unsigned)
- 
- 
- 餘數要sign extension
- remainder只有左半邊和divisor相減 !!!(p.182)
- 乘法器與除法器共用硬體

## IEEE 754浮點數
### single precision
- range: $1.1...1 \times 2^{127}\ to\ 1.0 \times 2^{-126}$和負數同範圍
- 非正規化 單精度可達1.0*2^-149^
| sign | exponent | fraction(mantissa) |
| ---- | -------- | ------------------ |
| 1 | 8 | 23 |
### double precision
| sign | exponent | fraction(mantissa) |
| ---- | -------- | ------------------ |
| 1 | 11 | 52 |
### 編碼規則
| exp | frac | representation |
| ----- | -------- | --------------------- |
| 1-154 | anything | +-floating number |
| 0 | 0 | +-0 |
| 0 | nonzero | +-Denormalized number |
| 255 | 0 | +-inf |
| 255 | nonzero | NaN |
# CH3 Computer performance
### Exe Time = IC $\times$ CPI $\times$ CCT
- IC=instruction count
- CPI=cycle per instruction
- CCT=clock cycle time
### MIPS=$\frac{IC}{Exe Time \times 10^6}=\frac{rate}{CPI \times 10^6}$
### Amdahl's law
- speedup=$\displaystyle \frac{1}{\frac{F}{S} + (1-F)}$
- F: 被改善的比例,S: 改善倍率
# CH4 CPU datapath & control

- 控制單元truth table中
- 只有三個信號可能出現don't care (上圖有接control訊號的Mux)
- RegDst
- ALUSrc
- MemtoReg
- j type會用到三個cycle,雖然target address讀到指令就知道了 (考古1)
- IF
- ID
- compose target addr
- 對應其他指令cycle,這邊是EXE
- 把26bits的target addr接上左4bits、右2bits才是完整的32bits addr
- longest path計算ALU前面的Mux(ALUSrc)的時機(p388練習)
- Rtype: 需要算,因為要等reg file跑完才能過
- $I$type: 不用算,因為sign ext後可以先進mux,此時reg file還在跑
# CH5 Pipeline
- ### Hazard
- Structural Hazard
- 硬體資源不足,以致於無法pipeline (ex. IF和ME都用同個memory)
- Data Hazard
- Data dependency (最常見的那種,ex. load use)
- Control Hazard
- MEM階段才得到branch結果的話,可能已經多做三個指令了
- ### Delay branch
- From before: delay slot中插入不管有沒有branch都會執行的指令
- From target: 類似predict taken,插入branch後的指令
- From fall through: 類似predict not taken,插入不成立會用的指令
- 以上都需用A==B的條件說明
- 此技術已經很少被使用了Dynamic prediction比較好
- ### NOPs vs stall (p448練習,成大考古)
- NOPs: 軟體hazard解法,由compiler插入,是assembly code的一部份
- stall: 硬體hazard解法,由硬體偵測並stall by processor,所有hazard都能用他解決
- ### Bubble(stall)的畫法
- 寫法應該像下面那樣,才能正確標示出bubble所在的cycle,要養成習慣
- 
- 上面那種畫法是NOPs,塞入編譯好的空指令才能這樣表示
- ### Dynamic branch prediction
- branch prediction buffer = branch history table (BHT)
- 就是那些預測用的state
- branch target buffer (BTB)
- 如果branch成立,去這邊(cache)撈地址,省一個計算address的cycle
# CH6 Memory
## 三種記憶體 (p.7)
- SRAM: cache,比較快,儲存在反向閘之間,體積較大
- DRAM: main memory,比較慢,以電荷存在電容中,體積較小
- Magnetic disk: 容量大,最便宜,最慢
## 快取寫入 (p.25)
- Inconsistent
- 資料在快取和主記憶體不一致
- ### write back
- hit
- 只寫到cache上,用dirty bit管理
- miss
- write allocate: 去cache上要空間,寫上去
- ### write through
- hit
- 寫到cache和memory上,很浪費時間
- solution: write buffer
- 處理器先寫到buffer,之後buffer再寫到memory
- miss
- Write around: 只更新資料在Memory中的值
- write allocate: 去cache上要空間,寫上去
- [ref](https://www.jyt0532.com/2018/09/23/cache-mechanism/)
## Cache
- byte offset
- 用來存取一個 word 裡面的某個 byte,所以所需的 bit 數會是可以代表一個 word 裡面可以存的 byte 數,通常 1 word = 4 byte 所以這個部分通常為 2 bit
- block offset
- 用來存取 cache line 中的某個 word,所以由 cache line size(cache block size) 和 byte per word 來決定,舉例 64 B 的 cache block size,4 byte per word,那 block offset 就是 4 bit (64/4 = 2^4^)
- [referencet](https://ithelp.ithome.com.tw/articles/10270524)
- ==offset = block offset + byte offset==
- 如果題目提供Word address,代表該數字已經扣除byte offset (p44練習, p127考古34)
- ex. address 134~10~=10000110~2~
- In 3-way set associative cache with 2-word blocks, total size=24 words
- 他的tag=10000, idx=11, block offset=0
## Virtual memory
- ### 把主記憶體當作磁碟的Cache
::: info
- ### 兩大目的
- 供多個程式安全的共享記憶體
- 消除主記憶體空間不足的限制
:::
- 是技術,不是memory
- 用virtual memory cache可降低hit time,因為不需要轉換至phy-addr的多餘過程
- 每個process都有自己的page table register
- 紀錄page table起點
- ### 如何限制過大的page table size (p67)
- Limit register
- page table大小可調整
- 兩張分頁表
- 分別位於stack(上往下)和heap(下往上)中,大小可變,MIPS採用之
- 多層page table
- OS提過,缺點是cache miss penalty變大
- page table也被分頁
- 可能會無限page fault
- 3C miss: compulsory miss, capacity miss, conflict miss
- capacity miss是原本在cache中的block被擠出去後,再被存取卻找不到人的miss。如果該組tag/idx原本就不再cache中,算是complusory miss (p155,74題)
- context switch需清除TLB,不然會抓到錯的page,因此context switch後page fault高
- 務必寫過: p 20, 21, 68, 69, 72, 73
- 有打星星的考古都很重要一定要寫到背起來
### Cache vs Virtual memory
| | 術語 | | mapping | write | replace |
| -------- | --------- | ---------- | ---------------- | ------------------ | ----------- |
| Cache | block/line | miss | fully,set,direct | write through/back | random/ LRU |
| virtual mem | page | page fault | fully | write through | LRU |
### virtual address cache(p78)
- 之前Cache都是用 phy idx / phy tag
- 需要先用TLB轉換至phy addr才能去cache找資料,效率差
- 如果是 vir idx / vir tag 呢
- 效率明顯變好,critical path上移除TLB
- 問題: 可能有aliasing(別名)問題,兩個vir addr可能對應到同個phy addr
- 造成data inconsistent (同步問題)
- 解法: vir idx / phy tag,平行處理TLB和Cache
- 拿TLB得到的phy tag進cache找東西
### TLB
- [TLB miss的處理](https://unix.stackexchange.com/questions/386591/in-linux-is-the-kernel-who-handle-all-the-tlb-miss)
- x86: Hardware only
- MIPS: OS kernel mode
# CH7 儲存裝置和其他周邊設備
- Disk Access Time
- Seek time + Control time + Rotation time + Wait time + Data Transfer time
- 口訣: SCRWT
- Rotation time = $\displaystyle \frac{0.5}{RPS}$
- RPS: rotation per second
- 小心單位轉換(s/ms, RPM/RPS)
- ### MTTF, MTTR
- 衡量availability的一種標準
- MTTR: Mean time to repair
- MTTF: Mean time to failure
- MTBF: Mean time between failure (MTTF + MTTR)
- availability = $\displaystyle \frac{MTTF}{MTTF+MTTR}$
- [Mega大: CA ch7 Disk筆記](https://drive.google.com/drive/folders/1bfU-MfVsvKu_1QVsWoR_-XxJ4nRL6X5J)
- RAID
- RAID 0+1的本質是RAID 0
- RAID 1+0的本質是RAID 1
- RAID 3 不利於small data write
- ### IO data transfer & processor
- Programmed IO = polling
- Interrupt
- DMA
- Direct memory access (DMA) is the mechanism that provides device controller with the ability to transfer data directly from or to memory without the involving the processor.
- 會作[cycle stealing ( Interleaving )](https://www.dcard.tw/f/softwareengineer/p/239481598)
- CPU utilization:
- DMA > Interrupt > polling
- Latency:
- polling < DMA < Interrupt
- Polling / Interrupt / DMA 比較
| | Polling | Interrupt | DMA |
| ---- | ----------------- | ---------------------------- | ------------------------- |
| 優點 | simple | processor不用一直看IO device | 用cycle stealing 更省時間 |
| 缺點 | 浪費processor時間 | 比polling複雜 | 需要硬體支援 |
- Blocking IO & non-blocking IO
- 參考: 2022/11/19 20:09 TG群
- 
- 
- 簡單這樣記
- Blocking IO = Synchronous IO 會進waiting state
- Non-blocking IO = polling 留在running state,process困在polling
- Asynchronous IO 留在running state,可做IO外的事,等inturrupt
- interrupt可能屬於blocking 或 non blocking,參考下文ref
- [ref](https://medium.com/@clu1022/%E6%B7%BA%E8%AB%87i-o-model-32da09c619e6)
# CH8 多重處理器
- 兩種多重處理器
- 記憶體共享(重點3): shared-memory multiprocessor (SMP)
- 傳遞訊息(重點4): message passing multiprocessor (MPP)
- 兩種多重處裡器,作加法時的演算法不同 (看課本p.311, 316)
- code不好背,看圖記規則
- ### 重點5: 多處裡器的快取一致性
- snooping (p318)
- cache coherence protocol,分以下兩種
- write-update
- 又稱write-broadcast
- 用匯流排把所有對應該data的cache都更新
- write-invalid
- 先把別人快取copy變invalid,在把自己變invalid
- ### 重點6: 單一晶片multithreading
- find-grained multithreading
- 隱藏short/long stalls
- coarse-grained multithreading
- 只能藏住long stalls
- simultaneous multithreading (SMT)
- with "Instruction level parallelism (ILP)"和"thread level parallelism (TLP)"
- 用danamically scheduled processor達成
- ### 重點7: 平行電腦分類
- Flynn's taxonomy
- SISD: single instruction single data (uniprocessor, scalar processor)
- SIMD: single instruction multiple data
- MISD: multiple instruction single data (not exist)
- MIMD: multiple instruction multiple data
- Processors
- [vector processor](https://zh.wikipedia.org/wiki/%E5%90%91%E9%87%8F%E5%A4%84%E7%90%86%E5%99%A8) (array processor)
- 能高效且有效地處理稱為向量的大型1d array (ex. GPU)
- [scalar processor](https://en.wikipedia.org/wiki/Scalar_processor)
- 一次處裡一個data,和vector processor相反
- [superscaler processor](https://en.wikipedia.org/wiki/Superscalar_processor)
- 在一個processor中實現指令平行化
- 比較與差異
- ==SIMD== v.s. ==vector processor==
- SIMD處裡的vector長度不可變
- vector processor處裡的vector長度可變
- ==superscaler processor== v.s. ==pipeline==
- superscaler processor: 平行執行多個指令用多個執行unit
- pipeline: 平行執行多個指令用一個執行unit
- superscaler + pipeline: 
- parallelism
- ILP
- Instruction-level parallelism is a measure of how many of the instructions in a computer program can be executed simultaneously.
- DLP
- Data-level parallelism is the parallelism achieved by operating on independent data.
- MLP
- Memory-level parallelism refers to the ability to have pending multiple memory operations, in particular cache misses or translation lookaside buffer (TLB) misses, at the same time.
- TLP
- Thread-level parallelism is the parallelism inherent in an application that runs multiple threads at once.
- 以上四種平行只是概念,實作可用硬體或軟體達成 (108台大計系)
- 
## 圖要背起來
### 上冊
- 16 bit 2 level carry lookahead adder (p.159)
- 
### 下冊
- MIPS datapath
- 
- Direct mapped with block offset (P.17)
- Utilize spatial locality
- 
- 4-way set associative (P.40)
- 
- Virtual address getting data (P.76)
- 先去TLB找phy addr, 假設找到,再去cache挖資料
- 