112考資工研究所時寫的筆記,有錯歡迎留言指正
內文有引用他人筆記/教材/文章的地方
若有侵權十分抱歉,告知後將立刻撤除

CH1 Instruction

縮寫區

  • MIPS: Microprocessor without Interlocked Pipe Stages
  • RISC: reduced instruction set compute
  • CISC: complex instruction set compute
  • ARM: Acorn RISC Machine
  • ISA: instruction set architecture
  • PC: program counter
  • GPR: general purpose register
  • 電腦五大單元
    1. 輸入單元(Input Unit)簡稱IU
    2. 輸出單元(Output Unit)簡稱OU
    3. 控制單元(Control Unit)簡稱CU
    4. 記憶單元(Memory Unit)簡稱MU
    5. 算術邏輯單元(Arithmetic Logic Unit)簡稱ALU
  • AM: arithmetic mean
  • GM: geometric mean
  • PLA: Programmable Logic Array
  • BHT: branch history table
  • BTB: branch target buffer
  • PTE: page table entry
  • TLB: translation lookaside buffer
  • MMU: Memory management unit
  • CLA: carry lookahead adder
  • MTTR: Mean time to repair
  • MTTF: Mean time to failure
  • MTBF: Mean time between failure (MTTF + MTTR)
  • SSD: Solid state drive
  • DRAM: Dynamic random access memory
  • SRAM: Static random access memory
  • DMA: Direct memory access
  • RAID: Redundant Array of Independent Disks
  • UMA: Uniform memory access multiprocesors
  • NUMA: nonuniform memory access multiprocesors
  • VLIW: very long instruction word
  • GPGPU: general purpose graphic process unit
  • SMP: Shared-memory Multiprocessor
  • MPP: Message passing Multiprocessor
  • SISD: single instruction single data
  • SIMD: single instruction multiple data
  • MISD: multiple instruction single data (not exist)
  • MIMD: multiple instruction multiple data
  • word字組: CPU一次可以處裡的資料量(可從reg, ALU的設計看出)
    • RISC是32bit word
  • Endian
    • big endian: (address遞增) 由most significand開始擺
      • MIPS
    • little endian: (address遞增) 由least significand開始擺
      • RISC-V
  • JVM (Java virtual machine)
    • 執行時才把method連(links)至Java library
  • JIT (Just in time compiler)
    • 執行時才把bytecode翻成native machine code
  • register in MIPs
    • $fp: frame pointer
    • $sp: stack pointer
    • $gp: global pointer
  • 很重要的兩張圖
    • Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →
    • 上圖為process在virtual memory中的樣子
    • 最初sp和fp都重疊在stack底部(以圖來看為最上方)
    • 隨著process開始運作,sp會被往low address的方向(以圖來看是往下)推進
    • Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →
    • 上圖勘誤: SP應為stack pointer,非step pointer!
    • 上圖為stack區段被放大來看的樣子,procedure call仍在同個process內
  • 高階語言轉譯步驟
    • compile 編譯器: 說明略
    • assemble 組譯器: 說明略
    • linker 連結器: 連結模組和library,解決address reference問題
    • loader 載入器: 把machine code載入適當記憶體位置

指令 instruction

  • 右移
    • shift right arithmetic: 左邊做signed-extension
    • shift right logical: 左邊補0
    • 左移都一樣,右邊補0
  • MIPS沒有not指令,與0 NOR即可達成 (或與自己NOR即可達成)
    • NOR $t0 $t0 0 (= NOR $t0 $t0 $t0 )
  • arithmetic
    • add r1 r2 r3
      • r1
        r2 + r3
    • sub r1 r2 r3
      • r1
        r2 - r3
  • data transfer:
    • lw reg offset(base reg)
      • reg
        Memory [ offset+base reg ]
    • sw reg offset(base reg)
      • reg
        Memory [ offset+base reg ]
    • 其他補充
      • signd會做sign extension, unsigned則不會做
      • lh (load half word signed)
      • lhu (load half word unsigned)
      • lb (load byte signed)
      • lbu (load byte unsigned)
  • flow control
    • beq r1, r2, Label
      • if(r1==r2) PC
        Label
    • bne r1 r2, Label
      • if(r1!=r2) PC
        Label
    • slt r1 r2 r3
      • r1
        1 if(r2<r3) else 0
    • jr r1
      • PC
        r1
    • MIPS沒有blt, bgt, ble, bge
    • 用beq, bne, slt即可做出虛擬指令(pseudoinstruction)
  • logical
    • sll (shift left logically), srl (shift right logically)
    • 只寫範例
    • sll r1 r2 5
    • srl r1 r2 5
    • and r1 r2 r3
    • or r1 r2 r3
    • nor r1 r2 r3
    • 沒有not !!! nor即可取代
      • A nor 0 = not(A or 0) = not(A)
  • 常數(應該歸類在arithmetic?)
    • lui (load upper immediate):
      • 載入16bit的常數載入32bit reg的左半部,右半部填0
    • 其他只寫範例
    • addi r1 r2 4
    • slti r1 r2 4
    • andi r1 r2 4
  • 補充: basic block (very important)
    • The code in a basic block has:
      • One entry point, meaning that no code within it is the destination of a jump instruction anywhere in the program.
      • One exit point, meaning that only the last instruction can cause the program to begin executing code in a different basic block.

指令格式 MIPS instruction format

以下可至MIPS Green Card對照

  • R type
    • add, sub, and, srl ,sll, jr, jalr (109交大計系)
    • 以add rd rs rt為例
      op rs rt rd shmat funct
      6 5 5 5 5 6
  • I type
    • lw, sw, beq, bne, addi

    • 以lw rt 32(rs)為例

      op rs rt addr/imm
      6 5 5 16
    • addr以branch下一個指令的位置當基準(0),往上為負,往下為正

  • J type
    • jal, j
    • op addr
      6 26
    • addr欄位26bits,以扣除最右邊2 bit(都是00),和最左邊4bits (同process一樣)

程序呼叫 Procedure call

  • 步驟
    1. 存參數
    2. 跳去
    3. 要空間,存前人data
    4. 執行
    5. 存回傳值
    6. 還前人data,還空間
    7. 跳回
    • 12在caller做
    • 34567在callee做
    • 3,6順序成鏡向關係
  • 這類題目爆炸難,自己去練課本p61, 62, 63

CH2 Arithmetic Logic unit (ALU)

  • 考古題一定要寫13, 18, 49, 67
  • 好難筆記,少寫很多東東,從課本p.150開始看就好
  • overflow
    • MIPS只有在add, addi, sub等指令overflow時產生exception
    • addu, addiu, subu等unsign運算溢位也不管他 (programmer要負責)
    • 檢查是否overflow: 看最後兩個carryout,不同就是overflow
      • xor起來是1就是overflow

1 bit full adder

  • 以下可從truth table推出,細節見課本p150
  • Cin: carry in
  • Cout: carry out
  • CarryOut =
    ab+aCin+bCin
    有2 gate delay
  • Sum =
    abCin+abCin+abCin+abCin
    有3 gate delay (多了not吧我猜)

64 bits ALU

  • 王廷基的影片講的很好
  • overflow detection: 把最後兩個1 bit ALU的carry拉出來xor
  • set less than 的地方比較複雜,其他都可以自己看
    • 1-62bit一樣,無腦輸出0就好(bit 63也output 0但有東西需計算)
    • 方法是讓A-B,此時sign bit (bit 63)經過些判斷就能當作slt餵回bit 0使用

Ripple carry header

  • N個1 bit full adder(p.150)串在一起
  • critical path delay想當然爾,超級長
  • gate delay
    • N bit的adder中
    • critical path delay (最後一個Cout算出來): 2N
    • sum delay (最後一個sum算出來): 2N+1

carry lookahead adder (CLA)

  • Cout = ab + (a+b)Cin
  • 上面改寫成Ci+1 = aibi+(ai+bi)Ci
  • 以上是推導,不用背
  • 定義generate, propagate
    • generate: gi = ai*bi
    • propagate: pi = ai+bi
  • Ci+1 = gi + pici -> 背!

4 bits CLA

  • 課本p156-158
  • 課本p158圖中的add block,細節如下
    • 但我猜真正的硬體是1 bit full adder(p151右下)跟這個的合體
  • carry
    • c1=g0+p0c0
    • c2=g1+(p1g0)+(p1p0c0)
    • c3=g2+(p2g1)+(p2p1g0)+(p2p1p0c0)
    • c3=g3+(p3g2)+(p3p2g1)+(p3p2p1g0)+(p3p2p1p0c0)
    • c2, c3, c4由Ci+1 = gi + piCi展開可得,不用背(但最好練習一下展開)

16 bits CLA

  • 課本p159
  • Multiple Level Carry Lookahead (上課講義說的)
  • Propagate(此處Pi為大寫)
    • P0=p3p2p1p0
    • P1=p7p6p5p4
    • P2=p11p10p9p8
    • P3=p15p14p13p12
  • Generate (此處Gi為大寫)
    • G0=g3+(p3g2)+(p3p2g1)+(p3p2p1g0)
    • G1=g7+(p7g6)+(p7p6g5)+(p7p6p5g4)
    • G2=g11+(p11g10)+(p11p10g9)+(p11p10p9g8)
    • G3=g15+(p15g14)+(p15p14g13)+(p15p14p13g12)
  • Carry (此處Ci為大寫)
    • C1=G0+P0C0
    • C2=G1+(P1G0)+(P1P0C0)
    • C3=G2+(P2G1)+(P2P1G0)+(P2P1P0C0)
    • C3=G3+(P3G2)+(P3P2G1)+(P3P2P1G0)+(P3P2P1P0C0)
  • p150-159的圖,看熟,CLA甚至要背!!!
  • gate delay的計算看下表 (ex: 2-level的16 bits adder,carry delay=5, sum delay=10)
    • 所有筆記的東西我現在理解了才背,希望之後回來還記得

gate delay總結

  • 幫助理解
  • 小背一下 (參考清大105計系)
    • 16 bits ripple carry adder c delay: 2*16=32
    • 16 bits 1 level cascade CLA c delay: 3+2+2+2=9
    • 16 bits 2 level CLA c delay: 2+3=5

Multiplication && Division (共5個,超重要)

  • 乘法器除法器(不管有號無號)都是sequential circuit

traditional multiplication (Unsigned)

  • 課本p165,務必練題目(畫運算表格)
  • 在32*32的乘法運算中
    • multiplier: 右移,32bits
    • multiplicand: 左移,64bits
    • product: 固定,64bits

hardware friendly (Robertson) multiplication (Unsigned)

  • 課本p167,務必練題目(畫運算表格)
  • 在32*32的乘法運算中
    • multiplier: 放入product右半邊,32bits
    • multiplicand: 固定,32bits
    • product: 右移,64bits

Booth's algorithm (signed)

  • 直接從表格理解,說明就免看了
  • 計算推導過程可以背下(p.172, 173)
  • 要注意product右移時要signed extend !!!

Traditional Division (unsigned)

  • 稍微複雜點,要作過才會記得
  • 餘數要sign extension

Hardware-friendly division (unsigned)

  • 餘數要sign extension

  • remainder只有左半邊和divisor相減 !!!(p.182)

  • 乘法器與除法器共用硬體

IEEE 754浮點數

single precision

  • range:
    1.1...1×2127 to 1.0×2126
    和負數同範圍
    • 非正規化 單精度可達1.0*2-149
sign exponent fraction(mantissa)
1 8 23

double precision

sign exponent fraction(mantissa)
1 11 52

編碼規則

exp frac representation
1-154 anything ±floating number
0 0 ±0
0 nonzero ±Denormalized number
255 0 ±inf
255 nonzero NaN

CH3 Computer performance

Exe Time = IC
×
CPI
×
CCT

  • IC=instruction count
  • CPI=cycle per instruction
  • CCT=clock cycle time

MIPS=
ICExeTime×106=rateCPI×106

Amdahl's law

  • speedup=
    1FS+(1F)
  • F: 被改善的比例,S: 改善倍率

CH4 CPU datapath & control

  • 控制單元truth table中
    • 只有三個信號可能出現don't care (上圖有接control訊號的Mux)
      • RegDst
      • ALUSrc
      • MemtoReg
  • j type會用到三個cycle,雖然target address讀到指令就知道了 (考古1)
    • IF
    • ID
    • compose target addr
      • 對應其他指令cycle,這邊是EXE
      • 把26bits的target addr接上左4bits、右2bits才是完整的32bits addr
  • longest path計算ALU前面的Mux(ALUSrc)的時機(p388練習)
    • Rtype: 需要算,因為要等reg file跑完才能過
    • I
      type: 不用算,因為sign ext後可以先進mux,此時reg file還在跑

CH5 Pipeline

  • Hazard

    • Structural Hazard
      • 硬體資源不足,以致於無法pipeline (ex. IF和ME都用同個memory)
    • Data Hazard
      • Data dependency (最常見的那種,ex. load use)
    • Control Hazard
      • MEM階段才得到branch結果的話,可能已經多做三個指令了
  • Delay branch

    • From before: delay slot中插入不管有沒有branch都會執行的指令
    • From target: 類似predict taken,插入branch後的指令
    • From fall through: 類似predict not taken,插入不成立會用的指令
    • 以上都需用A==B的條件說明
    • 此技術已經很少被使用了Dynamic prediction比較好
  • NOPs vs stall (p448練習,成大考古)

    • NOPs: 軟體hazard解法,由compiler插入,是assembly code的一部份
    • stall: 硬體hazard解法,由硬體偵測並stall by processor,所有hazard都能用他解決
  • Bubble(stall)的畫法

    • 寫法應該像下面那樣,才能正確標示出bubble所在的cycle,要養成習慣
    • 上面那種畫法是NOPs,塞入編譯好的空指令才能這樣表示
  • Dynamic branch prediction

    • branch prediction buffer = branch history table (BHT)
      • 就是那些預測用的state
    • branch target buffer (BTB)
      • 如果branch成立,去這邊(cache)撈地址,省一個計算address的cycle

CH6 Memory

三種記憶體 (p.7)

  • SRAM: cache,比較快,儲存在反向閘之間,體積較大
  • DRAM: main memory,比較慢,以電荷存在電容中,體積較小
  • Magnetic disk: 容量大,最便宜,最慢

快取寫入 (p.25)

  • Inconsistent
    • 資料在快取和主記憶體不一致
  • write back

    • hit
      • 只寫到cache上,用dirty bit管理
    • miss
      • write allocate: 去cache上要空間,寫上去
  • write through

    • hit
      • 寫到cache和memory上,很浪費時間
      • solution: write buffer
        • 處理器先寫到buffer,之後buffer再寫到memory
    • miss
      • Write around: 只更新資料在Memory中的值
      • write allocate: 去cache上要空間,寫上去
  • ref

Cache

  • byte offset
    • 用來存取一個 word 裡面的某個 byte,所以所需的 bit 數會是可以代表一個 word 裡面可以存的 byte 數,通常 1 word = 4 byte 所以這個部分通常為 2 bit
  • block offset
    • 用來存取 cache line 中的某個 word,所以由 cache line size(cache block size) 和 byte per word 來決定,舉例 64 B 的 cache block size,4 byte per word,那 block offset 就是 4 bit (64/4 = 24)
    • referencet
  • offset = block offset + byte offset
  • 如果題目提供Word address,代表該數字已經扣除byte offset (p44練習, p127考古34)
    • ex. address 13410=100001102
    • In 3-way set associative cache with 2-word blocks, total size=24 words
    • 他的tag=10000, idx=11, block offset=0

Virtual memory

  • 把主記憶體當作磁碟的Cache

  • 兩大目的

    • 供多個程式安全的共享記憶體
    • 消除主記憶體空間不足的限制
  • 是技術,不是memory
  • 用virtual memory cache可降低hit time,因為不需要轉換至phy-addr的多餘過程
  • 每個process都有自己的page table register
    • 紀錄page table起點
  • 如何限制過大的page table size (p67)

    • Limit register
      • page table大小可調整
    • 兩張分頁表
      • 分別位於stack(上往下)和heap(下往上)中,大小可變,MIPS採用之
    • 多層page table
      • OS提過,缺點是cache miss penalty變大
    • page table也被分頁
      • 可能會無限page fault
  • 3C miss: compulsory miss, capacity miss, conflict miss
    • capacity miss是原本在cache中的block被擠出去後,再被存取卻找不到人的miss。如果該組tag/idx原本就不再cache中,算是complusory miss (p155,74題)
  • context switch需清除TLB,不然會抓到錯的page,因此context switch後page fault高
  • 務必寫過: p 20, 21, 68, 69, 72, 73
  • 有打星星的考古都很重要一定要寫到背起來

Cache vs Virtual memory

術語 mapping write replace
Cache block/line miss fully,set,direct write through/back random/ LRU
virtual mem page page fault fully write through LRU

virtual address cache(p78)

  • 之前Cache都是用 phy idx / phy tag
    • 需要先用TLB轉換至phy addr才能去cache找資料,效率差
  • 如果是 vir idx / vir tag 呢
    • 效率明顯變好,critical path上移除TLB
    • 問題: 可能有aliasing(別名)問題,兩個vir addr可能對應到同個phy addr
      • 造成data inconsistent (同步問題)
    • 解法: vir idx / phy tag,平行處理TLB和Cache
      • 拿TLB得到的phy tag進cache找東西

TLB

CH7 儲存裝置和其他周邊設備

  • Disk Access Time

    • Seek time + Control time + Rotation time + Wait time + Data Transfer time
    • 口訣: SCRWT
    • Rotation time =
      0.5RPS
      • RPS: rotation per second
      • 小心單位轉換(s/ms, RPM/RPS)
  • MTTF, MTTR

    • 衡量availability的一種標準
    • MTTR: Mean time to repair
    • MTTF: Mean time to failure
    • MTBF: Mean time between failure (MTTF + MTTR)
    • availability =
      MTTFMTTF+MTTR
  • Mega大: CA ch7 Disk筆記

  • RAID

    • RAID 0+1的本質是RAID 0
    • RAID 1+0的本質是RAID 1
    • RAID 3 不利於small data write
  • IO data transfer & processor

    • Programmed IO = polling
    • Interrupt
    • DMA
      • Direct memory access (DMA) is the mechanism that provides device controller with the ability to transfer data directly from or to memory without the involving the processor.
      • 會作cycle stealing ( Interleaving )
    • CPU utilization:
      • DMA > Interrupt > polling
    • Latency:
      • polling < DMA < Interrupt
  • Polling / Interrupt / DMA 比較

    Polling Interrupt DMA
    優點 simple processor不用一直看IO device 用cycle stealing 更省時間
    缺點 浪費processor時間 比polling複雜 需要硬體支援
  • Blocking IO & non-blocking IO

    • 參考: 2022/11/19 20:09 TG群
    • 簡單這樣記
      • Blocking IO = Synchronous IO 會進waiting state
      • Non-blocking IO = polling 留在running state,process困在polling
      • Asynchronous IO 留在running state,可做IO外的事,等inturrupt
      • interrupt可能屬於blocking 或 non blocking,參考下文ref
      • ref

CH8 多重處理器

  • 兩種多重處理器
    • 記憶體共享(重點3): shared-memory multiprocessor (SMP)
    • 傳遞訊息(重點4): message passing multiprocessor (MPP)
    • 兩種多重處裡器,作加法時的演算法不同 (看課本p.311, 316)
    • code不好背,看圖記規則
  • 重點5: 多處裡器的快取一致性

    • snooping (p318)
    • cache coherence protocol,分以下兩種
    • write-update
      • 又稱write-broadcast
      • 用匯流排把所有對應該data的cache都更新
    • write-invalid
      • 先把別人快取copy變invalid,在把自己變invalid
  • 重點6: 單一晶片multithreading

    • find-grained multithreading
      • 隱藏short/long stalls
    • coarse-grained multithreading
      • 只能藏住long stalls
    • simultaneous multithreading (SMT)
      • with "Instruction level parallelism (ILP)"和"thread level parallelism (TLP)"
      • 用danamically scheduled processor達成
  • 重點7: 平行電腦分類

  • Flynn's taxonomy
    • SISD: single instruction single data (uniprocessor, scalar processor)
    • SIMD: single instruction multiple data
    • MISD: multiple instruction single data (not exist)
    • MIMD: multiple instruction multiple data
  • Processors
  • 比較與差異
    • SIMD v.s. vector processor
      • SIMD處裡的vector長度不可變
      • vector processor處裡的vector長度可變
    • superscaler processor v.s. pipeline
      • superscaler processor: 平行執行多個指令用多個執行unit
      • pipeline: 平行執行多個指令用一個執行unit
      • superscaler + pipeline:
  • parallelism
    • ILP
      • Instruction-level parallelism is a measure of how many of the instructions in a computer program can be executed simultaneously.
    • DLP
      • Data-level parallelism is the parallelism achieved by operating on independent data.
    • MLP
      • Memory-level parallelism refers to the ability to have pending multiple memory operations, in particular cache misses or translation lookaside buffer (TLB) misses, at the same time.
    • TLP
      • Thread-level parallelism is the parallelism inherent in an application that runs multiple threads at once.
    • 以上四種平行只是概念,實作可用硬體或軟體達成 (108台大計系)

圖要背起來

上冊

  • 16 bit 2 level carry lookahead adder (p.159)

下冊

  • MIPS datapath
  • Direct mapped with block offset (P.17)
    • Utilize spatial locality
  • 4-way set associative (P.40)
  • Virtual address getting data (P.76)
    • 先去TLB找phy addr, 假設找到,再去cache挖資料