計算機結構筆記

--- tags: [2025_Fall] --- # 計算機結構筆記  ## 電腦種類 * Desktop Computer: General Purpose，需要 Cost-effective solution * Server: **Dependability** & **Scalibility**，需要高 throughput。 * Embedded Computer: 在其他裝置(車子、手機、冰箱)內的電腦，需要 Minimize **power** & cost，**Dependability** 也很重要。用法較侷限，比如手機不會跑科學運算。 ## PostPC Era ### Personal Mobile Device 用電池、連網、便宜。 ARM 從新創成為 PMD 的王者。因為是第一個 low power, 32-bit RISC 架構。而且不像 Intel 全部自己來，是授權 IP。 ### Cloud Computing Software as a Service (SaaS)。 * 40s~60s: 第一台 supercomputer, 3 megaflops。 * 75~90s: Cray era * Early 90s: Fujitsu，日本加入 * Mid 90s: Cluster era，把一堆電腦連起來 * 00s: Petaflops($10^{15}$)。開始注重 power。 * 10s~: Exaflops($10^{18}$)。中國天河-1、Fujitsu、Cray。採用 CPU + GPU 的異質(Heterogeneous)架構。 AMD 大膽採用新技術，發展很好。 ## Intel Processor 歷史 * 4004(1971): 第一個微處理器，主要是商用。4-bit, 10 $\mu$m, 2,300 transistors * 8008(1972): Dumb Terminal。8-bit, 10 $\mu$m, 3,500 transistors * 8080(1974): 第一個 Personal Computer。8-bit, 6 $\mu$m, 6,000 transistors * 8086/8088(1978~9): IBM PC，第一個使用 **ISA**(x86)，不用打卡。16-bit, 3 $\mu$m, 29k transistors * 80286(1982): 引入 **Virtual Memory**。16-bit, 1.5 $\mu$m, 134k transistors * 80386(1985): 現代的 x86 ISA(逐漸擴增)。32-bit, 1.5 ~1$\mu$m, 275k transistors, 16~33 MHz * 80486(1989): 第一次有 **Pipeline**, **8KB cache**, FPU(Floating point unit)。32-bit, 1 $\mu$m, 1.2M transistors, 25~100 MHz * Pentium(1993): **Superscalar** (2 op/cycle), Instruction, Data Cache 分離，各 8KB。32-bit, 0.8~0.35 $\mu$m, 3.2M transistors, 60~300 MHz * Pentium Pro/II/III(1995~9): **Out-of-order execution**, 3 op/cycle, Multimedia instructions, 16~32KB cache。32-bit * Pentium 4(2001): **Deep Pipeline**, 頻率非常高。1.4~3.4 GHz(大概到頂，很耗電), 180~90 nm(奈米年代) * Pentium D(2005): **Dual-core**，2.8~3.6 GHz, 90 nm ### Multicore Era * Core 2 Duo(2006): 2 cores, 65 nm * Core 2 Quad(2007): 4 cores, 45 nm * Core i-Series(2008~): i3/5/7/9。i9 可以到 18 cores, 14 nm ## GPU/TPU 很多核心，Tensor Core 作矩陣運算。 TPU: Google, 專門做 AI 運算。  ## Computer Architecture ### Application * 90s 開始有影像、3D 需求，不只是簡單的數值運算、Data 操作。 * 00s RMS(Recognition, Mining(資料), Synthesis(Ray Tracing)): 運算需求進一步增加。 * 10s ML: 需要 Domain-specific architecture(e.g. for Deep Neural Network(DNN))。 ### Technologies 年份, 科技, 相對 Performance/cost * 1951 真空管(Vacuum Tube), 1 * 1965 電晶體(Transistor), 35 * 1975 積體電路(IC), 900。變成矽 * 1995 Very large scale IC(VLSI), 2.4M * 2013 Ultralarge scale IC, 250B ### Engineering Methodology 需要先定義能量化、一致標準的 factor/metric，比如 Cost, Performance, Power。Design 與 Analysis 大約各佔一半。 ## Benchmark Performance 的 metrics 主要是 Response time(Latency), Throughput。定義 Performance = 1/Execution Time。 Execution Time 分成：Elapsed time(整個 response time, 包含 I/O), CPU time(純 CPU 運算時間)。 CPU time = Cycles × Cycle time = Cycles / Clock rate Cycles 又能拆成 Instruction count × Cycles per instruction (CPI)。不同 Instruction 的 CPI 可能不同。因此比較時不能只看 Instuction count 或 CPI，要兩個一起看。使用多核的原因：Power 大約與 Capacitive load $\times$ Voltage$^2\times$ Frequency，而 Voltage 驅動 Frequency，以非常粗略估計大致呈正比。所以 Energy 大約 $t$(時間) $\times f^3$ 呈正比，因此同樣工作分給兩個，則 $t\times f^3$ 變成 $2\times t\times (0.5f)^3 = 0.25 t\times f^3$，省很多。不過平行程式比較難寫，所以早期不重視。 ### SPEC Benchmark 衡量 **Performance** 的標準，首先選了一些代表性的運算任務(GP 則要各種任務，Embedded 則針對特定應用)。對每個分別計算 SPECRatio = Execution time on reference computer / Execution time on evaluated computer。再使用幾何平均得到整體的數值。衡量 **Power** 的標準，Performance (ssj_ops/sec) per Watt。 ### Pitfalls and Fallacies * Amdahl's Law: 如果改進的部分只佔一點點、不太重要，那就沒什麼用。 * Low power at idle: 低 load 時有些機器會很耗電，但通常很多時間都是低 load，比如 data center。所以要注重 idle 時的 power。 * MIPS: Million Instructions Per Second。 $$ \begin{aligned} &\text{MIPS} = \frac{\text{Instruction count}}{\text{Execution time} \times 10^6}\\ &= \frac{\text{Instruction count}}{\text{Instruction count}\times \text{CPI}/\text{Clock rate}\times 10^6}\\ &=\frac{\text{Clock rate}}{\text{CPI} \times 10^6} \end{aligned} $$ 不能只看 MIPS，因為不同機器的 Instruction set 不同，CPI 也不同。 ## Design Principles * Design for **Moore's Law**: 盡量跟上速度 * Use **abstraction** to simplify design: 利用分層架構更了解系統 * Make the **common case fast**: 讓程式更好平行化 * Performance via **parallelism**: Instruction, Thread, Processor level... * Performance via **pipelining** * Performance via **prediction**: 預測未來要做什麼 * **Hierarchy** of memories: L1, L2, L3 Cache * **Dependability** via redundancy: 備用。  ## Cost of Manufacturing IC $\text{Cost per die} = \frac{\text{Cost per wafer}}{\text{Die per wafer} \times \text{Yield}}$ Yield: wafer 上能運作的 die 比例。 $\text{Die per wafer} \approx \frac{\text{wafer area}=\pi r^2}{\text{Die area}}$ $\text{Yield} = \frac{1}{(1+\text{Defects per area} \times \text{Die area}/2)^2}$ Cost per die 與 Area、Defect rate 呈非線性關係。 ## ISA Design Principles 讓 programmer 能用到 hardware 的所有功能，且要是簡單的。 ### Interface Design Principles * 能支援底層的不同實作 * 能夠有多元的用法 * 提供方便的高階功能 * 同時提供高性能的低階功能 ### Instruction 運作方式 1. Instruction Fetch(要在 Memory) 2. Instruction Decode 3. Operand Fetch 4. Execute 5. Result Store 6. Next Instruction 比較 RISC(Reduced Instruction Set Computer) 與 CISC(Complex...) 如何做乘法。 * RISC: * Load A, 2:3 * Load B, 4:5 * Multi A, B * Store 2:3, A * CISC: * Multi 2:3, 4:5 這門課主要講 RISC-V，開源的 RISC 架構(ARM 等架構閉源、要錢)。 ## RISC-V Instructions Principles: Simplicity favors regularity，因此 CPI 大約只要 1。 * Operation 盡量恰好有三個 operands。 * Instruction format 盡量少 * 預設 operands 都在 registers 裡，用到 memory 的話要先 load 出來 (例外：immediate, memory operations) ### Arithmetic Operations 比如 `add a, b, c`: a=b+c。 ### Registers RISC-V 有 32 個 64-bit(double word) Registers x0~x31。 | Register | ABI Name | 敘述 | Saver | |----------|----------|------|-------| | x0 | zero | hard-wired 0 | - | | x1 | ra | return address | caller | | x2 | sp | stack pointer | callee | | x3 | gp | global pointer | - | | x4 | tp | thread pointer | - | | x5~x7 | t0~t2 | temporaries | caller| | x8 | s0/fp | saved register/frame pointer | callee | | x9 | s1 | saved register | callee | | x10~x11 | a0~a1 | function arguments/return values | caller | | x12~x17 | a2~a7 | function arguments | caller | | x18~x27 | s2~s11 | saved registers | callee | | x28~x31 | t3~t6 | temporaries | caller | ### Memory Operations load from, store to memory，存在 imm 的 offset 與存在 register 的 base 都以 bytes 為單位。 * `lw x9, 8(x22)`: x9=mem[8+reg[x22]] (Load Word) * `sw x9, 8(x22)`: mem[8+reg[x22]]=x9 (Store Word) 所以 A[12] = h + A[8] (base address of A in x22) 要先 `lw x9, 64(x22)` (8*8=64)。其中 offset 是 constant(immediate)。若要搞清楚 register, memory，記得 register 是用名字或編號存取應該就行了。 * Byte order: RISC-V 是 Little Endian，把一個整數拆成多個 bytes 後，低位元組存在低位址。 * Alignment: 必須放在 word size 的整數倍位址上，也就是四的倍數。而 RISC-V 不要求。 ### Constant or Immediate Operations 常常出現 A=A+5 之類的常數，如果用一般操作，得要把常數先 load 到 register 再計算，但因為太常見了還是簡化比較好。 `addi x22, x22, 5`，允許 Operand 是常數(Immediate)。 ### Logical Operations * `<<, slli`: Shift Left Logical Immediate * `>>, srli`: Shift Right Logical Immediate，但其實為了考慮 signed，比較常用 Arithmetic 的 `sra`, `srai` * `&, and, andi`: AND, AND Immediate * `|, or, ori`: OR, OR Immediate * `^, xor, xori`: XOR, XOR Immediate * `~`(用 `xori x9, x9, -1` 實現 not，-1 2-s complement 全 1) ### R-format Instructions 一個 instruction 以一個 32-bit word 表示，分成多段稱為 field 的欄位。 `funct7(7) | rs2(5) | rs1(5) | funct3(3) | rd(5) | opcode(7)` * opcode: operation 的大分類 * rs1, rs2: source registers * rd: destination register * funct3, funct7: operation 的小分類，以 bytes 數命名。 ### I-format Instructions 如果 load, store 的 offset 只能塞在 5-bit 裡，範圍太小不夠用。反正 offset 以外只有兩個 operands，可以刪掉一個 operand。 `immediate(12) | rs1(5) | funct3(3) | rd(5) | opcode(7)` 可以發現是把 R-format 的 rs2 拿掉，與 funct7 都被 immediate 替代。最 simple 的設計當然是只有一種 instruction，但不得已得要有多種，讓不同種長得盡量像會比較好做 pipeline。 ### S-format Instructions 另一種 Immediate format，能夠有三個 operands。 `imm[11:5](7) | rs2(5) | rs1(5) | funct3(3) | imm[4:0](5) | opcode(7)`  ## Branching instructions 改變 program 執行的 control flow，也就是不照順序 (sequentially) 執行，而是跳到其他位置。在 high-level language 如 C 中，if-then-else、while loop、function call 等都屬於這類。 > `jalr rd, imm (rs1)`: 跳到 reg[rs1]+imm 的位置，並把 PC+4 存在 rd 中。 > `jal rd, Label`: 跳到 Label，並把 PC+4 存在 rd 中。 ### Program Counter (PC) **重要的 register**：存放目前要 fetch 的 instruction 的 address。 * Fetch instruction: 從 memory[PC] 讀取 * 一般情況：PC = PC + 4 (下一個 instruction，因為每個 instruction 4 bytes) * Branch 情況：PC = target address (跳到指定位置) ### Conditional Branch Instructions 只有在設定的條件符合時才會 branch，否則繼續 sequentially 執行。 * `beq rs1, rs2, L1`: if (reg[rs1]==reg[rs2]) PC=L1 (Branch if EQual) * `bne rs1, rs2, L1`: if (reg[rs1]!=reg[rs2]) PC=L1 (Branch if Not Equal) ### basic block 一段連續的 instructions，滿足以下條件： * **只有最後一個 instruction 是 branch instruction** * **只有第一個 instruction 可能是其他 branch 的 target** * 中間沒有任何 branch target，也沒有 branch instruction **重要性**： * Basic block 內的 instructions **一定會一起執行** * Compiler 可以在 basic block 內做 **optimization**，例如： * Instruction scheduling (重新排序) * Register allocation * 無法跨越 basic block 做 optimization，因為 branch 的結果要執行時才知道 (taken 或 not taken) ### More branching instructions **Signed 比較**： * `blt rs1, rs2, L1`: if (reg[rs1]<reg[rs2]) PC=L1 (Branch if Less Than) * `bge rs1, rs2, L1`: if (reg[rs1]>=reg[rs2]) PC=L1 (Branch if Greater or Equal) **Unsigned 比較**： * `bltu rs1, rs2, L1`: unsigned 版本的 blt * `bgeu rs1, rs2, L1`: unsigned 版本的 bge **Set 指令**（用於比較後設定值）： * `slt rd, rs1, rs2`: if (reg[rs1]<reg[rs2]) rd=1 else rd=0 (Set if Less Than) * `slti rd, rs1, imm`: if (reg[rs1]<imm) rd=1 else rd=0 (Set if Less Than Immediate) * `sltu rd, rs1, rs2`: unsigned 版本的 slt ### Pseudo instruction Assembler 提供的簡化指令，會被展開成實際的 instructions： * `la x28, JumpTable`: load address of JumpTable into reg[x28] (Load Address) * 展開成 `auipc` + `addi` * `li rd, imm`: 把 rd 設為 imm * 展開成 `addi rd, x0, imm` * `mv x1, x2`: 把 x2 的值複製到 x1 * 展開成 `addi x2, x1, 0` * `jr x7`: PC=reg[x7] (Jump Register) * 展開成 `jalr x0, 0(x7)` * `j label`: 無條件跳到 label (Jump) * 展開成 `jal x0, label` * `ret`: 從 function 返回 (Return) * 展開成 `jr x1` 或 `jalr x0, 0(x1)` ### Implement if-else ```c if(i==j){ f=g+h; }else{ f=g-h; } ``` f,g,h in x19~x21 i,j in x22, x23 ```asm bne x22, x23, ELSE # if i != j, jump to ELSE add x19, x20, x21 # f = g + h (執行 then 部分) beq x0, x0, END # 無條件跳到 END (x0 always 0) ELSE: sub x19, x20, x21 # f = g - h (執行 else 部分) END: ``` ### Implement a while loop ```c while(save[i]==k){ i+=1; } ``` i in x22, k in x24, base address of save in x25 每個 element 是 8 bytes (double word) ```asm LOOP: slli x10, x22, 3 # x10 = i * 8 (計算 byte offset，shift left 3 = 乘以 8) add x10, x10, x25 # x10 = base + offset = &save[i] # 因為 ld 的 offset 必須是 immediate，所以要先算好完整位址 ld x9, 0(x10) # x9 = save[i] (load 該 element) bne x9, x24, END # if save[i] != k, 跳出迴圈 addi x22, x22, 1 # i += 1 beq x0, x0, LOOP # 無條件跳回 LOOP (x0 永遠是 0) END: ``` **重點**： * Array indexing: 位址 = base + (index × element_size) * Load instruction 的 offset 只能是 immediate (constant)，所以要先計算完整位址 ### Implement a switch ```c switch(i){ case 0: f=g+h; break; case 1: f=g-h; break; } ``` f,g,h in x19~x21 i in x22 x5~x7: temporaries ```asm .data JumpTable: .word CASE0, CASE1 # Jump table 存放各 case 的位址 .text # 檢查 i 是否在有效範圍 [0, 1] slt x5, x22, x0 # if (i < 0) x5 = 1 else x5 = 0 bne x5, x0, END # if (i < 0) 跳出 slti x5, x22, 2 # if (i < 2) x5 = 1 else x5 = 0 beq x5, x0, END # if (i >= 2) 跳出 # 使用 Jump Table 跳到對應的 case la x6, JumpTable # x6 = JumpTable 的位址 slli x7, x22, 2 # x7 = i * 4 (假設每個位址 4 bytes) add x5, x6, x7 # x5 = &JumpTable[i] ld x7, 0(x5) # x7 = JumpTable[i] (讀取目標位址) jr x7 # 跳到該位址 CASE0: add x19, x20, x21 # f = g + h beq x0, x0, END # break (跳到 END) CASE1: sub x19, x20, x21 # f = g - h END: ``` **Jump Table 技巧**： * 適用於 case 值連續且範圍小的情況 * 先檢查 index 是否在有效範圍 * 用 array 存放各 case 的 target address * 透過 index 直接跳到對應位置，比一連串的 if-else 快 ### Implement a function 呼叫的稱為 caller，被呼叫的稱為 callee。呼叫步驟: 1. Place parameters in registers x10~x17 2. Transfer control to the callee, placing the return address in x1 (`jal`) 3. Acquire stack space for local variables for the callee 4. Perform the callee function 5. Place the return value in register x10/x11 6. Return to the caller (in x1) (`jalr`) **第 2,6 步：** * `jal x1, function`: Save next instruction address in x1, jump to function (Jump And Link) * `jalr x0, 0(x1)`: Jump to address in x1, save next instruction address in x0 (x0 is never changed, so discarded) (Jump And Link Register) **第 3 步：** > 畫圖時高位址在上，低位址在下，而 stack 從上往下長，因此 stack pointer 往下移。把目前的 SP 減少需要的空間大小(以 bytes 為單位)。 **Save and Restore registers：** registers 分成兩類： * temporary x5~x7, x28~x31: callee 不用還原，換句話說 caller 要確保不變的話要自己存在記憶體 * saved x8~x9, x18~x27: callee 要在一開始存到記憶體，return 時還原 **leaf function(不用再 jump) 範例：** ```c define int long long int leaf(int g, int h, int i){ int f; f = (g + h) + i; return f; } ``` g,h,i in x10,x11,x12 f in x20 x8: saved temporaries ```asm addi sp, sp, -16 # allocate stack space for f, x8 sd x8, 8(sp) # save x8 sd x20, 0(sp) # save f add x8, x10, x11 # x8 = g + h add x20, x8, x12 # f = x8 + i addi x10, x20, 0 # save return value f in x10 ld x20, 0(sp) # restore f ld x8, 8(sp) # restore x8 addi sp, sp, 16 # deallocate stack space jalr x0, 0(x1) # return ``` **Non-leaf function 範例：** ```c define int long long int fact(int n){ if(n<1) return 1; else return n*fact(n-1); } ``` x in x10, return value in x10 ```asm fact: addi sp, sp, -16 # allocate stack space for ra(x1), n(x10) sd x1, 8(sp) # save ra sd x10, 0(sp) # save n addi x5, x10, -1 # x5 = n - 1 bge x5, x0, RECURSE # if (n-1 >= 0) jump to RECURSE addi x10, x0, 1 # set x10 = 1 addi sp, sp, 16 # deallocate stack space jalr x0, 0(x1) # return, no need to restore ra and n, since not changed RECURSE: addi x10, x10, -1 # x10 = n - 1 jal x1, fact # jump to fact, save return address in x1 addi x6, x10, 0 # save return value (fact(n-1)) in x6 ld x10, 0(sp) # restore n ld x1, 8(sp) # restore ra addi sp, sp, 16 # deallocate stack space mul x10, x10, x6 # x10 = n * fact(n-1) jalr x0, 0(x1) # return ``` ### Memory layout 一些相關的 register: * PC 指到 text segment 的下一個 instruction * SP 指到 stack segment 的頂端(最低位址) * FP 指到 stack segment 的底端(最高位址) * GP(global pointer, x3) access static data 用的 ### Character Data * Byte-encoded: Stored in a byte * ASCII: 7 bits, 128 characters * Latin-1: 8 bits, 256 characters, more graphic characters * Unicode: 32-bit * Used in Java, C++ wide characters * UTF-8, UTF-16: variable-length encoding of Unicode ### Load & Store instructions 在 l 或 s 後面會有 b,h,w，分別代表 byte(8-bit), halfword(16-bit), word(32-bit)，一般是 signed，會以讀取的值的 sign 填滿 rd 中的的高位元(到 64 bit)。在 l[bhw] 後面有 u 代表 unsigned，一律以 0 填滿 rd 中的高位元。 #### String copy example ```c void strcpy(char *x, char *y){ unsigned int i=0; while((x[i]=y[i])!='\0'){ i++; } } ``` s,t in x11,x10 i in x19 ```asm strcpy: addi sp, sp, -8 # allocate 1 byte of stack space for x19 sd x19, 0(sp) # save x19 addi x19, x0, 0 # i = 0 LOOP: add x5, x19, x10 # x5 = &y[i] lbu x6, 0(x5) # x6 = y[i] add x7, x19, x11 # x7 = &x[i] sb x6, 0(x7) # x[i] = y[i] beq x6, x0, END # if (y[i] == '\0') jump to END addi x19, x19, 1 # i++ jal x0, LOOP # jump to LOOP, next iteration END: ld x19, 0(sp) # restore x19 addi sp, sp, 8 # deallocate stack space jalr x0, 0(x1) # return ``` ### 32-bit Immediate 原本 instruction 中 Immediate 只有 12 bits，因此用 lui(Load Upper Immediate) 先設定 [31:12]。 `lui rd, const`: const 為 20 bits (Load Upper Immediate) * rd[63:32] = 以 const[19] 填滿 * rd[31:12] = const[19:0] * rd[11:0] = 以 0 填滿 ### U-format Instructions `lui` 與 `auipc` `imm[31:12](20) | rd(5) | opcode(7)` ### UJ-format Instructions Unconditional Jump: jal (Upper-Jump format) `imm[20](1) | imm[10:1](10) | imm[11](1) | imm[19:12](8) | rd(5) | opcode(7)` imm 是要跳過去的位址，也是 PC-relative address。 > `jalr` 是 I-format。 ### SB-format Instructions branches: bne, beq (Store-Branch format) `imm[12](1) | imm[10:5](6) | rs2(5) | rs1(5) | funct3(3) | imm[4:1](4) | imm[11](1) | opcode(7)` 要比較的 registers 編號放在 rs1, rs2，要跳過去的位置放在 immediate(切成很多段)。實際上跳過去的是 PC-relative address，也就是 PC + immediate × 2 byte。事實上可以乘以 4 來節省空間，因為 instruction 本來就對齊在 4 的倍數位址，但 RISC-V 保留 2-byte instruction 的可能性。 ### How to jump far away SB-format 的 offset 只能 12 bits × 2(共 $2\times2^{12}$ bytes 範圍，具體範圍要看 2 補數的範圍)，雖然通常都夠用，但要跳更遠可以利用 UJ-format 的 jal 指令。 ```asm beq x10, x0, L1 # L2 code L1: ``` 改成 ```asm bne x10, x0, L2 jal x0, L1 # = j L1 L2: # L2 code L1: ``` #### long jump 如果 20 bits × 2 還不夠，可以用 lui + jalr，這樣是 32 bits × 2。 ```asm lui x8, address[31:12] jalr x0, address[11:0](x8) ``` ### Addressing Summary * Immediate: 只用 immediate，12 bits 或 32 bits * Register: 把位址存在 register 裡，64 bits * Base: register + 存在 immediate 裡的 offset * PC-relative: PC + immediate × 2 ### Synchronization 最基本的就是用 Lock，而 Lock 可以用 atomic swap 實作。如果簡單用 lw, sw 來實作 lock，會有 race condition。因此用： * `lr.w rd, (rs1)`: Load Reserved Word，讀取 memory[rs1] 到 reg[rd]，並且在 reg[rs1] 的 address 做一個 reservation（透過硬體實作） * `sc.w rd, rs2, (rs1)`: Store Conditional Word，把 reg[rs2] 寫到 memory[rs1]，如果 reg[rs1] 的 reservation 還在的話，就把 rd 設為 0，否則設為 non-zero(表示失敗) #### Implement atomic swap This will try to swap reg[x21] and memory[x20] ```asm again: lr.w x10,(x20) sc.w x11,x21,(x20) bne x11,x0,again addi x21,x10,0 ``` #### Implement lock/unlock Get the lock if it's initially 0 and successfully swap 1 into it. x12 is initially 1. ```asm lock again: lr.w x10,(x20) bne x10,x0,again sc.w x11,x12,(x20) bne x11,x0,again unlock sw x0,0(x20) ```  ## From C to Execution 程式從高階語言到執行的完整流程： 1. **C program**: 高階語言程式碼 2. **Compiler**: 將 C 編譯成 assembly program 3. **Assembler**: 將 assembly 轉成 object file (包含機器碼與 symbol table) 4. **Linker**: 將多個 object files 與 libraries 連結成 executable 5. **Loader**: 將 executable 載入 memory，產生 memory image 6. **CPU**: 執行程式每個階段都有其特定功能，逐步將人類可讀的程式碼轉換成機器可執行的形式。 ### Assembler **功能**：將 assembly code 轉換成 object file。 **主要任務**： 1. **建立 Symbol Table**: 記錄所有 labels 與變數的位址資訊 2. **產生機器碼**: 將 assembly instructions 轉成對應的二進位碼 3. **處理 Pseudo Instructions**: 展開成實際的 instructions 4. **產生 Relocation Information**: 記錄需要 linker 處理的 references **Assembly Code 範例**： ```asm ld x10, X # load variable X jal x1, B # call function B ``` **產生的 Object File 內容**： * **Header**: 檔案名稱、各 segment 大小 * **Text Segment**: 機器碼 * `ld x10, 0(x3)` (x3 是 global pointer，指向 data segment) * `jal x1, 0` (offset 暫時為 0，等 linker 填入) * **Data Segment**: 全域變數的初始值 * 變數 `X` 的實際資料 * **Relocation Information**: 供 linker 使用 * `ld` 指令依賴變數 `X` 的位址 * `jal` 指令依賴函數 `B` 的位址 * **Symbol Table**: 記錄 symbols 的位置 * `X` 在 data segment 的 offset * `B` 在 text segment 的 offset (若定義在此檔案) ### Linker 又稱 Link Editor，因為會把 object file 內的值設為實際的位址。步驟： 1. 把 code 與 data 放入 memory，決定各個 segment 的實際位址 2. 根據 relocation information 修改 code 與 data，把 symbol 換成實際的位址 3. 處理 external references，把多個 object file 連結起來，解決 symbol 的 reference 產生的 executable 包含： * text segment: 實際的機器碼 * data segment: 初始化的 global variables * symbol table: 剩下的 unresolved references ### Loader 載入 executable 到 memory 並執行。步驟： 1. 讀取 executable header，確定 text, data segment 大小 2. 建立 virtual address space 3. 把 text, data segment 複製到 memory 4. 把 main 的參數複製到 stack 5. 初始化 registers (包括 SP, FP) 6. 跳到 start routine (初始化用的 code)，會呼叫 main，main return 時會呼叫 exit system call ### Dynamic Linking 傳統的 Static Linking 會把所有用到的 library 都複製到 executable，造成檔案很大，而且如果 library 更新了，executable 也要重新 compile。 Dynamic Linking 只在 executable 內放 stub，實際執行時才把 library load 進來。會使用 **Lazy Linkage**，也就是第一次呼叫時才會載入。呼叫的地方會先跳到 stub，stub 中再跳到 library 真正位置。之後就會改呼叫地方的位址，改為直接跳到 library，因此第一次會比較慢。 ## Datapath CPU 內部執行 instruction 的硬體路徑。包含： * ALU: Arithmetic Logic Unit，做運算 * Register File: 存放 registers * Memory: 存放 data, instruction * PC: Program Counter * Control Unit: 控制其他元件的運作 ### Instruction Execution Steps 1. **Instruction Fetch (IF)**: 從 memory 讀取 instruction * PC → Instruction Memory → IR (Instruction Register) * PC += 4 2. **Instruction Decode (ID)**: 解讀 instruction，讀取 registers * 從 Register File 讀取 rs1, rs2 * 產生 Control signals 3. **Execute (EX)**: 執行運算 * ALU 計算結果 * 計算 branch target address 4. **Memory Access (MEM)**: 存取 memory (若需要) * Load: 從 memory 讀取 * Store: 寫入 memory 5. **Write Back (WB)**: 寫回 register * 把結果寫入 rd 以下以一個支援 add, sub, ori, load(register+imm offset), store, beq 的處理器為例。 ### R-type Instruction Datapath 以 `add x5, x6, x7` 為例（第一步都是 Instruction Fetch 所以就跳過）： 1. ID: 讀取 reg[x6], reg[x7] 2. EX: ALU 計算 reg[x6] + reg[x7] 3. WB: 把結果寫入 reg[x5] 需要有 ID 連到 ALU，再連到 WB(ID 的 busW)，並且有輸入：RegWr(是否要 WB)、ALUctr(ALU 執行的 Op) ### I-type (Load) Instruction Datapath 以 `ld x5, 8(x6)` 為例： 1. ID: 讀取 reg[x6]，取出 immediate(8) 2. EX: ALU 計算 reg[x6] + 8 3. MEM: 從 memory[reg[x6]+8] 讀取 4. WB: 把讀取的值寫入 reg[x5] 至此，WB 的來源有 ALU 與 Data Memory，所以需要一個 MUX 來控制 busWr 從哪個來源來，使用 MemtoReg 控制，若 1 就是從 Data Memory。另外，ALU 的第二個輸入可能來自 imm，因此也要一個 MUX 控制，使用 ALUSrc 控制，1 則從 imm 來。 ### S-type (Store) Instruction Datapath 以 `sd x5, 8(x6)` 為例： 1. ID: 讀取 reg[x5], reg[x6]，取出 immediate(8) 2. EX: ALU 計算 reg[x6] + 8 3. MEM: 把 reg[x5] 寫入 memory[reg[x6]+8] 因為要儲存，Data Memory 的輸入(WrEn) 需要從 ID 的輸出(busB)直接連過去，並且用 MemWr 控制是否要寫入 Memory。 ### SB-type (Branch) Instruction Datapath 以 `beq x5, x6, L1` 為例： 1. ID: 讀取 reg[x5], reg[x6]，取出 immediate 2. EX: * ALU 計算 reg[x5] - reg[x6] (稱為 Zero)，判斷是否為 0 * 計算 branch target: PC + immediate × 2 * 如果相等，PC = branch target，否則 PC += 4 因為 PC 來自 PC+4 或 ALU 算出的 Branch 位址，所以用 MUX PCSrc 控制，1 則從 ALU 來。PCSrc 可以設為 branch_inst & Zero。 ### Single-Cycle Implementation 所有 instruction 都在 1 個 cycle 完成，因此 cycle time 取決於最慢的 instruction、CPI=1，但會浪費時間。  ## Pipeline 比如 laundry，需要洗衣服、烘衣服、摺衣服、收衣服，如果每袋衣服都完全分開等上一袋完全做完才做下一袋，會很浪費時間。Pipeline 可以讓每個階段同時進行不同袋的衣服，這樣就能提高 throughput。 ## Pipeline in RISC-V * IF: Instruction Fetch * ID: Instruction Decode and Register Fetch(因為 instruction 固定長度，而且 source register 都在 instruction 編碼的同個位置，可以很輕鬆的 decode 也能同時 Register Fetch) * EX: Execution, or Branch Completion * MEM: Memory Access * WB: Write-back time to drain the pipeline: 把最後幾個工作最後幾個階段做完的時間。理想的狀況是每個 stage 都花一樣的時間，speedup 就是 stage 數量（time to drain 在 instruction 很多時可以忽略） ## Pipeline Hazards ### Structural hazards 同時用同個資源。解決方法：用多一點資源。比如前面有 load instruction，之後的 instruction 又要 IF。解法：可以暫停(stall)衝突的 instruction。把資料與指令存在不同的 memory。 ### Data hazards 有相依性且時間有衝突，上面修改後下面要讀取得要等到修改生效。解法：可以透過 Forwarding(Bypassing) 解決，因為其實 Stage 3 從 ALU 出來就可以用，不需要 Stage 5 Write Back 完再從 Register 讀。如果是 load-use hazard，load 完馬上使用，則 Stage 4 Memory Access 後才能用，所以 Forwarding 也沒辦法完全避免 stall。但在不改變結果的情況 compiler 可以調整 instruction 順序(instruction scheduling，也是要 basic block 的原因)，避免 load-use 等。 ### Control hazards 在相依的結果算出來前就要做決定。例如 branching 後一般只能等 ALU 算完才 IF。可以透過 Branch Prediction 解決，猜 branch 會不會 taken(跳走)。可以簡單假設全部都不 taken(static)，或是根據歷史動態預測等(dynamic)。 Predict 在錯誤時需要能還原，或是根本就不改變狀態。 ## Pipeline Datapath 每個 stage 需要有各自的 register，比如 control 很多 unit 會用但同時會執行多個 instruction。命名為前後兩個 stage 名稱，例如 IF/ID。後面接具體內容，例如 IF/ID.IR。以下以 load 舉例。 * IF: * IF/ID.IR=mem[PC] 是 Instruction Register，IF 讀出來後 ID 需要用到 * IF/ID.PC=PC 要存，因為如果要 branch 改變的話會用到，且下一個 instruction 也會需要增加 PC * PC=PC+4 * ID * ID/EX.A=Reg[IF/ID.IR[19-15]] * ID/EX.B=Reg[IF/ID.IR[11-7]] * ID/EX.Immediate=(sign-ext(IF/ID.IR[31-20])) * ID/EX.PC=IF/ID.PC * EX * EX/MEM.ALUout=ID/EX.A+ID/EX.immediate * MEM * MEM/WB.MDR=mem[EX/MEM.ALUout] * WB * Reg[Rd]=MEM/WB.MDR，但 Rd 會被洗掉，需要像 PC 一樣一路從 ID/EX EX/MEM 傳到 MEM/WB（實際上上面需要改）。 beq： * EX * EX/MEM.PC=ID/EX.PC+(ID/EX.immediate<<1): 先算好跳過去的位置 * EX/MEM.Zero=(ID/EX.A-ID/EX.B)?0:1 * MEM * If (EX/MEM.Zero) PC=EX/MEM.PC 雖然有些 instruction (R type, Arithmetic)不需要 MEM 等，但還是所有 instruction 都 5 個 stage 做完比較簡單，避免 structural hazards 也不用判斷類型。 ## Pipeline Control Control unit 出來的信號也需要在不同 stage 間的 register 儲存，每走一個 stage 消耗掉該 stage 需要的 control signals，其他的繼續傳下去。實際的 Pipeline 中，forwarding, branch prediction 更複雜：（以下是 AI 生成，還沒改） ### Data Hazard Detection & Forwarding 在 EX stage 檢查是否有 hazard： * EX Hazard: `EX/MEM.RegWrite and (EX/MEM.RegisterRd $\ne$ 0) and (EX/MEM.RegisterRd == ID/EX.RegisterRs1 or rs2)` * MEM Hazard: `MEM/WB.RegWrite and (MEM/WB.RegisterRd $\ne$ 0) and (MEM/WB.RegisterRd == ID/EX.RegisterRs1 or rs2)` Forwarding Unit 產生 MUX control signals (ForwardA, ForwardB)： * 00: 從 Register File 讀 (ID/EX) * 10: Forward 前一個 instruction 的 ALU 結果 (EX/MEM.ALUOutput) * 01: Forward 前兩個 instruction 的結果 (MEM/WB.ALUOutput 或 LMD) 優先順序：同時有 EX 與 MEM Hazard 時，優先 Forward 最新的 (EX/MEM)。 ### Load-Use Data Hazard `ld` 後立刻使用結果，Forwarding 不夠快 (data 在 MEM stage 才 ready，但 EX stage 就需要)。偵測：在 ID stage，`ID/EX.MemRead and ((ID/EX.RegisterRd == IF/ID.RegisterRs1) or (ID/EX.RegisterRd == IF/ID.RegisterRs2))` 解決：Stall 1 cycle。Hazard Detection Unit 執行： 1. Freeze PC & IF/ID: `PCWrite=0`, `IF/IDWrite=0` 2. Insert Bubble: 將 ID/EX control signals 清零，EX stage 空轉 ### Control Hazard Branch 結果在 EX 或 MEM stage 才確定，但 IF stage 需要知道 PC。解法： 1. **Stall**: 等結果確定再 fetch，MEM stage 確定則 penalty 3 cycles 2. **Reduce Delay**: 把 branch 判斷移到 ID stage，需要額外 adder 與 comparator，penalty 降為 1 cycle。可能產生新的 Data Hazard 3. **Branch Prediction**: 預測 branch outcome * Static: 假設全部 not taken，只有 taken 時才有 penalty (1 cycle) * Dynamic: Branch History Table (BHT) 記錄過去行為 * 1-bit Predictor: 記錄上次 T/NT，loop 在 first iteration 與 exit 會 mispredict * 2-bit Predictor: 4 個 state (Strongly/Weakly Taken/Not Taken)，連錯兩次才改預測，loop 表現較好 * Branch Target Buffer (BTB): cache 預測的 target address Misprediction 處理：Flush IF/ID (或 ID/EX) 變成 NOP，restart fetch from correct address。 ### Exceptions & Interrupts * Exception: 內部 (overflow, illegal opcode) * Interrupt: 外部 (I/O) 處理流程： 1. 停止 offending instruction 與後續 instructions (Flush IF, ID, EX) 2. 讓之前的 instructions 完成 3. 把 PC 存到 SEPC (Supervisor Exception PC) 4. 把 cause 存到 SCAUSE 5. 跳到 handler address Precise Exceptions: 確保 offending instruction 之前的都完成，之後的都沒執行。 ### Instruction-Level Parallelism (ILP) 讓多個 instructions 平行執行，目標是 CPI < 1 或 IPC > 1。方法： * Deeper Pipeline: 更短的 stage，更高的 clock rate * Multiple Issue: 每個 cycle fetch, decode, issue 多個 instructions **Static Multiple Issue (VLIW)**: Compiler 把獨立的 instructions 組成固定大小的 issue packets，負責 hazard detection 與 scheduling，必要時插入 NOP。Hardware 較簡單。例如每 cycle issue 1 ALU/branch + 1 Load/Store。 **Dynamic Multiple Issue (Superscalar)**: Hardware 檢查 instruction stream，在資源與相依性允許時 issue 多個 instructions。 * Out-of-Order Execution: Instructions 在 operands ready 時執行，不照程式順序。需要 Reservation Stations、多個 functional units、result broadcast * Register Renaming: 把 architectural registers 對應到更多 physical registers，消除 false dependencies (WAR, WAW) * In-Order Commit: 用 Reorder Buffer (ROB) 照原順序寫回 registers/memory，維持 precise exceptions **Loop Unrolling**: Compiler 複製 loop body，增加可排程的 instructions，減少 loop overhead。需要 Register Renaming 來打破 iterations 間的相依性。限制：真相依性、branch prediction 準確度、cache miss、硬體複雜度、功耗。  ## Memory Hierarchy 主要在講 cache 的作用與設計。越 low level 的空間越大但越慢。 Coherency：保持不同層級記憶體的資料一致性。 ### Locality 來自一般程式的特性： * Temporal Locality: 最近用過的資料可能很快會再用到。因此把最近用過的資料存在較高層級的記憶體。 * Spatial Locality: 附近的資料可能很快會用到。因此一次把附近的資料一起 load 進較高層級的記憶體。 ### Terms * Upper/Lower Level: 較快/較慢的記憶體層級 * Hit: Read/Write 時在 cache 找到 * Miss: Read/Write 時沒在 cache 找到 ### Direct-mapped cache 最簡單的 cache 結構。 Cache 以 block 為單位，每個 block 稱為 set。假設 cache 有 8 sets、一個 block 為 4 words=4*4 bytes=16 bytes。把 memory 也分成 block，第 N 個 block 會存在 cache 的 (N mod 8) 位置。通常 set 數量是 2 的次方，這樣一個 byte addressing 的 memory address 可以切成三個部分： **Word-addressing Address 分解**： * Cache Tag: 剩餘高位元 * Cache Index(Set Index): $\log_2(\text{set 數量})$ bits，作為索引決定存在 cache 中的哪個 set * Block Offset: $\log_2(\text{block size})$ bits，決定在 block 中的哪個位址。前兩部分也稱為 Block Address。而 $2^{\text{後兩者的長度}}$ 是 Cache 的大小(bytes)。若是 byte-addressing，Block Offset 會多 2 bits(1 word=4 bytes=$2^2$ bytes)。每個 cache set 需要儲存： * Valid bit: 該位置是否有有效資料 * Tag: 這個 set 會儲存的多個 block 中的哪一個 * Data: block 內的資料在存取資料時：先用 Cache Index 找到對應的 set，再比對 Tag 是否相同且 Valid bit 為 1，若是則為 Hit，否則為 Miss。 ### Block Size Block size 大一點的話： * 優點： * 利用 Spatial Locality * 缺點： * Set 數量會減少，導致較多衝突 Miss * 每次 Miss 要從 Memory 搬更多資料。這可以透過 Early restart(搬完要的部分 CPU 就繼續執行) 或 Critical word first(先搬要的部分) 改善。 ### Handle Read Miss 去 Memory 找，複製到 Cache，設好 Tag 與 Valid bit，再回傳資料給 CPU。 ### Handle Write * Write through: 同時寫到 Cache 與 Memory，較簡單但 Memory 負擔較重。 * Write back: 只寫到 Cache，等同個 set 有其他 miss 時被 replace 時再寫回 Memory，減少 Memory 存取次數。但較複雜，需要 Dirty bit 記錄 block 是否被修改過。 #### Write Buffer 要寫回 Memory 時（write through 寫的時候、write back 被取代時），先寫到 write buffer(FIFO)，再由 write buffer 由硬體慢慢寫回 Memory，減少 write 或 cache miss 時 CPU 等待時間。 write back 通常也會有。但會產生 Coherency 問題，如果要讀的資料還在 write buffer 就要從 buffer 讀才對。另外 write frequency 高於 DRAM write rate 時，會導致 buffer saturation。 ### Handle Write Miss Write 的 block 不在 cache 中。 * write allocate (fetch on write): 會把 block 從 memory load 到 cache * no write allocate (write around): 直接寫到 memory，不 load 到 cache 與 write through/back 會產生四種組合。  ### Memory & Cache 假設 * 要從 DRAM 讀取 1 個 block * 1 memory bus cycle(longer than CPU cycle) to send the address * 15 memory bus cycles for DRAM access * 1 memory bus cycle to send each word back to cache/CPU * 4-word block cache miss panelty(現在的 DRAM 更複雜許多，這只是簡化版): * **One-word wide memory organization**: 傳一次只能讀一個 word 1 (address) + 4 × 15 (access) + 4 × 1 (data) = 65 memory bus cycles * **Wide memory organization**: 傳一次可以讀多個 words（假設 2） 1 (address) + 2 × 15 (access) + 2 × 1 (data) = 18 memory bus cycles * **Interleaved memory organization**: 傳一次可以讀多個 words，但是每次只回傳一個 word，會先存在 banks 裡面再回傳給 cache/CPU 1 (address) + 1 × 15 (access) + 4 × 1 (data) = 20 memory bus cycles 一次從 DRAM 讀取一個 block，因此 block size 與 miss penalty 有關。現在會使用 Interleaved memory organization。 ### Cache Performance **Hit Time**: 存取 cache 並判斷 hit/miss 的時間。 **Miss Penalty**: Miss 時從下一層記憶體取得資料的時間(以 cycle 數計算)。 **Memory-stall cycles per program**: $$ = \frac{\text{Memory accesses}}{\text{Program}} \times \text{Miss Rate} \times \text{Miss Penalty} $$ **Average Memory Access Time (AMAT)**: $$ \text{AMAT} = \text{Hit Time} + \text{Miss Rate} \times \text{Miss Penalty} $$ ### Memory Wall **考慮 Memory Stall 的 CPU Time**: $I:\text{ \# of instructions}$. $$ \begin{aligned} \text{I-cache stall cycles} &= I \times \text{I-cache miss rate} \times \text{Miss penalty}\\ \text{D-cache stall cycles} &= I \times \text{Frequency of memory accesses per instruction} \times \text{D-cache miss rate} \times \text{Miss penalty}\\ \text{CPU Time} &= (I\times \text{CPI} + \text{I-cache stall cycles} + \text{D-cache stall cycles})\times \text{Cycle time}\\ \text{CPU Time with perfect caches} &= I\times \text{CPI} \times \text{Cycle time} \end{aligned} $$ 假設 clock rate 翻倍，但 Memory 不變，則 Miss Penalty 的 cycle 數也會翻倍(時間不變)，所以 CPU Time 進步不到兩倍。這就是 Memory Wall。 ## Set-Associative Cache ### Associativity **Direct-Mapped Cache** 的問題：如果程式常常存取會映射到同個 set 的不同 blocks，會一直發生 **conflict miss**。 **Fully Associative Cache**: 相當於只有一個 set，包含所有 block。任何 block 可以放在任何位置，沒有 conflict miss，但需要比較所有 entries 的 tag，tag 也較長，硬體成本高且速度慢。 > 比如 Virtual Memory 用的 Page Table 就是 fully associative，Page 放哪都可以。 **N-way Set-Associative Cache**: 折衷方案。 * Cache 分成多個 sets，每個 set 有 N 個 ways (blocks) * 一個 memory block 會映射到特定的 set，但可以放在該 set 的任何一個 way * 需要並行比較該 set 內所有 N 個 tags Address 分解與一般 direct-mapped 一樣。 #### Hardware Design of Associative Cache 在每個 set 中，以每個 tag 與輸入的 tag 比較，並以 MUX 選出正確的 block data。因此 N 增加會增加比較器與 MUX 的數量。 ### Replacement Policy 當 set 已滿需要替換時，選擇哪個 block 替換。 **LRU (Least Recently Used)**: * 替換最久沒用過的 block * 實作：需要記錄每個 block 的使用順序 * 2-way: 1 bit 即可（哪個是最近用的） * 4-way: 需要較複雜的硬體（可能用 counter 或 matrix） * N 很大時硬體成本太高 **Random**: * 隨機選一個替換 * 實作簡單，效能通常不會差太多 * 適合 associativity 高的情況 **FIFO (First In First Out)**: * 替換最早進來的 block * 實作簡單但效能通常不如 LRU **實際使用**： * 小的 N (2-4 way): 常用 LRU 或 pseudo-LRU * 大的 N: 常用 Random ### Multi-level Cache 現代處理器通常有多層 cache： **L1 Cache**: * 最靠近 CPU，速度最快 * 容量較小（幾十 KB） * 通常分成 I-cache (instruction) 與 D-cache (data) * 目標：因為每個 Memory access (包含 Instruction) 都會先去 Cache 看看，要追求 Hit time 短。因此大小不能太大，而且也不能有太高的 associativity。 **L2 Cache**: * 較大（幾百 KB 到幾 MB） * Unified cache（instruction 和 data 共用） * 目標：降低 miss rate **L3 Cache** (部分處理器)： * 更大（幾 MB 到幾十 MB） * 可能在多個 cores 間共享 * 目標：進一步降低 miss rate **Multi-level AMAT**: $$ \begin{aligned} \text{AMAT} &= \text{Hit Time}_{L1} + \text{Miss Rate}_{L1} \times \text{Miss Penalty}_{L1}\\ \text{Miss Penalty}_{L1} &= \text{Hit Time}_{L2} + \text{Miss Rate}_{L2} \times \text{Miss Penalty}_{L2} \end{aligned} $$ * **Local Miss Rate**: 該層 cache 的 miss 佔該層存取的比例。 * **Global Miss Rate**: 該層 miss 佔總 memory 存取的比例 = $\text{本層 local miss rate}\times\prod_{\text{上層}} (1-\text{local miss rate})$。上方都是指 Local，但看起來很高的話會誤導，也可以看 Global。 ### Caches vs. Performance 比如 radix sort 理論上比 quicksort 在 item 多的時候每 item 時間較少，但因為 memory access pattern 不佳，cache miss rate 較高，實際上反而較慢。或是矩陣乘法，為了避免 capacity miss，可以每次只算一部分，compiler 通常也會幫忙優化。 ## Virtual Memory 利用 Memory 作為 Disk 的 Cache，主要講如何用硬體實作。 ### Motivation * 每個 Process 有獨立的 Address Space (Safe sharing) * Program 可以大於 Physical Memory * 簡化 Loading (Relocation) ### 系統架構分成三層： 1. **TLB** (CPU): 作為 Page Table 的 cache 2. **Page Table** (Memory): 儲存 VA → PA 的映射 3. **Cache** (CPU): 儲存常用的資料 **階層關係**： * TLB Hit → Page Table 也一定有資料 (TLB 是 PT 的 subset) * Cache Hit → Page Table 也一定有資料 (Data 在 Cache 代表一定在 Memory) ### Address Translation * **Mapping**: MMU，Virtual Address (VA) $\rightarrow$ Physical Address (PA) * **Page Offset**: 不變 (VA 與 PA 的最低位元相同，由 Page Size 決定，e.g., 4KB page $\rightarrow$ 12 bits offset) * **Translation**: VPN (Virtual Page Number) $\rightarrow$ PPN (Physical Page Number) ### Page Table 存在 Memory 中，由 Page Table Register 指向。 **結構設計**: * **Linear (線性)**: Array of PTEs * 問題：Size 太大 (e.g. 32-bit addr, 4KB page, needs $2^{20}$ entries) * **Inverted (反向)**: Size 取決於 Physical Memory 大小 * 問題：需用 Hash 搜尋，較慢 * **Multi-level (多層)**: 像 Tree 一樣 (e.g., RISC-V 用 4-level) * 優點：省空間 (Sparse address space 不用 allocate 全部) * RISC-V 是 48-bit virtual address, 12-bit page offset，剩下 36 bits 分成 4-level，每層 9 bits index (512 entries)。 **Page Fault 處理**: * **觸發條件**: Valid bit = 0 * **Miss Penalty**: 極大 (Disk access) * **設計選擇** (因為 miss penalty 太大)： * **Fully Associative**: 軟體處理 placement，任何 Page 可以放在任何 Physical Frame * **LRU** replacement: 值得花硬體/運算資源維護 Reference bit * **Write-back**: 減少寫入 Disk 次數，用 Dirty bit 記錄 * **處理方式**: 由 OS (Software) 處理 Fault * 因為硬碟很慢，沒必要用硬體處理 * 但若硬碟真的很快也未必要用軟體 ### TLB (Translation Lookaside Buffer) **Page Table 的 Cache** (存 VPN $\rightarrow$ PPN)，避免每次存取資料都要 access 兩次 memory。 **設計特性**: * **Associativity**: 通常是 **Fully Associative** 或高 associativity (Small size, high hit rate) * **Fields**: * PPN (Physical Page Number) * Protection bits (R/W/X) * Dirty bit * **ASN** (Address Space Number): 類似 PID，避免 Context Switch 時要 Flush TLB **事件關係**: * TLB Hit $\rightarrow$ Page Table Hit (TLB 是 PT 的 subset) * Cache Hit $\rightarrow$ Page Table Hit (Data 在 Cache 代表一定在 Memory) * TLB Miss $\rightarrow$ Hardware 或 OS 查 Page Table reload ### Integrating TLB and Cache 三種設計方案： #### 1. Physically Addressed Cache * 流程：CPU $\rightarrow$ TLB $\rightarrow$ Cache * 缺點：慢，TLB 在 Critical path #### 2. Virtually Addressed Cache * 流程：CPU $\rightarrow$ Cache (直接用 VA 當 Key) * 優點：快 (不用翻譯) * 問題： * **Aliasing** (不同 VA 對應同 PA) * Context Switch 需 Flush (或加 PID tag) * Shared memory 無法運作 * 要分辨不同 process 的資料要存 PID，overhead 過大 #### 3. Virtually Indexed, Physically Tagged (VIPT) 最佳解 * **策略**: Overlap TLB access 與 Cache access 的時間 * **做法**: * 同時送 VPN 去 TLB 查 PPN (Tag) * 用 VA 的 Index 去 Cache 讀 Tag/Data * Cache 會先用 Cache Index 去找，第二步才需要 Cache Tag * **關鍵條件**: **Cache Index bits 必須完全落在 Page Offset 內** * 因為 Page offset 不會因為轉換而改變 (VA = PA) * 如果能保證 PA VA 的 Cache Index 區段相同，就能直接用 VA 先去找 Cache Set * **限制**: $\text{Cache Size} \le \text{Page Size} \times \text{Associativity}$ * 例如：4KB Page, Direct-mapped cache 最大只能 4KB * 若要更大的 cache，需要增加 associativity ## Advanced cache techniques ### Non-blocking caches 或 lockup-free cache。允許在 cache miss 時，CPU 繼續執行不 stall，但還是有 miss 數量上限，因為需要存正在排隊的 cache miss。 ### Prefetch from memory to cache * static: compiler 塞一些 load * dynamic: 硬體先去找 memory access 但可能會造成 cache pollution (預取了不需要的資料) ### L3 cache 數 MB 的 cache。 ## Parallel Processor 平行程式設計的簡介 * Data Parallelism: processors 對同一塊資料的不同部分做同樣的操作 * Task Parallelism: processors 對同一塊資料做不同的操作 * Pipelining: 像 pipelined processor 一樣，但是在不同 processor 進行當 N 個 processor 產生 M 倍 speedup 時，會說用了 $M/N$ 的 potential。 * Strong scaling: 無論 problem size 多大，用更多 processor 總是能得到好的 speedup * Weak scaling: problem size 太小時增加 processor 數量 speedup 未必好。大多數問題都是這個分類 * Shared Memory Model: 用在 UMA，共用一塊 memory，主要是要處理不同 processor 各自的 cache 的 coherency 問題。 * Message Passing Model: 可以用在 NUMA，比如 MPI，不同 processor 間使用 send, recv 溝通。 ### Snooping solution(snoopy bus) Maintain Coherency in Shared Memory Model * Send all requests for data to all processors * Processors snoop to see if they have a copy and respond accordingly * Requires broadcast, since caching information is at processors * Works well with bus (natural broadcast medium) * Dominates for small scale machines (most of the market) #### write protocol 分類 * Write **Invalidate** Protocol * 只有一個 processor 能寫，此時其他 processor 的這個 block 會變為 invalid * 可以多個 processors 一起讀 shared data * Read Miss 又分成 * Write through: Memory 一直是最新的，直接讀 Memory * Write back: Memory 可能不是最新，這樣的話需要從別的 processor 的 cache 裡面找 * Write **Update**(boardcast) Protocol * 寫的時候 Update 所有 processor 的 cache 以及 Memory * Read Miss 時只要從 Memory 讀就好不同 processors 在同個 clock write 到同個 block 的話，bus 會自己決定哪個先。 #### 優劣勢比較 * Write Invalidate Protocol: 更好運用 Temporal Locality，只有一個 processor 一直寫的話就不用一直 boardcast。 * Write Update(boardcast) Protocol: read 與 write latency 比較差不多。在 small scale 時通常 bus 是 bottleneck，所以會偏好 Write Invalidate。 #### Snoopy example write invalidation protocol, write-back cache 因此每個 memory block 可能是以下狀態： * Clean in all caches, up-to-date in memory * Dirty in exactly one cache * Not in any cache 在某 processor 的每個 cache block 可能是以下狀態： * Shared: 是最新的，可以讀 * Exclusive: 這個 processor 可以寫，並且是 dirty (還沒 write back 到 memory) * Invalid: 要讀要寫都要從別的地方來 ##### State machine 某 processor 的 cache block 來自 CPU(processor 本身) 的 request 導致的變化： * Invalid * Read miss -> Shared: place read miss on bus * Write miss -> Exclusive: place write miss on bus * Shared * Read hit -> Shared: None * Read miss -> Shared: place read miss on bus * Write miss -> Exclusive: write 一律是 miss，因為不是 exclusive。place write miss on bus * Exclusive * Read hit -> Exclusive: None * Read miss -> Shared: Write back, place read miss on bus * Write hit -> Exclusive: None * Write miss -> Exclusive: Write back, place write miss on bus 某 processor 的 cache block 來自 Bus(別的 processor) 的 request 導致的變化： * Invalid * 不用管，與這個 block 無關 * Shared * Write miss -> Invalid: 別的 processor 變成 Exclusive * Read miss -> Shared: 與本身無關，只是多了一個 processor 一起 share * Exclusive * Write miss -> Invalid: write back，別的 processor 取代本身成為 Exclusive * Read miss -> Shared: write back，別的 processor 要讀，變成一起 share 以上只是範例，不同設定應該都可以簡單推理出來。 #### Coherency Cache Miss 除了 Compulsory(Cold miss), Capacity, Conflict(交替使用對應到同個 set 的 block)，還有 Coherency miss，因為不同 processor 交替寫到同個 block。其中又能分成 * True: 不同 processors 確實是用到 block 內的同一塊資料 * False: 不同 processors 用到同 block 的不同塊資料，如果 block 小一點可能會改善 ## Interconnection Network Parallel architecture 的 processors 之間怎麼連接。 ### Topology 種類 * **Bus**: 最簡單，大家掛在同一條 bus 上。問題是一次只能一個 transaction，bandwidth 低。 * **Ring**: 每個 processor 跟相鄰的連。允許多對 processor 同時 communication (用不同 link segment)。Total network bandwidth = per link bandwidth × link number。Bisection bandwidth (切一半後兩 part 之間的 bandwidth) 是 bus 的兩倍。 * **Fully Connected**: 每兩個 node 之間都有 dedicated link。Bandwidth 最高但 cost 也最高。 * **2D Torus**: 環狀網格結構。 * **Cube**: 立方體連接。 * **Multi-stage Networks**: Processor 透過多個 switch 連接，不是一個 node 配一個 switch。允許較少 node 搭配較多 switch，從一個 processor 到另一個可能經過多個 switch。 * Crossbar: 用 crossbar 結構的 switch * Omega Network: 用 2-in-2-out 的 switch box 組成 ### 評估 Interconnection Performance * **Latency**: 傳一個 message 從一個 node 到另一個 node 要花的時間 * **Bandwidth**: Link bandwidth, Bisection bandwidth, Total network bandwidth * **Congestion delay**: Switch 中的 queue 累積訊息等待送出的時間，取決於 traffic 重不重 * **Cost & Power** * **Routability in silicon**: IC 裡能 afford 多少 wire ## Multicore 以前是一個 chip 一個 core，透過 off-chip bus 連接。現在是同一個 chip 內多個 core，透過 on-chip bus communication。 ### Cache 結構 * **Private cache**: 每個 core 有自己專用的 cache，不會 interference，但空間利用率較差 * **Shared cache**: 最後一層 (Last Level Cache, LLC) 是 shared，空間利用率好 * 通常會是 L1/L2 private，L3 shared ### Scalable Architecture (Tile-based) Core 數多 (10~100 個) 時用 tile-based 結構，每個 tile 包含： * **Core**: 運算核心 * **Router**: 負責 packet switching，像 network on chip * **Cache**: L1 private，L2 是 physically distributed 但 logically shared ### Big-Little (大小核) 同一 ISA 但不同 power/performance 的 core 組合 (如 ARM Cortex-A15 大核 + A7 小核)。 **用途**: Energy efficiency。不需要強大運算的 workload 用小核，可以省一半 energy 達到同樣 performance。 ## Multi-Threading Machine **Hardware thread**: 每個 thread 有自己的 registers、PC 等 states physically replicated，不需 context save/restore，thread switch 很快。 ### 類型 * **Fine-grained**: 每個 cycle 換 thread，消除 vertical waste (整排空白的 stall) * **Coarse-grained**: Run 到 long stall (如 cache miss) 才換 thread * **Simultaneous Multi-Threading (SMT)**: 同一 cycle 可以從多個 threads schedule instructions。Intel 叫 Hyper-Threading (HT)。增加 horizontal 彈性，更能填滿 superscalar 的 issue slots。 ## SIMD (Single Instruction Multiple Data) 一個 instruction 同時處理多個 data elements。例如一個 register 放兩個 2-byte data，一個乘法 instruction 就同時做兩組相乘。 **優點**: * 比 MIMD 更 energy efficient (只 fetch 一次 instruction) * Programmer 還是 think sequentially，比寫 parallel program 簡單 * 適合 multimedia (RGB 等 narrow data type)、matrix operations **演進**: MMX → SSE → AVX，register 越來越長，一次處理更多 data。要求 data 是 consecutive and aligned。 ## Vector Architecture 更 general 的 SIMD。Vector register 很長 (如 RISC-V 有 64 個 64-bit elements = 512 bytes)，一個 instruction 操作整個 vector。 **操作流程**: 1. **Gather**: 把 data elements 讀進 vector register 2. **Pipeline operations**: 對 vector 做運算 3. **Scatter**: 結果寫回 memory **範例**: DAXPY (Y = a×X + Y) * 傳統 RISC-V: 要 loop 64 次，每次 load/multiply/add/store * Vector RISC-V: 幾個 vector instructions 就完成 **優點**: * 大幅減少 instruction fetch bandwidth * 避免 loop control overhead 和 control hazards * 利用 interleaved/burst memory * 使用 multiple-laned vector units (多個 functional units pipeline 處理不同 elements) ## GPU (Graphics Processing Unit) ### 歷史演進 * 早期: VGA display controller (fixed function) * 1999~2001: 開始 programmable (vertex, pixel shader) * 2005: Unified shader (vertex/pixel/compute 用同一種 processor) * 2017 Volta: 加入 Tensor Core (專門加速 machine learning) ### 3D Graphics Pipeline (基本概念) Vertex (座標) → Transform (轉成螢幕座標) → Rasterization (填 pixel) → Pixel Shader (顏色、lighting、texture) ### GPGPU (General Purpose Computing on GPU) 除了 graphics 也能跑 general applications (medical imaging, molecular dynamics 等)。程式語言: CUDA (NVIDIA), Stream SDK (AMD), OpenCL (標準)。 ### CPU vs GPU Architecture **CPU**: * Coarse-grained heavyweight thread * 大 cache + 複雜 control (branch prediction, OoO) * 少數大而強的 core **GPU**: * Fine-grained multi-threading machine * 小 cache，大部分 area 是 computing elements * 很多小 core * 用 latency hiding 解決 memory latency (thread 多到可以一直換) ### GPU 結構 * 很多 SM (streaming multiprocessor)，每個是 SIMD lane * On-chip local memory (64KB) + off-chip global memory (DRAM) * Shared L2 cache ### SIMT (Single Instruction Multiple Threads) 類似 SIMD + threading 的結合。**Warp** (32 個 threads) 同步執行同一 instruction 但處理不同 data。Fine-grained threading 在 warp 之間切換。 **Thread 組織**: Thread → Thread Block (share local memory) → Grid ### Latency Hiding GPU 不需大 cache，因為 thread 夠多。一個 warp stall 時馬上換另一個 warp 執行，stall 時間都被隱藏。 ### Tensor Core (Volta) 專門做 matrix multiply-add (machine learning 最常用的運算)。 ### 系統整合演進 1. GPU 掛 PCI bus，有自己 address space (driver 管理) 2. GPU 整進 CPU die，共用 DRAM 但切分 GPU/system memory (data copy via memory bus) 3. Unified memory: CPU/GPU 共用 address space，不需 explicit data transfer (如 CUDA Unified Memory)，但當然還是得傳輸。 ## Performance Evaluation ### Benchmark * **Linpack**: Matrix linear algebra (scientific computing) * **SPEC rate**: 多個獨立 programs 同時 run (job-level parallelism) * **SPLASH**: Stanford Parallel Applications for Shared Memory * **NAS**: NASA computational fluid dynamics kernels * **PARSEC**: Princeton multithreaded applications (Pthread, OpenMP) * **MLPerf**: Machine learning 標準 benchmark ### Roofline Model 快速分析系統是 compute bound 還是 memory bound。 **三個參數**: * Hardware peak performance (GFLOPS) * Hardware peak memory bandwidth (GB/s) * Application arithmetic intensity (operations per byte) **圖的畫法**: * X 軸: Arithmetic intensity * Y 軸: Achievable GFLOPS = min(Peak GFLOPS, Arithmetic intensity × Peak bandwidth) * 形成一條折線 (roof) **用途**: * Intensity 低 → memory bound，升級 CPU 沒用，要改善 memory 或增加 data reuse * Intensity 高 → compute bound，升級 CPU 有幫助 * 如果 working set 全在 cache 就不受 memory bandwidth 限制