計組筆記 - HackMD

# 計組筆記 111學年下學期/計算機組織/涂嘉恆課本: Computer Organization and Design RISC-V Edition: The Hardware Software Interface, 2/e (X)表示比較不重要 --- # chapter1: Computer Abstractions and Technology # 1.1 Introduction:(X) **計算機分類:** * 個人(personal):通用、軟件多、cp值高 * 嵌入式(embedded):位於電器中、對功率及效能要求高 * 服務器(server):用於網路、高容量高性能、可靠、大小不一 * 超級電腦(supercomputer):高效能、研究用 * 個人行動裝置(Personal Mobile Device):有電池、可連網、便宜 * 雲端計算(cloud computing) **計算機的效能:** * 演算法 * 編譯系統 * 處理器和內存系統 * I/O系統 # 1.2 Seven great ideas: * Use abstraction to simplify design 使用抽象簡化設計 * Make the common case fast 快速處裡常見情況 * Performance via parallelism 並行處裡 * Performance via pipelining 流水線處理(像工廠那樣) * Performance via prediction 預測(減少if，當然不能太不準) * Hierarchy of memories 記憶體的階級(像是ram、快取) * Dependability via redundancy 冗餘(多存幾份(備份))實現可靠性(以防記憶體壞掉) # 1.3 Below your program: 1. **Application software:** high-level language 2. **System software:** Compiler、machine code、operating System、IO、memory 3. **Hardware:** Processor、memory、I/O controllers * Instruction set architecture (ISA): The hardware/software interface(軟硬體的橋樑) * Application Binary Interface (ABI): The ISA plus system software interface # 1.4 Level of code: 1. High-level language 2. Assembly language(文字表示) 3. Machine language(Binary) # 1.5 Components of computers: 1. Input 2. Output 3. Memory 4. Control 5. datapath 6. 4+5 = processor ![](https://hackmd.io/_uploads/SkdCqz7Ja.png) # 1.6 Place of data: * Volatile main memory(易失性主存儲器): 斷電資料即消失 * Non-volatile secondary memory(非易失性輔助存儲器) **記憶體階級:** 1. Cache memory (SRAM) 2. Primary memory (DRAM) 3. Secondary memory (Flash memory) # 1.7 Semiconductor technology:(X) ![](https://hackmd.io/_uploads/HyH1iMXy6.png) # 1.8 Performance(time): * Performance = 1/Execution Time * “X is n time faster than Y” means Performance_X/Performance_Y = n **Execution Time:** * Elapsed time:Total response time * CPU (execution) time:Time spent on CPU **CPU time:** * CPU has a constant-rate clock, a full clock cycle = a clock period * CPU Time = CPU Clock Cycles * Clock Cycle Time = CPU Clock Cycles / Clock Rate **So, to make it run faster, we can:** 1. reduce number of clock cycles required by a program(CPI) 2. increasing clock rate **Cycles per instruction (CPI):** * CPU Clock Cycles = Instruction Count(IC) * Cycles per Instruction(CPI) (Different instructions take different numbers of cycles) **Performance depends on:** * Algorithm: affects IC, possibly CPI * Programming language: affects IC, CPI * Compiler: affects IC, CPI * Instruction set architecture: affects IC, CPI, clock rate # 1.9 Performance devlopment: 一開始，是以增加 Clock rate 和 power 的方式來提高 performance，但後來遇到了 Power Wall，因此改成以多核並行處裡的方式取代單核來提高 performance，許多程式也因此要改寫為使用多處理器 **Power Wall:** Power = Capacitive load * Voltage^2 * Frequency 當 power 太高後，就很難有足夠強大的冷卻設備來散熱 --- # chapter2: Instructions,language of the Computer > Instructions are the words of a computer’s language, command a computer’s hardware, computers have their instructions set. # 2.1 RISC-V Instructions set: * 32 registers, x0 ~ x31, x0 = 0, faster to access than memory * memory addressing 2^32 words, word accesses differ by 4(Mem[0], Mem[4]...Mem[4,294,967,292]) * six group instructions: arithmetic, data transfer, logical, shift, conditional branch, unconditional branch ![](https://i.imgur.com/LB9zcZC.jpg) **Registers:** faster to access than memory because memory data requires loads and stores | Name | ABI | Meaning | Preserved across calls | | ------- | ------ | ---------------------- | ---------------------- | | x0 | zero | Zero | | | x1 | ra | Return address | No | | x2 | sp | Stack pointer | Yes | | x3 | gp | Global pointer | | | x4 | tp | Thread pointer | | | x5-x7 | t0-t2 | Temporary registers | No | | x8-x9 | s0-s1 | Callee-saved registers | Yes | | x10-x17 | a0-a7 | Argument registers | No | | x18-x27 | s2-s11 | Callee-saved registers | Yes | | x28-x31 | t3-t6 | Temporary registers | No | # 2.2 Arithmetic operations: operators: add,substract,addi ``` add x5, x20, x21 addi x6, x22, 6 sub x19, x5, x6 ``` > x20,x21,x5 是存變數 means ``` x5 = x20 + x21 x6 = x22 + 6 x19 = x5 - x6 ``` addi:used to add constant # 2.3 Memory Operand: lw, sw 用來連接 memory 跟 register if we have an array A whose base address is x22 ``` lw x9, 32(x22) add x20, x21, x9 sw x20, 48(x22) ``` > x22 是存位址(指標),x9,x20,x21是存變數，lw 是 load word,把 x22+32這個位址上的資料存到 x9 means ``` x9 = A[8](4*8=32) x20 = x21 + x9 A[12](4*12=48) = x20 ``` # 2.4 bit numbers(X) **1.unsigned number** max: (2^32)-1 **2.signed number** **use 2's complement, the left bit is sign bit** max = (2^31)-1 = 0111...1111 min = -2^31 = 1000...0000 ex: 5 = (0101) -5 = inverse(5)+1 = (1010)+1 = (1011) p.s. 1's complement just inverse **extend to 8-bit:** 5 = (0000 0101) -5 = (1111 1011) **load byte(not word):** lb: load with sigend extend lbu: load with unsigend/zero extend # 2.5 instructions format: * ways to store program * binary Machine language * use opcode to distinguish format and instructions * Instructions represented in binary, like data, and they both stored in memory **R-format**(add x9, x20, x21): | funct7 | rs2 | rs1 | funct3 | rd | opcode | | ------ | --- | --- | ------ | --- | ------ | | 0 | 21 | 20 | 0 | 9 | 51 | (funct7,funct3 are additional opcode) **I-format**(lw x9, 32(x22)): | immediate | rs1 | funct3 | rd | opcode | | --------- | --- | ------ | --- | ------ | | 32 | 22 | 2 | 9 | 3 | address(32) may be very large, so immediate is 11-bits,very long **S-format**(sw x20, 48(x22)): | imm[11:5] | rs2 | rs1 | funct3 | imm[4:0] | opcode | | --------- | --- | --- | ------ | -------- | ------ | | | 20 | 22 | | | | ![](https://hackmd.io/_uploads/HkGMofXy6.png) # 2.6 Logical operations: * slli:shift left(fill 0) * srli:shift right(fill 0) `slli x11, x19, 4 ` means `x11 = reg x19 << 4 bits` and ``` and x9,x10,x11 or x9,x10,x11 xor x9,x10,x11 ``` means `x9 = x10 & x11` p.s. `xor x9,x10,x12; (x12 = 1111...1111)` means `x9 = not(x10)` # 2.7 Conditional operations: beg: `beq rs1, rs2, L1` means `if rs1 == rs2, goto Lable1` bne: `bne rs1, rs2, L1` means `if rs1 != rs2, goto Lable1` **if statement:** ``` bne x22, x23, Else add x19, x20, x21 beq x0, x0, Exit Else: sub x19, x20, x21 Exit: … ``` means ``` if (i==j) f = g+h; else f = g-h; ``` **while statement:** ``` Loop: slli x10, x22, 2 add x10, x10, x25 lw x9, 0(x10) bne x9, x24, Exit addi x22, x22, 1 beq x0, x0, Loop Exit: … ``` means ``` while (save[i] == k) i += 1; ``` basic block:一段具有單一入口及出口的程式(中間沒有branch 跟 label) # 2.8 Procedure(or function): steps: 1. Place parameters in registers x10 to x17 2. Transfer control to procedure (by jump) 3. Acquire storage for procedure (store register to memory) 4. Perform procedure’s operations 5. Place result in register for caller(x10 to x17) and restore registers from memory 6. Return to place of call (return address in x1 and jump back) instructions: * jal x1, ProcedureLabel: store PC+4(跳回來後就可以接著做) to x1 and jump to ProcedureLabel * jal x0, Label: just jump to Label(x0 cannot be changed)，can jump furthur * jal rd, immed: has a 20 bits immed to jump * jalr x0, 0(x1): jump back to address in x1(used in callee) * PC: a register to hold the address of the current instruction being executed ``` sum: ble x10, x0, sum_exit add x11, x11, x10 addi x10, x10, -1 jal x0, sum sum_exit: addi x12, x11, 0 jalr x0, 0(x1) ``` means: ``` int sum (int n, int acc) { if (n > 0) return sum(n − 1, acc + n); else return acc; } ``` memory: * x5−x7 and x28−x31: temporary registers, maintained by caller(caller 決定要不要預先保存它) * x8−x9 and x18−x27: save registers, maintained by callee(callee會負責預先保存它) p.s. * Leaf procedures: procedures that do not call others * Non-leaf procedures: procedures that do call others 進入新的 procedure 時，需要把之前的資料由 register 存入 stack 裡，離開後再把資料從 stack 存回 register，避免資料覆蓋到彼此 ![](https://hackmd.io/_uploads/SywQjf7kT.png) * sp: stack pointer(x2), denoting the most recently allocated address in a stack recursive example: 進入函式、進行運算、存變數(根據sp)、移動sp、重新進入函式(用jal跳)......終止條件、取出變數(根據sp)、運算、移動sp、脫離函式 * fp: frame pointer(x8), points to the first word of the frame of a procedure # 2.9 Memory Layout(RISC-V): ![](https://hackmd.io/_uploads/B1vNsfQyp.png) * Text segment: The RISC-V program (machine) code * Static data segment: for static variables, pointed by gp (global pointer)(x3) * Stack: 可對照上方 procedure 的圖 # 2.10 Load/Store data operations: * lb/lh/lw rd, offset(rs1): Load byte/halfword/word and sign extend to 64 bits in rd * lbu/lhu/lwu rd, offset(rs1): Load byte/halfword/word and unsign extend to 64 bits in rd * sb/sh/sw rs2, offset(rs1): Store byte/halfword/word and store rightmost 8/16/32 bits ``` void strcpy (char x[], char y[]) { size_t i; i = 0; while ((x[i]=y[i])!='\0') i += 1; } ``` becomes ``` strcpy: addi sp,sp,-4 // adjust stack for 1 doubleword sd x19,0(sp) // push x19 add x19,x0,x0 // i=0+0 L1: add x5,x19,x10 // x5 = addr of y[i] lbu x6,0(x5) // x6 = y[i] add x7,x19,x11 // x7 = addr of x[i] sb x6,0(x7) // x[i] = y[i] beq x6,x0,L2 // if y[i] == 0 then exit addi x19,x19, 1 // i = i + 1 (advance to next byte) jal x0,L1 // next iteration of loop L2: ld x19,0(sp) // restore saved x19 addi sp,sp,4 // pop 1 doubleword from stack jalr x0,0(x1) // and return ``` **lui:** ![](https://hackmd.io/_uploads/B1UHoMXya.png) # 2.11 Addresing Mode:(X) * Immediate addressing: the operand is a constant within the instruction itself * Register addressing: the operand is a register * Base/displacement addressing: the operand is at the memory whose address is the sum of a register and a constant in the instruction * PC-relative addressing: the branch address is = PC + offset or PC + immediate * 2 對於bne等跳至指定位置的instructions來說，從0開始數的話，bits可能不夠用，因此可以從PC開始數，PC是上一個指令的位置，以此為基準點進行跳躍，比起從0跳，可以大幅減小長度 # 2.12 Synchronization:(X) > Synchronization is used to know when a task is finished writing so that it is safe for another to read use lr.w and sc.w, if memory specified by the lr.w changed before the sc.w, then the sc.w fails and does not write the value ``` lr.w rd,(rs1) sc.w rd,(rs1),rs2 (Store from rs2 to rs1, if successful, rd = 0) ``` ![](https://hackmd.io/_uploads/B1w_oMQ1T.png) # 2.13 C program execution flow ![](https://hackmd.io/_uploads/SytYsGQya.png) 1. **compiler:** Compiler translates high-level language into machine instructions p.s. java has a just-in-time compiler to fast execution 2. **assembler:** the assembler turns the assembly language program into an object file, a combination of machine code (binary), data, and information needed, contain 6 parts * Header: described contents of object module * Text segment: translated instructions * Static data segment: data allocated for the life of the program * Relocation info: for contents that depend on absolute location of loaded program * Symbol table: global definitions and external references * Debug info: for associating with source code 3. **linker:** It takes all the independently assembled machine language programs and “stitches” them together, 2 ways to link external object files * Static link: linked during linking, bad for upgrade and code size * Dynamic link: linked during execution, Require procedure code to be relocatable 4. **loader:** Loader is a system software, place an object program in main memory so that it is ready to execute # 2.14: some things: 1. 相同的code，分別使用array跟pointer，轉成組合語言(指令)後，pointer需要的指令比較少 2. the RISC-V ISA is divided into: a base(I)+several extensions (five standard extensions(MAFDC)) 3. Powerful instruction: higher performance、fewer instructions required # 2.15: Design principles 1. Simplicity favors regularity 2. Smaller is faster 3. Good design demands good compromises # chapter3: Arithmetic for Computers # 3.1: addition & subtraction **1.addition:** a+b overflow if (a,b>0 and result sign = 1) or (a,b<0 and result sign = 0) **2.subtraction:** use 2's complement a-b overflow if (a>0,b<0 and result sign = 1) or (a<0,b>0 and result sign = 0) **3.one bit ALU:** contain 1 AND, 1OR, 1 Adder, and 1 multiplexer ![](https://hackmd.io/_uploads/B1YojzQJp.png) **multiplexer:** 選擇器，根據operation選擇要輸出哪個結果 ![](https://hackmd.io/_uploads/BJP2iGmy6.png) **adder:** Si = Ai⊕Bi⊕Ci-1 Ci = AiBi+(Ai⊕Bi)Ci-1 ![](https://hackmd.io/_uploads/BkV6iGXJ6.png) # 3.2: multiplication: ![](https://hackmd.io/_uploads/H1NCsGX1T.png) (0010)x(0011) (a x b) in each step: 1. a shift left, b shift right 2. c = a*(b's LSB bit) 3. ans += c (use 64 bits ALU) **optermizer1:** (use 32 bits ALU) (1000)x(1011) (a x b) in each step: 初始0/b, 加a/0 1. initial: 0000(just 0)/1011(b) 2. do add(1000(a)/0000(just 0)) if LSB=1 and shift right, for 4 times 3. 1000/1011, 0100/0101 (add, shift) 4. 1100/0101, 0110/0010 5. 0110/0010, 0011/0001 6. 1011/0001, 0101/1000 7. 0101/1000 is answer **optermizer2: use many adder** # 3.3: divive: **approach:** 1. If bit len(divisor) ≤ len(dividend): 1 bit in quotient and subtract 2. otherwise: 0 bit in quotient, bring down next dividend bit 3. use unsigned value, check sign after calculate ex: 7/2(a=0111/b=0010) 1. 2 shift left 4 bits, b=0010/0000 2. a < b: shift right b,b=0001/0000, qoutient = 0000 3. a < b: shift right b,b=0000/1000, qoutient = 0000 4. a < b: shift right b,b=0000/0100, qoutient = 0000 5. a >= b: a = a-b = 0011, qoutient = 0001, then shift right, b=0000/0010 7. a >= b: a = a-b = 0001, qoutient = 0011, had shift right 4 times, so stop 9. qoutient = 0011, remainder = a = 0001 **optimizer:** 使用查表來進行一定程度的預測 # 3.4 floating number ![](https://hackmd.io/_uploads/BJXJnMmyp.png) **Smallest value:** * Exponent: 00000001 * actual exponent = 1–127 = –126(127是定值用於32bits) * Fraction: 000…00 * significand = Fraction+1 = 0+1 = 1 * ans = ±1.000000000 × 2^–126 **Largest value:** * exponent: 11111110 * actual exponent = 254 – 127 = +127 * Fraction: 111…11 * significand = Fraction+1 = 1.111....+1 ≈ 2 * ans = ±2.0 × 2^+127 **special value:** * Exponent = 111...1, Fraction = 000...0: ±Infinity * Exponent = 111...1, and Fraction ≠ 000...0: Not-a-Number(NaN) IEEE754: single: 1+8+23 bits(bias=127), double: 1+11+52 bits(bias=1023) ex: ![](https://hackmd.io/_uploads/S11l2zXy6.png) **addition** 1. Make the exponents of the two numbers become same(adjusting smaller exponent) 2. Add the two significands 3. Normalize the result and check for overflow or underflow 4. Round the number **multiplication** 1. Calculate the exponent of the product by adding the exponents directly 2. Multiplication of the significands 3. Normalize the product and check for overflow and underflow 4. Round the number 5. Use proper sign of the product # 3.5 FP(X) > The RISC-V has separate floating-point registers and specific fp instructions > f0~f31, fadd.s, feq, flw ..., FP instructions operate only on FP registers **four rounding modes:** round up，round down、truncate、round to nearest even **accuracy:** measured by “units in the last place(LSB)” (ulp) --- # chapter 4: The Processor # 4.1 instruction execution 1. Read from PC(Program Counter) 2. Read from Reg. 3. Use ALU 4. Actions for each class * (Mem.) Load/Store data * (Arith.) Write ALU result to reg * (Cond. or Branch) PC = target addr.(may be = PC+4) ![](https://hackmd.io/_uploads/B1-ZnMQ1p.png) # 4.2 logic design(X) **1.two types:** 1. Elements that operate on data values(no memory) 2. Elements that contain state(memory) **2.clock:** Edge-triggered clocking, data updated only on a clock edge ![](https://hackmd.io/_uploads/rJ5mQ_mNn.jpg =50%x) Register with write control, Only updates on clock edge when Write control input is 1 ![](https://hackmd.io/_uploads/Sy-IQdQ42.png =50%x) # 4.3 datapath **1.datapath element:** 1. register file: keep 32 registers which can be read or written by specifying the number of register ![](https://hackmd.io/_uploads/HJV3duXVn.png =70%x) 2. a unit to sign-extend the 12-bit offset field to a 32-bit signed value ![](https://hackmd.io/_uploads/B1ZVtdX43.png =50%x) 3. a memory unit ![](https://hackmd.io/_uploads/SJZQt_XV2.png =50%x) **2.data flow for load:** ![](https://hackmd.io/_uploads/ryxTgcuQ42.png) **3.data flow for branch:** ALU check branch or not ![](https://hackmd.io/_uploads/H1AicO742.png) # 4.4 control ![](https://hackmd.io/_uploads/HkPohu7Vh.png =70%x) **setting:** ![](https://hackmd.io/_uploads/Hk_FNYXNh.png =70%x) ![](https://hackmd.io/_uploads/B1CSztXV3.png =70%x) **1.ALU control:** ![](https://hackmd.io/_uploads/SkI1C_7N2.png =30%x) used for * Load/Store inst. by addition * Branch inst. by subtraction * R-type inst. depending on opcode has two inputs * ALUOp: a 2-bit control field * inst. ops (funct7 and funct3 fields) has an output * the 4-bit ALU control input ![](https://hackmd.io/_uploads/S18HCuQVn.png =70%x) **2.execution flow:** **R-format:** ![](https://hackmd.io/_uploads/rkMlQFQ4n.png =70%x) **I-format:** ![](https://hackmd.io/_uploads/H1sN7tXVh.png =70%x) **branch** ![](https://hackmd.io/_uploads/Bk-YmtXN3.png =70%x) # 4.5 pipeline * **single-cycle model:** every instruction takes exactly one clock cycle, the clock cycle must be stretched to the slowest instruction(load) * **pipelined model:** split instruction into steps(stages), the clock cycle is the step(stage) ![](https://hackmd.io/_uploads/B1BsntmE2.png =80%x) ![](https://hackmd.io/_uploads/B1ZITKXN3.png =80%x) * If all stages take the same time, Time_pipelined = Time_nonpipelined / Number of stages * If not balanced, speedup is less **RISC-V ISA designed for pipelining:** * Fixed length of RISC-V instructions * Few and regular instruction formats * Addressing design of Load/Store insts. # 4.6 stages **non_pipeline:** ![](https://hackmd.io/_uploads/r11Ph2QEn.png) **pipeline:** ![](https://hackmd.io/_uploads/By4WT3QE2.png) a pipeline register is redundant to the state that is updated there is no pipeline register at the end of the write-back stage **control of stages:** Execution/address calculation(EX): ALUOp and ALUSrc Memory access(MEM): Branch, MemRead, and MemWrite Write-back(WB): MemtoReg, RegWrite ![](https://hackmd.io/_uploads/Sys1q6Q42.png =80%x) **a bug of pipeline:** Q: IF/ID 的 write register 已經被改變，要load/store的指令才傳回來，但原本的位置資訊已經不在。 A: 讓其也跟著一起傳下去，要是需要load/store，再把它跟load/store的指令一起傳回來 ![](https://hackmd.io/_uploads/HJT5tp7Eh.png =80%x) # 4.7 pipeline hazards > situations (events) that prevent starting the next instruction in the next cycle **1.Structure hazards:** * The hardware cannot support * required resource is busy * so pipelined datapaths require separate instruction/data memories **2.Data hazard:** * Need to wait for previous instruction to complete its data read/write * if hazard can be detected when the instruction is in the EX stage and the prior instruction is in the MEM stage, this is hazard: EX/MEM.RegisterRd = ID/EX.RegisterRs1 * sol 1: Forwarding(solve a=a+1,a=a+1,a=a+1) how: if(detect hazard, read forward data) ![](https://hackmd.io/_uploads/Hyh2VhmE2.png =70%x) ![](https://hackmd.io/_uploads/ByHqzaT8n.png) ![](https://hackmd.io/_uploads/HkYX7TaL3.png) * sol 2: stall a cycle(solve lw x2, then add x1, x1, x2) how: preventing PC and the IF/ID from changing, and set seven controls to 0 (stall + forwarding) ![](https://hackmd.io/_uploads/Syh_shpLn.png) ![](https://hackmd.io/_uploads/HJXqlCVEh.png =70%x) * sol 3: Code scheduling, don't use last line code's result **3.Control hazard:** * Deciding on a control action depends on previous instruction * sol 1: stall on branch(too slow) * sol 2: predict how: predict not branch, if wrong, just flush 3 inst.(in 3 stages), because branch is decided in MEM * sol 3: reduce beanch delay how: Move the branch adder from the EX stage to the ID stage, then do the conditional branch execution in ID stage, so just need flush 1 inst. * sol 4: Dynamic branch prediction: predict base on hitory(has a small memory to recoed it) (Static branch prediction: sufficient for a simple pipeline design, like loop and if) ![](https://hackmd.io/_uploads/HkjVNR4N2.png =70%x) # 4.8 exception(X) > any unexpected change in control flow > Arises within the CPU > undefined opcode, syscall, overflow... **when an exception occurs:** 1. The processor saves the address of the inst. in 64-bit SEPC 2. Transfers control to OS to look up SCAUSE for the reason of the exception 3. OS jumps to the specific event handler 4. After exception handling, OS may continue the interrupted program SEPC: supervisor exception cause register SCAUSE: Supervisor Exception Cause Register **Exception is treated as another form of control hazard:** * flush the instructions that follow the interrupted instruction * fetch instructions from the new address # 4.9 instruction-level parallelism(ILP) > by executing multiple instructions in parallel **two methods of ILP:** 1. **Deeper pipeline:** Increase the depth of the pipeline * Less work per stage * shorter clock cycle * Hard to be done now(physical limits) 2. **Multiple issue:** Replicate internal components to launch multiple inst. in pipeline * Require extra work to keep all the machines busy and * transferring the loads to the next pipeline stage * CPI < 1, use IPC(1/CPI) instead ![](https://hackmd.io/_uploads/rkoRluIBn.png =60%x) **two ways of Multiple issue:** 1. Static multiple issue (compiler-based) * Compiler groups instructions into “issue packets" * Compiler detects and avoids hazards(ex: code scheduling) * Require extra ports in register file and a separate adder ![](https://hackmd.io/_uploads/HJ1T-uUr2.png =80%x) example: ![](https://hackmd.io/_uploads/rJyrMuIHn.png =80%x) 2. Dynamic multiple issue (hardware-based) * Have a dynamic pipeline scheduling processor * CPU chooses instructions to issue each cycle * Compiler can help by reordering instructions * CPU resolves hazards using advanced techniques at runtime * 讀取/執行指令時不一定會照順序，每個指令有buffer，視情況決定要執行哪個指令的哪個步驟，不過WB階段會是有序的 * require power --- # chapter 5: Large and Fast: Exploiting Memory Hierarchy # 5.1 principle of locality 1. Temporal locality Items accessed recently are likely to be accessed again soon 2. Spatial locality Items near those accessed recently are likely to be accessed soon 1. Hit: access satisfied by upper level Hit ratio: #hits/#accesses 2. Miss: block copied from lower level Miss ratio: #misses/#accesses = 1 – hit ratio Miss time >> Hit time stall the CPU pipeline to wait ![](https://hackmd.io/_uploads/Hkdbkt4L2.png) ![](https://hackmd.io/_uploads/S1IMetN83.png) # 5.2 Cache design 1. **Direct mapped** * each word can go in exactly one place in cache * The simplest way * use low-order address bits to specific location * use upper bits to make tag, can check whether data is in that location ![](https://hackmd.io/_uploads/B1ezMt4Lh.png) **organization:** * cache block address = Tag + Index * valid bit means it is useful(at the beginning, all are unuseful) ex: ![](https://hackmd.io/_uploads/SyxfeOLuL3.png =60%x) ![](https://hackmd.io/_uploads/ByxDEKNUn.png =60%x) ![](https://hackmd.io/_uploads/BkL_SK48h.png =60%x) **cache block size:** if too large: 1. increase miss rate 2. increase miss penalty ![](https://hackmd.io/_uploads/BJyASY4Ih.png) **write policy:** 1. Write-through scheme * writes always update both the cache and the lower level memory * take longer time * write buffer can reduce the memory write latency (temporarily holds data while waiting to be written to memory and CPU continues immediately) (if write too fast buffer may be filled and become no use) 2. Write-back scheme * when modify clean cache, set it dirty in dirty bit * when modify dirty cache, need to store old data to memory, then set new data clean * can use a write buffer **when write miss(要更新的東西不在cache):** 1. Write allocate 同時改cache跟memory 2. No write allocate 只改memory(因為不一定馬上會用到) * write through/back 都可以使用 (no) write allocate，但一般來說: * write-through cache adopts no write allocate policy * write-back cache uses write allocate polic **performance:** 1. reduce the miss rate by reducing that two distinct memory blocks contend for the same cache location 2. reduces the miss penalty(stop and wait for memory) by adding an additional level to the hierarchy ex: ![](https://hackmd.io/_uploads/Hy0soYV8h.png) ex: ![](https://hackmd.io/_uploads/B1mCoYVLn.png) ![](https://hackmd.io/_uploads/ByJWhYEL3.png) 2. **Fully Associative** * 隨便存，都能存，不用取餘數了 * it requires all entries to be searched at once, done in parallel * increase the hardware cost * only for caches with small blocks 3. **Set Associative** Each set contains N entries, ex: N = 2 ![](https://hackmd.io/_uploads/BkjSyqNL3.png) three mode: * directed mapped: higher miss * fully connected: lowest miss but high cost of hardware ![](https://hackmd.io/_uploads/rJ8zdkHUh.png) ![](https://hackmd.io/_uploads/SJP48IuU2.png) # 5.3 Mutilevel cache ![](https://hackmd.io/_uploads/H1jacIu8n.png) * Level-1 cache: Focus on minimal hit time * Level-2 cache Focus on low miss rate to avoid main memory access * Results Level-1 cache usually smaller than a single cache ex: ![](https://hackmd.io/_uploads/BJAHcIuIh.png) # 5.4 The key design decisions 1. Block placement * Direct mapped (1-way associative) * N-way set associative * Fully associative 2. Finding a block ![](https://hackmd.io/_uploads/SJWVfDuUn.png) * Hardware caches: Set-associative placement is often used * Virtual memory: Full associativity is beneficial, since misses are very expensive 3. Replacement on a miss * Least recently used (LRU) * Random 4. Write policy * Write-through * Write-back # 5.5 three kinds cache miss 1. Compulsory misses caused by the first access to a block that has never been in the cache. 2. Capacity misses These are cache misses caused when the cache cannot contain all the blocks needed during execution of a program. Capacity misses occur when blocks are replaced and then later retrieved 3. Conflict misses occur when multiple blocks compete for the same set ![](https://hackmd.io/_uploads/S1iSiv_Un.png)