祭祖 - HackMD

###### tags: `祭祖` 祭祖 === ## Chapter emphasis - Chapter1: Performance & Power Evalution - Chapter 2: Instruction set Architecture - Chapter 3 & 4: Arithmetic & Datapath/Control & Pipelining (Processor) - Chapter 5: Memory Hierarchy - Chapter 6: Storage & I/O system ## chapter list [TOC] ## 10/8 ![QQ 404](https://i.imgur.com/xmP6MVr.png) ### Notions of perfermance - latency(一個程式執行的時間) - Throughput(1秒時間可以處理的程式) ### Perfance Equation `Performance is in units of things-per-second` #### matrix for performance evalution - Program execution time - elapse time 1. i.o. time(資料傳遞時間) 2. disk access(硬碟響應時間) - Seconds for a program(單純process 執行時間) - CPU execution time - 不考慮io time、disk access time，單純考慮整個 process 執行時間。 - 可以被分成 user time 和 system time。 #### performance evalution ::: success $CPU\ time$ = $\dfrac{Seconds}{Program}$ = $\dfrac{Instructions}{Program}$ x $CPI$($\dfrac{Cycles}{Instruction}$) x $\dfrac{Seconds}{Cycle}$ ::: - instruction per program - CPI (cycles per instruction) - $CPI = \dfrac{Cycles}{Instruction}$ - cpi受很多因素影響，不同cpu、compiler、撰寫的程式語言、演算法等都有影響 - 不一定cycle數等於instruction數 - 例如乘法所需 cycle數大於加法，存取 ram 和存取 register 同理 - clock cycle time(Second per cycle) - $\ Clock\ cycle\ time= \dfrac{1}{cpu\ clock\ rate}$ - 通常都以 cycle數為衡量執行時間的單位而不是以時間而定(同一 cpu 下的程式 CPU time) - 也就是clock rate的倒數 ### MIPS - 公式 - $MIPS = \dfrac{Instruction \ count}{Excution\ time} \\= \dfrac{Instruction_a + Instruction_b}{(Instruction \ count_{a} * CPI_a + Instruction \ count_{b} * CPI_b) * cycle\ time}$ - 數值越大代表需要越多時間執行 instruction - 數值越少越 powerful??? - 仍然須考慮 instructions、CPI、execution time ### what affect to 3 matrix | |instruction count|CPI|Clock Rate| |---|-----------------|---|----------| |Algorithm|**YES**|**YES**| | |Programming Language|**YES**|**YES**| | |Compiler|**YES**|**YES**| | |ISA|**YES**|**YES**|**YES**| ::: success **所以cpu整體性能提升，可以造成3項指標同時提升** ::: ### Spacefy bentch mark - CPU - 整數運算 - 浮點運算 - High Performance Computing, OpenMP, MPI - 平行運算 - Power - Web server > more info in http://www.spec.org/ ### Normalization(正規化) - 調整方式 - 以相對於參考主機的性能，標出他是參考主機效能的幾% ![](https://i.imgur.com/BATZboE.png) ### Amdahl's Law ``` 1 : 整體時間 F : 可以改變的部分占總時間的比例(時間) S : F/S 改變後時間變成改變前時間F的 1/S 倍 ``` $Execute\ time(w/E) = ((1-F) + \dfrac{F}{S}) * Execute\ time(w/oE)$ $sppeed up(E) = \dfrac{1}{(1-F) + \dfrac{F}{S}}$ - 代表什麼??? - 不能只單一的加強同一部分 - 要注意程式片段所占比例來進行優化 - 越優化同一片段，因為在整體執行時間比例不斷下降，提升比率會越來越少 ## 10/15 ### power #### Basics of Power Consumption ![](https://i.imgur.com/X9avGjt.png) - switch 晶片工作消耗電壓 - short-crircuit 短路 - leakage 漏電 :::success high-k 突破more's law 的 intel 究極塗料 ::: #### 多核心帶來的好處 ![](https://i.imgur.com/cqB3Ihw.png) ![](https://i.imgur.com/Z4F7Fpl.png) - 好處??? - 降低處理器 latancy 效能，增加 through put 效能 - 省電 - 壞處? - 降低處理器 latancy 效能，增加 through put 效能 ### assembly code #### five catogory of assemably code - store type |Name | Register | number Usage| |---|---|---| \$zero | 0 | the constant value 0 \$v0-$v1 | 2-3 | values for results and expression evaluation \$a0-$a3 | 4-7 | arguments \$t0-$t7 | 8-15 |temporaries \$s0-$s7 | 16-23 |saved \$t8-$t9 |24-25 |more temporaries \$gp| 28 |global pointer \$sp |29 |stack pointer \$fp |30| frame pointer \$ra |31 |return address - Eight Great Ideas for design instructions(in lesson 1) 1. Design for Moore’s Law 1. Use abstraction to simplify design 1. Make the common case fast 1. Performance via parallelism 1. Performance via pipelining 1. Performance via prediction 1. Hierarchy of memories 1. Dependability via Redundancy - 5 major categories of instructions: - Arithmetic: add, sub, … - add - format: ```add des src src``` - addi - format: ```addi $s3, $s3, 4``` - Data transfer: load, store, … - lw - format: ```lw des offset(src) ```// offset must be constant，==offset 的單位是byte== - sw(store word) - format: ```sw des offset(src) ```// offset must be constant，==offset 的單位是byte== - Logical: and, or, … - or, ori, and, andi - format - ```or rd rs rt``` - ```ori rd rs constant``` - Conditional branch: branch on equal, - format - ```beq reg1 reg2 L1```// if reg1 == reg2 jump to statment L1 - Unconditional jump: jump, - ```J L1``` jump to L1 ::: success - rd: register destination - rs: register source - rt: register target(or it just rs->rt->ru->rw..., maybe~) ::: ### Big Endian and Small Endian ![](https://i.imgur.com/SvyqGGb.png) #### what is lsb(Least significant bit) and msb(Most significant bit) ![](https://i.imgur.com/TnYCK9S.png) ### How about larger constants(it also memory address consist 32bit) ![](https://i.imgur.com/LcwK1uo.png) ![](https://i.imgur.com/Cl9w20G.png) ## 3種不同種類instruction - R-type - add - sub - slt - sll - and - or - nor - I-type - addi - subi - ori - andi - nori - jr - jal - J-type - j - $rt only have 26 bit,but word has 32 bits. ![](https://i.imgur.com/Jlx0oeg.png) ## 10/22 ## 10/29 ### Addressing Mode - Immediate addressing - Example: ```addi $2, $3, 4``` - Register addressing - Example :```add $r1, $r2, $r3``` - Base addressing - Example : ```lw $2, 100($3)``` - PC-relative addressing - Example : ```beq $2, $3, 100``` - Pseudodirect addressing - Example : ```j 100 ``` li loadInteger v0 v1 are for return values(大概?) fp frame pointer sp stack pointer ra return address ## 11/5 - Set Less Than `slt $s1, $s2, $s3` # if ($s2 < $s3) $s1 = 1; else $s1= 0 ## 11/26 ### Five execution steps - Instruction fetch - Instruction decode and operand fetch - Execution (ALU, memory address computation, Branch completion) - Data memory access - Write-back --- ## 12/10 ### recall #### in data path 1. instruction fetch - pre calcuation add 4 at program counter(PC) - get the instruction from memory 2. instruction decode and register fetch(ID) - operand fetch - pre-calaute the brench address(==in beq、bne==) :::success why can not put the artheritic unit(caculate the target address) in IF stage? - it cost more time and cost. - it can do it sequencial in ID stage. ::: 3. execution - ALU execution - in this step, you will know where bench will jump into(==in beq、bne==) - do arithmetic operation (in another instruction) 4. memory access - only load(lw) and save(sw) would use it(in single cycle datapath, however in multity cycle datapath it still need to run at least 5 pipeline stage). 5. memory read completion or write back - just write back to register ### pipeline troubles we will find 3 pipeline Hazards - structure hazzard - 主要就是當同一個部件在同個一個時間要存取相同的 resourcece，下面的例子就是 memory access unit 被重複使用 ![](https://i.imgur.com/JKsK5JS.png) - 解決方法 1. 就等ㄅ，等到會用到同個resource 的時間過去就行了 2. 多放一個memory access unit 在一開始的 instruction fetch 那來暫存，就不會和之後 write back 那的 memory access unit 打架了。 ![](https://i.imgur.com/ZOauqx7.png) - data hazzard - 在某些情況下，資料會在還沒寫入前就被使用，下面的例子是r1在還沒 write back 時就被讀取。 ![](https://i.imgur.com/UzMgkGr.png) - 解決方法 1. 等ㄅ，等到 r1 寫入再說吧 2. data forward - 說是 data forwarding，但實際行為比較像"data backwarding"，怎麼說呢? 在上圖中，我們看到 r1 在還沒 write back 時，就被存取了，所以我們在 ALU 那建一個可以 write back 的零件，可以將 r1 計算完的結果給下面要直接使用 r1 的 instruction 來使用。 :::danger ### 有一個不能被 data forwarding(bypass) 解決的情況(超過兩個clock)。 - lw-used data hazzarding 因為是在 memory access 時才會得到 r1 的值，所以 bypass 這時就不好用惹:cry: ![](https://i.imgur.com/d91mpG4.png) - 解決方法 1. stall(也就是暫停一個 clock cycle) ![](https://i.imgur.com/NizgfXv.png) 2. 再聰明一點的方法? 在 stall 原本的位置裡塞入一個之後一個在下個 clock 不會用到的 lw( 假設lw t2 4(r2) )，所以可以蓋掉那個 stall 的影響 ::: - control hazzard - 因為beq、bne都要在 ALU 計算完之後( execution stage )才知道要不要跳到另一個 branch，所以會有問題 ![](https://i.imgur.com/VsBrShl.png) - 解決方法 1. stall: - 等3個clock，直到要跳到哪個 branch 被算出來，就如上圖，那結果如下，delay 3 clock。 2. 在 IF 的時候就把 target address 和要不要跳算出來，沒啥印象可以回到[這裡](https://hackmd.io/4is1d9JdTOuKd2SUJcaVqw?both#in-data-path)看一下，詳細過程如下，delay 2 clock就可以知道跳哪。 ![](https://i.imgur.com/GMTKaTR.png) 3. predict: - 因為在IF就把target address 都算出來了，所以可以預測 target address 會跳到哪，反正跳錯跟上面第二個方法一樣，相當是 delay 2 clock，猜對就賺爛了。 - 兩種方法來猜 1. dynamic scheme (90% accuracy) - 2次 miss prediction(進入時0->1, 出去時 1->0) - 更聰明的方法（2-bit scheme） 2. 一直猜 true or false (50% accuracy) - 猜錯如何 flush ???(next chapter) 4. 和上面load-used data hazzard 其中一個很相像的解決方法，相同概念但不實用（變換次序仍不會影響計算結果的 instructio（MISC）），既然第2種解決方法要在execution階段才算出要不要跳到target address，那在中間塞個變換次序也不影響的instruction 就可以把這個可能要多 delay 1 clock 的可能性藏起來。 ![](https://i.imgur.com/Ef150Ek.png) ::: success 到現在我們的 data path 應該長成這樣 ![](https://i.imgur.com/Ay3GbA2.png) 目前的問題??? - 實作data forwarding - detect data forwarding, stall pipeline - how to resolve brench's target address in decode stage. - how to flush pipline ::: ### detect (data forwarding, data dependency) - in excution stage - in memory - how to implement - use pipeline unit #### when to detect Hazard conditions(建議看 slide 圖真的西北多的，懶得截了): - 1a: EX/MEM. RegisterRd = ID/EX.RegisterRs (sub & and) - 1b: EX/MEM. RegisterRd = ID/EX.RegisterRt - 2a: MEM//WB.RegisterRd = ID/EX.RegisterRs (sub & or) - 2b: MEM/WB.RegisterRd = ID/EX.RegisterRt - RegWrite signal of WB Control field - EX/MEM.RegWrite, MEM/WB.RegWrite - EX/MEM.RegisterRd <> $0 - MEM/WB.RegisterRd <> $0 ### Exception handling #### step to handle the exception(例外) 1. 清空在IF，ID，EX階段的register 2. 讓在發生例外的excpption 之前的instruction執行完畢 3. 紀錄EPC(發生例外的下一個PC) 4. 呼叫OS例外處理程式 - make PC = 0x80000180 5. return to EPC - PC = EPC - 4 ### Advanced Pipeline 增加ILP(instruction level parallelism)方法 - 更多 pipeline stage - 增加 pipeline 條數 ### memory hiearchy #### two different type of memory hierarchy - temporal Locality(把在instruction上下的變數load 到 memory cache) - Spatial Locality(把相鄰的陣列load到memory cache) #### two cache type of memory hierarchy - fully-associate cache - direct-mapping cache #### how did the block cache look like?? ![](https://i.imgur.com/jlwlKFk.png) ::::: info **Ex. How many total bits are required for a direct-mapped cache with 16 KB of data and 4-word blocks, assuming a 32-bit address?** - \# of sets = 16kB / 4B = 4k(word) = 4k / 4 = 1k(sets) - \# of data bits for each set = 4 * 32 = 128 - \# of tag bits for each set = ? - 32 - 10(\# of sets) - (2 + 2) = 18 - Valid bit for each set = 1 - total cache bits = # of set x (valid bit (1-bit) + tag bits + data bits) ::::: #### 資料查找 ![](https://i.imgur.com/6D9mPZn.png) - 由 Cache Index 找到該筆資料應該放的位置(圖中為1) - 確認該位置的 Cache Tag 是否正確(有就是 hit, 沒就是 miss) - 若為 miss 則將該處資料更新 #### handle cache miss 1. 把原本尚未發生記憶體 cache miss 的 PC 記在memory 2. 叫main memory準備好資料攻下一次的計算使用（等多少個cycle取決於miss 了幾層memory） 3. 從 memory 寫入 cache entry(把address前幾個bits 填入 tag，並且把 vaildate bit 填成 1) 4. 從剛剛第一步紀錄的 PC 開始執行程式，會從新執行 instruction，時間取決於你在記憶體花了多久的時間 ::::: danger 這個流程非常重要 :::::