CH5 - HackMD

# Overview * Pipeline is an implementation which **over lapping** multiple instrucions to make the whole process faster * It divide the excution of a **job** into steps and cut the hardware into corresponding **stages**.Each stage excutes one job * Pipeline can **increase the throughput of system** but **can't help the latency**(excution time) of single instrcution * The **potential of speedup** for pipeline is the **number of stages** divided # Characteristics of Pipeline in MIPS * MIPS divde the hardware into **5 stages** > ![](https://hackmd.io/_uploads/BJpc24yh2.png) * And the instructions are divide into the following table >![](https://hackmd.io/_uploads/r1IFpBy23.png) * MIPS is also optimized for pipelining >![](https://hackmd.io/_uploads/BJQyJ8yh2.png) # Performance in Pipeline > ![](https://hackmd.io/_uploads/Hk6X1Uyhh.png) * For cpu clocks,the **(S-1)** is used for **fill the pipeline** and time is needed for drain the pipe also * Idealy,**numbers of instructions** in pipeline machine should **close to infinty** i.e $N -> \infty$ > ![](https://hackmd.io/_uploads/rJ8klLJ33.png) > In this case, by the time the 4th clock is done the pipeline is full and after that every clock will finish one instrution until the pipe in drained. > To calculate the speedup versus single cycle machine we see the following graph > ![](https://hackmd.io/_uploads/HJ3eWIk3h.png) >Example:![](https://hackmd.io/_uploads/H1hemUyhn.png) > Cycle time = 1/ 500M = 2ns => Exetime = (5-1) * N * 2ns = 100ns => N =46 => For 2N Exetime = 4 * 92 * 2ns = 192ns > Example:![](https://hackmd.io/_uploads/SyDTVLy2n.png) > $ExeTime_{Single} = N * 10ns = 10N$ > For pipeline the cycle time is 10/4 + 1 = 3.5ns => $ExeTime_{Pipe} = (3+N)*3.5 = 10.5 + 3.5N$ => $SpeedUp = \frac{10}{3.5} = 2.86\ when\ N -> \infty$ > * **CPI for single or pipeline remain the same** # Pipeline Datapath * To build up the datapath we firsrt divide the single cycle datapath into 5 stages and add the pipeline register > ![](https://hackmd.io/_uploads/SyX06Pl2n.png) * Since we only write the register during WB stage to avoid write the data into wrong register when other instruciton is at ID stage we **lineup the register destination all the way until WB** > ![](https://hackmd.io/_uploads/Bks7Q_l33.png) * But it will cause two instructions accessing the register file simultaneously to avoid it we **divide the whole clock into two half clocks used for Write and Read respectively**.Therefore except for PC and pipeline register,the register file needs to connect with clock too. ## Control Unit * Generally,CU for pipeline is the same as that single cycle machine,but we identify which control line will be used for each stage and classify them to smae category > ![](https://hackmd.io/_uploads/r1MlTul3h.png) > * We moved RegDst MUX to MEM stage for better performance > >![](https://hackmd.io/_uploads/r1FOnux32.png) >![](https://hackmd.io/_uploads/ryAU6Ox3h.png) > * Transfer the control line through pipeline regsiter and drop it after used ## Pipeline Diagram * We have two ways 1. Standard * Fill up left half means reading,while filling up right half means writing > ![](https://hackmd.io/_uploads/S1qRpOgnn.png) > 2. **Traditional(Most frequently used)** > ![](https://hackmd.io/_uploads/SJkbJYe22.png) > Example:![](https://hackmd.io/_uploads/H1KfkYxh2.png) > 1). In pipeline => 500 ps, In single => 300+400+350+500+100 = 1650 ps > 2). In pipeline => $5*500 = 2500ps$, In single => 1650ps > 3). Choose the longest stage to decrease the cycle time => MEM => new cycle time = 400ps > Example:![](https://hackmd.io/_uploads/ryR4V2-nh.png) ![](https://hackmd.io/_uploads/ByLrN3Zn3.png) > 1). False > 2). Ture > 3). False,ALU can't take fewer than 5 cycle > 4). False > 5). Ture > Example:![](https://hackmd.io/_uploads/ryXES2b22.png) > 1). Ex : ALUSrc = 0, ALUOp = 10 , RegDst =1 > 　 MEM : Branch = 0, MemWrite = 0, MemRead = 0 > 　 WB : MemtoReg = 0, RegWrite = 1, > 2). 1 cycle, same as the single cycle > 3). PCSrc = 0, Since the EX stage has many components in the stage therefore it might be the 　 longest stage.Addding an extra AND Gate will increase the cyccle time. > 　If facing a control hazard,putting AND Gate in EX stage allows us to stall only one clock # Hazard in Pipeline ## Types of Hazard * There's 3 types of hazard 1. **Structure Hazard** * Caused by the **insufficient of hardware**(Distance between them must **smaller than 2**) > Example: If there's only one Memory > ![](https://hackmd.io/_uploads/HkSMo3W32.png) > At 4th clock,we need to fetch instruction and read the data from memory in the same time 2. **Data Hazard** * Caused by the **data dependency of instruction** > Example:![](https://hackmd.io/_uploads/B1iCi3-h2.png) > There's dependent relation between all instructions. > The correct result of s0 will be written back to the memory in 5th clock. > For beq,it fetches the data in 2th clock but the data isn't ready,thus causing the hazard 3. **Control Hazard** * Caused by **fetching the wrong insturction** * Usually happened when using **branch** insturction > Example:![](https://hackmd.io/_uploads/BJVk1pbh2.png) > We don't know that next instruction should be sub or add before the beq insturction done ## Resolving Hazard > ![](https://hackmd.io/_uploads/ByUuEbm3n.png) ### Structure Hazard 1. Stall * The easiest way to slove the hazard,but it lacks efficency > Example:![](https://hackmd.io/_uploads/SJyTBbQhh.png) > Clocks needed = (5-1)+5+3 = 12 2. Add Hardware ### Data Hazard #### Software Based Solution 1. NOP * Essentially the same as stall. * Since the hazard only happen when there's **data dependency between two instructions and their distance is less than 3**,adding NOP instruction(do nothing) to increase the distance beyond 2 can prevent hazard form occuring > Example: > ![](https://hackmd.io/_uploads/Bkm2uWQnn.png) 2. Instruction Reordering * Reordering the instruction to increase the distance bet two instructions > Example: > ![](https://hackmd.io/_uploads/rkeB5-X33.png) > Before reordering we must check the following restriction,if the same than the reodering is invaild > ![](https://hackmd.io/_uploads/BJGq9ZXnn.png) > In this example, the reordering of both lw/add is vaild > ![](https://hackmd.io/_uploads/H1rApZmh3.png) #### Hardware Based Solution 1. Forwarding * Since the reason causes the hazard happen is accessing data when the data is not ready(WB) Therefore we simplify **extract the right data from the pipeline register** after the arithmetic for data is done(EX) to avoid hazard happening * There's two types of data hazard depending on the location of ther correct data > ![](https://hackmd.io/_uploads/rkHJIGmnn.png) > For and instrcution, when it's in the ID stage and is about to fetch the data,but s2 is calculated in EX stage => EX hazard ##### Constructing Forwarding Unit/Path > ![](https://hackmd.io/_uploads/H15O9fm2n.jpg) * Forwarding the correct data from the pipeline register to ALU using two 3to1 MUX controlled by the Forwarding Unit. * If **EX harzard the correct data will be connected to 1,MEM harzard will connect to 2** >![](https://hackmd.io/_uploads/SJcNvQmn3.jpg) * To determine the Control line of Forwarding Unit,we **compare the source register(In ID/EX register see red line above) with the destination register** in the EX/MEM or MEM/WB register. * And the **rd register can't be $0** with the Control line **RegWrite be 1** --- > ![](https://hackmd.io/_uploads/HkHAVGB2h.png) * For the scenario above,there's both EX/MEM harzard for the 3rd add,but it actually has only EX hazard since the $1 is rewritten by the 2nd add. * To perform forwarding correctly, we need **check there's no EX harzard happening for MEM hazard** > ![](https://hackmd.io/_uploads/S1j8IGBh2.png) > Same for the Rt register > Example:![](https://hackmd.io/_uploads/B10IDzSn2.png) > There's EX hazard between add&or > => $4 is rt register > => (2). > Example: ![](https://hackmd.io/_uploads/HyorOGr3n.png) > ![](https://hackmd.io/_uploads/SynUuzH23.png) > 1). EX: 1&2, 2&3 , 3&4 , 4&5 , 6&7 MEM:4&6 Control:7 > **If there's branch instruction then control hazard will occur** > 　 Add 2 NOP before 2 3 4 5 7 and **1 NOP after bne** 2. Stall and Forwarding * Mainly focus on the **load-use hazard**(a type of EX hazard) > ![](https://hackmd.io/_uploads/HkhY9MSh2.png) > * Not like normal EX hazard(ready when EX done), the **data will not be ready until MEM stage** > * Thus we waiting one clock for data then forwarding it(like MEM hazard). * The following is the datapath with **hazard detection unit** > ![](https://hackmd.io/_uploads/H1WL3Gr3h.png) * The addition **PCWrite&IF/IDWrite line is used to preventing next instrcution come inside the path** and **make the 3rd instruction the stay in IF stage**(or in example above) * New **stall line is used to turn the instruction into NOP** by **reseting all control line into 0** * The 2nd instruction(sub) will continue going in the path unlike the 3rd instruction,thus it will copy into EX stage which go against what we want.Therefore we turn it into NOP by reseting all control line. * The detection unit is **placed in the ID stage** > Example:![](https://hackmd.io/_uploads/BJSXJ7rnn.png) > 1). There's load hazard between all pairs of lw&add(stall one clock) and EX hazard between all pairs of add&lw(can be solve by the forwarding) > =>$ANS = \frac{(5-1)1000 + 500}{1000} = 1.5$ > 2). Without forwarding,every EX hazard will need 2 nop to resolve it(containg the 1 clock needed for lw hazard) > =>$ANS = \frac{(5-1)1000 + 999}{1000} = 3$ > Example:![](https://hackmd.io/_uploads/rkl5DgP33.png) > ![](https://hackmd.io/_uploads/SJJsvlv2h.png) > a). f is the percentage of load-use hazard => $T = [(5-1)+N+N * f * 1]* 10$ > b). $T_{1} = [(5-1)+N+N * f * 1]* 12$ > 　 If spilt => cycle time = 10ns =>$T_{2} = [(6-1)+N+N * f * 1]* 10$ > 　 => for $f < \frac{1}{4}$ when $n\to \infty$ option 2 will faster than 1 > c). faster load can read the data in MEM1 stage => it only need to stall 1 clock =>in the other hand the slower load need 2 clocks > =>$T = [(6-1)+N+N * f * *0.4*1+N * f * *0.6*2]* 10$ --- #### Different Data Hazard > ![](https://hackmd.io/_uploads/BynI0lP33.png) * In MIPS,you wouldn't write before the former instruction read or read before it write(design of the pipeline stage)Therefore the **WAR/WAW is false data dependency** ### Control Hazard > ![](https://hackmd.io/_uploads/ByrgzZwh2.png) #### 1.Stall * Add **3 stall before the branch instruction** * We can **move the branch eariler to use less stall** the minimum **number of stall is 1** when the branch is in **ID stage** > ![](https://hackmd.io/_uploads/BysvxWD22.png) * To do so we need to move the whole components for performing branch into ID stage and **add a new component for comparing number** > ![](https://hackmd.io/_uploads/Sk_JZWPnn.png) > The corresponding datapath is the following graph(**IMPORTANT**) > ![](https://hackmd.io/_uploads/BkKHb-w22.png) * Noticed that we **can't move any further to IF** stage since there's no data for comparing #### 2.Static Prediction * We **always assume that the branch will not happen** even when it happen we only need to stall 1 clock to compenstate that * To make the datapath working,there's need to add another forwarding path form **EX to ID stage** > ![](https://hackmd.io/_uploads/HyJbSbP2h.png) > * There's new control line **IF.Flush** to clear the IF/ID register(turn into NOP) > Example:![](https://hackmd.io/_uploads/HktmObw3h.png) > It assume that branch will happen => only excute lw beq slt > clock = $(5-1)+3+**2**+1 = 10 > * Unlike the ususal load-use,we move the branch to ID stage thus we need to wait for one more clock i.e **stall for 2 clocks** > ![](https://hackmd.io/_uploads/Hy5Sh-D2h.png) > Different Forwarding Path > * To EX > ![](https://hackmd.io/_uploads/HJF13-P2h.png) > * To ID > ![](https://hackmd.io/_uploads/SkEW3-w32.png) > * To MEM(See 台聯101) > ![](https://hackmd.io/_uploads/SkV42bv33.png) #### 3.Dynamic Branch Prediction * We use a **BHT(Branch History Table)**/Branch Prediction Buffer to store the history action of the branch instruction * Along with the **BTB(Branch Target Buffer)** to store the branch target for each branch instructions which avoid the addition clock needed for calculating branch address > ![](https://hackmd.io/_uploads/S1BjWUd2n.png) 1. **1-bits Prediction** > ![](https://hackmd.io/_uploads/rJm-QLunh.png) * If the prediction is wrong **once**, then prediction bit will be inverted 2. **2-bits Prediction** > ![](https://hackmd.io/_uploads/H1cL4Id3n.png) * If the prediction is wrong **twice**, then the prediction will be changed * These schemes are **saturation operation** and the process of prediction can be seen as the addition/subtraction(In 2-bits 2/3=>Jump,0/1=>N-Jump) > Example:![](https://hackmd.io/_uploads/SJmbPUdn3.png) >| 1Bit | T | T | T | N | N | T | T | T | >| --- |:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:| >| State | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | >| Correct? | $\times$ | $\vee$ | $\vee$ | $\times$|$\vee$|$\times$|$\vee$|$\vee$ > > => 3 >| 2Bit | T | T | T | N | N | T | T | T | >| --- |:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:| >| State | 0 | 1 | 2 | 3 | 2 | 1 | 2 | 3 | >| Correct? | $\times$ | $\times$ | $\vee$ | $\times$|$\times$|$\times$|$\vee$|$\vee$ > > => 5 > Example:![](https://hackmd.io/_uploads/SJHVFt_n3.png) > 1). Alway:$\frac{3}{5} =60\%$ AlwaysNot:$\frac{2}{5}=40\%$ > 2). >| 2Bit | T | T | T | N | >| --- |:-:|:-:|:-:|:-:| >| State | 0 | 1 | 2 | 3 | >| Correct? | $\times$ | $\times$ | $\vee$ | $\times$| > > => $\frac{1}{4} = 25\%$ > 3). >| 2Bit | T | T | T | N | N | >| --- |:-:|:-:|:-:|:-:|:-:| >| R1 | 0 | 1 | 2 | 3 | 2 | >| R2 | 1 | 2 | 3 | 3 | 2 | >| R3 | 1 | 2 | 3 | 3 | 2 | > > => $\frac{2}{5} = 40\%$ > Example:![](https://hackmd.io/_uploads/By9MTKu33.png) >|| Always taken | Not Taken | Dynamic | >|-| :---: |:-:|:-:| >|$f=5\%$| 5% | 95% | 90% | >|$f=95\%$| 95% | 5% | 90% | >|$f=70\%$| 70% | 30% | 90% | > >1). Not Taken >2). Always Taken >3). Dynamic #### **3.Other Predicition Method** * *Correlaing Predicator* : Combined the global and local information to predict branches * *Tourment Predicator* : Use different predicator together to predict branches ### Delayed Branch * By **replaceing the NOP with a safty instruction**(Instructions that are not affected by whether the branch is taken or not) * The selection of safty instructions are performed by the **compiler** * We need confirm that the register/data of safty instruction is not the same as thse used by branch instruction * There are 3 strategies 1. *From Before* * Choose an **instructions before the branch** >![](https://hackmd.io/_uploads/S1_9ejt2h.png)=>![](https://hackmd.io/_uploads/HyF2eoK23.png) 2. *From Target* * Choose instrutions **at the branch target** * Will be **adapted when the chance of branch is high** > ![](https://hackmd.io/_uploads/H1bPZstn2.png)=>![](https://hackmd.io/_uploads/HJX3ZjYnh.png) * The **label need to move down** since we already excute the original labeled instruction 3. *From Fall Through* * Choose instructions **after the branch target** * Will be **adapted when the chance of branch is low** > ![](https://hackmd.io/_uploads/HkKBGjth2.png)=>![](https://hackmd.io/_uploads/B1EUMoF32.png) * For **best performance we will tend to use the From Before** and the other two is denpendent on the probability for beq to taken * If all 3 strategies can't find a safty instruction the only way to solve it is inserting NOP > Example:![](https://hackmd.io/_uploads/SJUCC1on2.png) > ![](https://hackmd.io/_uploads/B1Ouwls3n.jpg) > Example:![](https://hackmd.io/_uploads/H1scvlo23.png) >![](https://hackmd.io/_uploads/S11TDeih2.png) > ![](https://hackmd.io/_uploads/HJCNGWo2h.jpg) > Example:![](https://hackmd.io/_uploads/H1RPf-j3n.png) >note. Unconditional branch is JUMP instruction(alwasy has hazard) >note2. All instructions will not be fully decoded until the end of ID2 stage > If the prediction of branch is wrong then we need to turn the IF1.IF2.ID1.ID2.EX1 into NOP => penalty is 5![](https://hackmd.io/_uploads/BkMDvbsn2.jpg) # Comparison of Performance * The **base CPI of Pipeline machine is 1**(idealy) to get the rael CPI we need to take things like hazard into consider. > ![](https://hackmd.io/_uploads/BJpn1Mjh2.png) > Example:![](https://hackmd.io/_uploads/r11GlGo3h.png) > For single: > 　The longest cycle time => lw => PIRADMR =>200+50+100+200+50=600ps > For multiple: > 　Cycle time = 200ps, $CPI_{Avg} = 0.25*5+0.1*4+0.52*3+0.11*3+0.22*4 = 4.12$ > 　=> Excuction time = 4.12 * 200 = 824ps > For Pipeline: > 　Cycle time = 200ps,$CPI_{Avg} = 1+(0.25*0.5*1+0.11*0.25*1+0.02*1) = 1.17$ > 　=> Excution time = 1.17 * 200 = 234ps > Example:![](https://hackmd.io/_uploads/SJXJKfs33.png) >![](https://hackmd.io/_uploads/HyxetGi2n.png) >1). MEM stage => if mispredicted => 3 penalty => 0.15 * 0.6 * 3 = 0.27 >2). 0.15 * 0.4 * 3 = 0.18 >3). 0.15 * 0.2 * 3 = 0.09 >4). Before = 1.09, After = 1+ 0.075 * 0.2 * 3 =1.045 => $Speedup = \frac{1.09}{1.045} = 1.043$ >5). In this case the second ALU instruction can be seen as a 1 clock delay >　=> 1+ 0.075 * 0.2 * 3 + 0.075 * 1 = 1.12 =>$Speedup = \frac{1.09}{1.12} = 0.97$ > 6). $0.8 + 0.2 * x = 0.8 => x = 0$ # Exception * An **unscheulded event from inside of CPU called Exception**,in the other hand,an **event from outside of CPU called Interrupt** * The defination of interrupt is different from the OS course. > ![](https://hackmd.io/_uploads/H1Wdlr3n3.png) > *System Call is exception here* ## Solve the Exception * There's 3 steps to solve the exception 1. Save the *address of Offending Instruction* in the *Exception Program Counter(EPC)* 2. Save the cause of exception in the *Cause Register* 3. Transfer control to OS at specific address >![](https://hackmd.io/_uploads/H1ITQBn32.png) > It will branch to a certain address according to the reason of exception * The OS will either terminate the program or continue * To continue there's five steps(assume there's only overflow) 1. Flush IF,ID,EX register 2. Let instructions before excpeting instruction done 3. Store the memory address(PC+4) > note. since the next instruction address(PC+4) is store into the pipeline reg,thus we can derive the current address by (PC+4)-4 4. Call the OS to handle it 5. return to user >![](https://hackmd.io/_uploads/SJR5gP23n.png)=> >![](https://hackmd.io/_uploads/Syynev322.png) > Datapath:![](https://hackmd.io/_uploads/Bkt0fw33n.png) > Example:![](https://hackmd.io/_uploads/SyMWQDhhh.png) > The sub/and/or instructions will complete their job > It will store 50 into EPC > > In 6th clock >| IF | ID | EX | MEM | WB | >| -- | -- | -- | :---: | -- | >| lw | slt| add| or | and| > >In 7th clock > >| IF | ID | EX | MEM | WB | >| -- | -- | -- | :---: | -- | >| sw | NOP| NOP| NOP | or | * The more stage exist the more difficulty to detect where the exception/interrupt happend precisely > note. In *MIPS it performs precise exception/interrupt* # Advanced Pipeline * Pipeline is a method to perform instructions parallelism or known as **Instruction-Level-Parallelism(ILR)** * To increase the amout of ILP we have two way 1. Incrase Pipeline Depth(superpipeline) i.e. add more stage to pipeline >![](https://hackmd.io/_uploads/S1iEOwhn2.png) 2. Replicate the Hardware for Pipeline(n-issue pipeline) > ![](https://hackmd.io/_uploads/Hy7LdPnh2.png) > ![](https://hackmd.io/_uploads/S1dudv332.png) * It's possible for **compiler and hardware to guess the propertise of instructions** which enable us to change the order of excuting the whole process of guessing also know as **Speculation**(like guessing branch taken or not previously) * To recover the previous state after wrong guessing. * 1. Software:Use instruction to identify the result and use fix-up rountine to recover * 2. Hardware:Use an addition buffer storing the result after speculation,if correct then use it directly,if not then flush it ## Multiple Issue ### Static Multiple Issue #### MIPS64 * Pack two instructions together.Thus we need additional one set of hardware > ![](https://hackmd.io/_uploads/HynAK-0h2.png) * In some case if we **unrolling the loop can improve the efficency**(avoid control hazard) > Example:![](https://hackmd.io/_uploads/By74YICh3.png)(Assume that there's 16 loops) > ![](https://hackmd.io/_uploads/BkudiU0n2.jpg) > => 4 clocks for 1 loop => 4 * 16 = 64 clocks > If we unrolling all 16 loops and process 4 element together=> > ![](https://hackmd.io/_uploads/ryh1x_0h2.png)=>![](https://hackmd.io/_uploads/B1dfvOAnh.png) > The ***addi s1 -16*** is for next 4 element's reference position and it will be more efficency if we place the addi in the front => change the offset for each element > ![](https://hackmd.io/_uploads/rkF_d_Ahh.png) * The sw and lw between each element seems have data dependency on t0, but since the data isn't floationg between different element.Therefore there has no hazard in this case > ![](https://hackmd.io/_uploads/rkF_d_Ahh.png) * The scenario above is known as **Antidependence/Name Dependency**.We can solve it by simpliy **chaging the name of the register(Register Renaming)**.And the process is done by the compiler > ![](https://hackmd.io/_uploads/HyEi9u02h.png) > ![](https://hackmd.io/_uploads/SyW-DKR2n.jpg) > => 8 * 4 = 32 clocks => CPI = 8/14 = 0.57 #### Very-Long Instruction Word(VLIW) * Complier choose instructions and **pack them together into one instruction** >![](https://hackmd.io/_uploads/S1OL2bAhh.png) * **Each instruction will only accessing one functional unit**(i.e there will be mutiple functional unit) > ![](https://hackmd.io/_uploads/B1KVhWC23.png) * With this kind of design,we can add more hardware and make use of it > ![](https://hackmd.io/_uploads/rklbab03h.png) ##### IA64 * It's an **RISC** style ISA also known as **Explicitly Parallel Instrcution Computer(EPIC)** and inspired by the VLIW * Has 128 int/float register and 8 SP 64 condition register * Instuctions will be encoded into **bundle** which is **128bits** and the process is doned by the **hardware**.Each of them has 5bits for specify the operation and inside the bundle every instruction has 41bits * For **software it has instruction group** consist of instructions without data dependency * Has special **Prediction** instruction to unrolling the loop > ![](https://hackmd.io/_uploads/Bk3YOi162.png) ### Dynamic Multiple Issue #### Superscalar * A technique enables the pipeline run more than one instruction in one clock by selecting instrucion during runtime(i.e reduce the stalls in the pipeline) * Mainly done by the hardware but the compiler still playing a role here > ![](https://hackmd.io/_uploads/r17dYjkT3.png) > * Can be seem as **3 stages** > * The reservation station can be seem as queue waiting for using the FU and contains all operands > * Commit Unit will write the result back * It allows later instruction finish excution before the former instruction(i.e **WAW/WAR will happen**)This situation is also known as **Out-of-Order Excution** * To aovid write the data wrong,the commit unit has **Reorder Buffer** for writing the data back in correct order(**In-order Commit**) * Since the result of each instruction will be stored in the CU it can be used for forwarding and the **Register Renaming**(see [here](#out-of-order-execution)) is accomplished by that. * Pros see [THIS](https://drive.google.com/file/d/1-PDIC7Lpc-H_jEa-nGvfIRLD-CZrL5hG/view?usp=drive_link) P325 ##### Out-of-Order Excution * Will cause the WAW(Output dependency) and WAR(Antidependency) * Casued by the **stroage conflict**(false programming)that says it **can be solve by register renaming** > ![](https://hackmd.io/_uploads/rJgoJLlan.png) > Can be solve true dependency by R3a = R3 * R5 > 　　　　　　　　　　　　　　　R4 = R3a + 1 > Example:![](https://hackmd.io/_uploads/B1OPlLla3.png) > (1). Both (2). Both (3). Soft (4). Hard (5) Hard > (6). Hard (7). Both (8). Both (9). Hard (10). Both(in superscalar) > (11). Soft > Example:![](https://hackmd.io/_uploads/BkgFNLla2.png) ![](https://hackmd.io/_uploads/HJqYNLe62.png) > ![](https://hackmd.io/_uploads/rkebGDlpn.png) > 1). 8 ports > 2). $\frac{((5-1)+N)*T}{((10-1)+N/4)*T/2} \simeq 8$ > 3). Since the consecutive insturcitons have hazard => one pack can have only 1 instruction 　　=>$\frac{((5-1)+N)*T}{((10-1)+N)*T/2} \simeq 2$ > 4). 出現錯誤預測的機率 = 0.15 => 每$\frac{1}{0.15}(frequency) = 66.7$個指令會出現一次錯誤 => 66.7/4 =16.7 cycles > => penalty = $6+ \frac{\frac{3}{4}+\frac{2}{4}+\frac{1}{4}+\frac{0}{4}}{4}$(如果該指令位於Issue的最下端則和他同一stage的所有指令也需清空也就是3個) > => Stall percent = $\frac{6}{16.7+6.4} = 26\%$(6為6.4捨去) > 5). loss 0.1 => (6.4 * 4) / 0.1 = 255(need 255 clocks between every misprediction) => 1 / 255 * 0.3 = 1.31% > 6). loss 0.5 => (6.4 * 4) / 0.5 = 51.2 => 1 / 51.2 * 0.3 = 5.9%