Instruction-Level parallelism and its exploition

# Instruction-Level parallelism and its exploition ###### tags: `Computer Architecture` ## Basic Compiler Techniques for Exposing ILP(unrolling) To avoid a pipeline stall, the execution of a dependent instruction must be separate from the source instruction by a distance in clock cycles equal to pipeline latency of that source instruction. Assumption: All functional units are pipelined and replicated. We will rely on the following code segment: ![](https://i.imgur.com/1zDJvZJ.png) ![](https://i.imgur.com/sHEkmWa.png) We can see that this loop is parallel by noticing that the body of each iteration is independent. 1. straightforward risc-V code without any schedusled: ![](https://i.imgur.com/KRIraCM.png) 2. stall are needed for the straightforward risc-V code: ![](https://i.imgur.com/YTMUJZy.png) 3. Schedule the loop to obtain only two stalls: ![](https://i.imgur.com/DvfOtJS.png) **The stalls after fadd.d are for use by fsd, and repositioning the addi prevents the stall after the fld** **definition:** **loop unrolling**:A simple scheme for increasing the number of instructions relative to the branch and overhead instructions. Unrolling replicate the loop body multiple times, adjusting the loop termination code. *Why scheduling work better in unrolling code?* Ans: unrolling eliminates the branch, allowing instructions from different iterations to be scheduled together. Unrolling the previous example: Assuming x1 - x2 is initially a multiple of 32, which means that the number of loop iterations is a multiple of 4. Ans: First merge the addi instructions and dropping hte unnecessary bne operations that are duplicated during unrolling. ![](https://i.imgur.com/EyMj4wf.png) Without scheduling, every FP load or operation in the unrolled loop is followed by a dependent operation and thus will cause a stall. This unrolled loop will run in 26 clock cycles--each fld has 1 stall, each fadd.d has 2 stalls, plus 14 insturction issue cycle. For average: 6.5 cycle for each loop. Schedule the previous unrolled example: ![](https://i.imgur.com/3lDn84I.png) This code don't need any stall. The execution time of the unrolled loop has dropped to a total of 14. For average: 3.5 cycle for each loop. **The gain from scheduling on the unrolled loop is even larger than on the original loop.(explaination is mentioned above)** ## Overcoming Data Hazards With Dynamic Scheduling ### dynamic scheduling: a technique by which the hardware reorders the instruction execution to reduce the stalls while maintaining data flow and exception behavior > advantages: > 1. it allows code that was compiled with one pipeline in mind to run efficiently on a different pipeline > 2. Second, it enables handling some cases when dependences are unknown at compile time > 3. it allows the processor to tolerate unpredictable delays, such as cache misses, by executing other code while waiting for the miss to resolve. Although a dynamically scheduled processor cannot change the data flow, it tries to avoid stalling when dependences are present. In contrast, static pipeline scheduling by the compiler (covered in Section 3.2) tries to minimize stalls by separating dependent instructions so that they will not lead to hazards. ### Idea of dynamic programming Consider for this code: > fdiv.d f0,f2,f4 > fadd.d f10, f0, f8 > fsub.d f12, f8. f14 fsub.d instruction cannot execute because the dependece of faddd on fdiv.d causes the pipeline stall. Yet, fsub.d is not data-dependent on anything in the pipeline. This hazard creates a performance limitation that can be eliminated by not requiring instructions to execute in program order.