# Quiz 6 Problem X - Oral Test
:::info
**AI Tools Usage**
Use Gemini for grammar correction
:::
## Question 9
### Why is branch resolution often moved earlier in modern pipelines?
#### Answers
1. Branch Misprediction Penalty
In this slide from the lecture materials, the updated PC is only written back during the MEM stage.
* When a branch is taken, or a jump occurs, the next instructions already in the pipeline are incorrect.
* Under the default design we'll lose 3 cycles to no-ops because by the time the updated PC is known in the MEM stage, three unwanted instructinos have already entered the pipeline.
* To fix this, the processor must convert these incorrect instructions into no-ops.
* Moving resolution earlier, e.g., to the ID stage can reduce the number of instructions that need to be flushed, thereby decreasing the penalty or wasted cycles when a misprediction occurs.
2. Reduce the cost of control hazard
* The later a branch is resolved, the longer the pipeline must stall or the more no-ops must be inserted to ensure correct results.
* Every control hazard that requires a stall or a flush increases the average CPI above the ideal value of 1.
* Increasing pipeline depth, for example, 10 or 15 stages, increases the potential for all types of hazards. In such pipelines, resolving branches early is critical to prevennt the CPI from rising significantly due to long control hazard stalls.
#### Discussion using [MyCPU (3-pipeline)](https://github.com/moonchin819/ca2025-mycpu/blob/main/3-pipeline/src/main/scala/riscv/core/fivestage_final/Control.scala)
`Control.scala` in `fivestage_final` from MyCPU provides the evidence for this question.
It shows the hardware logic of moving branch resolution earlier to ID stage.
Traditional design execute branch resolution in EX stage, which needs to flush 2 instruction when branching.
But this design moves branch resolution to ID stage so we only need to flush 1 instruction.
#### Examples
I did some experiments on Ripes and look into the 5-Stage RISC-V Processor.
When executing branch instruction in pipeline cpu, we'll know whether to branch very late, but pipeline has already started fetching the following instructions.
In the case like this :
```
beq x1, x2, target # branch instruction
add x3, x4, x5 # the instruction which might be executed incorrectly
target:
sub x6, x7, x8
```
`add` instruction was already fetched in last stage and we only know the branch result now.
I load immediate to registers x1 and x2, so instruction `beq` will branch.

Here, `beq` at EX stage has decided to branch, so the next instruction `add` will be flushed.

And PC will be updated to the address of target, then instruction `sub` will be fetched.
The total cycles from `beq` to `sub` inside `target` is 8 because of flush.

(Minus 2 cycles for loading immediate into x1 and x2)
Due to the penalty cost in this case, branch resolution is moved earlier in modern pipelines to predict whether to branch.
Based On :
* Exercise 19.
* Exercise 21. Q4
#### Lecture Materials
[Lectures 21-23: RISC-V 5-Stage Pipeline](https://docs.google.com/presentation/d/1v-Squx8lK-oOrflFOwBZh-ue94seVHZudqDOgJmFf5Q/edit?slide=id.g2fa69f884ef_0_2234#slide=id.g2fa69f884ef_0_2234)
Section. Control Hazards
P.71 - P.78
Section. Superscala Processors
P.81 - P.84
## Question 10
### Why can full forwarding still fail to eliminate stalls?
#### Answer
The main reason to this question is that **data cannot be forwarded backward in time**.
In other words, when the data required by an instruction has not been calculated or retrieved at the moment it must be used, no amount of forwarding paths can solve the problem.
1. Load-Use Hazard
* This is because the data of `LW` will be read from memory after MEM stage finished.
* If the following instruction needs this data when `LW` is just starting at EX stage, the data was not generated.
* Resulting even if with forwarding, the data generated later than it was used, the processor must forced insertion 1 cycle of stall.
2. Load-Branch Hazard
This hazard would get worse if branch resolution was moved earlier into ID stage.
* This is because the branch instruction needs to compare the value in the registers in ID stage.
* If the previous instruction is `LW`, the data will be generated at the end of MEM, they are 2 stages away.
* This will result in 1 or 2 cycles of stall because forwarding cannot forward the future data to the ID stage in the past.
3. Jump Register Dependency
* Assume a `jalr` needs the value of `x1` to jump to the target address
* If `x1` here is generated by previous instruction and it's not ready.
* This could result stall since the jump address must be calculated in ID stage.
So, based on all three scenarios above, my conclusion is :
Although full forwarding solves most of the dependency like `add` following `sub`, it still can't overcome the inherent latency for `LW` instruction.
#### Discussion using MyCPU (3-pipeline)
This question can also be discussed with the program file mentioned earlier.
In the example scenarios in the program file, including the last examples which we have already answered in detail. They all emphasize that `LW` instruction might be the major blind spot of full forwarding.
Based on :
* Exercise 19.
* Exercise 21. Q3
#### Examples
Here are some experiments also be done on Ripes to illustrate the pipeline issues in Question 10.
**For Load-use hazard**
```
lw x1, 0(x2) # load data to x1
add x3, x1, x4 # next instruction immediately use x1, results in a stall
```
We can see when `lw` arrives EX stage, and `add` arrives ID stage, everything is same as normal.

But `add` actually needs the data generated by `lw` at ID stage, and the data will be generated after `lw` pass MEM stage.

So we can see that there's a nop inserted between `add` and `lw`, namely, a stall.
This results in 7 cycles.

**For Load-branch hazard**
Here's the state when the first instruction (`lw`) arrives at EX stage.

There's no stall so far, but `beq` needs the value generated by `lw` to determine whether to branch or not.
```
lw x1, 0(x2) # load data
beq x1, x0, target # Branch depends on loaded value
target: # target function
li t0, 0 # the content inside target function does not matter here.
```
Same as the case in Load-use hazard, the value generated by `lw` will be available after passing MEM stage.

Therefore, we still need a stall to deal with the hazard in this case.
#### Lecture Materials
[Lectures 21-23: RISC-V 5-Stage Pipeline](https://docs.google.com/presentation/d/1v-Squx8lK-oOrflFOwBZh-ue94seVHZudqDOgJmFf5Q/edit?slide=id.g2fa69f884ef_0_2234#slide=id.g2fa69f884ef_0_2234)
Section. Fixing Data Hazards: Forwarding
P.63 - P.65
[MIT 6.004 L16: Processor Pipelining
](https://www.youtube.com/watch?v=TMpjvAvQCWA)
Section. Load-to-Use Stalls