# Annotate and explain Quiz4
contributed by < `Cheng Yu` >
###### tags: `Computer Architure`
## Question `A`
:::spoiler **Description**
For each of the questions below, create a valid **N-stage pipeline** of the given circuit. Each component in the circuit is annotated with its **propagation delay**. Give the latency and throughput of each design, assuming ideal registers (t~PD~ = 0, t~SETUP~ = 0). Remember that our convention is to place a pipeline register on each output.
> **Latency** : time to finish a fixed task.
> **Throughput** : number of tasks in fixed time
:::
#### (1) Show the maximum-throughput 1-stage pipeline.

Latency (ns): __ A01 __
Throughput (ns^-1^): __ A02 __
:::spoiler solution
> Since we can only use **1-stage** pipeline in this question, so the six units need to be in the **same stage**.
> Like the picture below :
> 
> So the **latency** (longest path) in this pipeline is **3->4->2->2 = 11(ns).**
> And the **maximum-throughput** is **1/11(ns)**.
:::
#### (2) Show the maximum-throughput 2-stage pipeline using a minimal number of registers.

Latency (ns): __ A03 __
Throughput (ns^-1^): __ A04 __
:::spoiler solution
> The **longest propagation delay** of **one unit** in this question is 4ns, but there is no way to narrow **critical path** to 4ns (either 5, 6) just by 2-stage pipeline.
> So we find the **minimum critical path** (7ns) like the picture below :
> 
> The **longest path** in a **single stage** is **3->4 = 7(ns)**, so the **latency** in this pipeline is **7 * 2 = 14(ns).**
> The **maximum-throughput** is **1/7(ns)**.
:::
#### (3) Show the maximum-throughput pipeline using a minimal number of registers.

Latency (ns): __ A05 __
Throughput (ns^-1^): __ A06 __
:::spoiler solution
> Since we can separate this circuit to **any stages pipeline** we want, the **maximum-throughput** means the **critical path** in a **single stage** needs to be the **smallest (4ns)**.
> So we separate the circuit like the picture below :
> 
> The **longest path** in a **single stage** is **4(ns)**, so the **latency** in this pipeline is **4 * 3 = 12(ns).**
> The **maximum-throughput** is **1/4(ns)**.
:::
#### (4) You manage to reimplement the slowest combinational component in the previous circuit (the one with a propagation delay of 4 ns) using two components with propagation delays of 2 ns, as shown below. Show the maximum-throughput pipeline using a minimal number of registers.

Latency (ns): __ A07 __
Throughput (ns^-1^): __ A08 __
:::spoiler solution
> Since the **longest propagation delay** of **one unit** in this question become 3ns, so our goal is making the **critical path** in a **single stage** to be 3ns.
> So we separate the circuit like the picture below :
> 
> The **longest path** in a **single stage** is **3(ns)**, so the **latency** in this pipeline is **3 * 4 = 12(ns).**
> The **maximum-throughput** is **1/3(ns)**.
:::
---
## Question `B`
:::spoiler **Description**
For each of the questions below, create a valid **N-stage pipeline** of the given circuit. Each component in the circuit is annotated with its **propagation delay** in nanoseconds. Give the latency and throughput of each design, assuming ideal registers (tPD = 0, tSETUP = 0). Remember that our convention is to place a pipeline register on each output.
> **Latency** : time to finish a fixed task.
> **Throughput** : number of tasks in fixed time
:::
#### (1) Show a maximum-throughput pipeline that uses the smallest possible number of pipeline stages.

Latency (ns): __ B01 __
Throughput (ns^-1^): __ B02 __
:::spoiler solution
> The **maximum-throughput** means the **critical path** in a **single stage** needs to be the **smallest (4ns)**.
> So we separate the circuit like the picture below :
> 
> The **longest path** in a **single stage** is **4(ns)**, so the **latency** in this pipeline is **4 * 3 = 12(ns).**
> The **maximum-throughput** is **1/4(ns)**.
:::
#### (2) You reimplement the 4 ns combinational component in the previous circuit using two faster components connected in series, shown in **grey** below. You can choose the propagation delays of these two components, as long as their delays add to 4 ns (e.g., they could be 3 ns and 1 ns, 2 ns and 2 ns, etc.) Choose the propagation delays of both components in a way that lets you pipeline the circuit for maximum throughput while minimizing the number of pipeline stages. Then, find the maximum-throughput pipeline. For full credit, your solution should use the minimum possible number of pipeline stages.

Latency (ns): __ B03 __
Throughput (ns^-1^): __ B04 __
:::spoiler solution
> Since we can reimplement the 4ns combinational component using two faster components, so the **longest propagation delay** of **one unit** in this question become 3ns.
> Our goal is making the **critical path** in a **single stage** to become 3ns.
> Then we separate the circuit like the picture below :
> 
> The **longest path** in a **single stage** is **3(ns)**, so the **latency** in this pipeline is **3 * 3 = 9(ns).**
> The **maximum-throughput** is **1/3(ns)**.
:::
---
## Question `C`
:::spoiler **Description**
For each of the questions below, please create a valid **N-stage pipeline** of the given circuit. Each component in the circuit is annotated with its **propagation delay** in nanoseconds. Give the latency and throughput of each design, assuming ideal registers (t~PD~ = 0, t~SETUP~ = 0). Remember that our convention is to place a pipeline register on each output.
> **Latency** : time to finish a fixed task.
> **Throughput** : number of tasks in fixed time
:::
#### (1) Show the maximum-throughput 2-stage pipeline using a minimal number of registers. What are the latency and throughput of the resulting circuit?

Latency (ns): __ C01 __
Throughput (ns^-1^): __ C02 __
:::spoiler solution
> The **longest propagation delay** of **one unit** in this question is 6ns, but there is no way to narrow **critical path** to 6ns (either 7, 8, 9) just by 2-stage pipeline.
> So we find the **minimum critical path** (10ns) like the picture below :
> 
> The **longest path** in a **single stage** is **6->4 = 10(ns)**, so the **latency** in this pipeline is **10 * 2 = 20(ns).**
> The **maximum-throughput** is **1/10(ns)**.
:::
#### (2) Show the maximum-throughput pipeline using a minimal number of registers. What are the latency and throughput of the resulting circuit?

Latency (ns): __ C03 __
Throughput (ns^-1^): __ C04 __
:::spoiler solution
> The **maximum-throughput** means the **critical path** in a **single stage** needs to be the **smallest (6ns)**.
> So we separate the circuit like the picture below :
> 
> The **longest path** in a **single stage** is **6(ns)**, so the **latency** in this pipeline is **6 * 4 = 24(ns).**
> The **maximum-throughput** is **1/6(ns)**.
:::
---
## Question `D`
Consider the below 5-Stage Pipeline processor.

(1) Even a simple, [in-order](https://en.wikipedia.org/wiki/Out-of-order_execution#In-order_processors) pipelined processor makes use of [speculative execution](https://en.wikipedia.org/wiki/Speculative_execution). For the 5-stage pipeline above, assume that there is no virtual memory, and that misaligned accesses are checked in the Execute stage. For the instruction sequence below, complete the execution diagram and mention the cycles in which the second add is being executed speculatively. For example, if one stage F is being executed speculatively, use the remark `F (S)`.
| Clock Cycle | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| ----------- | - | - | - | - | - | - | - | - | - | - |
| `add x1, x2, x0` | F | D | X | M | W | - | - | - | - | - |
| `lw x3, 0(x2)` | - | D01 | D02 | D03 | D04 | D05 | - | - | - | - |
| `add x3, x4, x5` | - | - | D06 | D07 | D08 | D09 | D10 | - | - | - |
> D01 = ??
> D02 = ??
> D03 = ??
> D04 = ??
> D05 = ??
> D06 = ??
> D07 = ??
> D08 = ??
> D09 = ??
> D10 = ??
:::spoiler solution
* Because `add x1, x2, x0` and `lw x3, 0(x2)` don't need speculative execution. Then **D01~D05** are **{F,D,X,M,W}**.
* Therefore, if `lw x3, 0(x2)` misaligned address, we need to do `add x3, x4, x5` successively. But misaligned accesses are checked in the execute stage, so F & D in `add x3, x4, x5` need speculative execution. Then **D06~D10** are **{F(s),D(s),X,M,W}**.
:::
#### (2) Given the 5-stage pipeline above, how long is the load-use delay? Answer in terms of how many bubbles must be added between a load and a dependent register-register instruction that is fetched right after the load.
How many bubble(s) must be added? __ D11 __
:::spoiler solution
* Because the CPU above has **bypass** in **MEM** stage, we only need to add **one** bubble.
:::
---
## Question `E`
Consider the 5-Stage pipeline RISC-V processor.

#### (1) How many cycles does it take to run each iteration of the following loop on a standard 5-stage pipelined RISC-V processor (P1)?
```c
loop: lw x10, 0x100(x0)
beqz x10, loop
add x12, x10, x11
sub x13, x12, x1
```
Number of cycles per loop iteration: __ E01 __
:::spoiler solution
> Because the CPU above have **no bypass** in **MEM** stage, so we need to add **2 bubbles** after **lw** instruction. Then the result of **lw** can **write back** to **ID** stage on time.
> 
> In addition, the **branch taken** is determined in **EXE stage**, so we need to flush **IF & ID** after branch taken.
> 
> The **stage table** of the loop is like the picture below.
> 
:::
#### (2) Consider a modified processor, P2, which has extra hardware for the special case of checking if a register is equal to zero or not in the decode stage. What would be the number of cycles per loop iteration in this case?
Number of cycles per loop iteration on processor P2: __ E02 __
:::spoiler solution
> Because we will check if a register is equal to zero or not in the **decode stage** in **P2**, we only need to flush **IF** stage now. Then there will become **only one** **nop** between **beqz** and **lw**.
> The **stage table** of the loop is now like the picture below.
> 
:::
#### (3) Now consider a third processor, P3, whose instruction and data memories are pipelined and take 2 clock cycles to respond. Assume that P3 also has the extra hardware for checking if a register is equal to zero or not in the decode stage. What would be the number of cycles per loop iteration using P3?
NOP Number of cycles per loop iteration on processor P3: __ E03 __
:::spoiler solution
> **P3** has almost the same architecture as **P2**, but **IF & MEM** stage take two cycles now. Fortunately, **brach taken** still make **two nops**. But **lw** need to take **3 cycles** until it can write back on time now.
> The **stage table** of the loop is now like the picture below.
> 
:::
---
## Question `F`
You have been given a 5-stage pipelined RISC-V processor. Unfortunately, the processor you have been given is defective: it has no bypass paths, annulment of instructions in branch delay slots, or pipeline stalls.
```c
nop
nop
nop
nop
Loop: lw x10, 0x0(x10)
AAA:
sll x14, x10, x11
BBB:
bnez x10, loop
CCC:
add x13, x10, x13
nop
nop
nop
nop
```
You undertake to convert some existing code, designed to run on an unpipelined RISC-V, to run on your defective pipelined processor. The scrap of code on above is a sample of the program to be converted. It doesn’t make much sense to you, but you are to add the minimum number of NOP instructions at the various tagged points in this code to make it give the same results on your defective pipelined RISC-V as it gives on a normal, unpipelined RISC-V.
Note that the code scrap begins and ends with sequences of NOPs; thus, you do not need to worry about pipeline hazards involving interactions with instructions outside of the region shown.
#### (1) Specify the minimal number of NOP instructions (defined as `add x0, x0, x0`) to be added at each of the labeled points in the above program.
NOPs at Loop: __ F01 __
NOPs at AAA: __ F02 __
NOPs at BBB: __ F03 __
NOPs at CCC: __ F04 __
:::spoiler solution
> * No need to add any **nops** in **Loop**.
> * Because we don't have any bypass paths, annulment of instructions in branch delay slots, or pipeline stalls. So we need to stall **3** cycles (in **AAA**) until **lw** write back.
> > 這裡是因為R-file只能讀或寫才需要3個cycle? 不然在我的理解一般就算沒bypass也是只需要2個stall就夠了。
> * No need to add any **nops** in **BBB** because there is no hazard.
> * Because **branch taken** is determined in **EXE** stage, we need to wait **2 **cycles in **CCC**.
> The **stage table** of the loop is now like the picture below.
> 
:::
#### (2) On a fully functional 5-stage pipeline (with working bypass, annul, and stall logic), the above code will run fine with no added NOPs. How many clock cycles of execution time are required by the fully functional 5-stage pipelined RISC-V for each iteration through the loop?
Clocks per loop iteration: __ F05 __
:::spoiler solution
> * Now we have working bypass, annul, and stall logic, so we only need to stall **2** cycles between **lw** and **sll**. But **brach taken** still need to wait **2** cycles.
> * The **stage table** of the loop is now like the picture below.
> 
> > If **MEM** can bypass **data memory** to **ID** stage, we only need **one** stall.
:::
---
## Question `G`
The following programs are being executed on the 5-stage pipelined RISC-V processor with full bypassing. For each fragment, the pipeline diagram shows the state of the pipeline for cycle 1000 of execution. Please fill in the diagram for cycle 1001; use `?` if you cannot tell what opcode to write into a stage.
#### (1) Program
```c
...
sw x1, 0(x0)
lw x17, 0xC(x1)
addi x2, x2, -4
slli x11, x17, 2
sw x11, 0(x2)
jal ra, fact
...
```
| Cycle | 1000 | 1001 |
| ----- | ---- | ---- |
| IF | sw | G01 |
| ID | slli | G02 |
| IE | addi | G03 |
| MEM | lw | G04 |
| WB | sw | G05 |
:::spoiler solution
> Because **slli** need **x17** data which is loaded by **lw** two cycles ago, so we need to wait **one** cycle until **lw** get to **WB** stage.
> The table is like the picture below :
> 
> > If **MEM** can bypass **data memory** to **ID** stage, the stall **won't** be needed.
> > Like Ripes :
> > 
:::
#### (2) Program
```c
...
xor x11, x11, x12
slli x12, x12, 3
sub x13, x12, x11
and x12, x13, x11
add x13, x12, x13
sw x13, 0x100(x0)
...
```
| Cycle | 1000 | 1001 |
| ----- | ---- | ---- |
| IF | add | G06 |
| ID | and | G07 |
| IE | sub | G08 |
| MEM | slli | G09 |
| WB | xor | G10 |
:::spoiler solution
> Because we have full bypassing, so R-type data hazard can be solved without adding any nops.
> The table is like the picture below :
> 
> we can see the same result in Ripes :
> 
:::
---
:::info
最後想跟老師說我本身是工科系的大四生,在大三下時才決定要往數位IC這個領域發展,所以專題跟修課經驗非常不足,國內的電子所也都不好推,因此我決定用考試的,但台大卻有二階面試,於是這學期我同時在準備考研的時候,也修了3堂有關IC Design的課,分別是電機系的VLSI系統設計、資訊所的深度學習積體電路和老師的計算機結構,這三堂課的loading真的都不小,而我對於C語言很不熟,linux也是這堂課才開始接觸的,所以中間有些作業光理解就很吃力,而準備考試也讓我兩頭燒,導致最後不是做得很好(老師應該也發現了Q),所以cache那個作業我寫得很認真:sweat_smile:,不過我真的很喜歡這堂課,小考跟作業我覺得都很有意義,完全不後悔修了這堂,因為VLSI的期末就是用verilog實做出一顆pipeline cpu,我也靠著這堂課帶領組員設計了一顆簡單的RISC-V CPU,很希望自己可以有更多時間好好消化這門課,如果之後考試考完我應該會想把沒做完的作業完成(如果惰性沒發作的話),感謝老師這學期的指教和資源。
ps : 很推薦要走數位IC design的人來修這門課。
:::