Quiz4 of Computer Architecture (2023 Fall)

Solutions

Problem `A`

Take a look at the circuit diagram provided below, along with the associated delays of its components:

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

component	delays
t_AND	20ps
t_NOT	5ps
t_OR	20ps
t_XOR	25ps
t_multiplier	50ps
t_subtract	35ps
t_clk-q	5ps
t_setup	2ps

Assuming that this circuit exclusively operates on two's complement integers, and given that the subtractor component has a delay of 35ps, what is the maximum delay permissible for the adder component while allowing us to replace the subtractor component with adders, NOT gates, and constants to achieve the same delay as the subtractor while maintaining the same behavior? It can be assumed that constants introduce no delay.

As a refresher, the subtract component performs the following operation:
output = top input - bottom input

A01 = ?
35ps
Since we are dealing with two’s complement numbers, subtracting by x is equivalent to adding ∼ x + 1, where ∼ x flips all of the bits of x. As a result, we can chain together a NOT gate, an adder with a constant 1, and another adder (to add the output of the previous adder and the top input of the subtractor) to achieve the same behavior.
The intent of the question is for students to realize that the NOT gate and the first adder does not actually add any additional delay, since the multiplier/NOT gate combo of the top input of the subtractor takes more time than the NOT gate/adder combo for the bottom input.
Therefore, the adder can have a delay of 35ps (the same as the existing subtractor) for the circuit to maintain the same timing behavior.

Problem `B`

Examine the sequential circuit provided below, along with the associated timing specifications. Registers R1, R2, and R3 are synchronized by a shared clock signal, while A, B, and C represent combinational logic circuits.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

-	t_pd	t_cd	t_setup	t_hold
Register (R1, R2, R3)	5ns	1ns	7ns	2ns
Circuit A	3ns	?	-	-
Circuit B	4ns	2ns	-	-
Circuit C	6ns	2.5ns	-	-

What is t_pd of this sequential circuit? __ B01 __ ns
- B01 = ?
  5
What is t_cd of this sequential circuit? __ B02 __ ns
- B02 = ?
  1
What is the minimum value of t_cd for Circuit A that will ensure proper operation of this circuit? __ B03 __ ns
- B03 = ?
  1
  t_cd,R1 + t_cd,A >= t_hold,R2
  1 + t_cd,A >= 2
  t_cd,A >= 1ns
What is the minimum clock period that can be employed to clock all the registers within the circuit? __ B04 __ ns
- B04 = ?
  19
  t_CLK >= t_pd,R1 + t_pd,A + t_pd,B + t_setup,R3
  t_CLK >= 5 + 3 + 4 + 7
  t_CLK >= 19ns
Available from the supplier are alternative circuits for A, B, and C, each with the following specifications.

- t_pd t_cd t_setup t_hold

A-New 1ns 0.5ns - -

B-New 2.5ns 2ns - -

C-New 2ns 1ns - -

Replacement of these new circuits is cost-sensitive, allowing for the substitution of only one combinational circuit to achieve a minimized clock period. Please specify which combinational circuit you intend to replace and the resulting minimum clock period.
- Combinational circuit replaced: __ B05 __
- t_CLK of circuit: __ B06 __ ns
- B05 = ?
  B
- B06 = ?
  18
  To minimize t_CLK >= t_pd,R1 + t_pd,A + t_pd,B + t_setup,R3, we want to replace A or B.
  Can’t replace A with A-New because contamination delay of 0.5 would not satisfy the hold time for R2. So, we want to replace B since it is the other component in the critical path. Replacing B reduces the R1-R3 path to 5 + 3 + 2.5 + 7 = 17.5ns. However, the critical path is now the R2-R1 path which is 5 + 6 + 7 = 18.

-	t_pd	t_cd	t_setup	t_hold
A-New	1ns	0.5ns	-	-
B-New	2.5ns	2ns	-	-
C-New	2ns	1ns	-	-

Problem `C`

Consider a module called 'F' with three inputs (X, Y, and Z) and one output (A).

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

The circuit is known to function but has suboptimal throughput. To address this issue, you opt for pipelining the circuit. For each of the following questions, design a valid K-stage pipeline for the given circuit. Each component in the circuit is labeled with its propagation delay in nanoseconds. Provide the latency and throughput of each design, assuming ideal registers (i.e., t_PD=0, t_SETUP=0). Remember that according to our convention, a pipeline register is placed at each output.

Consider a 2-stage pipeline optimized for maximum throughput while utilizing the fewest possible registers. Determine the latency and throughput of the resulting circuit, taking careful note of the direction of each arrow.
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- Latency: __ C01 __ ns
- Throughput: __ C02 __ ns^-1
- C01 = ?
  30
- C02 = ?
  
  $\frac{1}{15}$
  Image Not Showing Possible Reasons
  The image was uploaded to a note which you don't have access to
  The note which the image was originally uploaded to has been deleted
  Learn More →
Consider a maximum-throughput 4-stage pipeline using a minimal number of registers. What are the latency and throughput of the resulting circuit?
- Latency: __ C03 __ ns
- Throughput: __ C04 __ ns^-1
- C03 = ?
  32
- C04 = ?
  
  $\frac{1}{8}$
  Image Not Showing Possible Reasons
  The image was uploaded to a note which you don't have access to
  The note which the image was originally uploaded to has been deleted
  Learn More →
Consider a maximum-throughput pipeline that uses a minimum number of pipeline stages. What are the latency and throughput of the resulting circuit?
- Latency: __ C05 __ ns
- Throughput: __ C06 __ ns^-1
- C05 = ?
  42
- C06 = ?
  
  $\frac{1}{7}$
  Image Not Showing Possible Reasons
  The image was uploaded to a note which you don't have access to
  The note which the image was originally uploaded to has been deleted
  Learn More →

Problem `D`

Examine the processor implementation designed for single-cycle operation.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Below, you can find the timing specifications for each of the components:

Component	Propagation delay (t_PD)
Register	1ns
Decoder	2ns
RegFile read	3ns
MUX	1ns
ALU	4ns
Adder	3ns
Memory read (instruction or data)	4ns

Setup/hold times for clocked inputs (registers and writes to RegFile and data memory)

Setup time (t_SETUP): 2 ns
Hold time (t_HOLD): 0 ns

What is the shortest clock cycle duration achievable for this processor? Minimum t_CLK: __ D01 __ ns
- D01 = ?
  22
  Critical path is PC -> IMem -> Decoder -> RF(read) -> BSEL mux -> ALU -> DMem (read) -> WDSEL mux -> RF(write) => t_CLK >= 1 + 4 + 2 + 3 + 1 + 4 + 4 + 1 + 2 (RF setup)
The current processor's performance is suboptimal. We intend to introduce an alternative data path (while keeping the control logic unchanged) in which the data memory's Adr input is sourced differently. What is the minimum clock cycle time of the new processor? Minimum t_CLK: __ D02 __ ns
- D02 = ?
  18
  Critical path is PC -> IMem -> Decoder -> RF(read) -> BSEL mux -> ALU -> WDSEL mux -> RF(write) => t_CLK >= 1 + 4 + 2 + 3 + 1 + 4 + 1 + 2 (RF setup)
Now, the new processor executes certain instructions incorrectly as per the RISC-V ISA. Provide an example of such an instruction, lw a0, 8(a1), and write its equivalent RISC-V instruction. __ D03 __
- D03 = ?
  lw a0, 0(a1)
  The same lw or sw instruction with the offset set to 0.
The program provided requires 90 instructions to complete execution on the original processor. Unfortunately, when run on the new processor, it yields incorrect results. Please make the necessary modifications to the program to ensure it operates correctly on the new processor.

provided program running on the original processor

# int x[16];
# for (int i = 0; i < 15; i++)
#     x[i+1] = x[i] + x[i+1];
#
# Initial values:
# a0: address of x[0]
# a7: address of x[15]
loop:
    lw a1, 0(a0)
    lw a2, 4(a0)
    add a3, a2, a1
    sw a3, 4(a0)
    addi a0, a0, 4
    blt a0, a7, loop

Modified assembly that executes correctly on new processor

# Initial values:
# a0: address of x[0]
# a7: address of x[15]
    lw a1, 0(a0)
    addi a0, a0, 4
loop:
    __ D04 __
    add a1, a1, a2
    __ D05 __
    addi a0, a0, 4
    __ D06 __

Additionally, how many instructions does your assembly code execute? __ D07 __

D04 = ?
lw a2, 0(a0)

D05 = ?
sw a1, 0(a0)

D06 = ?
ble a0, a7, loop

D07 = ?
77

$15 \times 5 + 2 = 77$

What is the elapsed time for the execution of the program on both the original and new processors? (Please utilize the relevant program variant for each respective processor.)
- Execution time on the original processor: __ D08 __ ns
- Execution time on the new processor: __ D09 __ ns
- D08 = ?
  1694
  
  $77 \times 22 = 1694$
- D09 = ?
  1386
  
  $77 \times 18 = 1386$

Problem `E`

Take a look at the loop provided below, which calculates the sum of values in a linked list. We execute this code on a typical 5-stage RISC-V processor equipped with complete bypassing. Assume that branches are consistently predicted as not taken, and that branch determinations are made during the EXE stage. Also, consider that the loop undergoes numerous iterations and is presently in the middle of its execution.

loop:
    lw a1, 0(a0)
    lw a0, 4(a0)
    add a2, a2, a1
    bnez a0, loop
    mv a0, a2
    addi sp, sp, 8
    ret

Assume that at cycle 100 the lw a1, 0(a0) instruction is fetched. Fill in the following:
- Number of cycles per loop iteration: __ E01 __
- Number of cycles per loop iteration wasted due to stalls: __ E02 __
- Number of cycles per loop iteration wasted due to annulments: __ E03 __
- E01 = ?
  7
- E02 = ?
  1
- E03 = ?
  2

We aim to enhance the performance of this processor. Upon careful examination, we have identified an opportunity to save cycles by introducing additional bypass paths. Contrary to its label, "full bypassing" does not fully exploit the potential for bypassing. It primarily bypasses data to the DEC stage alone. Occasionally, instructions do not require their inputs in the EXE stage but rather at a later stage, and stalling at the DEC stage leads to unnecessary cycle wastage. To address this issue, we suggest a bypassing scheme named "extra bypassing." In extra bypassing:

There exists a bypass path from each stage to DEC and all intermediate stages between DEC and the source of the bypass. For example, in a 5-stage pipeline, the paths include:
- EXE -> DEC
- MEM -> DEC
- WB -> DEC
- MEM -> EXE
- WB -> EXE
- WB -> MEM (Note that full bypassing only includes the first 3 of these paths).
As customary, each bypass path forwards data to the end of the receiving stage. For instance:
- MEM -> DEC bypasses data to the end of DEC, right before the DEC-EXE pipeline register, after register reads, and after the ALU operation.
- MEM -> EXE bypasses data to the end of EXE, right before the EXE-MEM pipeline register, after the ALU operation.
If an instruction relies on a value produced by an earlier instruction, and that value is still unavailable (either in the register file or through a bypass path), the instruction doesn't stall until the stage just before the value is required. For example:
- If a value is needed in EXE and hasn't been generated yet, the instruction stalls in DEC (as in the case of full bypassing).
- However, if the value is needed in MEM, the instruction proceeds from DEC to EXE and only stalls in EXE until the value becomes accessible through a bypass path.
Instructions capture missing inputs from bypass paths as soon as they become available, even if they are currently stalled for other reasons. This ensures that if an instruction advances past DEC, it always acquires the correct input value before it needs it.

The diagram below shows the bypass paths in full vs. extra bypassing.

Consider a 2-instruction sequence that results in a delay of two cycles when using full bypassing but only one cycle when employing extra bypassing.
```
lw a1, 0(a0)
sw a1, 0(a2)
```
Specify which bypass paths are exercised in each case.
- Bypass paths exercised with full bypassing: __ E04 __
- Bypass paths exercised with extra bypassing: __ E05 __
- E04 = ?
  WB->DEC
- E05 = ?
  WB->EXE

To harness the benefits of extra bypassing more effectively, we introduce a 5-stage pipeline structure as illustrated below, with the ALU positioned in the fourth stage:

In this pipeline:
- The first two stages, IF (Instruction Fetch) and DEC (Decode), function similarly to the standard 5-stage pipeline.
- The third stage, AC (Address Calculation), handles address computation for load and store operations.
- The fourth stage, EXM (Execute Memory), carries out ALU operations and manages data access.
- The fifth stage, WB (Write Back), operates in line with the standard pipeline procedure.
Branch predictions are made with an assumption of not-taken and are resolved in the EXM stage.
This pipeline incorporates extra bypassing for improved performance.

Consider again the code from part 1 and assume that at cycle 100 the lw a1, 0(a0) instruction is fetched. Fill in the following: (Remember that the Execution module resides within the EXM stage of the pipeline.)
- Number of cycles per loop iteration: __ E06 __
- Number of cycles per loop iteration wasted due to stalls: __ E07 __
- Number of cycles per loop iteration wasted due to annulments: __ E08 __
- E06 = ?
  7
- E07 = ?
  0
- E08 = ?
  3

Problem `F`

We are in the process of developing our custom RISC-V processor, which includes support for integer multiplication. This processor incorporates a mul instruction designed to multiply two source registers and generate a 32-bit result. To assess the functionality of our processor, we employ the following program for testing purposes. It is important to note that the branch instruction is presumed to be taken, and all instructions following the 'ret' instruction are NOPs.

loop:
    lw x14, 4(x12)
    mul x10, x14, x10
    andi x12, x14, 255
    bnez x12, loop
exit:
    sw x12, 0(x11)
    ret

In the initial phase, we attempt to construct a single-cycle processor, which is depicted below.

Consider that the processor is equipped with perfect registers (with no setup or hold times, and zero propagation and clock-to-q delays), and all memory operations can be completed within a single cycle (combinational reads). Utilize the provided table to calculate the minimum clock cycle duration, the average instruction cycle count, and assess the overall processor performance in terms of average instructions executed per nanosecond while running the specified test code.

Logic Block	Propagation Delay
IF	3ns
DEC	2ns
EXE	10ns
MEM	3ns
WB	2ns

Fill in the following:
- Average cycles per instruction: __ F01 __
- Clock period: __ F02 __ ns
- Average number of instructions per nanosecond: __ F03 __
- F01 = ?
  1
- F02 = ?
  20
- F03 = ?
  
  $\frac{1}{20}$

It is evident that pipelining can be employed to enhance clock frequency. We have the option to enhance the initial design, as depicted in the subsequent diagram. In this scenario, we assume full bypassing is implemented, and branch decisions are rendered during the EXE stage. The IF stage, in a speculative manner, retrieves the instruction located at PC+4 and invalidates incorrectly fetched instructions in the event of a branch misprediction.

We employ the identical test program to evaluate the performance of the suggested processor.

Compute the clock cycle duration, average instruction cycles, and average instruction throughput per nanosecond while executing the test code. It is assumed that bypassing does not impact the critical path.
- Clock period: __ F04 __ ns
- Average cycles per instruction: __ F05 __
- Average number of instructions per nanosecond: __ F06 __
- F04 = ?
  10
- F05 = ?
  2
- F06 = ?
  
  $\frac{1}{20}$
  
  8 cycles per loop iteration (4 instructions), so average cycles per instruction = 2.
  4 instr/80 ns = 1/20 instr/ns.

We have a suspicion that the EXE stage might be the root cause of the limited performance improvement. Consequently, we divide the EXE stage into three separate pipeline stages. The bypassing and branch decision logic that was previously in the EXE stage has now been relocated to the EX3 stage.

Logic Block	Propagation Delay
EX1	4ns
EX2	3ns
EX3	3ns

Calculate the clock period, average cycles per instruction, and average number of instructions per nanosecond for the test benchmark running on this processor.
- Clock period: __ F07 __ ns
- Average cycles per instruction: __ F08 __
- Average number of instructions per nanosecond: __ F09 __
- F07 = ?
  4
- F08 = ?
  3.5
- F09 = ?
  
  $\frac{1}{14}$
  
  Branch decision is made when bnez gets to EX3 in cycle 14, so lw is fetched again in cycle 15.
  14 cycles per loop iteration (4 instructions), so average cycles per instruction = 14/4 = 3.5
  4 instr/56 ns = 1/14 instr/ns.

Subsequently, we discover that the critical path in the EXE stage is primarily attributed to the multiplier employed for the mul instruction. In response, we separate the multiplier from the EXE stage and run it in parallel, ensuring that non-mul instructions require only one cycle to complete. The branch decision logic, however, remains in the EX stage.

The MUX preceding the MEM stage adheres to the following policy: it prioritizes selecting the valid instruction from EX over MUL3. However, if both instructions from EX and MUL3 are valid, it gives precedence to the one from MUL3 and consequently stalls EX and all preceding pipeline stages.

The timing parameters for the above new processor are shown below:

Logic Block	Propagation Delay
EX	4ns
MUL1	4ns
MUL2	3ns
MUL3	3ns

Determine the clock cycle duration, average instruction cycles, and average instructions executed per nanosecond for the test benchmark when executed on this processor.
- Clock period (ns): __ F10 __
- Average cycles per instruction: __ F11 __
- Average number of instructions per nanosecond: __ F12 __
- F10 = ?
  4
- F11 = ?
  2
- F12 = ?
  
  $\frac{1}{8}$ OR
  $\frac{1}{9}$
  
  8 cycles per loop iteration (4 instructions), so average cycles per instruction = 8/4 = 2
  4 instr/32 ns = 1/8 instr/ns.
  Also accepted 9 cycles per loop iteration if you thought lw could not be fetched until bnez leaves EX stage.
  In this case CPI is 2.25 and average number of instrucations per ns is 1/9.

Problem `G`

Chisel is a domain specific language (DSL) implemented using Scala's macro features. Therefore, all programming related to circuit logic must be implemented using the macro definitions provided in the Chisel library, rather than directly using Scala language keywords. Specifically, when you want to use a constant in a circuit, such as 12, and compare it with a circuit value, you cannot write if (value == 12) but instead must write when(value === 12.U). Memorizing this fundamental concept is essential to correctly write valid Chisel code.

Chisel Cheatsheet

Have you begun your work on Homework3? __ G01 __
- G01 = ?
  只要作答，就給分

Chisel defines a constructor for n-way multiplexors

MuxLookup(index, default,
          Array(key1->value1, key2->value2,..., keyN->valueN))

Explain

The index to key match is implemented using the === operator.
Therefore MuxLookup would work for any type for which === is defined.
=== is defined on bundles and vecs, as well as the primitive Chisel types.
Users might can override === for their own bundles.

Take a look at the Scala code provided in the instruction decoding section of a RISC-V processor implementation.

class InstructionDecode extends Module {
  val io = IO(new Bundle {
    val instruction = Input(UInt(Parameters.InstructionWidth))

    val regs_reg1_read_address = Output(UInt(Parameters.PhysicalRegisterAddrWidth))
    val regs_reg2_read_address = Output(UInt(Parameters.PhysicalRegisterAddrWidth))
    val ex_immediate           = Output(UInt(Parameters.DataWidth))
    val ex_aluop1_source       = Output(UInt(1.W))
    val ex_aluop2_source       = Output(UInt(1.W))
    val memory_read_enable     = Output(Bool())
    val memory_write_enable    = Output(Bool())
    val wb_reg_write_source    = Output(UInt(2.W))
    val reg_write_enable       = Output(Bool())
    val reg_write_address      = Output(UInt(Parameters.PhysicalRegisterAddrWidth))
  })
  val opcode = io.instruction(6, 0)
  val funct3 = io.instruction(14, 12)
  val funct7 = io.instruction(31, 25)
  val rd     = io.instruction(11, 7)
  val rs1    = io.instruction(19, 15)
  val rs2    = io.instruction(24, 20)

  io.regs_reg1_read_address := Mux(opcode === Instructions.lui, 0.U(Parameters.PhysicalRegisterAddrWidth), rs1)
  io.regs_reg2_read_address := rs2
  val immediate = MuxLookup(
    opcode,
    Cat(Fill(20, io.instruction(31)), io.instruction(31, 20)),
    IndexedSeq(
      InstructionTypes.I -> Cat(Fill(21, io.instruction(31)), io.instruction(30, 20)),
      InstructionTypes.L -> Cat(Fill(21, io.instruction(31)), io.instruction(30, 20)),
      Instructions.jalr  -> Cat(Fill(21, io.instruction(31)), io.instruction(30, 20)),
      ...
      Instructions.lui   -> G02 /* Your answer here! */,
      ...
    )
  )

Complete the provided code (i.e., G02) to ensure it functions as intended.
- G02 = ?
  Cat(io.instruction(31, 12), 0.U(12.W))

Quiz4 of Computer Architecture (2023 Fall)

Problem A

Problem B

Problem C

Problem D

Problem E

Problem F

Problem G

Problem `A`

Problem `B`

Problem `C`

Problem `D`

Problem `E`

Problem `F`

Problem `G`