Annotate and explain Quiz4/5 with additional challenge Problems

Quiz4 of Computer Architecture (2023 Fall)

Solutions

Problem A

Take a look at the circuit diagram provided below, along with the associated delays of its components:

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

component	delays
$t_{A N D}$	20ps
$t_{N O T}$	20ps
$t_{O R}$	20ps
$t_{X O R}$	20ps
$t_{m u l t i p l i e r}$	20ps
$t_{s u b t r a c t}$	20ps
$t_{c l k - q}$	20ps
$t_{s e t u p}$	20ps

Assuming that this circuit exclusively operates on two’s complement integers, and given that the subtractor component has a delay of 35ps, what is the maximum delay permissible for the adder component while allowing us to replace the subtractor component with adders, NOT gates, and constants to achieve the same delay as the subtractor while maintaining the same behavior? It can be assumed that constants introduce no delay.

As a refresher, the subtract component performs the following operation:
output = top input - bottom input

A01 = ?
35ps

📝 Annotate

As we are working with two's complement numbers, subtracting x is essentially the same as adding ∼ x + 1, where ∼ x flips all the bits of x. Consequently, we can link a NOT gate, an adder with a constant value of 1, and another adder (to sum the output of the previous adder and the top input of the subtractor) to achieve the same functionality.

The purpose of the question is for students to recognize that the NOT gate and the first adder do not introduce any additional delay since the multiplier/NOT gate combination at the top input of the subtractor takes longer than the NOT gate/adder combination for the bottom input. Therefore, the adder can have a delay of 35ps (equivalent to the existing subtractor) for the circuit to maintain consistent timing behavior.

Problem B

Examine the sequential circuit provided below, along with the associated timing specifications. Registers R1, R2, and R3 are synchronized by a shared clock signal, while A, B, and C represent combinational logic circuits.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

-	$t_{p d}$	$t_{c d}$	$t_{s e t u p}$	$t_{h o l d}$
Register (R1, R2, R3)	5ns	1ns	7ns	2ns
Circuit A	3ns	?	-	-
Circuit B	4ns	2ns	-	-
Circuit C	6ns	2.5ns	-	-

1.What is tpd of this sequential circuit? __ B01 __ ns

B01 = ?
5

2.What is tcd of this sequential circuit? __ B02 __ ns

B02 = ?
1

3.What is the minimum value of tcd for Circuit A that will ensure proper operation of this circuit? __ B03 __ ns

B03 = ?
1

4.What is the minimum clock period that can be employed to clock all the registers within the circuit? __ B04 __ ns

B04 = ?
19

5.Available from the supplier are alternative circuits for A, B, and C, each with the following specifications.

-	$t_{p d}$	$t_{c d}$	$t_{s e t u p}$	$t_{h o l d}$
A-New	1ns	0.5ns	-	-
B-New	2.5ns	2ns	-	-
C-New	2ns	1ns	-	-

Replacement of these new circuits is cost-sensitive, allowing for the substitution of only one combinational circuit to achieve a minimized clock period. Please specify which combinational circuit you intend to replace and the resulting minimum clock period.

Combinational circuit replaced: __ B05 __
$t_{C L K}$ of circuit: __ B06 __ ns

📝 Annotate

B01、B02 :

t_{p d}

represents the maximum propagation delay from input to output, while

t_{c d}

represents the minimum clock-to-output delay. Based on the provided information, we can calculate

t_{p d}

as 5ns and

t_{c d}

as 1ns. This is because

t_{p d}

is determined by the longest path, which in this sequential circuit is from R1 to Circuit C, resulting in

t_{p d}

of 5ns. On the other hand,

t_{c d}

is determined by the shortest path, which in this case is from the clock signal to R2, resulting in

t_{c d}

of 1ns .

B03 : The contamination delay is the minimum time required for a change in the input to produce a change in the output. In this case, the

t_{c d}

of Circuit A is found to be 1ns, which is the smallest value that will allow the circuit to operate correctly. This is determined by comparing the

t_{c d}

of Circuit A with the

t_{c d}

of the subsequent circuit in the critical path, ensuring that the output of Circuit A is stable before it is used as an input to the next stage.

Formula :

t_{c d, R 1} + t_{c d, A} \geq t_{h o l d, R 2}

1 + t_{c d, A} \geq 2

t_{c d, A} \geq 1 n s

B04 : This involves calculating the maximum clock period that satisfies the timing requirements of the entire circuit. By considering the propagation delays of the registers and combinational circuits, we can calculate the critical path delay, which is the longest path through the circuit. In this case, the critical path delay is found to be 19ns, which represents the minimum clock period that can be used to drive all of the registers in the circuit while ensuring correct operation.

Formula :

t_{C L K} \geq t_{p d, R 1} + t_{p d, A} + t_{p d, B} + t_{s e t u p, R 3}

t_{C L K} \geq 5 + 3 + 4 + 7

t_{C L K} \geq 19 n s

B05、B06 : The goal is to identify which combinational circuit to replace and the resulting minimum clock period, while justifying the selection. The analysis involves comparing the timing specifications of the available alternative circuits and selecting the one that minimizes the overall clock period. In the provided solution, the combinational circuit B is replaced with the new B-New circuit, resulting in a reduced critical path delay and a minimum clock period of 18ns. The explanation highlights the comparison of the critical paths and the impact of the replacement on the overall timing requirements, ultimately justifying the selection of the B-New circuit over the other alternatives .

Problem B

Consider a module called ‘F’ with three inputs (X, Y, and Z) and one output (A).

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

The circuit is known to function but has suboptimal throughput. To address this issue, you opt for pipelining the circuit. For each of the following questions, design a valid K-stage pipeline for the given circuit. Each component in the circuit is labeled with its propagation delay in nanoseconds. Provide the latency and throughput of each design, assuming ideal registers (i.e., tPD=0, tSETUP=0). Remember that according to our convention, a pipeline register is placed at each output.

Consider a 2-stage pipeline optimized for maximum throughput while utilizing the fewest possible registers. Determine the latency and throughput of the resulting circuit, taking careful note of the direction of each arrow.
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →

Latency: __ C01 __ ns
Throughput: __ C02 __
$n s^{- 1}$

C01 = ?
30

C02 = ?

$\frac{1}{15}$

Consider a maximum-throughput 4-stage pipeline using a minimal number of registers. What are the latency and throughput of the resulting circuit?

Latency: __ C03 __ ns
Throughput: __ C04 __
$n s^{- 1}$

C03 = ?
32

C04 = ?

$\frac{1}{8}$

Consider a maximum-throughput pipeline that uses a minimum number of pipeline stages. What are the latency and throughput of the resulting circuit

Latency: __ C05 __ ns
Throughput: __ C06 __
$n s^{- 1}$

C05 = ?
42

C06 = ?

$\frac{1}{7}$

📝 Annotate

C01、C02 : First, you need to partition the circuit into pipeline stages and place a pipeline register at each output. Pay close attention to the direction of each arrow. Finally, calculate the latency and throughput of the resulting circuit. The values for latency and throughput can be obtained based on the given conditions .

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

C03、C04 : Similar to part B, you need to partition the circuit into pipeline stages and place a pipeline register at each output. After that, calculate the latency and throughput of the resulting circuit based on the given conditions .

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

C05、C06 : Similar to parts B and C, you need to partition the circuit into pipeline stages and place a pipeline register at each output. Then, calculate the latency and throughput of the resulting circuit based on the given conditions .

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Problem D

Examine the processor implementation designed for single-cycle operation.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Below, you can find the timing specifications for each of the components:

Component	Propagation delay ( $t_{P D}$ )
Register	1ns
Decoder	2ns
RegFile read	3ns
MUX	1ns
ALU	4ns
Adder	3ns
Memory read(instruction or data)	4ns

Setup/hold times for clocked inputs (registers and writes to RegFile and data memory)

Setup time (tSETUP): 2 ns
Hold time (tHOLD): 0 ns

What is the shortest clock cycle duration achievable for this processor? Minimum tCLK: __ D01 __ ns

D01 = ?
22

The current processor’s performance is suboptimal. We intend to introduce an alternative data path (while keeping the control logic unchanged) in which the data memory’s Adr input is sourced differently. What is the minimum clock cycle time of the new processor? Minimum tCLK: __ D02 __ ns

D02 = ?
18

Now, the new processor executes certain instructions incorrectly as per the RISC-V ISA. Provide an example of such an instruction, lw a0, 8(a1), and write its equivalent RISC-V instruction. __ D03 __

D03 = ?
lw a0, 0(a1)

he program provided requires 90 instructions to complete execution on the original processor. Unfortunately, when run on the new processor, it yields incorrect results. Please make the necessary modifications to the program to ensure it operates correctly on the new processor.

provided program running on the original processor

# int x[16];
# for (int i = 0; i < 15; i++)
#     x[i+1] = x[i] + x[i+1];
#
# Initial values:
# a0: address of x[0]
# a7: address of x[15]
loop:
    lw a1, 0(a0)
    lw a2, 4(a0)
    add a3, a2, a1
    sw a3, 4(a0)
    addi a0, a0, 4
    blt a0, a7, loop

Modified assembly that executes correctly on new processor

# Initial values:
# a0: address of x[0]
# a7: address of x[15]
    lw a1, 0(a0)
    addi a0, a0, 4
loop:
    __ D04 __
    add a1, a1, a2
    __ D05 __
    addi a0, a0, 4
    __ D06 __

Additionally, how many instructions does your assembly code execute? __ D07 __

D04 = ?
lw a2, 0(a0)

D05 = ?
sw a1, 0(a0)

D06 = ?
ble a0, a7, loop

D07 = ?
77

What is the elapsed time for the execution of the program on both the original and new processors? (Please utilize the relevant program variant for each respective processor.)
- Execution time on the original processor: __ D08 __ ns
- Execution time on the new processor: __ D09 __ ns
- D08 = ?
  1694
- D09 = ?
  1386

📝 Annotate

D01 : The solution involves identifying the critical path within the processor, then calculating the propagation delay of each component along the critical path and summing them to obtain the minimum clock cycle time. The critical path includes PC, instruction memory, decoder, register file read, BSEL multiplexer, ALU, data memory read, WDSEL multiplexer, and register file write. Based on the propagation delays of these components, the minimum clock cycle time is determined to be 22ns .

Formula :

Critical path is PC -> IMem -> Decoder -> RF(read) -> BSEL mux -> ALU -> DMem (read) -> WDSEL mux -> RF(write) => tCLK >= 1 + 4 + 2 + 3 + 1 + 4 + 4 + 1 + 2 (RF setup)

D02 : The solution involves identifying the critical path within the processor and then calculating the propagation delay of each component along the critical path. By summing these propagation delays, the minimum clock cycle time is determined. The critical path includes PC, instruction memory, decoder, register file read, BSEL multiplexer, ALU, WDSEL multiplexer, and register file write. Based on the propagation delays of these components, the minimum clock cycle time is calculated to be 18ns .

Formula :

Critical path is PC -> IMem -> Decoder -> RF(read) -> BSEL mux -> ALU -> WDSEL mux -> RF(write) => tCLK >= 1 + 4 + 2 + 3 + 1 + 4 + 1 + 2 (RF setup)

D03 : the same lw or sw instruction with the offset set to 0

D04 : Since a0 has already been incremented by 4, directly using 'lw a2, 0(a0)' retrieves the value of x[i+1]

D05 : The instruction 'add a1, a1, a2' has already calculated the value of x[i] + x[i+1]. Now, it's just a matter of storing this value into x[i+1]

D06 : Since we need to perform the summation from x[0] to x[15], the final comparison checks whether a0 is less than or equal to a7. If true, the program exits the loop; if false, it continues executing the loop

D07 : the original C code contains a loop with 5 instructions in the loop body, which is executed 15 times. In the modified program, the number of instructions in the loop body remains 5, but 2 additional instructions are added for loop condition checking and branching. Therefore, the total number of executed instructions is the product of the number of instructions in the loop body and the number of loop iterations, plus the number of additional instructions for loop condition checking and branching, which is 15*5 + 2 = 77.

D08 : obtained by multiplying the total number of executed instructions (77) by the clock cycle time of the original processor (22 ns).

Formula :

77 * 22 = 1694

D09 : obtained by multiplying the total number of executed instructions (77) by the clock cycle time of the original processor (18 ns).

Formula :

77 * 18 = 1386

Problem E

Take a look at the loop provided below, which calculates the sum of values in a linked list. We execute this code on a typical 5-stage RISC-V processor equipped with complete bypassing. Assume that branches are consistently predicted as not taken, and that branch determinations are made during the EXE stage. Also, consider that the loop undergoes numerous iterations and is presently in the middle of its execution.

loop:
    lw a1, 0(a0)
    lw a0, 4(a0)
    add a2, a2, a1
    bnez a0, loop
    mv a0, a2
    addi sp, sp, 8
    ret

Assume that at cycle 100 the lw a1, 0(a0) instruction is fetched. Fill in the following:
- E01 = ?
  7
- E02 = ?
  1
- E03 = ?
  2

We aim to enhance the performance of this processor. Upon careful examination, we have identified an opportunity to save cycles by introducing additional bypass paths. Contrary to its label, “full bypassing” does not fully exploit the potential for bypassing. It primarily bypasses data to the DEC stage alone. Occasionally, instructions do not require their inputs in the EXE stage but rather at a later stage, and stalling at the DEC stage leads to unnecessary cycle wastage. To address this issue, we suggest a bypassing scheme named “extra bypassing.” In extra bypassing:

There exists a bypass path from each stage to DEC and all intermediate stages between DEC and the source of the bypass. For example, in a 5-stage pipeline, the paths include:
- EXE -> DEC
- MEM -> DEC
- WB -> DEC
- MEM -> EXE
- WB -> EXE
- WB -> MEM (Note that full bypassing only includes the first 3 of these paths).
As customary, each bypass path forwards data to the end of the receiving stage. For instance:
- MEM -> DEC bypasses data to the end of DEC, right before the DEC-EXE pipeline register, after register reads, and after the ALU operation.
- MEM -> EXE bypasses data to the end of EXE, right before the EXE-MEM pipeline register, after the ALU operation.
If an instruction relies on a value produced by an earlier instruction, and that value is still unavailable (either in the register file or through a bypass path), the instruction doesn’t stall until the stage just before the value is required. For example:
- If a value is needed in EXE and hasn’t been generated yet, the instruction stalls in DEC (as in the case of full bypassing).
- However, if the value is needed in MEM, the instruction proceeds from DEC to EXE and only stalls in EXE until the value becomes accessible through a bypass path.
Instructions capture missing inputs from bypass paths as soon as they become available, even if they are currently stalled for other reasons. This ensures that if an instruction advances past DEC, it always acquires the correct input value before it needs it.
The diagram below shows the bypass paths in full vs. extra bypassing.

Consider a 2-instruction sequence that results in a delay of two cycles when using full bypassing but only one cycle when employing extra bypassing.
```
lw a1, 0(a0)
sw a1, 0(a2)
```
Specify which bypass paths are exercised in each case.
- Bypass paths exercised with full bypassing: __ E04 __
- Bypass paths exercised with extra bypassing: __ E05 __
- E04 = ?
  WB->DEC
- E05 = ?
  WB->EXE

To harness the benefits of extra bypassing more effectively, we introduce a 5-stage pipeline structure as illustrated below, with the ALU positioned in the fourth stage:

In this pipeline
- The first two stages, IF (Instruction Fetch) and DEC (Decode), function similarly to the standard 5-stage pipeline.
- The third stage, AC (Address Calculation), handles address computation for load and store operations.
- The fourth stage, EXM (Execute Memory), carries out ALU operations and manages data access.
- The fifth stage, WB (Write Back), operates in line with the standard pipeline procedure.
Branch predictions are made with an assumption of not-taken and are resolved in the EXM stage.
This pipeline incorporates extra bypassing for improved performance.

Consider again the code from part 1 and assume that at cycle 100 the lw a1, 0(a0) instruction is fetched. Fill in the following: (Remember that the Execution module resides within the EXM stage of the pipeline.)
- Number of cycles per loop iteration: __ E06 __
- Number of cycles per loop iteration wasted due to stalls: __ E07 __
- Number of cycles per loop iteration wasted due to annulments: __ E08 __
- E06 = ?
  7
- E07 = ?
  0
- E08 = ?
  3

📝 Annotate

E01 : First,we need to complete a pipeline chart, displaying the instruction execution status from clock cycle 100 to clock cycle 109.

We can observe that there are 6 instructions in the loop body, namely lw, lw, add, bnez, mv, and addi. Based on the pipeline chart, we can determine the number of clock cycles required for each instruction as follows:

lw: 5 clock cycles
lw: 5 clock cycles
add: 1 clock cycle
bnez: 1 clock cycle
mv: 1 clock cycle
addi: 1 clock cycle

Therefore, the total number of cycles required for each loop iteration is 14 clock cycles. However, due to data dependencies between lw instructions, bypassing is employed to avoid stalls. By using bypassing, the execution time of lw instructions is reduced to 3 clock cycles. Consequently, the total number of cycles required for each loop iteration becomes 7 clock cycles

E02 : Can be determined by analyzing the pipeline diagram and identifying any data dependencies that result in a stall. In this specific case, we can see that there is a data dependency between the first and second lw instructions, which means that the second lw instruction must wait for the first lw instruction to complete before it can proceed. This results in a stall of 1 cycle for the second lw instruction. Therefore, the number of cycles wasted due to stalls per loop iteration is 1.

E03 : Can be determined by analyzing the pipeline diagram and identifying any instructions that are annulled or canceled. In this specific case, we can see that the branch instruction (bnez) is annulled in the last cycle of the loop iteration. This annulment results in 2 cycles being wasted, as the pipeline stages for the annulled instruction are effectively canceled. Therefore, the number of cycles wasted due to annulments per loop iteration is 2.

E04 : The bypass path WB->DEC is exercised with full bypassing because the result of the write-back stage is bypassed directly to the decode stage to provide the required data for the next instruction.

E05 : The bypass path WB->EXE is exercised with extra bypassing because the result of the write-back stage is bypassed directly to the execute stage to provide the required data for the next instruction, bypassing the decode stage.

E06 : The pipeline diagram for cycles 100-109

starting with the "lw a1, 0(a0)" instruction, involves the execution of subsequent instructions and the use of bypassing to forward data between pipeline stages. The number of cycles per loop iteration is 7, determined by the total cycles required for the entire loop to complete, with each stage processing one instruction per cycle.

E07 : By carefully analyzing the dependencies between instructions and the use of bypassing, we determined that there were no stalls in the pipeline during the execution of the loop. Therefore, the number of cycles per loop iteration wasted due to stalls was 0.

E08 : By carefully analyzing the dependencies between instructions and the use of bypassing, we can determine if any annulments occur and calculate the number of cycles wasted due to annulments.The number of cycles per loop iteration wasted due to annulments is calculated to be 3, indicating that there are 3 cycles per loop iteration where instructions are annulled due to a branch misprediction or an exception.

Problem F

We are in the process of developing our custom RISC-V processor, which includes support for integer multiplication. This processor incorporates a mul instruction designed to multiply two source registers and generate a 32-bit result. To assess the functionality of our processor, we employ the following program for testing purposes. It is important to note that the branch instruction is presumed to be taken, and all instructions following the ‘ret’ instruction are NOPs.

loop:
    lw x14, 4(x12)
    mul x10, x14, x10
    andi x12, x14, 255
    bnez x12, loop
exit:
    sw x12, 0(x11)
    ret

In the initial phase, we attempt to construct a single-cycle processor, which is depicted below.

Consider that the processor is equipped with perfect registers (with no setup or hold times, and zero propagation and clock-to-q delays), and all memory operations can be completed within a single cycle (combinational reads). Utilize the provided table to calculate the minimum clock cycle duration, the average instruction cycle count, and assess the overall processor performance in terms of average instructions executed per nanosecond while running the specified test code.

Logic Block	Propagation Delay
IF	3ns
DEC	2ns
EXE	10ns
MEM	3ns
WB	2ns

Fill in the following:
- Average cycles per instruction: __ F01 __
- Clock period: __ F02 __ ns
- Average number of instructions per nanosecond: __ F03 __
- F01 = ?
  1
- F02 = ?
  20
- F03 = ?
  
  $\frac{1}{20}$

It is evident that pipelining can be employed to enhance clock frequency. We have the option to enhance the initial design, as depicted in the subsequent diagram. In this scenario, we assume full bypassing is implemented, and branch decisions are rendered during the EXE stage. The IF stage, in a speculative manner, retrieves the instruction located at PC+4 and invalidates incorrectly fetched instructions in the event of a branch misprediction.

We employ the identical test program to evaluate the performance of the suggested processor.

Compute the clock cycle duration, average instruction cycles, and average instruction throughput per nanosecond while executing the test code. It is assumed that bypassing does not impact the critical path.
- Clock period: __ F04 __ ns
- Average cycles per instruction: __ F05 __
- Average number of instructions per nanosecond: __ F06 __
- F04 = ?
  10
- F05 = ?
  2
- F06 = ?
  
  $\frac{1}{20}$

We have a suspicion that the EXE stage might be the root cause of the limited performance improvement. Consequently, we divide the EXE stage into three separate pipeline stages. The bypassing and branch decision logic that was previously in the EXE stage has now been relocated to the EX3 stage.

Logic Block	Propagation Delay
EX1	4ns
EX1	3ns
EX1	3ns

Calculate the clock period, average cycles per instruction, and average number of instructions per nanosecond for the test benchmark running on this processor.
- Clock period: __ F07 __ ns
- Average cycles per instruction: __ F08 __
- Average number of instructions per nanosecond: __ F09 __
- F07 = ?
  4
- F08 = ?
  3.5
- F09 = ?
  
  $\frac{1}{14}$

Subsequently, we discover that the critical path in the EXE stage is primarily attributed to the multiplier employed for the mul instruction. In response, we separate the multiplier from the EXE stage and run it in parallel, ensuring that non-mul instructions require only one cycle to complete. The branch decision logic, however, remains in the EX stage.

The MUX preceding the MEM stage adheres to the following policy: it prioritizes selecting the valid instruction from EX over MUL3. However, if both instructions from EX and MUL3 are valid, it gives precedence to the one from MUL3 and consequently stalls EX and all preceding pipeline stages.

The timing parameters for the above new processor are shown below:

Logic Block	Propagation Delay
EX	4ns
MUL1	4ns
MUL2	3ns
MUL3	3ns

Determine the clock cycle duration, average instruction cycles, and average instructions executed per nanosecond for the test benchmark when executed on this processor.
- Clock period: __ F10 __ ns
- Average cycles per instruction: __ F11 __
- Average number of instructions per nanosecond: __ F12 __
- F10 = ?
  4
- F11 = ?
  2
- F12 = ?
  
  $\frac{1}{8}$ OR
  $\frac{1}{9}$

📝 Annotate

F01 : The longest path is through the EXE stage, which has a propagation delay of 10ns. Therefore, the minimal clock period is 10ns.

F02 : There are 4 instructions in the loop, and it takes 1 cycle to execute each instruction. Additionally, there is 1 cycle to execute the sw instruction and 1 cycle to execute the ret instruction. Therefore, the total number of cycles is 6, and the average number of cycles per instruction is 6/6 = 1.

F03 : Since the minimal clock period is 10ns, the processor can execute 1/10 = 0.1 instructions per nanosecond. Therefore, the average number of instructions per nanosecond is 0.1 * 10 = 1/20.

F04 :
The longest path is through the EXE stage, which has a propagation delay of 10ns. Therefore, the clock period is 10ns.

F05 : With a clock period of 10ns, the total number of cycles required to execute all 4 instructions is 8. Therefore, the average cycles per instruction is 8/4 = 2.

F06 : Since there are 4 instructions in each loop iteration, and the clock cycle is 10ns, the 4 instructions require 80ns (10ns × 8 clock cycles) to complete. Therefore, on average, each nanosecond have 4instr / 80ns = 1 / 20 instr/ns

F07 :
The longest path is through the MEM stage, which has a propagation delay of 4ns. Therefore, the clock period is 4ns.

F08 : There are 4 instructions in the loop, and it takes 14 cycles to execute all the instructions. Therefore, the average cycles per instruction is 14/4 = 3.5.

F09 : The branch decision is finalized when bnez reaches the EX3 stage in cycle 14, leading to the refetching of the lw instruction in cycle 15. With 14 cycles per loop iteration involving 4 instructions, the average cycles per instruction are 14/4 = 3.5. Therefore, the rate is 1/14 instruction per ns, given 4 instructions in 56 ns.

F10 :
The longest stage is EX, with a time of 4ns. Thus, the clock cycle is 4ns

F11 : There are 4 instructions in the loop, and it takes 8 cycles to execute all the instructions. Therefore, the average cycles per instruction is 8/4 = 2.

F12 : Since there are 4 instructions in each loop iteration, and the clock cycle is 10ns, the 4 instructions require 32ns (4ns × 8 clock cycles) to complete. Therefore, on average, each nanosecond have 4instr / 32ns = 1 / 20 instr/ns

Also accepted 9 cycles per loop iteration if you thought lw could not be fetched until bnez leaves EX stage.In this case CPI is 2.25 and average number of instrucations per ns is 1/9.

Problem G

Chisel is a domain specific language (DSL) implemented using Scala’s macro features. Therefore, all programming related to circuit logic must be implemented using the macro definitions provided in the Chisel library, rather than directly using Scala language keywords. Specifically, when you want to use a constant in a circuit, such as 12, and compare it with a circuit value, you cannot write if (value == 12) but instead must write when(value === 12.U). Memorizing this fundamental concept is essential to correctly write valid Chisel code.

Chisel Cheatsheet

Have you begun your work on Homework3? __ G01 __
- G01 = ?
  只要作答，就給分

Chisel defines a constructor for n-way multiplexors

MuxLookup(index, default,
          Array(key1->value1, key2->value2,..., keyN->valueN))

Explain

The index to key match is implemented using the === operator.
Therefore MuxLookup would work for any type for which === is defined.
=== is defined on bundles and vecs, as well as the primitive Chisel types.
Users might can override === for their own bundles.

Take a look at the Scala code provided in the instruction decoding section of a RISC-V processor implementation.

class InstructionDecode extends Module {
  val io = IO(new Bundle {
    val instruction = Input(UInt(Parameters.InstructionWidth))

    val regs_reg1_read_address = Output(UInt(Parameters.PhysicalRegisterAddrWidth))
    val regs_reg2_read_address = Output(UInt(Parameters.PhysicalRegisterAddrWidth))
    val ex_immediate           = Output(UInt(Parameters.DataWidth))
    val ex_aluop1_source       = Output(UInt(1.W))
    val ex_aluop2_source       = Output(UInt(1.W))
    val memory_read_enable     = Output(Bool())
    val memory_write_enable    = Output(Bool())
    val wb_reg_write_source    = Output(UInt(2.W))
    val reg_write_enable       = Output(Bool())
    val reg_write_address      = Output(UInt(Parameters.PhysicalRegisterAddrWidth))
  })
  val opcode = io.instruction(6, 0)
  val funct3 = io.instruction(14, 12)
  val funct7 = io.instruction(31, 25)
  val rd     = io.instruction(11, 7)
  val rs1    = io.instruction(19, 15)
  val rs2    = io.instruction(24, 20)

  io.regs_reg1_read_address := Mux(opcode === Instructions.lui, 0.U(Parameters.PhysicalRegisterAddrWidth), rs1)
  io.regs_reg2_read_address := rs2
  val immediate = MuxLookup(
    opcode,
    Cat(Fill(20, io.instruction(31)), io.instruction(31, 20)),
    IndexedSeq(
      InstructionTypes.I -> Cat(Fill(21, io.instruction(31)), io.instruction(30, 20)),
      InstructionTypes.L -> Cat(Fill(21, io.instruction(31)), io.instruction(30, 20)),
      Instructions.jalr  -> Cat(Fill(21, io.instruction(31)), io.instruction(30, 20)),
      ...
      Instructions.lui   -> G02 /* Your answer here! */,
      ...
    )
  )

Complete the provided code (i.e., G02) to ensure it functions as intended.
- G02 = ?
  Cat(io.instruction(31, 12), 0.U(12.W))

📝 Annotate

io.instruction(31, 12): Extracts the bits 31 to 12 from the input instruction, representing the 20-bit immediate value of the LUI instruction.

0.U(12.W): Represents a Chisel UInt constant of value 0, with a width of 12 bits. This is used to zero-pad the lower 12 bits of the G02 signal.

Cat(…): Concatenates the two values above, forming the complete 32-bit immediate value for the LUI instruction.

Quiz5 of Computer Architecture (2023 Fall)

Problem A

NASA’s team bid a poignant farewell to the Opportunity rover,marking the end of its 15-year Martian exploration journey. Opportunity had been a treasure trove of data for Earth’s scientists but began transmitting concerning signals months ago, indicating low battery levels and diminishing light as a massive dust storm engulfed Mars. NASA persistently tried to reestablish contact, hoping to revive the rover. However, when these efforts proved futile, they officially concluded the mission and announced Opportunity’s end. Amidst these events, expressions of homage emerged, including one from a Twitter user, dutchess-becky, who paid tribute in a language befitting the rover: binary code.

The message, “01000111 01101111 01100100 01110011 01110000 01100101 01100101 01100100 00100000 01001111 01110000 01110000 01111001,” translates to “Godspeed Oppy” in English. This reflects a sentiment echoed by many in the space community, where “Oppy” is the nickname of the Opportunity rover.

Reference: ‘Godspeed, Oppy’ - emotional tributes as Mars Opportunity rover pronounced dead

Let’s create a C program designed to translate messages from binary to plain, human-readable text. The program is stored in a file named conv.c, and its contents are as follows.

#include <stdio.h>
#define BUF_SIZE 64
int main()
{
    int stack = 0, buf_idx = 0;
    char c, buf[BUF_SIZE] = {0};
    while ((c = getchar()) != EOF) {
        if ((c & ~1) == '0') {
            A01 /* Fill your code here */;
            if (stack & 0x100) {
                buf[buf_idx++] = stack >> 1;
                A02 /* Fill your code here */;
                buf[buf_idx] = 0;
            }
        } else {
            if (stack) {
                buf[buf_idx++] = stack;
                stack = 0;
            }
            buf[buf_idx++] = c;
            buf[buf_idx] = 0;
        }
        if (buf_idx >= BUF_SIZE - 1) {
            printf("%s", buf);
            buf_idx = 0;
            buf[buf_idx] = 0;
        }
    }
    printf("%s", buf);
    return 0;
}

Assuming conv is the built executable from conv.c, the message sent from Earth is:

Dear Opportunity,
01000111 01101111 01100100 01110011
01110000 01100101 01100101 01100100
00100000 01001111 01110000 01110000
01111001
                            From,
                    A Martian Fan

The decoded message appears as follows.

Dear Opportunity,
G o d s
p e e d
  O p p
y
                            From,
                    A Martian Fan

To complete the code conv.c as a functional message decoder as described earlier, please insert the necessary statements. Your solutions should exclusively utilize the operations =, &, |, ^, >>, and <<, along with parentheses ( and ). Other operators are not permitted. Additionally, ensure that both A1 and A2 consist of a single statement each, without including a semicolon ;. Your code should be written in the most concise manner possible.

A01 = ?
stack = (stack << 1) | (c & 1)

A02 = ?
stack = c & 1

📝 Annotate

A01 :
(c & 1): This part extracts the least significant bit (rightmost bit) of c.
(stack << 1): This part left-shifts the current value of stack by one position.
(stack << 1) | (c & 1): This part combines the left-shifted stack with the least significant bit of c, placing it at the rightmost position.
Finally, stack = (stack << 1) | (c & 1): This statement updates the stack with the newly constructed value.
A02 :
(c & 1): This part still extracts the least significant bit of c
stack = c & 1: This statement sets the value of stack to the least significant bit of c.
This part handles the last bit of the binary number, adding it to the stack and constructing a complete byte.

Problem B

Evaluate the cache performance (2-way set associative, 8 sets, block size of 2 words) using the given RISC-V assembly code. This program cycles through arrays A (starting at 0x130) and B (starting at 0x200), performing the operation A[i] - B[i] and storing the result back in A[i].

# Initial register settings:
# x1 = 0 (loop index i)
# x2 = 4 (size of arrays A and B)
. = 0x0              # Code begins at address 0x0
loop:
    slli x3, x1, 2   # Convert index to byte offset
    lw x4, 0x130(x3) # Load A[i]
    lw x5, 0x200(x3) # Load B[i]
    sub x4, x4, x5   # Calculate A[i] - B[i]
    sw x4, 0x130(x3) # Store result in A[i]
    addi x1, x1, 1   # Increment index
    blt x1, x2, loop # Repeat 4 times
    unimp            # End of this program

Begin with an empty cache (all valid bits set to 0) before running the code. This cache employs a least-recently used (LRU) replacement strategy, with all cache lines in Way 0 being the least recently used. After one iteration of the loop (post the first execution of the blt instruction), update the LRU bit, the dirty bit (D), the valid bit (V), the tag, and the data words. Use the opcode for instruction words (like blt) and the array element for data words (like B[0]). It is assumed that if V = 0, then D = 0 as well.

Index LRU after one loop iteration

0 __ B01 __

1 __ B02 __

2 __ B03 __

3 __ B04 __

4 __ B05 __

5 __ B06 __

6 __ B07 __

7 __ B08 __
- B01 = ?
  0
- B02 = ?
  1
- B03 = ?
  1
- B04 = ?
  1
- B05 = ?
  0
- B06 = ?
  0
- B07 = ?
  1
- B08 = ?
  0

Index	LRU after one loop iteration
0	__ B01 __
1	__ B02 __
2	__ B03 __
3	__ B04 __
4	__ B05 __
5	__ B06 __
6	__ B07 __
7	__ B08 __

Complete the following table by noting the counts of instruction hits, instruction misses, data hits, and data misses across each of the four iterations of the loop. I$ = instruction cache; D$ = data cache;

# Loop interation	I$ Hits	I$ Hits	I$ Hits	I$ Hits
First	__ B09 __	__ B10 __	__ B11 __	__ B012 __
Second	__ B13 __	__ B14 __	__ B15 __	__ B16 __
Third	__ B17 __	__ B18 __	__ B19 __	__ B20 __
Fourth	__ B21 __	__ B22 __	__ B23 __	__ B21 __

B09 = ?
3

B10 = ?
4

B11 = ?
1

B12 = ?
2

B13 = ?
7

B14 = ?
0

B15 = ?
3

B16 = ?
0

B17 = ?
7

B18 = ?
0

B19 = ?
1

B20 = ?
2

B21 = ?
7

B22 = ?
0

B23 = ?
3

B24 = ?
0

Calculate the hit ratio for the four iterations of the loop, excluding the execution of the unimp instruction. Your answer can be presented in fractional form.
- B25 = ?
  
  $\frac{32}{40}$

📝 Annotate

B01 ~ B08 : After one loop iteration, the cache will have been accessed for the following addresses: 0x130, 0x200, 0x134, and 0x204. The cache index bits for these addresses are 0, 1, 0, and 1, respectively. Therefore, the cache lines accessed are in sets 0 and 1 of Way 0 and Way 1, respectively.

Since the cache uses a LRU replacement policy, the cache line that was accessed least recently in each set will have its LRU bit set to 1, while the other line in the set will have its LRU bit set to 0. Therefore, the LRU bits after one loop iteration will be 0 1 1 1 0 0 1 0, as indicated in the table.

B09 ~ B24 : The cache behavior can be determined by analyzing the memory accesses made by the assembly program. Each load (lw) and store (sw) instruction will result in a memory access, and the cache will be checked for the presence of the required data. If the required data is present in the cache, it is considered a hit; otherwise, it is a miss.

For each iteration of the loop, we need to track the memory accesses made by the load and store instructions and determine whether the accessed data is present in the cache or not. Based on this analysis, we can calculate the number of instruction hits, instruction misses, data hits, and data misses for each iteration.

B25 :
From the previous calculations, we have the following totals:

Total Instruction Hits = 24
Total Instruction Misses = 4
Total Data Hits = 8
Total Data Misses = 4

Using these totals, we can calculate the hit ratio as follows:

Hit Ratio = (Total Instruction Hits + Total Data Hits) / (Total Instruction Hits + Total Instruction Misses + Total Data Hits + Total Data Misses)
= (24 + 8) / (24 + 4 + 8 + 4)
= 32 / 40

Ripes simulation
Index : 8 set =>

l o g 8 = 3 b i t s

Offset : 2 word = 8 bytes =>

l o g 8 = 3 b i t s

(block offset : 1bit 、byte offset : 2bits)
Tag :

32 - 3 - 3 = 26

bits

After one loop iteration :
Data cache :

Instruction cache :

This results in their respective hit and miss rates.
Data cache :

Instruction cache :

Due to data dependencies, an NOP instruction was inserted, resulting in an additional hit.

After completing the entire loop,
Data cache :

Instruction cache :

The data cache performs as expected, but the instruction cache differs slightly from the expected outcome due to data dependencies and the insertion of additional NOP instructions caused by J-type instructions.

Problem C

In the given problems, you will work with a simplified vector instruction set architecture. Key terms include VL (vector length) and VE (vector element size). Registers in this ISA are denoted by prefixes: v for vector, r for scalar, and f for mask registers (examples: vd, rs1, fs1). Here are the operations:

vld vd, (rs1): Load VL consecutive values of size VE from the memory address in register R[rs1] into vector register vd. Essentially, vd gets values from Mem[R[rs1]].
vld vd, vs1: Load VL values from memory addresses specified in each element of vector register vs1 into vector register vd. So, vd = Mem[vs1].
vst vd, (rs1): Store VL consecutive values of size VE from vector register vd into the memory address in R[rs1]. Thus, Mem[R[rs1]] receives values from vd.
vst vd, vs1: Store values from vector register vd into memory addresses specified in each element of vector register vs1. Therefore, Mem[vs1] gets values from vd.
vst vd, (rs1), imm: Store the first imm consecutive values of size VE from vector register vd into the memory address in R[rs1], which means Mem[R[rs1]] equals vd.
vadd vd, vs2, vs1: Perform element-wise addition of vs1 to vs2, storing the result in vd. So, vd = vs1 + vs2.
vsub vd, vs2, vs1: Subtract vs2 from vs1 on an element-wise basis and store in vd, hence vd = vs1 - vs2.
vmul vd, vs2, vs1: Multiply vs1 by vs2 element-wise, with the result in vd, so vd = vs1 * vs2.
vdiv vd, vs2, vs1: Divide vs1 by vs2 element-wise, placing the result in vd. Therefore, vd = vs1 / vs2.
vredsum vd, vs1: Reduce the sum of elements in vs1 into the first element of vd. That is,
$v d [0] = \sum_{i = 0}^{V L} v s l [i]$
vgt vs1, fd: This operation examines each element in vector register vs1. If an element is greater than 0, the corresponding bit in the mask register fd is set to 1; otherwise, it is set to 0. In other words, for each index i from 0 to VL, fd[i] becomes 1 if vs1[i] > 0, else it is 0. After executing this instruction, the vector register vd will be subject to this mask for subsequent instructions.

You have a vector code snippet for calculating the dot product of a vector of length VL. The code is as follows:

vld v1, (x1)    # Load vector v1
vld v2, (x2)    # Load vector v2
vmul v3, v2, v1 # Multiply vectors v2 and v1
vredsum v4, v3  # Reduce sum of vector v3
vst v4, (x3), 0 # Store vector v4

When computing the number of cycles, consider these assumptions:

Vector length (VL) is 32 elements.
There are 8 vector lanes.
Each vector register holds 32 elements.
Each lane has one ALU with a 1-cycle latency.
Each lane has one load/store unit with a 4-cycle latency, and it is fully pipelined.
There is no dead time.
The memory system is ideal, with no bank conflicts or cache misses.
Reduction operations (like vredsum) behave similarly to arithmetic vector instructions (such as vmul).

Each vector instruction goes through fetch (F) and decode (D) phases, then stalls (-) until the necessary vector functional unit becomes available. Without chaining, a dependent vector instruction will stall until the preceding instruction has completed writing back all its elements. Vector instructions are executed in a pipelined manner across all lanes simultaneously. For each element, operands are read ® from the vector register file, processed in the load/store unit (M) or the ALU (X), and then the result is written back (W) to the vector register file.

Determine the total number of cycles required to execute the sequence without vector chaining, excluding the cycle for the final Writeback (W) phase of the store instruction.
- C01 = ?
  36
Calculate the total number of cycles required with vector chaining. In this setup, vector loads and stores use the same load-store unit, while vector arithmetic operations have distinct execution units. Vector chaining operates via the register file, allowing an element to be read ® in the same cycle as its writeback (W), or in any subsequent cycle, offering flexible chaining. Considering there is only one set of vector lanes (8 lanes) that can perform writeback in a single cycle, provide the cycle count excluding the final Writeback (W) phase of the store instruction.
- C02 = ?
  28

📝 Annotate

Therefore, the total number of cycles without vector chaining would be:

F + D + (32 * (R + M + W)) + F + D + (32 * (R + M + W)) + F + D + (32 * (R + X + W)) + F + D + (R + X + W) + F + D + (32 * (R + M + W)) = 36 cycles.

C02 :

With vector chaining, we can take advantage of the flexibility in reading elements from the register file. This allows an element to be read ® in the same cycle in which it is written back (W), or it can be read on any later cycle.

Given that only one set of vector lanes (8 vector lanes) can write back in a single cycle, we can calculate the number of cycles with vector chaining as follows:

F + D + (R + M + W) + F + D + (R + M + W) + F + D + (R + X + W) + F + D + (R + X + W) + F + D + (R + M + W) = 28 cycles.

RV32 Vector extension
Since the QEMU environment couldn't be properly configured, I only wrote a 'possible' rewriting method.

vsetvli t0, t0, e32,m1  # Set the vector length to the maximum
vle32.v v1, (x1)        # Load vector v1
vle32.v v2, (x2)        # Load vector v2
vmul.vv v3, v2, v1      # Multiply vectors v2 and v1
vredsum.vs v4, v3       # Reduce sum of vector v3
vse32.v v4, (x3)        # Store vector v4

Environment :

Problem D

Complete the table with the hexadecimal tags for each cache line, given these conditions:

You are working with a 2-way set-associative L1 cache that is initially empty and uses a FIFO (First-In, First-Out) replacement policy.
Way 0 is filled before Way 1.
The address structure for the L1 cache is divided as follows: 3-bit tag, 1-bit index, and 4-bit offset.

Cache Access Table

Address	Set 0, Way 0	Set 0, Way 1	Set 1, Way 0	Set 1, Way 1	Hit (Y/N)
0b1000 1000	__ D01 __	-	-	-	__ D02 __
0b0000 0000	-	__ D03 __	-	-	__ D04 ___
0b1000 0010	-	-	-	-	__ D05 ___
0b0101 0010	-	-	__ D06 __	-	__ D07 ___
0b0001 0000	-	__ D08 __	-	-	__ D09 ___
0b0100 0000	__ D10 __	-	-	__ D11 __	__ D12 ___
0b1000 1101	-	__ D13 __	-	-	__ D14 ___
0b0010 0000	__ D15 __	-	-	-	__ D16 ___
0b0011 0000	-	-	__ D17 __	-	__ D18 ___

Provide the hexadecimal tags or specify whether each cell, from D01 to D18, represents a hit (Y for Yes) or a miss (N for No) in hexadecimal format, shown in Cache Access Table.
- D01 = ?
  0x4
- D02 = ?
  N
- D03 = ?
  0x0
- D04 = ?
  N
- D05 = ?
  Y
- D06 = ?
  0x2
- D07 = ?
  N
- D08 = ?
  0x0
- D09 = ?
  N
- D10 = ?
  0x2
- D11 = ?
  0x0
- D12 = ?
  N
- D13 = ?
  0x4
- D14 = ?
  N
- D15 = ?
  0x1
- D16 = ?
  N
- D17 = ?
  0x1
- D18 = ?
  N
Calculate the average memory access time (AMAT) for a program with the following parameters: The L1 cache hit time is 4ns, the L2 cache access time is 20ns, and out of 10 memory accesses, 2 are hits in the L1 cache. All misses in L1 are resolved in the L2 cache. __ D19 __ ns
- D19 = ?
  20

📝 Annotate

D01 ~ D18 :
This is a 2-way set-relative cache, using a FIFO (first in, first out) replacement strategy. Each address has a 3-bit tag, a 1-bit index, and a 4-bit offset. Under such a structure, let's look at the access status of each address:

For D01 (0b1000 1000), because it is the first access, Way 0 of this group (Set 0) is empty, so this is a missing (N), and the mark is 0x4.
For D02, this is another Way from Set 0, so this is a missing (N) and the mark is 0x4.
D03 (0b0000 0000), first access, Way 0 of Set 0 is empty, missing (N), and the mark is 0x0.
D04, same as Way 1 of Set 0, missing (N), marked 0x0.
D05 (0b1000 0010), first access, Way 0 of Set 1 is empty, missing (N), and the mark is 0x2.
D06, same as Way 0 of Set 1, this is a hit (Y), the mark is 0x2.
D07, same as Way 1 of Set 1, missing (N), marked 0x2.
D08 (0b0001 0000), first access, Way 1 of Set 1 is empty, missing (N), and the mark is 0x1.
D09, same as Way 1 of Set 1, missing (N), marked 0x1.
D10 (0b0100 0000), first access, Way 0 of Set 0 is empty, missing (N), and the mark is 0x4.
D11, same as Way 0 of Set 0, this is a hit (Y), the mark is 0x4.
D12, same as Way 1 of Set 0, missing (N), marked 0x4.
D13 (0b1000 1101), first access, Way 0 of Set 1 is empty, missing (N), and the mark is 0x4.
D14, same as Way 0 of Set 1, missing (N), marked 0x4.
D15 (0b0010 0000), first access, Way 0 of Set 0 is empty, missing (N), and the mark is 0x8.
D16, same as Way 0 of Set 0, missing (N), marked 0x8.
D17 (0b0011 0000), first access, Way 1 of Set 0 is empty, missing (N), and the mark is 0xC.
D18, same as Way 1 of Set 0, missing (N), marked 0xC.

In this way, the Cache Access Table is filled in based on the analysis of the address and cache structure.

D19 :
AMAT = HitTime + MissRate * MissPenalty
AMAT =

4 n s + \frac{8}{10} \times 20 n s = 4 + 16 = 20 n s

AMAT = 20 ns

Problem E

Take into account a computer system that uses 256-byte page sizes, 16-bit virtual addresses, and 16-bit Page Table Entries (PTEs), with 4 bits of these PTEs dedicated to validity and protection flags. This computer employs a two-level hierarchical page table system and is byte-addressable.

Calculate the total number of virtual pages that this computer can address.
- E01 = ?
  256
Determine the largest possible size of physical memory that this computer, with its given specifications, can support. __ E02 __ KiB
- E02 = ?
  1024
Estimate the minimum number of valid Page Table Entries (PTEs) and Page Table Pointers (PTPs) required in the page table(s) for a program that is currently utilizing 300 bytes of memory. __ E03 __ PTP(s)
- E03 = ?
  1

📝 Annotate

E01 : The number of virtual pages can be obtained by dividing the size of the virtual address space by the page size:

Number of Virtual Pages =

\frac{S i z e o f V i r t u a l A d d r e s s S p a c e}{P a g e S i z e}

Number of Virtual Pages =

\frac{2^{16}}{256}

Number of Virtual Pages =

2^{16 - 8}

2^{8}

= 256

E02 : The size of the physical page number (PPN) is 16 − 4 = 12 bits.
Each of the PPNs can address a single page in memory which is 256-bytes.
Thus, the maximum size of physical memory that this machine can address is

2^{12} \times 256 = 1

megabyte(MiB)

E03 : First, we need to calculate the number of pages that the program is using. Since the page size is 256 bytes, we can calculate the number of pages as follows:
Number of pages = Total memory used / Page size = 300 bytes / 256 bytes/page ≈ 1.17 pages

Since the program is using less than 1.5 pages, we can conclude that at least 2 pages are needed to accommodate the 300 bytes of memory. This means that at least 2 PTEs must be valid in the page table.

Additionally, since the computer uses a two-level hierarchical page table, at least 1 PTP must be valid to point to the second-level page table.

Therefore, the smallest possible number of PTEs and PTPs that can be valid in the page table(s) of this program is:

2 PTEs
1 PTP

Problem F

Consider the scenario where we are performing vector addition using RISC-V assembly language. The code we have developed is as follows:

    mv x3, x4
loop:
    lw x6, 0(x4)     # Load an element from vector 1
    lw x7, 0x300(x4) # Load an element from vector 2
    add x7, x7, x6   # Perform addition of the two elements
    sw x7, 0(x4)     # Store the result back into vector 1
    addi x4, x4, 4   # Increment the vector offset
    blt x4, x8, loop # Continue loop for each vector element
    mv x10, x3
    addi sp, sp, 8   # Here, sp is equivalent to x2
    ret

This code is executed on a conventional 5-stage RISC-V processor equipped with full bypassing and branch annulment capabilities. It operates under the assumption that branch predictions are consistently made as not taken, meaning we expect that the branch will not be taken, and branch decisions are finalized during the EXE stage. The loop is designed to run multiple times and is currently in the midst of its execution process.

Suppose the instruction lw x6, 0(x4) is fetched at cycle 100. Determine the number of cycles each iteration of the loop requires. Additionally, calculate the number of cycles lost per iteration due to stalls and the number of cycles wasted due to annulments.
-Number of cycles per loop iteration: __ F01 __
-Number of cycles per loop iteration wasted due to stalls: __ F02 __
-Number of cycles per loop iteration wasted due to annulments: __ F03 __
- F01 = ?
  10
- F02 = ?
  2
- F03 = ?
  2

We are aiming to enhance the performance of a processor, particularly noting that the vectors involved in the operations are lengthy, leading to the loop’s branch being taken almost always. To address this, we introduce a branch predictor into the processor. This predictor initially learns the loop’s starting location during the first few iterations. Once it reaches a steady state, it consistently predicts the branch as taken, looping back to the start.

During the EXE stage, the processor checks the accuracy of the branch prediction. If the prediction is incorrect, it flushes all instructions fetched since the loop’s beginning from the pipeline. The processor then resumes fetching instructions from the point immediately following the branch instruction. This approach aims to optimize the processor’s performance by reducing unnecessary instruction execution and improving the efficiency of the loop processing.

With the implementation of the improved branch prediction, suppose the instruction blt x4, x8, loop is fetched at cycle 200. Calculate the number of cycles saved in each loop iteration as a result of this enhanced branch prediction mechanism.
-Number of cycles saved per loop iteration in steady state: __ F04 __
- F04 = ?
  2

We acknowledge that there are several methods to increase the speed of a computation. Previously, we focused on optimizing the cycle count in part B. Now, we shift our focus to making each cycle quicker. Recognizing that excessive bypassing can actually hinder performance by increasing the propagation delay in the circuit, we modify the processor described in Part 1 (which lacked branch prediction). Specifically, we eliminate the bypassing from the Writeback (WB) stage to the Decode (DEC) stage.

Determine how many cycles each loop iteration takes after the removal of the bypass from the Writeback (WB) stage to the Decode (DEC) stage.
- F05 = ?
  11

📝 Annotate

F01 ~ F03 :

First, we need to calculate the number of cycles required for each loop iteration. Based on the code and pipeline diagram provided in the question, we can calculate the number of execution cycles in the pipeline instruction by instruction. According to calculations, each loop iteration takes 10 cycles.

Second, we need to calculate the number of cycles wasted due to stalls and canceled instructions. In this problem, stalls and canceled instructions may occur due to data conflicts and branch instructions. By examining the pipeline graph, we can determine the number of cycles wasted due to stalls and canceled instructions in each loop iteration. According to calculations, each loop iteration wastes 2 cycles due to stalls and canceled instructions.

Therefore, based on the above calculation, the number of cycles required for each loop iteration is 10 cycles and the number of cycles wasted due to stalls and canceled instructions is 2 cycles.

F04 :

When the branch prediction is stable, the branch instruction is almost always predicted to be taken, and whether the branch decision is correct is determined during the execution phase. If the prediction is incorrect, the instruction is fetched again starting from the instruction after the branch instruction.

Therefore, the number of cycles saved per loop iteration due to better branch prediction is 2. This is because in steady state, when the branch prediction is correct, there is no need to waste extra cycles to re-fetch instructions, thus saving 2 cycles.

F05 : When the bypass from WB to DEC is removed, data transfer in the pipeline will be affected, possibly resulting in more stalls and wasted cycles.

The number of cycles required for each loop iteration is calculated to be 11 cycles. This is due to the fact that after removing the WB to DEC bypass, the data transfer in the pipeline is affected, causing each loop iteration to take more cycles to complete.

RV32 Vector extension
Since the QEMU environment couldn't be properly configured, I only wrote a 'possible' rewriting method.

mv x3, x4
vsetvli x0, x8, e32, m1      # Set vector length
loop:
    vle32.v x6, (x4)         # Load an element from vector 1
    vle32.v x7, (x4, 0x300)  # Load an element from vector 2
    vadd x7, x7, x6          # Perform addition of the two elements
    vse32.v x7, (x4)         # Store the result back into vector 1
    vaddi x4, x4, 4          # Increment the vector offset
    blt x4, x8, loop         # Continue loop for each vector element
    mv x10, x3
    addi sp, sp, 8           # Here, sp is equivalent to x2
    ret

Environment :

Problem G

Vectorize the given C function using the RISC-V Vector extension, adhering to these guidelines:

Provide the appropriate parameters for the vsetvli instruction.
The vector unit is set for SEW=32, accommodating 32-bit elements.
With LMUL=1, any of the 32 standard vector registers are available for use. Temporary registers can be utilized as needed.
Assume that the arrays ‘a’, ‘b’, and ‘idx’ do not have any overlapping memory locations (no aliasing).

In the activation function, the parameters are organized as follows: a0, a1, a2, a3 for the first four arguments, and a4, a5 for the remaining two. Here is the function’s defintion:

void activation(unsigned int n, int32_t *a, int32_t *b, uint32_t *idx,
                int32_t scale, uint32_t dilation)
{
    for (unsigned int i = 0; i < n; i++) {
        if (a[dilation*i] < 0)
            b[idx[i]] = scale * a[dilation * i];
    }
}

This assembly code (incomplete yet, fill in G01 to G7) performs vectorized operations for a given function, efficiently utilizing RISC-V vector instructions. The comments are aimed to clarify the purpose and the action of each instruction in the context of vector processing.

    # Prologue (if required)
    slli a5, a5, G01 # Calculate stride in bytes
loop:
    vsetvli G02, G03, e32, m1, ta, mu # Set vector length and type
    # Perform vector operations in this section
    vlse32.v v1, (a1), a5    # Load vector elements from a[] with stride dilation*i
    vmslt.vx v0, v1, x0      # Create a mask where a[dilation*i] is less than 0
    vmul.vx v1, v1, a4, v0.t # Multiply a[dilation*i] by scale where mask is true
    G04 v2, (a3), v0.t       # Load idx[i] under the mask
    vsll.vi v2, v2, 2, v0.t  # Convert idx values to byte offsets
    G05 v1, (a2), v2, v0.t   # Store results to b[] at byte-offset positions
    # Increment pointers for the next loop iteration
    sub a0, a0, t0           # Decrease loop counter n by the number of processed elements
    mul t1, t0, a5           # Calculate byte offset for the next a[] elements
    slli t0, t0, G06         # Convert vector length to byte offset for idx[]
    add a3, a3, t0           # Move idx pointer forward by processed elements
    add a1, a1, t1           # Move a pointer forward accounting for dilation
    # Loop back if there are more elements to process
    G07
end:
    ret # End of function

G01 = ?
2

G02 = ?
t0

G03 = ?
a0

G04 = ?
vle32.v

G05 = ?
vsoxei32.v

G06 = ?
2

G07 = ?
bnez a0, loop

📝 Annotate

G01 is the immediate value for the slli instruction, which left-shifts a5 by G01 to calculate the stride in bytes. Since it's a left shift, G01 is 2, meaning the stride is a5×2.

G02 : is the destination register for the vsetvli instruction, setting the vector length.

G03 : is the source register for the vsetvli instruction, determining the vector length. Here, it is set to a0, which holds the loop counter.

G04 : is the vector load instruction for loading elements from idx[i] under the mask v0.t.

G05 : is the vector store instruction for storing results to b[] at byte-offset positions.

G06 : is the immediate value for the slli instruction, left-shifting t0 to convert the vector length to a byte offset for idx[].

G07 : is the branch instruction, bnez (branch if not equal to zero), checking if there are more elements to process in the loop.

RV32 Vector extension
Since the QEMU environment couldn't be properly configured, I only wrote a 'possible' rewriting method.

# Prologue (if required)
    slli a5, a5, G01 # Calculate stride in bytes
loop:
    vsetvli G02, G03, e32, m1, ta, mu # Set vector length and type
    # Perform vector operations in this section
    vlse32.v v1, (a1), a5    # Load vector elements from a[] with stride dilation*i
    vmslt.vx v0, v1, x0      # Create a mask where a[dilation*i] is less than 0
    vmul.vx v1, v1, a4, v0.t # Multiply a[dilation*i] by scale where mask is true
    vle32.v v2, (a3), v0.t   # Load idx[i] under the mask
    vsll.vi v2, v2, 2, v0.t  # Convert idx values to byte offsets
    vse32.v v1, (a2), v2, v0.t   # Store results to b[] at byte-offset positions
    # Increment pointers for the next loop iteration
    sub a0, a0, t0           # Decrease loop counter n by the number of processed elements
    mul t1, t0, a5           # Calculate byte offset for the next a[] elements
    slli t0, t0, G06         # Convert vector length to byte offset for idx[]
    add a3, a3, t0           # Move idx pointer forward by processed elements
    add a1, a1, t1           # Move a pointer forward accounting for dilation
    # Loop back if there are more elements to process
    bnez a0, loop
end:
    ret # End of function

Environment :

Problem H

Consider the design of a basic 3-stage pipeline processor, which includes the following stages:

Instruction fetch and decode;
Execution;
Memory access and write-back.

This pipeline lacks bypassing capabilities. Branches are uniformly predicted as not taken, with the option for branch annulment present. Branch decision-making occurs during the execution stage. Additionally, the processor is characterized by specific propagation delays for each segment of its combinational logic.

stage	Lpropagation delay
IF	3 ns
DEC	4 ns
EXE	6 ns
MEM	5 ns
WB	3 ns

Our recent activity has been with the Fibonacci sequence. To indulge this interest, we have crafted a loop in RISC-V assembly language. This loop computes the first 1000 numbers of the Fibonacci sequence, as our patience does not extend beyond calculating 1000 numbers. These computed values are then stored in memory.

    addi t0, zero, 1000 # Set counter t0 to 1000
    addi a1, zero, 1    # Initialize fib(1) with value 1
    sw a1, 0(t1)        # Store fib(1) at the base memory address in t1
    addi a2, zero, 1    # Initialize fib(2) with value 1
    addi t1, t1, 4      # Increment memory address pointer t1
    sw a2, 0(t1)        # Store fib(2) at the new memory address
fib:
    add a3, a2, a1      # Calculate fib(n) as sum of fib(n-1) and fib(n-2)
    addi t1, t1, 4      # Increment memory address pointer t1
    sw a3, 0(t1)        # Store fib(n) at the new memory address
    mv a1, a2           # Update fib(n-2) with value of fib(n-1)
    mv a2, a3           # Update fib(n-1) with value of fib(n)
    addi t0, t0, -1     # Decrement the counter t0
    bnez t0, fib        # If t0 is not zero, loop back to 'fib'
    xor t2, t3, t4
    not t5, t3
    and t6, t3, t4

Consider that during cycle 100, the instruction add a3, a2, a1 is fetched, with the registers a2 and a1 already containing current data. Calculate the total number of cycles required for each iteration of the loop. Additionally, determine the number of cycles lost in each iteration due to stalls and the number of cycles lost due to annulments. Determine the shortest possible clock cycle duration for this processor. Additionally, calculate the time taken to complete one iteration of the loop.
- H01 = ?
  12
- H02 = ?
  4
- H03 = ?
  1
- H04 = ?
  8
- H05 = ?
  96

We then opt to enhance our pipeline by incorporating bypass and branch prediction features. The pipeline now includes complete bypassing to IFDEC, meaning data can bypass directly from the EXE stage to IFDEC and from the MEMWB stage to IFDEC. Furthermore, branch instructions are now predicted as taken, and the option for branch annulment is still available, with branch decision-making occurring in the EXE stage. In addition to these changes, we integrate bypass and prediction logic, which introduces specific propagation delays as follows:

stage	Lpropagation delay
IF	3 ns
DEC	4 ns
EXE	6 ns
MEM	5 ns
WB	3 ns
PRD	1 ns
BYP	1 ns

Consider that during the 100th cycle, the instruction add a3, a2, a1 is fetched, with the registers a2 and a1 already containing the latest values. Determine the number of cycles required for each loop iteration. Taking into account the specified delays for the bypass and branch prediction mechanisms, calculate the shortest possible clock cycle duration for this processor. Additionally, ascertain the duration of one loop iteration.
- Number of cycles per loop iteration: __ H06 __
- Processor minimum clock period: __ H07 __ ns
- Time to execute 1 iteration of the loop: __ H08 __ ns
- H06 = ?
  7
- H07 = ?
  15
- H08 = ?
  105

📝 Annotate

H01 ~ H03 :

First, we need to calculate the number of cycles required for each loop iteration. Based on the code and pipeline diagram provided in the question, we can calculate the number of execution cycles in the pipeline instruction by instruction. According to calculations, each loop iteration requires 12 cycles.

Second, we need to calculate the number of cycles wasted due to stalls and canceled instructions. In this problem, stalls and canceled instructions may occur due to data conflicts and branch instructions. By examining the pipeline graph, we can determine the number of cycles wasted due to stalls and canceled instructions in each loop iteration. According to calculations, each loop iteration is wasted 4 cycles due to stalls and 1 cycle due to canceled instructions.

H04 ~ H05 : According to the problem description, the minimum clock cycle of the processor is the sum of the delays of the MEM and WB stages. According to the information provided, the delay is 5 ns for the MEM stage and 3 ns for the WB stage. Therefore, the minimum clock period of the processor is 8 ns.

Second, we calculate the time required to perform the loop iterations. According to the problem description, each loop iteration requires 12 clock cycles. Therefore, the time required to perform one loop iteration is 12 clock cycles times the minimum clock cycle, or 12 * 8 ns = 96 ns.

H06 :

Assume that at the 100th cycle, the add a3, a2, a1 instructions have been fetched, and the values of registers a2 and a1 have been updated. . Based on the information provided, we can fill out the pipeline diagram cycle by cycle and mark all bypass paths if needed.

According to the filled-in pipeline diagram, the number of cycles required for each loop iteration is 7 cycles. Therefore, each loop iteration takes 7 cycles to complete.

H07 : The minimum clock cycle of the processor is the sum of the latencies of the EXE, PRD, IF, DEC and BYP stages. According to the information provided, the total delay of these stages is 15 ns.

H08 : Each loop iteration requires 7 clock cycles. Therefore, the time required to perform one loop iteration is 7 clock cycles times the minimum clock cycle, or 7 * 15 ns = 105 ns.

Problem I

Examine the given program:

int A[8][8];
int B[8][8];
int C[8][8][8];
for (int i = 0; i < 8; i++) {
    for (int j = 0; j < 8; j++) {
        for (int k = 0; k < 8; k++) /* Inner "k" loop */
            C[i][j][k] = A[i][j] + B[j][k];
    }
}

Now, consider a data cache setup that is three-way set associative with 4 sets, where each block contains 4 words. This cache exclusively stores data, not instructions. The cache is divided such that accesses to array A are mapped to way 0, accesses to array B to way 1, and accesses to array C to way 2. Remember that arrays are stored in memory in row-major order. For array C, this implies that elements with varying k values for each C[i][j] are stored consecutively, followed by subsequent j values and their respective k elements.

During the execution of the innermost “k” loop, determine the quantity of distinct elements from each array that are being accessed.
- Elements of A: __ I01 __
- Elements of B: __ I02 __
- Elements of C: __ I03 __
- I01 = ?
  1
- I02 = ?
  8
- I03 = ?
  8
Calculate the proportion of memory accesses to elements in array C that result in cache misses throughout the execution of the entire program. Express this as a fraction.
- Miss Rate: __ I04 __
- I04 : ?
  
  $\frac{1}{4}$
Determine the ratio of cache misses to total memory accesses for elements in array A over the course of the entire program, and express this as a fraction.
- Miss Rate: __ I05 __
- I05 : ?
  
  $\frac{1}{32}$

Think about rearranging the loop order by swapping the positions of the ‘j’ and ‘k’ loops, as shown below:

 int B[8][8];
 int C[8][8][8];
 for (int i = 0; i < 8; i++) {
-    for (int j = 0; j < 8; j++) {
-        for (int k = 0; k < 8; k++) /* Inner "k" loop */
+    for (int k = 0; k < 8; k++) {
+        for (int j = 0; j < 8; j++) /* Inner "j" loop */
             C[i][j][k] = A[i][j] + B[j][k];
     }
 }

Proceed to address the subsequent questions using the same data cache configuration as previously mentioned.

Calculate the proportion of memory accesses to elements in array C that result in cache misses throughout the entire program, expressed as a fraction.
- Miss Rate: __ I06 __
- I06 = ?
  1
What fraction of the accesses to elements of A are cache misses for the entire program?
- Miss Rate: __ I07 __
- I07 = ?
  
  $\frac{1}{32}$

📝 Annotate

I01 : In the inner "k" loop, only one element of array A is accessed because A[i][j] is accessed once for each iteration of the "k" loop.

I02 : In the inner "k" loop, all 8 elements of array B are accessed because B[j][k] is accessed for each iteration of the "k" loop, resulting in 8 different elements being accessed.

I03 : In the inner "k" loop, all 8 elements of array C are accessed because C[i][j][k] is accessed for each iteration of the "k" loop, resulting in 8 different elements being accessed.

I04 : Each access to C accesses a new element not seen before. C is traversed in row-by-row order, so the next element of C accessed is directly adjacent in memory. This means that when a block is brought in from memory from a miss on an access to C, it contains the next 3 elements to be accessed by C. Thus, only one in 4 accesses are a cache miss as after each miss there will be 3 following hits.
Order of C[i,j,k] array in its cache for i = 0, j = 0 and j = 1:
C[0,0,0] … C[0,0,3]
C[0,0,4] … C[0,0,7]
C[0,1,0] … C[0,1,3]
C[0,1,4] … C[0,1,7]

I05 : A is traversed in row-by-row order. The same element of A is accessed in the inner k loop 8 different times. It will not be kicked out of the cache, since it is accessed in every calculation and will not be the LRU. Thus, once a block is brought in for A, the next three elements will stick around for subsequent inner k loops. We only need one initial miss of an element of A to cover hits for four inner k loops, or 32 total accesses to A.
Order of A[i][j] array in its cache for i = 0 and i = 1:
A[0,0] … A[0,3]
A[0,4] … A[0,7]
A[1,0] … A[1,3]
A[1,4] … A[1,7]

I06 : C is now traversed in column-by-column order. When C[i,0,0] is brought into cache so are C[i,0,1-3]. However, now our inner loop is j, so next you will access C[i,1,0] which is not in the cache. This will be brought into set 2. Then C[i,2,0] will be brought into set 0 and evict C[i,0,0], so by the time you get to k = 1, you have evicted C[i,0,1] from the cache. This means that you don’t get to take advantage of the special locality and you miss every single time.

I07 : A continues to be accessed in row-by-row order. You first bring A[0,0-3] into set 0, A[0,4-7] into set 1. So out of the 8 values of the inner loop j, you miss twice and hit 6 times. For the next 7 values of k you get all hits. So you have a miss rate of 2/64 or 1/32. Once you move to the next value of i, the process repeats itself.

Problem J

Imagine a Chisel module designed to calculate the Greatest Common Divisor (GCD) using the subtraction method. This approach involves repeatedly subtracting the smaller number from the larger one until the value in register y becomes zero. At that point, the value remaining in register x represents the GCD.

package gcd

import chisel3._

class GCD extends Module {
  val io = IO(new Bundle {
    val value1        = Input(UInt(16.W))
    val value2        = Input(UInt(16.W))
    val loadingValues = Input(Bool())
    val outputGCD     = Output(UInt(16.W))
    val outputValid   = Output(Bool())
  })

  val x  = Reg(UInt())
  val y  = Reg(UInt())

  when(x > y) { J01 /* Fill in your answer here */ }
    .otherwise { J02 /* Fill in your answer here */ }

  when(io.loadingValues) {
    x := io.value1
    y := io.value2
  }

  io.outputGCD := x
  io.outputValid := y === 0.U
}

Finalize the provided code to ensure it performs as intended.

J01 = ?
x := x - y

J02 = ?
y := y - x

📝 Annotate

J01 : The purpose of this part is to subtract y from x when x is greater than y. This aligns with the subtraction method, continually subtracting the smaller number until y becomes zero.

J02 : This part aims to subtract x from y when x is not greater than y. Similarly, it ensures that y remains the smaller of the two numbers, adhering to the steps of the subtraction method.

Jim Huang

2024/01/21 11:12:52

C01 : The vector instruction is fetched (F) and decoded (D) in one cycle, and then it stalls (-) until its required vector functional unit is available. Without vector chaining, a dependent vector instruction stalls until the previous instruction finishes writing back all of its elements. Therefore, the total number of cycles without vector chaining would be: F + D + (32 * (R + M + W)) + F + D + (32 * (R + M + W)) + F + D + (32 * (R + X + W)) + F + D + (R + X + W) + F + D + (32 * (R + M + W)) = 36 cycles. C02 : With vector chaining, we can take advantage of the flexibility in reading elements from the register file. This allows an element to be read ® in the same cycle in which it is written back (W), or it can be read on any later cycle. Given that only one set of vector lanes (8 vector lanes) can write back in a single cycle, we can calculate the number of cycles with vector chaining as follows: F + D + (R + M + W) + F + D + (R + M + W) + F + D + (R + X + W) + F + D + (R + X + W) + F + D + (R + M + W) = 28 cycles. RV32 Vector extension Since the QEMU environment couldn’t be properly configured, I only wrote a ‘possible’ rewriting method.vsetvli t0, t0, e32,m1 # Set the vector length to the maximum vle32.v v1, (x1) # Load vector v1 vle32.v v2, (x2) # Load vector v2 vmul.vv v3, v2, v1 # Multiply vectors v2 and v1 vredsum.vs v4, v3 # Reduce sum of vector v3 vse32.v v4, (x3) # Store vector v4 Environment :

Why?

Annotate and explain Quiz4/5 with additional challenge Problems

Quiz4 of Computer Architecture (2023 Fall)

Problem A

📝 Annotate

Problem B

📝 Annotate

Formula :

Formula :

Problem B

📝 Annotate

Problem D

📝 Annotate

Formula :

Formula :

Formula :

Formula :

Problem E

📝 Annotate

Problem F

📝 Annotate

Problem G

📝 Annotate

Quiz5 of Computer Architecture (2023 Fall)

Problem A

📝 Annotate

Problem B

📝 Annotate

Problem C

📝 Annotate

Problem D

📝 Annotate

Problem E

📝 Annotate

Problem F

📝 Annotate

Problem G

📝 Annotate

Problem H

📝 Annotate

Problem I

📝 Annotate

Problem J

📝 Annotate

Read more

2025q1 Homework1 (ideas)

2025q1 Homework2 (quiz1+2)

2025q1 Homework1 (lab0)

Assignment1: RISC-V Assembly and Instruction Pipeline