Solutions
A
Take a look at the circuit diagram provided below, along with the associated delays of its components:
component | delays |
---|---|
tAND | 20ps |
tNOT | 5ps |
tOR | 20ps |
tXOR | 25ps |
tmultiplier | 50ps |
tsubtract | 35ps |
tclk-q | 5ps |
tsetup | 2ps |
Assuming that this circuit exclusively operates on two's complement integers, and given that the subtractor component has a delay of 35ps, what is the maximum delay permissible for the adder component while allowing us to replace the subtractor component with adders, NOT gates, and constants to achieve the same delay as the subtractor while maintaining the same behavior? It can be assumed that constants introduce no delay.
As a refresher, the subtract component performs the following operation:
output = top input - bottom input
- A01 = ?
35ps
Since we are dealing with two’s complement numbers, subtracting by x is equivalent to adding ∼ x + 1, where ∼ x flips all of the bits of x. As a result, we can chain together a NOT gate, an adder with a constant 1, and another adder (to add the output of the previous adder and the top input of the subtractor) to achieve the same behavior.
The intent of the question is for students to realize that the NOT gate and the first adder does not actually add any additional delay, since the multiplier/NOT gate combo of the top input of the subtractor takes more time than the NOT gate/adder combo for the bottom input.
Therefore, the adder can have a delay of 35ps (the same as the existing subtractor) for the circuit to maintain the same timing behavior.
B
Examine the sequential circuit provided below, along with the associated timing specifications. Registers R1, R2, and R3 are synchronized by a shared clock signal, while A, B, and C represent combinational logic circuits.
- | tpd | tcd | tsetup | thold |
---|---|---|---|---|
Register (R1, R2, R3) | 5ns | 1ns | 7ns | 2ns |
Circuit A | 3ns | ? | - | - |
Circuit B | 4ns | 2ns | - | - |
Circuit C | 6ns | 2.5ns | - | - |
What is tpd of this sequential circuit? __ B01 __ ns
- B01 = ?
5
What is tcd of this sequential circuit? __ B02 __ ns
- B02 = ?
1
What is the minimum value of tcd for Circuit A that will ensure proper operation of this circuit? __ B03 __ ns
- B03 = ?
1
tcd,R1 + tcd,A >= thold,R2
1 + tcd,A >= 2
tcd,A >= 1ns
What is the minimum clock period that can be employed to clock all the registers within the circuit? __ B04 __ ns
- B04 = ?
19
tCLK >= tpd,R1 + tpd,A + tpd,B + tsetup,R3
tCLK >= 5 + 3 + 4 + 7
tCLK >= 19ns
Available from the supplier are alternative circuits for A, B, and C, each with the following specifications.
- | tpd | tcd | tsetup | thold |
---|---|---|---|---|
A-New | 1ns | 0.5ns | - | - |
B-New | 2.5ns | 2ns | - | - |
C-New | 2ns | 1ns | - | - |
Replacement of these new circuits is cost-sensitive, allowing for the substitution of only one combinational circuit to achieve a minimized clock period. Please specify which combinational circuit you intend to replace and the resulting minimum clock period.
- B05 = ?
B- B06 = ?
18
To minimize tCLK >= tpd,R1 + tpd,A + tpd,B + tsetup,R3, we want to replace A or B.
Can’t replace A with A-New because contamination delay of 0.5 would not satisfy the hold time for R2. So, we want to replace B since it is the other component in the critical path. Replacing B reduces the R1-R3 path to 5 + 3 + 2.5 + 7 = 17.5ns. However, the critical path is now the R2-R1 path which is 5 + 6 + 7 = 18.
C
Consider a module called 'F' with three inputs (X, Y, and Z) and one output (A).
The circuit is known to function but has suboptimal throughput. To address this issue, you opt for pipelining the circuit. For each of the following questions, design a valid K-stage pipeline for the given circuit. Each component in the circuit is labeled with its propagation delay in nanoseconds. Provide the latency and throughput of each design, assuming ideal registers (i.e., tPD=0, tSETUP=0). Remember that according to our convention, a pipeline register is placed at each output.
Consider a 2-stage pipeline optimized for maximum throughput while utilizing the fewest possible registers. Determine the latency and throughput of the resulting circuit, taking careful note of the direction of each arrow.
- C01 = ?
30- C02 = ?
Consider a maximum-throughput 4-stage pipeline using a minimal number of registers. What are the latency and throughput of the resulting circuit?
- C03 = ?
32- C04 = ?
Consider a maximum-throughput pipeline that uses a minimum number of pipeline stages. What are the latency and throughput of the resulting circuit?
- C05 = ?
42- C06 = ?
D
Examine the processor implementation designed for single-cycle operation.
Below, you can find the timing specifications for each of the components:
Component | Propagation delay (tPD) |
---|---|
Register | 1ns |
Decoder | 2ns |
RegFile read | 3ns |
MUX | 1ns |
ALU | 4ns |
Adder | 3ns |
Memory read (instruction or data) |
4ns |
Setup/hold times for clocked inputs (registers and writes to RegFile and data memory)
What is the shortest clock cycle duration achievable for this processor? Minimum tCLK: __ D01 __ ns
- D01 = ?
22
Critical path is PC -> IMem -> Decoder -> RF(read) -> BSEL mux -> ALU -> DMem (read) -> WDSEL mux -> RF(write) => tCLK >= 1 + 4 + 2 + 3 + 1 + 4 + 4 + 1 + 2 (RF setup)
The current processor's performance is suboptimal. We intend to introduce an alternative data path (while keeping the control logic unchanged) in which the data memory's Adr input is sourced differently. What is the minimum clock cycle time of the new processor? Minimum tCLK: __ D02 __ ns
- D02 = ?
18
Critical path is PC -> IMem -> Decoder -> RF(read) -> BSEL mux -> ALU -> WDSEL mux -> RF(write) => tCLK >= 1 + 4 + 2 + 3 + 1 + 4 + 1 + 2 (RF setup)
Now, the new processor executes certain instructions incorrectly as per the RISC-V ISA. Provide an example of such an instruction, lw a0, 8(a1)
, and write its equivalent RISC-V instruction. __ D03 __
- D03 = ?
lw a0, 0(a1)
The same lw or sw instruction with the offset set to 0.
The program provided requires 90 instructions to complete execution on the original processor. Unfortunately, when run on the new processor, it yields incorrect results. Please make the necessary modifications to the program to ensure it operates correctly on the new processor.
Additionally, how many instructions does your assembly code execute? __ D07 __
- D04 = ?
lw a2, 0(a0)
- D05 = ?
sw a1, 0(a0)
- D06 = ?
ble a0, a7, loop
- D07 = ?
77
- D08 = ?
1694
- D09 = ?
1386
E
Take a look at the loop provided below, which calculates the sum of values in a linked list. We execute this code on a typical 5-stage RISC-V processor equipped with complete bypassing. Assume that branches are consistently predicted as not taken, and that branch determinations are made during the EXE stage. Also, consider that the loop undergoes numerous iterations and is presently in the middle of its execution.
lw a1, 0(a0)
instruction is fetched. Fill in the following:
- E01 = ?
7- E02 = ?
1- E03 = ?
2
We aim to enhance the performance of this processor. Upon careful examination, we have identified an opportunity to save cycles by introducing additional bypass paths. Contrary to its label, "full bypassing" does not fully exploit the potential for bypassing. It primarily bypasses data to the DEC stage alone. Occasionally, instructions do not require their inputs in the EXE stage but rather at a later stage, and stalling at the DEC stage leads to unnecessary cycle wastage. To address this issue, we suggest a bypassing scheme named "extra bypassing." In extra bypassing:
The diagram below shows the bypass paths in full vs. extra bypassing.
- E04 = ?
WB->DEC- E05 = ?
WB->EXE
To harness the benefits of extra bypassing more effectively, we introduce a 5-stage pipeline structure as illustrated below, with the ALU positioned in the fourth stage:
lw a1, 0(a0)
instruction is fetched. Fill in the following: (Remember that the Execution module resides within the EXM stage of the pipeline.)
- E06 = ?
7- E07 = ?
0- E08 = ?
3
F
We are in the process of developing our custom RISC-V processor, which includes support for integer multiplication. This processor incorporates a mul
instruction designed to multiply two source registers and generate a 32-bit result. To assess the functionality of our processor, we employ the following program for testing purposes. It is important to note that the branch instruction is presumed to be taken, and all instructions following the 'ret' instruction are NOPs.
In the initial phase, we attempt to construct a single-cycle processor, which is depicted below.
Consider that the processor is equipped with perfect registers (with no setup or hold times, and zero propagation and clock-to-q delays), and all memory operations can be completed within a single cycle (combinational reads). Utilize the provided table to calculate the minimum clock cycle duration, the average instruction cycle count, and assess the overall processor performance in terms of average instructions executed per nanosecond while running the specified test code.
Logic Block | Propagation Delay |
---|---|
IF | 3ns |
DEC | 2ns |
EXE | 10ns |
MEM | 3ns |
WB | 2ns |
- F01 = ?
1- F02 = ?
20- F03 = ?
It is evident that pipelining can be employed to enhance clock frequency. We have the option to enhance the initial design, as depicted in the subsequent diagram. In this scenario, we assume full bypassing is implemented, and branch decisions are rendered during the EXE stage. The IF stage, in a speculative manner, retrieves the instruction located at PC+4 and invalidates incorrectly fetched instructions in the event of a branch misprediction.
We employ the identical test program to evaluate the performance of the suggested processor.
- F04 = ?
10- F05 = ?
2- F06 = ?
8 cycles per loop iteration (4 instructions), so average cycles per instruction = 2.
4 instr/80 ns = 1/20 instr/ns.
We have a suspicion that the EXE stage might be the root cause of the limited performance improvement. Consequently, we divide the EXE stage into three separate pipeline stages. The bypassing and branch decision logic that was previously in the EXE stage has now been relocated to the EX3 stage.
Logic Block | Propagation Delay |
---|---|
EX1 | 4ns |
EX2 | 3ns |
EX3 | 3ns |
- F07 = ?
4- F08 = ?
3.5- F09 = ?
Branch decision is made when bnez gets to EX3 in cycle 14, so lw is fetched again in cycle 15.
14 cycles per loop iteration (4 instructions), so average cycles per instruction = 14/4 = 3.5
4 instr/56 ns = 1/14 instr/ns.
Subsequently, we discover that the critical path in the EXE stage is primarily attributed to the multiplier employed for the mul
instruction. In response, we separate the multiplier from the EXE stage and run it in parallel, ensuring that non-mul
instructions require only one cycle to complete. The branch decision logic, however, remains in the EX stage.
The MUX preceding the MEM stage adheres to the following policy: it prioritizes selecting the valid instruction from EX over MUL3. However, if both instructions from EX and MUL3 are valid, it gives precedence to the one from MUL3 and consequently stalls EX and all preceding pipeline stages.
The timing parameters for the above new processor are shown below:
Logic Block | Propagation Delay |
---|---|
EX | 4ns |
MUL1 | 4ns |
MUL2 | 3ns |
MUL3 | 3ns |
- F10 = ?
4- F11 = ?
2- F12 = ?
OR
8 cycles per loop iteration (4 instructions), so average cycles per instruction = 8/4 = 2
4 instr/32 ns = 1/8 instr/ns.
Also accepted 9 cycles per loop iteration if you thought lw could not be fetched until bnez leaves EX stage.
In this case CPI is 2.25 and average number of instrucations per ns is 1/9.
G
Chisel is a domain specific language (DSL) implemented using Scala's macro features. Therefore, all programming related to circuit logic must be implemented using the macro definitions provided in the Chisel library, rather than directly using Scala language keywords. Specifically, when you want to use a constant in a circuit, such as 12, and compare it with a circuit value, you cannot write if (value == 12)
but instead must write when(value === 12.U)
. Memorizing this fundamental concept is essential to correctly write valid Chisel code.
- G01 = ?
只要作答,就給分
Chisel defines a constructor for n-way multiplexors
Explain
===
operator.MuxLookup
would work for any type for which ===
is defined.===
is defined on bundles and vecs, as well as the primitive Chisel types.===
for their own bundles.Take a look at the Scala code provided in the instruction decoding section of a RISC-V processor implementation.
G02
) to ensure it functions as intended.
- G02 = ?
Cat(io.instruction(31, 12), 0.U(12.W))