--- tags: computer-arch --- # Quiz4 of Computer Architecture (2023 Fall) > Solutions ## Problem `A` Take a look at the circuit diagram provided below, along with the associated delays of its components: ![circuit diagram](https://hackmd.io/_uploads/HyopeegVT.png) | component | delays | |:----------|:-------| | t~AND~ | 20ps | | t~NOT~ | 5ps | | t~OR~ | 20ps | | t~XOR~ | 25ps | | t~multiplier~ | 50ps | | t~subtract~ | 35ps | | t~clk-q~ | 5ps | | t~setup~ | 2ps | Assuming that this circuit exclusively operates on two's complement integers, and given that the subtractor component has a delay of 35ps, what is the maximum delay permissible for the adder component while allowing us to replace the subtractor component with adders, NOT gates, and constants to achieve the same delay as the subtractor while maintaining the same behavior? It can be assumed that constants introduce no delay. As a refresher, the subtract component performs the following operation: output = top input - bottom input > * A01 = ? > ==35ps== > Since we are dealing with two’s complement numbers, subtracting by x is equivalent to adding ∼ x + 1, where ∼ x flips all of the bits of x. As a result, we can chain together a NOT gate, an adder with a constant 1, and another adder (to add the output of the previous adder and the top input of the subtractor) to achieve the same behavior. > The intent of the question is for students to realize that the NOT gate and the first adder does not actually add any additional delay, since the multiplier/NOT gate combo of the top input of the subtractor takes more time than the NOT gate/adder combo for the bottom input. Therefore, the adder can have a delay of 35ps (the same as the existing subtractor) for the circuit to maintain the same timing behavior. --- ## Problem `B` Examine the sequential circuit provided below, along with the associated timing specifications. Registers R1, R2, and R3 are synchronized by a shared clock signal, while A, B, and C represent combinational logic circuits. ![sequential circuit](https://hackmd.io/_uploads/BJoRxlgVT.png) | - | t~pd~ | t~cd~ | t~setup~ | t~hold~ | |---|:-----:|:-----:|:--------:|:-------:| | Register (R1, R2, R3) | 5ns | 1ns | 7ns | 2ns | | Circuit A | 3ns | ? | - | - | | Circuit B | 4ns | 2ns | - | - | | Circuit C | 6ns | 2.5ns | - | - | 1. What is t~pd~ of this sequential circuit? __ B01 __ ns > * B01 = ? > ==5== 2. What is t~cd~ of this sequential circuit? __ B02 __ ns > * B02 = ? > ==1== 3. What is the minimum value of t~cd~ for Circuit A that will ensure proper operation of this circuit? __ B03 __ ns > * B03 = ? > ==1== > t~cd,R1~ + t~cd,A~ >= t~hold,R2~ > 1 + t~cd,A~ >= 2 > t~cd,A~ >= 1ns 4. What is the minimum clock period that can be employed to clock all the registers within the circuit? __ B04 __ ns > * B04 = ? > ==19== > t~CLK~ >= t~pd,R1~ + t~pd,A~ + t~pd,B~ + t~setup,R3~ > t~CLK~ >= 5 + 3 + 4 + 7 > t~CLK~ >= 19ns 5. Available from the supplier are alternative circuits for A, B, and C, each with the following specifications. | - | t~pd~ | t~cd~ | t~setup~ | t~hold~ | |---|:-----:|:-----:|:--------:|:-------:| | A-New | 1ns | 0.5ns | - | - | | B-New | 2.5ns | 2ns | - | - | | C-New | 2ns | 1ns | - | - | Replacement of these new circuits is cost-sensitive, allowing for the substitution of only one combinational circuit to achieve a minimized clock period. Please specify which combinational circuit you intend to replace and the resulting minimum clock period. * Combinational circuit replaced: __ B05 __ * t~CLK~ of circuit: __ B06 __ ns > * B05 = ? > ==B== > * B06 = ? > ==18== > To minimize t~CLK~ >= t~pd,R1~ + t~pd,A~ + t~pd,B~ + t~setup,R3~, we want to replace A or B. > Can’t replace A with A-New because contamination delay of 0.5 would not satisfy the hold time for R2. So, we want to replace B since it is the other component in the critical path. Replacing B reduces the R1-R3 path to 5 + 3 + 2.5 + 7 = 17.5ns. However, the critical path is now the R2-R1 path which is 5 + 6 + 7 = 18. --- ## Problem `C` Consider a module called 'F' with three inputs (X, Y, and Z) and one output (A). ![Module F](https://hackmd.io/_uploads/S1PMOll4p.png =50%x) The circuit is known to function but has suboptimal throughput. To address this issue, you opt for pipelining the circuit. For each of the following questions, design a valid K-stage pipeline for the given circuit. Each component in the circuit is labeled with its propagation delay in nanoseconds. Provide the latency and throughput of each design, assuming ideal registers (i.e., t~PD~=0, t~SETUP~=0). Remember that according to our convention, a pipeline register is placed at each output. 1. Consider a 2-stage pipeline optimized for maximum throughput while utilizing the fewest possible registers. Determine the latency and throughput of the resulting circuit, taking careful note of the direction of each arrow. ![pipeline](https://hackmd.io/_uploads/BJ6D_ggVp.png) * Latency: __ C01 __ ns * Throughput: __ C02 __ ns^-1^ > * C01 = ? > ==30== > * C02 = ? > ==$\frac{1}{15}$== > ![pipeline](https://hackmd.io/_uploads/BJEv9xgEp.png =70%x) 2. Consider a maximum-throughput 4-stage pipeline using a minimal number of registers. What are the latency and throughput of the resulting circuit? * Latency: __ C03 __ ns * Throughput: __ C04 __ ns^-1^ > * C03 = ? > ==32== > * C04 = ? > ==$\frac{1}{8}$== > ![pipeline](https://hackmd.io/_uploads/H1VUaggVa.png =70%x) 3. Consider a maximum-throughput pipeline that uses a minimum number of pipeline stages. What are the latency and throughput of the resulting circuit? * Latency: __ C05 __ ns * Throughput: __ C06 __ ns^-1^ > * C05 = ? > ==42== > * C06 = ? > ==$\frac{1}{7}$== > ![pipeline](https://hackmd.io/_uploads/HJJX6xgEa.png =70%x) --- ## Problem `D` Examine the processor implementation designed for single-cycle operation. ![single-cycle CPU](https://hackmd.io/_uploads/HybXW-l4a.png) Below, you can find the timing specifications for each of the components: | Component | Propagation delay (t~PD~) | |:----------|:-------------------------:| | Register | 1ns | | Decoder | 2ns | | RegFile read | 3ns | | MUX | 1ns | | ALU | 4ns | | Adder | 3ns | | Memory read<br>(instruction or data) | 4ns | Setup/hold times for clocked inputs (registers and writes to RegFile and data memory) * Setup time (t~SETUP~): 2 ns * Hold time (t~HOLD~): 0 ns 1. What is the shortest clock cycle duration achievable for this processor? Minimum t~CLK~: __ D01 __ ns > * D01 = ? > ==22== > Critical path is PC -> IMem -> Decoder -> RF(read) -> BSEL mux -> ALU -> DMem (read) -> WDSEL mux -> RF(write) => t~CLK~ >= 1 + 4 + 2 + 3 + 1 + 4 + 4 + 1 + 2 (RF setup) 2. The current processor's performance is suboptimal. We intend to introduce an alternative data path (while keeping the control logic unchanged) in which the data memory's Adr input is sourced differently. What is the minimum clock cycle time of the new processor? Minimum t~CLK~: __ D02 __ ns ![revised CPU](https://hackmd.io/_uploads/ryahM-xV6.png) > * D02 = ? > ==18== > Critical path is PC -> IMem -> Decoder -> RF(read) -> BSEL mux -> ALU -> WDSEL mux -> RF(write) => t~CLK~ >= 1 + 4 + 2 + 3 + 1 + 4 + 1 + 2 (RF setup) 3. Now, the new processor executes certain instructions incorrectly as per the RISC-V ISA. Provide an example of such an instruction, `lw a0, 8(a1)`, and write its equivalent RISC-V instruction. __ D03 __ > * D03 = ? > ==`lw a0, 0(a1)`== > The same lw or sw instruction with the offset set to 0. 4. The program provided requires 90 instructions to complete execution on the original processor. Unfortunately, when run on the new processor, it yields incorrect results. Please make the necessary modifications to the program to ensure it operates correctly on the new processor. - [ ] provided program running on the original processor ```c # int x[16]; # for (int i = 0; i < 15; i++) # x[i+1] = x[i] + x[i+1]; # # Initial values: # a0: address of x[0] # a7: address of x[15] loop: lw a1, 0(a0) lw a2, 4(a0) add a3, a2, a1 sw a3, 4(a0) addi a0, a0, 4 blt a0, a7, loop ``` - [ ] Modified assembly that executes correctly on new processor ```c # Initial values: # a0: address of x[0] # a7: address of x[15] lw a1, 0(a0) addi a0, a0, 4 loop: __ D04 __ add a1, a1, a2 __ D05 __ addi a0, a0, 4 __ D06 __ ``` Additionally, how many instructions does your assembly code execute? __ D07 __ > * D04 = ? > ==`lw a2, 0(a0)`== > * D05 = ? > ==`sw a1, 0(a0)`== > * D06 = ? > ==`ble a0, a7, loop`== > * D07 = ? > ==77== > $15 \times 5 + 2 = 77$ 5. What is the elapsed time for the execution of the program on both the original and new processors? (Please utilize the relevant program variant for each respective processor.) * Execution time on the original processor: __ D08 __ ns * Execution time on the new processor: __ D09 __ ns > * D08 = ? > ==1694== > $77 \times 22 = 1694$ > * D09 = ? > ==1386== > $77 \times 18 = 1386$ --- ## Problem `E` Take a look at the loop provided below, which calculates the sum of values in a linked list. We execute this code on a typical 5-stage RISC-V processor equipped with complete bypassing. Assume that branches are consistently predicted as **not taken**, and that branch determinations are made during the EXE stage. Also, consider that the loop undergoes numerous iterations and is presently in the middle of its execution. ```c loop: lw a1, 0(a0) lw a0, 4(a0) add a2, a2, a1 bnez a0, loop mv a0, a2 addi sp, sp, 8 ret ``` 1. Assume that at cycle 100 the `lw a1, 0(a0)` instruction is fetched. Fill in the following: * Number of cycles per loop iteration: __ E01 __ * Number of cycles per loop iteration wasted due to stalls: __ E02 __ * Number of cycles per loop iteration wasted due to annulments: __ E03 __ > * E01 = ? > ==7== > * E02 = ? > ==1== > * E03 = ? > ==2== > ![pipeline](https://hackmd.io/_uploads/Sy5WiZl4p.png) We aim to enhance the performance of this processor. Upon careful examination, we have identified an opportunity to save cycles by introducing additional bypass paths. Contrary to its label, "full bypassing" does not fully exploit the potential for bypassing. It primarily bypasses data to the DEC stage alone. Occasionally, instructions do not require their inputs in the EXE stage but rather at a later stage, and stalling at the DEC stage leads to unnecessary cycle wastage. To address this issue, we suggest a bypassing scheme named "extra bypassing." In extra bypassing: - There exists a bypass path from each stage to DEC and all intermediate stages between DEC and the source of the bypass. For example, in a 5-stage pipeline, the paths include: - EXE -> DEC - MEM -> DEC - WB -> DEC - MEM -> EXE - WB -> EXE - WB -> MEM (Note that full bypassing only includes the first 3 of these paths). - As customary, each bypass path forwards data to the end of the receiving stage. For instance: - MEM -> DEC bypasses data to the end of DEC, right before the DEC-EXE pipeline register, after register reads, and after the ALU operation. - MEM -> EXE bypasses data to the end of EXE, right before the EXE-MEM pipeline register, after the ALU operation. - If an instruction relies on a value produced by an earlier instruction, and that value is still unavailable (either in the register file or through a bypass path), the instruction doesn't stall until the stage just before the value is required. For example: - If a value is needed in EXE and hasn't been generated yet, the instruction stalls in DEC (as in the case of full bypassing). - However, if the value is needed in MEM, the instruction proceeds from DEC to EXE and only stalls in EXE until the value becomes accessible through a bypass path. - Instructions capture missing inputs from bypass paths as soon as they become available, even if they are currently stalled for other reasons. This ensures that if an instruction advances past DEC, it always acquires the correct input value before it needs it. The diagram below shows the bypass paths in full vs. extra bypassing. ![full bypassing](https://hackmd.io/_uploads/By1S2-gEp.png =70%x) ![extra bypassing](https://hackmd.io/_uploads/B1Nv2ZeEp.png =70%x) 2. Consider a 2-instruction sequence that results in a delay of two cycles when using full bypassing but only one cycle when employing extra bypassing. ```c lw a1, 0(a0) sw a1, 0(a2) ``` Specify which bypass paths are exercised in each case. * Bypass paths exercised with full bypassing: __ E04 __ * Bypass paths exercised with extra bypassing: __ E05 __ > * E04 = ? > ==WB->DEC== > * E05 = ? > ==WB->EXE== To harness the benefits of extra bypassing more effectively, we introduce a 5-stage pipeline structure as illustrated below, with the ALU positioned in the fourth stage: - In this pipeline: - The first two stages, IF (Instruction Fetch) and DEC (Decode), function similarly to the standard 5-stage pipeline. - The third stage, AC (Address Calculation), handles address computation for load and store operations. - The fourth stage, EXM (Execute Memory), carries out ALU operations and manages data access. - The fifth stage, WB (Write Back), operates in line with the standard pipeline procedure. - Branch predictions are made with an assumption of **not-taken** and are resolved in the EXM stage. - This pipeline incorporates extra bypassing for improved performance. ![pipeline](https://hackmd.io/_uploads/ByAyAbxVp.png) 3. Consider again the code from part 1 and assume that at cycle 100 the `lw a1, 0(a0)` instruction is fetched. Fill in the following: (Remember that the Execution module resides within the EXM stage of the pipeline.) * Number of cycles per loop iteration: __ E06 __ * Number of cycles per loop iteration wasted due to stalls: __ E07 __ * Number of cycles per loop iteration wasted due to annulments: __ E08 __ > * E06 = ? > ==7== > * E07 = ? > ==0== > * E08 = ? > ==3== > ![pipeline](https://hackmd.io/_uploads/Hkc30-e4T.png) --- ## Problem `F` We are in the process of developing our custom RISC-V processor, which includes support for integer multiplication. This processor incorporates a `mul` instruction designed to multiply two source registers and generate a 32-bit result. To assess the functionality of our processor, we employ the following program for testing purposes. It is important to note that the branch instruction is presumed to be taken, and all instructions following the 'ret' instruction are NOPs. ```c loop: lw x14, 4(x12) mul x10, x14, x10 andi x12, x14, 255 bnez x12, loop exit: sw x12, 0(x11) ret ``` In the initial phase, we attempt to construct a single-cycle processor, which is depicted below. ![single-cycle processor](https://hackmd.io/_uploads/HJ5YnfgEp.png) Consider that the processor is equipped with perfect registers (with no setup or hold times, and zero propagation and clock-to-q delays), and all memory operations can be completed within a single cycle (combinational reads). Utilize the provided table to calculate the minimum clock cycle duration, the average instruction cycle count, and assess the overall processor performance in terms of average instructions executed per nanosecond while running the specified test code. | Logic Block | Propagation Delay | |:------------|:-----------------:| | IF | 3ns | | DEC | 2ns | | EXE | 10ns | | MEM | 3ns | | WB | 2ns | 1. Fill in the following: * Average cycles per instruction: __ F01 __ * Clock period: __ F02 __ ns * Average number of instructions per nanosecond: __ F03 __ > * F01 = ? > ==1== > * F02 = ? > ==20== > * F03 = ? > ==$\frac{1}{20}$== It is evident that pipelining can be employed to enhance clock frequency. We have the option to enhance the initial design, as depicted in the subsequent diagram. In this scenario, we assume full bypassing is implemented, and branch decisions are rendered during the EXE stage. The IF stage, in a speculative manner, retrieves the instruction located at PC+4 and invalidates incorrectly fetched instructions in the event of a branch misprediction. ![pipeline](https://hackmd.io/_uploads/r1hM0GeVa.png) We employ the identical test program to evaluate the performance of the suggested processor. 2. Compute the clock cycle duration, average instruction cycles, and average instruction throughput per nanosecond while executing the test code. It is assumed that bypassing does not impact the critical path. * Clock period: __ F04 __ ns * Average cycles per instruction: __ F05 __ * Average number of instructions per nanosecond: __ F06 __ > * F04 = ? > ==10== > * F05 = ? > ==2== > * F06 = ? > ==$\frac{1}{20}$== > ![pileine](https://hackmd.io/_uploads/B1741Xe4p.png) > 8 cycles per loop iteration (4 instructions), so average cycles per instruction = 2. > 4 instr/80 ns = 1/20 instr/ns. We have a suspicion that the EXE stage might be the root cause of the limited performance improvement. Consequently, we divide the EXE stage into three separate pipeline stages. The bypassing and branch decision logic that was previously in the EXE stage has now been relocated to the EX3 stage. ![pipeline](https://hackmd.io/_uploads/SJBhkQgET.png) | Logic Block | Propagation Delay | |-------------|-------------------| | EX1 | 4ns | | EX2 | 3ns | | EX3 | 3ns | 3. Calculate the clock period, average cycles per instruction, and average number of instructions per nanosecond for the test benchmark running on this processor. * Clock period: __ F07 __ ns * Average cycles per instruction: __ F08 __ * Average number of instructions per nanosecond: __ F09 __ > * F07 = ? > ==4== > * F08 = ? > ==3.5== > * F09 = ? > ==$\frac{1}{14}$== > ![pipeline](https://hackmd.io/_uploads/BJ8PgmeN6.png) > Branch decision is made when bnez gets to EX3 in cycle 14, so lw is fetched again in cycle 15. > 14 cycles per loop iteration (4 instructions), so average cycles per instruction = 14/4 = 3.5 > 4 instr/56 ns = 1/14 instr/ns. Subsequently, we discover that the critical path in the EXE stage is primarily attributed to the multiplier employed for the `mul` instruction. In response, we separate the multiplier from the EXE stage and run it in parallel, ensuring that non-`mul` instructions require only one cycle to complete. The branch decision logic, however, remains in the EX stage. ![pipeline](https://hackmd.io/_uploads/rJLgbXgV6.png) The MUX preceding the MEM stage adheres to the following policy: it prioritizes selecting the valid instruction from EX over MUL3. However, if both instructions from EX and MUL3 are valid, it gives precedence to the one from MUL3 and consequently stalls EX and all preceding pipeline stages. The timing parameters for the above new processor are shown below: | Logic Block | Propagation Delay | |-------------|-------------------| | EX | 4ns | | MUL1 | 4ns | | MUL2 | 3ns | | MUL3 | 3ns | 4. Determine the clock cycle duration, average instruction cycles, and average instructions executed per nanosecond for the test benchmark when executed on this processor. * Clock period (ns): __ F10 __ * Average cycles per instruction: __ F11 __ * Average number of instructions per nanosecond: __ F12 __ > * F10 = ? > ==4== > * F11 = ? > ==2== > * F12 = ? > ==$\frac{1}{8}$== OR ==$\frac{1}{9}$== > ![pipline](https://hackmd.io/_uploads/rk_xMXgVa.png) > 8 cycles per loop iteration (4 instructions), so average cycles per instruction = 8/4 = 2 > 4 instr/32 ns = 1/8 instr/ns. > Also accepted 9 cycles per loop iteration if you thought lw could not be fetched until bnez leaves EX stage. > In this case CPI is 2.25 and average number of instrucations per ns is 1/9. --- ## Problem `G` [Chisel](https://www.chisel-lang.org/) is a domain specific language (DSL) implemented using [Scala](https://www.scala-lang.org/)'s macro features. Therefore, all programming related to circuit logic must be implemented using the macro definitions provided in the [Chisel](https://www.chisel-lang.org/) library, rather than directly using Scala language keywords. Specifically, when you want to use a constant in a circuit, such as 12, and compare it with a circuit value, you cannot write `if (value == 12)` but instead must write `when(value === 12.U)`. Memorizing this fundamental concept is essential to correctly write valid [Chisel](https://www.chisel-lang.org/) code. > [Chisel Cheatsheet](https://github.com/freechipsproject/chisel-cheatsheet/releases/latest/download/chisel_cheatsheet.pdf) 1. Have you begun your work on [Homework3](https://hackmd.io/@sysprog/2023-arch-homework3)? __ G01 __ > * G01 = ? > 只要作答,就給分 Chisel defines a constructor for n-way multiplexors ```scala MuxLookup(index, default, Array(key1->value1, key2->value2,..., keyN->valueN)) ``` Explain * The index to key match is implemented using the `===` operator. * Therefore `MuxLookup` would work for any type for which `===` is defined. * `===` is defined on bundles and vecs, as well as the primitive Chisel types. * Users might can override `===` for their own bundles. Take a look at the Scala code provided in the instruction decoding section of a RISC-V processor implementation. ```scala class InstructionDecode extends Module { val io = IO(new Bundle { val instruction = Input(UInt(Parameters.InstructionWidth)) val regs_reg1_read_address = Output(UInt(Parameters.PhysicalRegisterAddrWidth)) val regs_reg2_read_address = Output(UInt(Parameters.PhysicalRegisterAddrWidth)) val ex_immediate = Output(UInt(Parameters.DataWidth)) val ex_aluop1_source = Output(UInt(1.W)) val ex_aluop2_source = Output(UInt(1.W)) val memory_read_enable = Output(Bool()) val memory_write_enable = Output(Bool()) val wb_reg_write_source = Output(UInt(2.W)) val reg_write_enable = Output(Bool()) val reg_write_address = Output(UInt(Parameters.PhysicalRegisterAddrWidth)) }) val opcode = io.instruction(6, 0) val funct3 = io.instruction(14, 12) val funct7 = io.instruction(31, 25) val rd = io.instruction(11, 7) val rs1 = io.instruction(19, 15) val rs2 = io.instruction(24, 20) io.regs_reg1_read_address := Mux(opcode === Instructions.lui, 0.U(Parameters.PhysicalRegisterAddrWidth), rs1) io.regs_reg2_read_address := rs2 val immediate = MuxLookup( opcode, Cat(Fill(20, io.instruction(31)), io.instruction(31, 20)), IndexedSeq( InstructionTypes.I -> Cat(Fill(21, io.instruction(31)), io.instruction(30, 20)), InstructionTypes.L -> Cat(Fill(21, io.instruction(31)), io.instruction(30, 20)), Instructions.jalr -> Cat(Fill(21, io.instruction(31)), io.instruction(30, 20)), ... Instructions.lui -> G02 /* Your answer here! */, ... ) ) ``` 2. Complete the provided code (i.e., `G02`) to ensure it functions as intended. > * G02 = ? > ==`Cat(io.instruction(31, 12), 0.U(12.W))`==