Quiz4 of Computer Architecture (2022 Fall)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

General Information

You are allowed to read lecture materials.
- That is, an open book exam.
We are using the honor system during this quiz, and would like you to accept the following:
1. You will not share the quiz with anyone.
2. You will not talk to anyone about the information on the quiz until the answers are made public.
Each answer has 4 points.
You must answer in valid numeric/C99/assembly representation and/or English alphabet except your formal name.
Message the instructor via Facebook if you found something wrong.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
09:10 ~ 10:15AM on Nov 15, 2022

Problem `A`

Assume that we are building a JIT compiler, which generates RISC-V instructions at runtime. Because a JIT must render and execute a native binary image at runtime, true machine-code JITs necessitate platforms that allow for data to be executed at runtime – code is data. Here is a simplified implementation based on C and RV32I: (filename: jit.c)

















#include <stdint.h>
#include <stdio.h>

typedef int func_t(void);
int main(int argc, char *argv[])
{
    uint32_t instructions[] = {
        [0] = A01,  /* addi a0, zero, ? */
        [1] = 0x00008067,  /* A02 */
    };

    /* Reinterpret the array address as a function */
    func_t *jit = (func_t *) instructions;
    printf("%d\n", jit());

    return 0;
}

Compile via the command below:

$ riscv-none-elf-gcc -O0 -march=rv32i -mabi=ilp32 -o jit.elf jit.c

Run it with rv32emu.

$ rv32emu jit.elf

We get the corresponding output.

56
inferior exit code 0

Reference: Porting a JIT compiler to RISC-V: Challenges and Opportunities

Given instruction addi a0, zero, ? as shown in Line 8, what is the HEX value of A01?
- A01 = ?
What is the dissassmbly of 0x00008067 (shown in Line 9) ? __ A02 __
- A02 = ?
Explain why the above program works. __ A03 __
- A03 = ?

Problem `B`

The Python code for the recursive function count8, which counts the number of 8's (base 16) in the input, is shown below. count8(0x988) = 2 and count8(0x81) = 1, for instance. A RISC-V assembly-based implementation of the function is shown below.

Python code

def count8(in):
    if in < 8:
        return 0
    end = in & 0xF
    current_8 = 0
    if end == 8:
        current_8 = 1
    return current_8 + count8(in >> 4)

The corresponding RISC-V assembly

count8:
    li a1, 8
    bgt a1, a0, base
    addi sp, sp, -12
    sw ra, 0(sp)
    sw s1, 4(sp)
    andi a2, a0, 0xF
    li s1, 0
L1:
    bne __ B01 __
    addi s1, s1, 1
continue:
    srli a0, a0, 4
    sw a0, 8(sp)
    call count8
    add a0, a0, s1
L2:
    lw ra, 0(sp)
    lw s1, 4(sp)
    addi sp, sp, 12
    j done
base:
    li a0, 0
done:
    ret

What should be B01 to make the assembly implementation match the Python code?
- B01 = ?
How many words will be written to the stack before the program makes each recursive call to the function count8? __ B02 __
- B02 = ?

The instruction call count8 makes the first call to the function count8 in the program outside of the function definition. The program is interrupted right before the execution of lw ra, 0(sp) at label L2 during a recursive call to count8. The layout of a memory area is shown in the diagram below. The values of all addresses and data are shown in hex. The sp register's current value, 0xEAC, points to the location indicated in the diagram.

-	Address	Data
	0xE94	0x9E4
	0xE98	0x0
	0xE9C	0x0
	0xEA0	__ B03 __
	0xEA4	__ B04 __
	0xEA8	__ B05 __
sp→	0xEAC	0x9E4
	0xEB0	0x1
	0xEB4	0x82
	0xEB8	0xC8
	0xEBC	0x73
	0xEC0	0x828
	0xEC4	0xC14
	0xEC8	0x82

Fill B03, B04, and B05 for the stack associated with 0xEA0, 0xEA4, and 0xEA8 (answer in HEX)
- B03 = ?
- B04 = ?
- B05 = ?
What is the initial input in at the initial call to count8? __ B06 __ (answer in HEX)
- B06 = ?
What are the values in the following registers right when the execution of count8 is interrupted? (answer in HEX)
- ra = B07
- a2 = B08
- s1 = B09
- B07 = ?
- B08 = ?
- B09 = ?
What is the value in register a0 right when the execution of count8 is interrupted? (answer in HEX) __ B10 __
- B10 = ?
What is the hex address of the call f instruction that made the initial call to count8? __ B11 __
- B11 = ?
What is the hex address of the continue label? __ B12 __
- B12 = ?

Problem `C`

Take a look at the logic diagram below, which computes two outputs {x, y} from four inputs {a, b, c, d}. Calculate the circuit's t_PD using the t_PD details for the gate components listed in the table below.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Gate	t_PD
XNOR2	5.5ns
AND2	3.5ns
OR2	2.5ns
INV	1.0ns

t_PD = __ C01 __ ns

C01 = ?

Problem `D`

Assume that you are creating G circuit, as seen below. Everything has been running smoothly until you discover that the G circuit's throughput is too low one day. Your biggest client needs a throughput at least twice as high as what you can currently offer in order to include the G circuit into their new product. You know precisely what to do: pipeline it as the G circuit is now only combinational.

You choose to start small with a two-stage pipeline for your initial iteration. Calculate your circuit's overall throughput and latency.
- Latency: __ D01 __ ns
- Throughput: __ D02 __ ns^-1
- D01 = ?
- D02 = ?
You decide to add an additional level after completing the first iteration because you are beginning to feel quite at ease with the G circuit. Let's implement the maximal throughput pipeline. Kindly explain a pipeline using the following design that maximizes throughput while utilizing the fewest possible registers. Calculate your circuit's throughput and latency. NOTE: Use the fewest possible pipeline stages to maximize throughput in your pipelined circuit.
- Latency: __ D03 __ ns
- Throughput: __ D04 __ ns^-1
- D03 = ?
- D04 = ?
Assume that we want to integrate another pipelined circuit, Circuit H, shown below. However, this circuit has multiple different timing specifications! In the below table, you have timing specifications for two different implementations of H: Version A, and Version B. Using your maximal-throughput G circuit from part C, select the version of H that minimizes your latency of the combined G-H circuit. Additionally, calculate the latency and throughput of the combined G-H circuit.

- Version A Version B

clock period 7 ns 8 ns

stages 2 1
- Which one can minimize your latency of the combined G-H circuit? (Answer in Version A or Version B) __ D05 __
- Overall latency = __ D06 __ ns
- Overall throughput = __ D07 __ ns^-1
- D05 = ?
- D06 = ?
- D07 = ?

-	Version A	Version B
clock period	7 ns	8 ns
stages	2	1

Problem `E`

We are considering the following RISC-V processor design. For this processor P, the code and the starting state of the necessary registers in the register file are shown below. Keep in mind that while x6 is in hexadecimal, registers x1 through x5 are given their values in decimal.

start:
    lw x1, 0(x6)
    addi x2, x0, 5
    blt x1, x2, end
    addi x3, x2, 11
    sub x4, x3, x1
    xori x5, x6, 0x1
end:
    sub x4, x3, x2
    addi x3, x4, 7
    addi x3, x1, 3
. = 0x400
.word 0x1

Register	Value
x1	5
x2	11
x3	30
x4	19
x5	20
x6	0x400

Processor P: 5-stage pipelined RISC-V processor, which is fully bypassed and has branch annulment. Branch decisions are made in the EXE stage and branches are always predicted not taken. What are the values of registers x3 and x4 at the start of cycle 12 (in decimal):

x3: __ E01 __
x4: __ E02 __
- E01 = ?
- E02 = ?

Problem `F`

The P₁ and the P₂ are two RISC-V based designs whose performance you are expected to evaluate. You have made the choice to compare the performance of the two CPUs using the next construction step.

L1:
    slli x11, x10, 2
    lw x12, 0x80(x11)
    add x13, x13, x12
    addi x10, x10, 1
    blt x10, x14, L1
    sub x14, x14, x15
    xori x15, x14, 0x1

The P₁ is a 4-stage pipelined processor with full bypassing. It attempts to improve performance by decreasing the number of cycles it takes to execute each instruction. The EXE and MEM stages are combined. Load requests are sent in the EXE stage and received in the WB stage one cycle after. The IF stage speculatively fetches the instruction at PC+4 and annuls incorrectly fetched instructions after a branch misprediction. Branches are resolved in the EXE stage.

The P₂ is a 6-stage pipelined processor with full bypassing. It attempts to improve performance by raising the clock speed. The EXE stage is split into two stages to break up the critical path through the ALU. The IF stage always fetches the instruction at PC+4 and annuls incorrectly fetched instructions after a taken branch. ALU and Branch ALU results are available only in EXE2.

How many cycles does it take the P₁ to execute one iteration of the loop? __ F01 __
- F01 = ?
How many cycles does it take the P₂ to execute one iteration of the loop? __ F02 __
- F02 = ?
Assume that we can build the P₁ with a clock period of 6ns. What is the clock period that the P₂ must achieve to attain the same performance on this assembly sequence? __ F03 __ ns
- F03 = ?

Problem `G`

Assume that while our data memory utilizes a little more realistic model that includes a DRAM, our instruction memory continues to respond in one cycle. A memory address, a read/write signal, and optional write data are all inputs to the memory. The CPU first uses the cache to fulfill read or write requests. The data can be quickly sent back to the pipeline if the access succeeds in the cache. It will take significantly longer for the CPU to reach the DRAM, and then retrieve the needed data from it.

Assume that each of the 2 processors has extra hardware that enables them to foresee when branches will be taken and to get the branch's destination instruction immediately following the branch instruction (without requiring any stalls).

Consider the following 2 RISC-V processor designs.

P₁: Basic 5-stage pipeline with bypass paths from EXE to DEC and from WB to DEC
P₂: Since the memory access can take a long time in the worst case, we use pipelining to improve the processor performance. The MEM stage is further divided into three pipeline stages.

Assume the 2 processors have ideal registers (t_SETUP = t_HOLD = t_PD = t_CD = 0ns) between different pipeline stages. Please derive the minimum clock cycle given the following propagation delays for the logic blocks.

Logic Block Propagation Delay

IF 2 ns

DEC 2 ns

EXE 3 ns

MEM 10 ns

WB 2 ns

bypass 1.5 ns

MEM1 4 ns

MEM2 3 ns

MEM3 3 ns
- Clock period for P₁: __ G01 __ ns
- Clock period for P₂: __ G02 __ ns
- G01 = ?
- G02 = ?
The code segment below copies elements from array A to array B, where Array A starts at address 0x100, and Array B starts at address 0x500. Assume that this loop has been running for a long time, and that the processor always predicts that the branch will be taken.
```
    add x1, x0, x0
    add x2, x0, x0
Next:
    lw x3, 0x100(x1)
    addi x1, x1, 4
    sw x3, 0x500(x2)
    addi x2, x2, 4
    blt x1, x4, Next // x4=1000
```
How many cycles does this loop take to execute in P₁? __ G03 __
- G03 = ?
Assume the memory can only serve one read or write operation at a time. In P₂, since MEM1, MEM2 and MEM3 stages all use the memory, we now have structural hazard. Structural hazard happens if multiple instructions try to use the same hardware resource. To resolve the structural hazard, we can only allow one memory access instruction exist in the MEM1, MEM2, and MEM3 stages. Note that it is ok to have multiple other types of instructions, e.g., arithmetic instructions, as long as they do not need to use the memory. Specifically, we need some mechanism to stall memory access instructions.
The detailed stall mechanism is explained below. We will inject NOPs to the EXE stage when the IF and DEC stages are stalled. When the DEC has a memory access instruction, we stall for the following 2 cases:
- If the EXE stage also has a memory access instruction, stall IF and DEC for 2 cycles
- If the MEM1 stage also has a memory access instruction, stall IF and DEC for 1 cycle
These stalls are in addition to any stalls required to deal with data hazards. How many cycles does this loop take to execute in P₂? __ G04 __
- G04 = ?

Logic Block	Propagation Delay
IF	2 ns
DEC	2 ns
EXE	3 ns
MEM	10 ns
WB	2 ns
bypass	1.5 ns
MEM1	4 ns
MEM2	3 ns
MEM3	3 ns

Quiz4 of Computer Architecture (2022 Fall)

Problem A

Problem B

Problem C

Problem D

Problem E

Problem F

Problem G