Quiz6 of Computer Architecture (2022 Fall)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

General Information

You are allowed to read lecture materials.
- That is, an open book exam.
- Even ChatGPT is available for consultation, however keep in mind the restrictions below.
  $\to$ Please let the instructor know if you do receive the right answers through ChatGPT.
We are using the honor system during this quiz, and would like you to accept the following:
1. You will not share the quiz with anyone.
2. You will not talk to anyone about the information on the quiz until the answers are made public.
Each answer has 4 points.
You must answer in valid numeric/C99/assembly representation and/or English alphabet except your formal name.
Message the instructor via Facebook if you found something wrong.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
09:10 ~ 10:20AM on Dec 20, 2022

Problem `A`

You are asked to create a memory hierarchy with an 8-word (32 byte) cache line and a 32-bit virtual and physical address for this problem. Think about the following two loops:

Loop A

int sum = 0;
for (int i = 0; i < (1 << 8); i++)
    for (int j = 0; j < (1 << 6); j++)
        sum += X[i][j] * Y[i][j];

Loop B

int sum = 0;
for (int j = 0; j < (1 << 6); j++)
    for (int i = 0; i < (1 << 8); i++)
        sum += X[i][j] * Y[i][j];

Assume X[0][0] is stored at address 0x0. Assume further that matrices X and Y are stored contiguously and that they are both kept in row-major order, i.e. X[i][j] is next to X[i][j + 1] in memory and X[256][64] is next to Y[0][0]. sum is not stored in the cache. Also assume that caches are initially empty. All elements of the matrices are 32-bit integers.

Suppose we have a processor with a 32 KiB 4-way set-associative D$ with FIFO replacement policy. Determine how many compulsory, conflict, and capacity misses will occur each time Loop A and Loop B are executed. Complete the following table.

- Compulsory Conflict and Capacity Total Misses

Loop A __ A01 __ __ A02 __ __ A03 __

Loop B __ A04 __ __ A05 __ __ A06 __
- A01 = ?
- A02 = ?
- A03 = ?
- A04 = ?
- A05 = ?
- A06 = ?
A 256 KiB 8-way set-associative L2$ with a FIFO replacement policy is what we would assume you implement. Assume the L2$ includes the L1$ and the L1 D$ is still a 32 KiB 4-way set-associative. How many cache misses at the L2$ occur when Loops A and B are run? For Loop A: __ A07 __; For Loop B: __ A08 __
- A07 = ?
- A08 = ?
You determine that the L1 hit time is 4 cycles, the L2 hit time is 12 cycles, and the main memory (DRAM) access time is 50 cycles based on thorough profiling and performance study. What is the average memory access time (AMAT) while executing Loop A and B? For Loop A: __ A09 __ cycles; For Loop B: __ A10 __ cycles
- A09 = ?
- A10 = ?
Consider using a Virtually-Indexed Physically-Tagged (VIPT) cache as your implementation. Is it possible to use virtual address aliasing with this cache, assuming a 4 KiB page size? what is the minimum number of set-associativity that is required to avoid aliasing given that the cache size is fixed at 32 KiB? __ A11 __ (Answer in Yes or No), associtvity has to be at least __ A12 __ (Answer in number or N/A if it is not possible.)
- A11 = ?
- A12 = ?

-	Compulsory	Conflict and Capacity	Total Misses
Loop A	__ A01 __	__ A02 __	__ A03 __
Loop B	__ A04 __	__ A05 __	__ A06 __

Problem `B`

Consider the loop shown below, which adds two vectors in a saturated manner; X[i], Y[i], and MAX are all int32_t-type variables.

for (int i = 0; i < N; i++) {
    Y[i] += X[i];
    if (Y[i] > MAX)
        Y[i] = MAX;
}

In RISC-V assembly, the loop is as follow:

# x1 is a pointer to the beginning of "X"
# x2 is a pointer to the beginning of "Y"
# x3 is "MAX"
# x4 is "i"
loop:
    ld x5, 0(x1) # x5 has "X[i]"
    ld x6, 0(x2) # x6 has "Y[i]"
    addi x1, x1, 4 # increment X pointer
    addi x2, x2, 4 # increment Y pointer
    addi x4, x4, -1 # decrement "i"
    add x7, x5, x6
    blt x7, x3, skip_sat
    mv x7, x3
skip_sat:
    sw x7, -4(x2)
    bnez x4, loop
end:

Consider that this loop is executed on a multi-threaded, 5-stage classic RISC processor where:

Loads and stores have a latency of 20 cycles.
Taken branches have a latency of 2 cycles.
All other instructions (including non-taken branches) have a latency of 1 cycle.
Perfect branch prediction is present.

Do not reorder the loop's instructions.

How many threads are required to ensure that fixed round-robin scheduling never causes us to stall? __ B01 __
- B01 = ?
Let's say we employ a coarse-grained thread scheduling policy that changes between threads anytime a stall might result from a RAW hazard or a taken branch. Stalls are not caused by WAR or WAW hazards. How many threads are required to ensure that this scheduling policy will never cause a stall? __ B02 __ Just consider about the steady-state execution.
- B02 = ?

Problem `C`

The following figure shows a classical fully-bypass 5-stage pipeline that has been enhanced by the addition of an unpipelined divider running parallel to the ALU. The graphic does not show any bypass routes. Up until it provides a complete 32-bit result, this iterative division outputs 2 bits every cycle.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

What is the latency in cycles of a divide operation? __ C01 __ cycles
- C01 = ?
What is the occupancy of a divide operation in cycles? __ C02 __ cycles
- C02 = ?
Note that the div instruction in RISC-V cannot raise a data-dependent exception. To avoid pipeline stalls while a multi-cycle divide operation is in progress, the pipeline control logic allows subsequent instructions that do not depend on the divide result to be issued and completed before the divide has completed. What additional hazards might be caused by div instructions, aside from the structural hazard on the divider itself? __ C03 __
- C03 = ?

Problem `D`

Consider SAXPY operation on 32-bit integer values:

for (i = 0; i < len; i++) {
    y[i] = a * x[i]+ y[i]
}

We are interested in both the CPI and how many cycles-per-element (CPE) the above takes to execute.

Consider the following RV32IM assembly implementation of this operation:

// x1 holds pointer to x
// x2 holds pointer to y
// x3 holds a
// x4 holds len
    add x5, x0, x0
LOOP:
    bge x5, x4, DONE
    lw x6, 0(x1)
    mul x6, x6, x3
    lw x7, 0(x2)
    add x6, x6, x7
    sw x6, 0(x2)
    addi x1, x1, #4
    addi x2, x2, #4
    addi x5, x5, #1
    j LOOP
DONE:

Consider a classic 5-stage in-order pipeline with full-bypassing, for the first 12 dynamic instructions. Assume no cache misses, and no branch prediction, and len > 2. Note that unconditional branches are resolved in D.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

How many instruction bytes are fetched per loop iteration? __ D01 __ bytes
- D01 = ?
How many data bytes are loaded per loop iteration? __ D02 __ bytes
- D02 = ?
What does CPI converge to as the number of completed iterations increases? __ D03 __
- D03 = ?
You are allowed to reorder the assembly that achieves a lower CPE, without adding new instructions. What is the CPI and CPE of the reordering? CPI: __ D04 __, CPE: __ D05 __
- D04 = ?
- D05 = ?

Problem `E`

The case with a variable number producer threads (as questions specify) and a single consumer will be consumed. The same code is executed by producer threads to produce two elements of 8 bytes each for the consumer. Thread C acts as the consumer. An example with two producers:

A FIFO manages the communication for each thread. For all, there is only one FIFO. The additional components were added by the producers at the tail-pointed position (in adjacent locations). The consumer reads one element at the head-directed position before incrementing it.

The code is as follows (you can not change or reorder intructions):

Assume that we maintain quiescent consistency, but now we will have two producer threads and one consumer. Further assume that the architecture provides an atomic instruction:
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Rewrite the producer's code to ensure correctness.
Spin:
Load R_tail, 0(tail)
Addi R_tail2, R_tail, 2
Swap(tail), E01
if E02 goto Spin
Store 0(R_tail), x
Store 1(R_tail), y
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Fill out E01 (an identifier) and E02 (an expression)
- E01 = ?
- E02 = ?
In previous part, how many memory (or cache) accesses does each failed attempt to enter the critical section produce? __ E03 __
- E03 = ?
Pick one of the following alternative atomic operations and describe how you would use it to reduce the number of memory or cache accesses. __ E04 __ (Answer in one of Fetch-and-Add, Test-and-Set, and Compare-and-Swap)
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- E04 = ?

Problem `F`

Consider the following two RISC-V programs executing processes that have virtual addresses assigned to them. It should be noted that this code converts each pseudoinstruction into a single RISC-V instruction.

Assume the operating system (OS) schedules Process B first. For the following questions, if you can not tell a value based on the information given, write N/A.

Scenario A : A timer interrupt occurs just prior to the execution of li t1, 10 in Process B. Process A runs for some time, then another timer interrupt occurs and control is returned to Process B.

What are the values in the following registers immediately after returning to Process B?
- t0 : __ F01 __
- t1 : __ F02 __
- pc : __ F03 __
- F01 = ?
- F02 = ?
- F03 = ?
What are the values in the following registers just after the first ecall in Process A completes?
- t0 : __ F04 __
- t1 : __ F05 __
- pc : __ F06 __
- F04 = ?
- F05 = ?
- F06 = ?

Problem `G`

Let's implement the lock for mutual exclusion using RISC-V 64-bit AMO instructions (RV64A).

Both threads execute:

    li xone, 1
spin: // Acquire Lock
    G01 xlock, xone, (xlockp)
    bnez xlock, spin
    // Critical Section
    ld xdata, (xdatap)
    add xdata, 1
    sd xdata, (xdatap)
    G02 // Release Lock

Fill out G01 and G02 following the RV64A instructions.

G01 = ?

G02 = ?

Quiz6 of Computer Architecture (2022 Fall)

Problem A

Problem B

Problem C

Problem D

Problem E

Problem F

Problem G

Problem `A`

Problem `B`

Problem `C`

Problem `D`

Problem `E`

Problem `F`

Problem `G`