Quiz6 of Computer Architecture (2022 Fall)

Solutions

Problem `A`

You are asked to create a memory hierarchy with an 8-word (32 byte) cache line and a 32-bit virtual and physical address for this problem. Think about the following two loops:

Loop A

int sum = 0;
for (int i = 0; i < (1 << 8); i++)
    for (int j = 0; j < (1 << 6); j++)
        sum += X[i][j] * Y[i][j];

Loop B

int sum = 0;
for (int j = 0; j < (1 << 6); j++)
    for (int i = 0; i < (1 << 8); i++)
        sum += X[i][j] * Y[i][j];

Assume X[0][0] is stored at address 0x0. Assume further that matrices X and Y are stored contiguously and that they are both kept in row-major order, i.e. X[i][j] is next to X[i][j + 1] in memory and X[256][64] is next to Y[0][0]. sum is not stored in the cache. Also assume that caches are initially empty. All elements of the matrices are 32-bit integers.

Suppose we have a processor with a 32 KiB 4-way set-associative D$ with FIFO replacement policy. Determine how many compulsory, conflict, and capacity misses will occur each time Loop A and Loop B are executed. Complete the following table.

- Compulsory Conflict and Capacity Total Misses

Loop A __ A01 __ __ A02 __ __ A03 __

Loop B __ A04 __ __ A05 __ __ A06 __
- A01 = ?
  4096
- A02 = ?
  0
- A03 = ?
  4096
- A04 = ?
  4096
- A05 = ?
  28672
- A06 = ?
  32768
  Loop A accesses the two matrices in row-major order (linear access). Only misses are compulsory misses that occur at the first access to each line. This happens every 8 accesses since cache lines are 8-words long. Therefore
  $256 \times 64 \times \frac{2}{8} = 2^{12} = 4096$ compulsory and total misses.
  Loop B accesses each matrix with a stride of 64 words or 256 Bytes. The first iteration of the inner loop will access sets with indices 0, 8, 16, …, 127, and after
  $\frac{128}{8} \times 4 = 64$ column acceses, all 4-ways of these sets will be full. The following accesses during the remainder of the inner loop will thus need to evict earlier lines, which prevents any reuse of cached lines. Effectively, this access pattern uses only 1/8 of the available sets in the cache. Therefore, all memory accesses will result in a cache miss with the same number of compulsory misses (4096) and the rest being capacity or conflict misses.
A 256 KiB 8-way set-associative L2$ with a FIFO replacement policy is what we would assume you implement. Assume the L2$ includes the L1$ and the L1 D$ is still a 32 KiB 4-way set-associative. How many cache misses at the L2$ occur when Loops A and B are run? For Loop A: __ A07 __; For Loop B: __ A08 __
- A07 = ?
  4096
- A08 = ?
  4096
  Loop A still experiences the same number of compulsory misses as in the L1, since these misses are unavoiadable. Total of 4096 misses.
  Now the L2 is large enough with
  $\frac{2^{18}}{2^{5} \times 2^{3}} = 2^{10}$ sets such that a single iteration of the inner loop that goes through 256 columns in 64-word strides for each matrix can all fit in the 1024 sets of 8 ways. In the next iteration of the inner loop, we then access the next word in each of the cached lines, resulting in cache hits. Therefore, the only misses are compulsory misses that occur 4096 times.
You determine that the L1 hit time is 4 cycles, the L2 hit time is 12 cycles, and the main memory (DRAM) access time is 50 cycles based on thorough profiling and performance study. What is the average memory access time (AMAT) while executing Loop A and B? For Loop A: __ A09 __ cycles; For Loop B: __ A10 __ cycles
- A09 = ?
  11.75
- A10 = ?
  22.25
  AMAT = t_L1 + LMR_L1 ⋅ (t_L2 + LMR_L2 ⋅ t_DRAM)
  where tL1, tL2 and tDRAM are L1, L2 and DRAM hit times, and the LMR’s are local miss rates.
  Loop A: AMAT = 4 + 1/8 ⋅ (12 + 1 ⋅ 50) = 11.75 cycles.
  Loop B: AMAT = 4 + 1 ⋅ (12 + 1/8 ⋅ 50) = 22.25 cycles.
  Note that the local miss rates for the L2 cache is 1 (always miss) for Loop A.
Consider using a Virtually-Indexed Physically-Tagged (VIPT) cache as your implementation. Is it possible to use virtual address aliasing with this cache, assuming a 4 KiB page size? what is the minimum number of set-associativity that is required to avoid aliasing given that the cache size is fixed at 32 KiB? __ A11 __ (Answer in Yes or No), associtvity has to be at least __ A12 __ (Answer in number or N/A if it is not possible.)
- A11 = ?
  Yes
- A12 = ?
  8
  Aliasing can occur. Minimum associativity to not have virtual address aliasing in a VIPT cache: (Num Ways) x (Num Sets) x (Line Size) ≤ (Page Size).
  Therefore associtvity has to be at least 8.

-	Compulsory	Conflict and Capacity	Total Misses
Loop A	__ A01 __	__ A02 __	__ A03 __
Loop B	__ A04 __	__ A05 __	__ A06 __

Problem `B`

Consider the loop shown below, which adds two vectors in a saturated manner; X[i], Y[i], and MAX are all int32_t-type variables.

for (int i = 0; i < N; i++) {
    Y[i] += X[i];
    if (Y[i] > MAX)
        Y[i] = MAX;
}

In RISC-V assembly, the loop is as follow:

# x1 is a pointer to the beginning of "X"
# x2 is a pointer to the beginning of "Y"
# x3 is "MAX"
# x4 is "i"
loop:
    ld x5, 0(x1) # x5 has "X[i]"
    ld x6, 0(x2) # x6 has "Y[i]"
    addi x1, x1, 4 # increment X pointer
    addi x2, x2, 4 # increment Y pointer
    addi x4, x4, -1 # decrement "i"
    add x7, x5, x6
    blt x7, x3, skip_sat
    mv x7, x3
skip_sat:
    sw x7, -4(x2)
    bnez x4, loop
end:

Consider that this loop is executed on a multi-threaded, 5-stage classic RISC processor where:

Loads and stores have a latency of 20 cycles
Taken branches have a latency of 2 cycles
All other instructions (including non-taken branches) have a latency of 1 cycle
- Perfect branch prediction is present

Do not reorder the loop's instructions.

How many threads are required to ensure that fixed round-robin scheduling never causes us to stall? __ B01 __
- B01 = ?
  5
  The longest stall is caused by the dependency between the ld x6 and the add x7.
  5N − N ≥ 20
  4N ≥ 20
  N ≥ 5
  At least 5 threads are required. This is also sufficient to hide the latency of the taken branches.
  [ld x6] [ld x6] [ld x6] [ld x6] [ld x6] [addi x1] [addi x1] [addi x1] [addi x1] [addi x1] [addi x2] [addi x2] [addi x2] [addi x2] [addi x2] [addi x4] [addi x4] [addi x4] [addi x4] [addi x4] [add x7]
Let's say we employ a coarse-grained thread scheduling policy that changes between threads anytime a stall might result from a RAW hazard or a taken branch. Stalls are not caused by WAR or WAW hazards. How many threads are required to ensure that this scheduling policy will never cause a stall? __ B02 __ Just consider about the steady-state execution.
- B02 = ?
  5
  In the worst case, we must assume that both branches are always taken, each causing a stall. Therefore, the longest sequence of instructions that we can execute without any stalls is:
```
ld x5
ld x6
addi x1
addi x2
addi x4
```
There is still a 16-cycle stall caused by the ld x6 to add x7 dependency. To hide this latency in the steady-state, we need the following number of threads:
⌈16/5⌉ + 1 = 4 + 1 = 5

Problem `C`

The following figure shows a classical fully-bypass 5-stage pipeline that has been enhanced by the addition of an unpipelined divider running parallel to the ALU. The graphic does not show any bypass routes. Up until it provides a complete 32-bit result, this iterative division outputs 2 bits every cycle.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

What is the latency in cycles of a divide operation? __ C01 __ cycles
- C01 = ?
  16
  32 / 2 = 16 cycles
What is the occupancy of a divide operation in cycles? __ C02 __ cycles
- C02 = ?
  16
  16 cycles, since the divider is unpipelined.
Note that the div instruction in RISC-V cannot raise a data-dependent exception. To avoid pipeline stalls while a multi-cycle divide operation is in progress, the pipeline control logic allows subsequent instructions that do not depend on the divide result to be issued and completed before the divide has completed. What additional hazards might be caused by div instructions, aside from the structural hazard on the divider itself? __ C03 __
- C03 = ?
  WAW hazard, structural hazard
  WAW hazards can arise from the out-of-order completion of div instructions. For any subsequent instruction that writes to the same destination register as an ongoing div instruction, one approach is to stall at the D stage (or alternatively, in X or M at the potential cost of higherfanout stall signals). This can be implemented using a comparator that checks the destination register of the div against the destination of the instruction in decode, or with a scoreboard that contains a bit-vector to track pending writes.
  A potential structural hazard also exists on the write port of the register. When a div and another instruction that writes a register are both in the W stage, one of them must stall.
  Although not a true control hazard, when a branch following a div that has not yet completed is taken, the ongoing divide operation should not be killed in the partial pipeline flush.

Problem `D`

Consider SAXPY operation on 32-bit integer values:

for (i = 0; i < len; i++) {
    y[i] = a * x[i]+ y[i]
}

We are interested in both the CPI and how many cycles-per-element (CPE) the above takes to execute.

Consider the following RV32IM assembly implementation of this operation:

// x1 holds pointer to x
// x2 holds pointer to y
// x3 holds a
// x4 holds len
    add x5, x0, x0
LOOP:
    bge x5, x4, DONE
    lw x6, 0(x1)
    mul x6, x6, x3
    lw x7, 0(x2)
    add x6, x6, x7
    sw x6, 0(x2)
    addi x1, x1, #4
    addi x2, x2, #4
    addi x5, x5, #1
    j LOOP
DONE:

For a classic 5-stage in-order pipeline with full-bypassing, for the first 12 dynamic instructions. Assume no cache misses, and no branch prediction, and len > 2. Note that unconditional branches are resolved in D.

How many instruction bytes are fetched per loop iteration? __ D01 __ bytes
- D01 = ?
  40
  (10 instructions / iteration) * (4B / instruction) = 40 bytes
How many data bytes are loaded per loop iteration? __ D02 __ bytes
- D02 = ?
  8
  2 lw instructions = 8 bytes loaded
What does CPI converge to as the number of completed iterations increases? __ D03 __
- D03 = ?
  1.3
  CPI converges to 1.3, which is equivalent to CPE = 13, as there are 10 instructions per iteration.

You are allowed to reorder the assembly that achieves a lower CPE, without adding new instructions. What is the CPI and CPE of the reordering? CPI: __ D04 __, CPE: __ D05 __

D04 = ?
1.1
D05 = ?
11
Reorder:

    add x5, x0, x0
LOOP:
    bge x5, x4, DONE
    lw x6, 0(x1)
    lw x7, 0(x2)
    mul x6, x6, x3
    add x6, x6, x7
    sw x6, 0(x2)
addi x1, x1, #4
addi x2, x2, #4
addi x5, x5, #1
j LOOP
DONE:

Problem `E`

The case with a variable number producer threads (as questions specify) and a single consumer will be consumed. The same code is executed by producer threads to produce two elements of 8 bytes each for the consumer. Thread C acts as the consumer. An example with two producers:

A FIFO manages the communication for each thread. For all, there is only one FIFO. The additional components were added by the producers at the tail-pointed position (in adjacent locations). The consumer reads one element at the head-directed position before incrementing it.

The code is as follows (you can not change or reorder intructions):

Assume that we maintain quiescent consistency, but now we will have two producer threads and one consumer. Further assume that the architecture provides an atomic instruction:
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Rewrite the producer's code to ensure correctness.
Spin:
Load R_tail, 0(tail)
Addi R_tail2, R_tail, 2
Swap(tail), E01
if E02 goto Spin
Store 0(R_tail), x
Store 1(R_tail), y
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Fill out E01 (an identifier) and E02 (an expression)
- E01 = ?
  R_tail2
- E02 = ?
  R_tail2 != R_tail
In previous part, how many memory (or cache) accesses does each failed attempt to enter the critical section produce? __ E03 __
- E03 = ?
  2
  Each failed attempt causes one memory access for the swap and one for the load, so 2 total.
Pick one of the following alternative atomic operations and describe how you would use it to reduce the number of memory or cache accesses. __ E04 __ (Answer in one of Fetch-and-Add, Test-and-Set, and Compare-and-Swap)
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- E04 = ?
  Fetch-and-Add
  We can use Fetch-and-Add to replace the code’s original load of R_tail and the two next instructions (instruction 1, 2, and 3). So the code has no extra loads for synchronization with Fetch-and-Add.

Problem `F`

Consider the following two RISC-V programs executing processes that have virtual addresses assigned to them. It should be noted that this code converts each pseudoinstruction into a single RISC-V instruction.

Assume the operating system (OS) schedules Process B first. For the following questions, if you can not tell a value based on the information given, write N/A.

Scenario A : A timer interrupt occurs just prior to the execution of li t1, 10 in Process B. Process A runs for some time, then another timer interrupt occurs and control is returned to Process B.

What are the values in the following registers immediately after returning to Process B?
- t0 : __ F01 __
- t1 : __ F02 __
- pc : __ F03 __
- F01 = ?
  50
- F02 = ?
  N/A OR 0
- F03 = ?
  0x454
What are the values in the following registers just after the first ecall in Process A completes?
- t0 : __ F04 __
- t1 : __ F05 __
- pc : __ F06 __
- F04 = ?
  5000
- F05 = ?
  0
- F06 = ?
  0x214

Problem `G`

Let's implement the lock fo Mutual-Exclusion using RISC-V 64-bit AMO instructions (RV64A).

Both threads execute:

    li xone, 1
spin: // Acquire Lock
    G01 xlock, xone, (xlockp)
    bnez xlock, spin
    // Critical Section
    ld xdata, (xdatap)
    add xdata, 1
    sd xdata, (xdatap)
    G02 // Release Lock

Fill out G01 and G02 with the instruction prefixing with amo.

G01 = ?
amoswap.w.aq

G02 = ?
amoswap.w.rl x0, x0, (xlockp)

Quiz6 of Computer Architecture (2022 Fall)

Problem A

Problem B

Problem C

Problem D

Problem E

Problem F

Problem G