A
You are asked to create a memory hierarchy with an 8-word (32 byte) cache line and a 32-bit virtual and physical address for this problem. Think about the following two loops:
Assume X[0][0]
is stored at address 0x0
. Assume further that matrices X and Y are stored contiguously and that they are both kept in row-major order, i.e. X[i][j]
is next to X[i][j + 1]
in memory and X[256][64]
is next to Y[0][0]
. sum
is not stored in the cache. Also assume that caches are initially empty. All elements of the matrices are 32-bit integers.
Suppose we have a processor with a 32 KiB 4-way set-associative D$ with FIFO replacement policy. Determine how many compulsory, conflict, and capacity misses will occur each time Loop A and Loop B are executed. Complete the following table.
- | Compulsory | Conflict and Capacity | Total Misses |
---|---|---|---|
Loop A | __ A01 __ | __ A02 __ | __ A03 __ |
Loop B | __ A04 __ | __ A05 __ | __ A06 __ |
- A01 = ?
- A02 = ?
- A03 = ?
- A04 = ?
- A05 = ?
- A06 = ?
A 256 KiB 8-way set-associative L2$
with a FIFO replacement policy is what we would assume you implement. Assume the L2$
includes the L1$
and the L1 D$
is still a 32 KiB 4-way set-associative. How many cache misses at the L2$
occur when Loops A and B are run? For Loop A: __ A07 __; For Loop B: __ A08 __
- A07 = ?
- A08 = ?
You determine that the L1 hit time is 4 cycles, the L2 hit time is 12 cycles, and the main memory (DRAM) access time is 50 cycles based on thorough profiling and performance study. What is the average memory access time (AMAT) while executing Loop A and B? For Loop A: __ A09 __ cycles; For Loop B: __ A10 __ cycles
- A09 = ?
- A10 = ?
Consider using a Virtually-Indexed Physically-Tagged (VIPT) cache as your implementation. Is it possible to use virtual address aliasing with this cache, assuming a 4 KiB page size? what is the minimum number of set-associativity that is required to avoid aliasing given that the cache size is fixed at 32 KiB? __ A11 __ (Answer in Yes
or No
), associtvity has to be at least __ A12 __ (Answer in number or N/A
if it is not possible.)
- A11 = ?
- A12 = ?
B
Consider the loop shown below, which adds two vectors in a saturated manner; X[i]
, Y[i]
, and MAX
are all int32_t
-type variables.
In RISC-V assembly, the loop is as follow:
Consider that this loop is executed on a multi-threaded, 5-stage classic RISC processor where:
Do not reorder the loop's instructions.
How many threads are required to ensure that fixed round-robin scheduling never causes us to stall? __ B01 __
- B01 = ?
Let's say we employ a coarse-grained thread scheduling policy that changes between threads anytime a stall might result from a RAW hazard or a taken branch. Stalls are not caused by WAR or WAW hazards. How many threads are required to ensure that this scheduling policy will never cause a stall? __ B02 __ Just consider about the steady-state execution.
- B02 = ?
C
The following figure shows a classical fully-bypass 5-stage pipeline that has been enhanced by the addition of an unpipelined divider running parallel to the ALU. The graphic does not show any bypass routes. Up until it provides a complete 32-bit result, this iterative division outputs 2 bits every cycle.
What is the latency in cycles of a divide operation? __ C01 __ cycles
- C01 = ?
What is the occupancy of a divide operation in cycles? __ C02 __ cycles
- C02 = ?
Note that the div
instruction in RISC-V cannot raise a data-dependent exception. To avoid pipeline stalls while a multi-cycle divide operation is in progress, the pipeline control logic allows subsequent instructions that do not depend on the divide result to be issued and completed before the divide has completed. What additional hazards might be caused by div
instructions, aside from the structural hazard on the divider itself? __ C03 __
- C03 = ?
D
Consider SAXPY operation on 32-bit integer values:
We are interested in both the CPI and how many cycles-per-element (CPE) the above takes to execute.
Consider the following RV32IM assembly implementation of this operation:
Consider a classic 5-stage in-order pipeline with full-bypassing, for the first 12 dynamic instructions. Assume no cache misses, and no branch prediction, and len > 2
. Note that unconditional branches are resolved in D.
How many instruction bytes are fetched per loop iteration? __ D01 __ bytes
- D01 = ?
How many data bytes are loaded per loop iteration? __ D02 __ bytes
- D02 = ?
What does CPI converge to as the number of completed iterations increases? __ D03 __
- D03 = ?
You are allowed to reorder the assembly that achieves a lower CPE, without adding new instructions. What is the CPI and CPE of the reordering? CPI: __ D04 __, CPE: __ D05 __
- D04 = ?
- D05 = ?
E
The case with a variable number producer threads (as questions specify) and a single consumer will be consumed. The same code is executed by producer threads to produce two elements of 8 bytes each for the consumer. Thread C acts as the consumer. An example with two producers:
A FIFO manages the communication for each thread. For all, there is only one FIFO. The additional components were added by the producers at the tail-pointed position (in adjacent locations). The consumer reads one element at the head-directed position before incrementing it.
The code is as follows (you can not change or reorder intructions):
Assume that we maintain quiescent consistency, but now we will have two producer threads and one consumer. Further assume that the architecture provides an atomic instruction:
Rewrite the producer's code to ensure correctness.
Spin:
Load Rtail, 0(tail)
Addi Rtail2, Rtail, 2
Swap(tail), E01
if E02 goto Spin
Store 0(Rtail), x
Store 1(Rtail), y
- E01 = ?
- E02 = ?
In previous part, how many memory (or cache) accesses does each failed attempt to enter the critical section produce? __ E03 __
- E03 = ?
Pick one of the following alternative atomic operations and describe how you would use it to reduce the number of memory or cache accesses. __ E04 __ (Answer in one of Fetch-and-Add, Test-and-Set, and Compare-and-Swap)
- E04 = ?
F
Consider the following two RISC-V programs executing processes that have virtual addresses assigned to them. It should be noted that this code converts each pseudoinstruction into a single RISC-V instruction.
Assume the operating system (OS) schedules Process B first. For the following questions, if you can not tell a value based on the information given, write N/A
.
Scenario A : A timer interrupt occurs just prior to the execution of li t1, 10
in Process B. Process A runs for some time, then another timer interrupt occurs and control is returned to Process B.
What are the values in the following registers immediately after returning to Process B?
t0
: __ F01 __t1
: __ F02 __pc
: __ F03 __
- F01 = ?
- F02 = ?
- F03 = ?
What are the values in the following registers just after the first ecall
in Process A completes?
t0
: __ F04 __t1
: __ F05 __pc
: __ F06 __
- F04 = ?
- F05 = ?
- F06 = ?
G
Let's implement the lock for mutual exclusion using RISC-V 64-bit AMO instructions (RV64A).
Both threads execute:
Fill out G01 and G02 following the RV64A instructions.
- G01 = ?
- G02 = ?