--- tags: computer-arch --- # Quiz6 of Computer Architecture (2022 Fall) > Solutions ## Problem `A` You are asked to create a memory hierarchy with an 8-word (32 byte) cache line and a 32-bit virtual and physical address for this problem. Think about the following two loops: - [ ] Loop A ```c int sum = 0; for (int i = 0; i < (1 << 8); i++) for (int j = 0; j < (1 << 6); j++) sum += X[i][j] * Y[i][j]; ``` - [ ] Loop B ```c int sum = 0; for (int j = 0; j < (1 << 6); j++) for (int i = 0; i < (1 << 8); i++) sum += X[i][j] * Y[i][j]; ``` Assume `X[0][0]` is stored at address `0x0`. Assume further that matrices X and Y are stored contiguously and that they are both kept in [row-major order](https://en.wikipedia.org/wiki/Row-_and_column-major_order), i.e. `X[i][j]` is next to `X[i][j + 1]` in memory and `X[256][64]` is next to `Y[0][0]`. `sum` is not stored in the cache. Also assume that caches are initially empty. All elements of the matrices are 32-bit integers. 1. Suppose we have a processor with a 32 KiB 4-way set-associative [D\$](https://inst.eecs.berkeley.edu/~cs61c/sp20/pdfs/lectures/lec16.pdf) with FIFO replacement policy. Determine how many compulsory, conflict, and capacity misses will occur each time Loop A and Loop B are executed. Complete the following table. | - | Compulsory | Conflict and Capacity | Total Misses | |:--:|:----------:|:---------------------:|:------------:| | Loop A | __ A01 __ | __ A02 __ | __ A03 __ | | Loop B | __ A04 __ | __ A05 __ | __ A06 __ | > * A01 = ? > ==4096== > * A02 = ? > ==0== > * A03 = ? > ==4096== > * A04 = ? > ==4096== > * A05 = ? > ==28672== > * A06 = ? > ==32768== > Loop A accesses the two matrices in row-major order (linear access). Only misses are compulsory misses that occur at the first access to each line. This happens every 8 accesses since cache lines are 8-words long. Therefore $256 \times 64 × \frac{2}{8} = 2^{12} = 4096$ compulsory and total misses. > Loop B accesses each matrix with a stride of 64 words or 256 Bytes. The first iteration of the inner loop will access sets with indices 0, 8, 16, ..., 127, and after $\frac{128}{8} \times 4 = 64$ column acceses, all 4-ways of these sets will be full. The following accesses during the remainder of the inner loop will thus need to evict earlier lines, which prevents any reuse of cached lines. Effectively, this access pattern uses only 1/8 of the available sets in the cache. Therefore, all memory accesses will result in a cache miss with the same number of compulsory misses (4096) and the rest being capacity or conflict misses. 2. A 256 KiB 8-way set-associative `L2$` with a FIFO replacement policy is what we would assume you implement. Assume the `L2$` includes the `L1$` and the `L1 D$` is still a 32 KiB 4-way set-associative. How many cache misses at the `L2$` occur when Loops A and B are run? For Loop A: __ A07 __; For Loop B: __ A08 __ > * A07 = ? > ==4096== > * A08 = ? > ==4096== > Loop A still experiences the same number of compulsory misses as in the L1, since these misses are unavoiadable. Total of 4096 misses. > Now the L2 is large enough with $\frac{2^{18}}{2^{5} \times 2^{3}} = 2^{10}$ sets such that a single iteration of the inner loop that goes through 256 columns in 64-word strides for each matrix can all fit in the 1024 sets of 8 ways. In the next iteration of the inner loop, we then access the next word in each of the cached lines, resulting in cache hits. Therefore, the only misses are compulsory misses that occur 4096 times. 3. You determine that the L1 hit time is 4 cycles, the L2 hit time is 12 cycles, and the main memory (DRAM) access time is 50 cycles based on thorough profiling and performance study. What is the average memory access time (AMAT) while executing Loop A and B? For Loop A: __ A09 __ cycles; For Loop B: __ A10 __ cycles > * A09 = ? > ==11.75== > * A10 = ? > ==22.25== > AMAT = t~L1~ + LMR~L1~ ⋅ (t~L2~ + LMR~L2~ ⋅ t~DRAM~) > where tL1, tL2 and tDRAM are L1, L2 and DRAM hit times, and the LMR’s are local miss rates. > Loop A: AMAT = 4 + 1/8 ⋅ (12 + 1 ⋅ 50) = 11.75 cycles. > Loop B: AMAT = 4 + 1 ⋅ (12 + 1/8 ⋅ 50) = 22.25 cycles. > Note that the local miss rates for the L2 cache is 1 (always miss) for Loop A. 4. Consider using a Virtually-Indexed Physically-Tagged (VIPT) cache as your implementation. Is it possible to use virtual address aliasing with this cache, assuming a 4 KiB page size? what is the minimum number of set-associativity that is required to avoid aliasing given that the cache size is fixed at 32 KiB? __ A11 __ (Answer in `Yes` or `No`), associtvity has to be at least __ A12 __ (Answer in number or `N/A` if it is not possible.) > * A11 = ? > ==Yes== > * A12 = ? > ==8== > Aliasing can occur. Minimum associativity to not have virtual address aliasing in a VIPT cache: (Num Ways) x (Num Sets) x (Line Size) ≤ (Page Size). > Therefore associtvity has to be at least 8. --- ## Problem `B` Consider the loop shown below, which adds two vectors in a [saturated manner](https://en.wikipedia.org/wiki/Saturation_arithmetic); `X[i]`, `Y[i]`, and `MAX` are all `int32_t`-type variables. ```c for (int i = 0; i < N; i++) { Y[i] += X[i]; if (Y[i] > MAX) Y[i] = MAX; } ``` In RISC-V assembly, the loop is as follow: ```c # x1 is a pointer to the beginning of "X" # x2 is a pointer to the beginning of "Y" # x3 is "MAX" # x4 is "i" loop: ld x5, 0(x1) # x5 has "X[i]" ld x6, 0(x2) # x6 has "Y[i]" addi x1, x1, 4 # increment X pointer addi x2, x2, 4 # increment Y pointer addi x4, x4, -1 # decrement "i" add x7, x5, x6 blt x7, x3, skip_sat mv x7, x3 skip_sat: sw x7, -4(x2) bnez x4, loop end: ``` Consider that this loop is executed on a multi-threaded, 5-stage classic RISC processor where: * Loads and stores have a latency of 20 cycles * Taken branches have a latency of 2 cycles * All other instructions (including non-taken branches) have a latency of 1 cycle * * Perfect branch prediction is present Do not reorder the loop's instructions. 1. How many threads are required to ensure that fixed round-robin scheduling never causes us to stall? __ B01 __ > * B01 = ? > ==5== > The longest stall is caused by the dependency between the ld x6 and the add x7. > 5N − N ≥ 20 > 4N ≥ 20 > N ≥ 5 > At least 5 threads are required. This is also sufficient to hide the latency of the taken branches. > [ld x6] [ld x6] [ld x6] [ld x6] [ld x6] [addi x1] [addi x1] [addi x1] [addi x1] [addi x1] [addi x2] [addi x2] [addi x2] [addi x2] [addi x2] [addi x4] [addi x4] [addi x4] [addi x4] [addi x4] [add x7] 2. Let's say we employ a coarse-grained thread scheduling policy that changes between threads anytime a stall might result from a RAW hazard or a taken branch. Stalls are not caused by WAR or WAW hazards. How many threads are required to ensure that this scheduling policy will never cause a stall? __ B02 __ Just consider about the steady-state execution. > * B02 = ? > ==5== > In the worst case, we must assume that both branches are always taken, each causing a stall. Therefore, the longest sequence of instructions that we can execute without any stalls is: > ``` > ld x5 > ld x6 > addi x1 > addi x2 > addi x4 > ``` > There is still a 16-cycle stall caused by the ld x6 to add x7 dependency. To hide this latency in the steady-state, we need the following number of threads: > ⌈16/5⌉ + 1 = 4 + 1 = 5 --- ## Problem `C` The following figure shows a classical fully-bypass 5-stage pipeline that has been enhanced by the addition of an unpipelined divider running parallel to the ALU. The graphic does not show any bypass routes. Up until it provides a complete 32-bit result, this iterative division outputs 2 bits every cycle. ![](https://hackmd.io/_uploads/Hkih7IAuj.png) 1. What is the latency in cycles of a divide operation? __ C01 __ cycles > * C01 = ? > ==16== > 32 / 2 = 16 cycles 2. What is the occupancy of a divide operation in cycles? __ C02 __ cycles > * C02 = ? > ==16== > 16 cycles, since the divider is unpipelined. 3. Note that the `div` instruction in RISC-V cannot raise a data-dependent exception. To avoid pipeline stalls while a multi-cycle divide operation is in progress, the pipeline control logic allows subsequent instructions that do not depend on the divide result to be issued and completed before the divide has completed. What additional hazards might be caused by `div` instructions, aside from the structural hazard on the divider itself? __ C03 __ > * C03 = ? > ==WAW hazard, structural hazard== > WAW hazards can arise from the out-of-order completion of `div` instructions. For any subsequent instruction that writes to the same destination register as an ongoing `div` instruction, one approach is to stall at the D stage (or alternatively, in X or M at the potential cost of higherfanout stall signals). This can be implemented using a comparator that checks the destination register of the `div` against the destination of the instruction in decode, or with a scoreboard that contains a bit-vector to track pending writes. > A potential structural hazard also exists on the write port of the register. When a `div` and another instruction that writes a register are both in the W stage, one of them must stall. > Although not a true control hazard, when a branch following a `div` that has not yet completed is taken, the ongoing divide operation should not be killed in the partial pipeline flush. --- ## Problem `D` Consider [SAXPY operation](https://developer.nvidia.com/blog/six-ways-saxpy/) on 32-bit integer values: ```c for (i = 0; i < len; i++) { y[i] = a * x[i]+ y[i] } ``` We are interested in both the CPI and how many [cycles-per-element (CPE)](https://www.cs.cmu.edu/afs/cs/academic/class/15213-s10/www/lectures/11-opt.pdf) the above takes to execute. Consider the following RV32IM assembly implementation of this operation: ```c // x1 holds pointer to x // x2 holds pointer to y // x3 holds a // x4 holds len add x5, x0, x0 LOOP: bge x5, x4, DONE lw x6, 0(x1) mul x6, x6, x3 lw x7, 0(x2) add x6, x6, x7 sw x6, 0(x2) addi x1, x1, #4 addi x2, x2, #4 addi x5, x5, #1 j LOOP DONE: ``` For a classic 5-stage in-order pipeline with full-bypassing, for the first 12 dynamic instructions. Assume no cache misses, and no branch prediction, and `len > 2`. Note that unconditional branches are resolved in D. ![](https://hackmd.io/_uploads/Byf8u80di.png) 1. How many instruction bytes are fetched per loop iteration? __ D01 __ bytes > * D01 = ? > ==40== > (10 instructions / iteration) * (4B / instruction) = 40 bytes 2. How many data bytes are loaded per loop iteration? __ D02 __ bytes > * D02 = ? > ==8== > 2 `lw` instructions = 8 bytes loaded 3. What does CPI converge to as the number of completed iterations increases? __ D03 __ > * D03 = ? > ==1.3== > CPI converges to 1.3, which is equivalent to CPE = 13, as there are 10 instructions per iteration. 5. You are allowed to reorder the assembly that achieves a lower CPE, without adding new instructions. What is the CPI and CPE of the reordering? CPI: __ D04 __, CPE: __ D05 __ > * D04 = ? > ==1.1== > * D05 = ? > ==11== > Reorder: > ``` > add x5, x0, x0 > LOOP: > bge x5, x4, DONE > lw x6, 0(x1) > lw x7, 0(x2) > mul x6, x6, x3 > add x6, x6, x7 > sw x6, 0(x2) > addi x1, x1, #4 > addi x2, x2, #4 > addi x5, x5, #1 > j LOOP > DONE: > ``` --- ## Problem `E` The case with a variable number producer threads (as questions specify) and a single consumer will be consumed. The same code is executed by producer threads to produce two elements of 8 bytes each for the consumer. Thread C acts as the consumer. An example with two producers: ![](https://hackmd.io/_uploads/SyjrcLRdi.png) A FIFO manages the communication for each thread. For all, there is only one FIFO. The additional components were added by the producers at the tail-pointed position (in adjacent locations). The consumer reads one element at the head-directed position before incrementing it. ![](https://hackmd.io/_uploads/HkOqcIC_o.png) The code is as follows (you can not change or reorder intructions): ![](https://hackmd.io/_uploads/rkiyjI0ds.png) 1. Assume that we maintain [quiescent consistency](https://en.wikipedia.org/wiki/Concurrent_data_structure#Basic_principles), but now we will have two producer threads and one consumer. Further assume that the architecture provides an atomic instruction: ![](https://hackmd.io/_uploads/ByRV68Ruj.png) Rewrite the producer's code to ensure correctness. Spin: Load R~tail~, 0(tail) Addi R~tail2~, R~tail~, 2 Swap(tail), E01 if E02 goto Spin Store 0(R~tail~), x Store 1(R~tail~), y :arrow_right: Fill out E01 (an identifier) and E02 (an expression) > * E01 = ? > ==R~tail2~== > * E02 = ? > ==R~tail2~ != R~tail~== 2. In previous part, how many memory (or cache) accesses does each failed attempt to enter the critical section produce? __ E03 __ > * E03 = ? > ==2== > Each failed attempt causes one memory access for the swap and one for the load, so 2 total. 3. Pick one of the following alternative atomic operations and describe how you would use it to reduce the number of memory or cache accesses. __ E04 __ (Answer in one of Fetch-and-Add, Test-and-Set, and Compare-and-Swap) ![](https://hackmd.io/_uploads/H1WtkD0us.png) > * E04 = ? > ==Fetch-and-Add== > We can use Fetch-and-Add to replace the code’s original load of R~tail~ and the two next instructions (instruction 1, 2, and 3). So the code has no extra loads for synchronization with Fetch-and-Add. --- ## Problem `F` Consider the following two RISC-V programs executing processes that have virtual addresses assigned to them. It should be noted that this code converts each pseudoinstruction into a single RISC-V instruction. ![](https://hackmd.io/_uploads/ryZE-PA_s.png) Assume the operating system (OS) schedules Process B first. For the following questions, if you can not tell a value based on the information given, write `N/A`. Scenario A : A timer interrupt occurs just prior to the execution of `li t1, 10` in Process B. Process A runs for some time, then another timer interrupt occurs and control is returned to Process B. 1. What are the values in the following registers immediately after returning to Process B? * `t0` : __ F01 __ * `t1` : __ F02 __ * `pc` : __ F03 __ > * F01 = ? > ==50== > * F02 = ? > ==N/A== OR ==0== > * F03 = ? > ==0x454== 2. What are the values in the following registers just after the first `ecall` in Process A completes? * `t0` : __ F04 __ * `t1` : __ F05 __ * `pc` : __ F06 __ > * F04 = ? > ==5000== > * F05 = ? > ==0== > * F06 = ? > ==0x214== --- ## Problem `G` Let's implement the lock fo Mutual-Exclusion using RISC-V 64-bit [AMO instructions](https://msyksphinz-self.github.io/riscv-isadoc/html/rva.html) (RV64A). ![](https://hackmd.io/_uploads/rk_lLP0di.png) Both threads execute: ```c li xone, 1 spin: // Acquire Lock G01 xlock, xone, (xlockp) bnez xlock, spin // Critical Section ld xdata, (xdatap) add xdata, 1 sd xdata, (xdatap) G02 // Release Lock ``` Fill out G01 and G02 with the instruction prefixing with `amo`. > * G01 = ? > ==`amoswap.w.aq`== > * G02 = ? > ==`amoswap.w.rl x0, x0, (xlockp)`==