--- tags: computer-arch --- # Quiz6 of Computer Architecture (2022 Fall) :::info :information_source: General Information * You are allowed to read **[lecture materials](http://wiki.csie.ncku.edu.tw/arch/schedule)**. * That is, an open book exam. * Even [ChatGPT](https://openai.com/blog/chatgpt/) is available for consultation, however keep in mind the restrictions below. $\to$ Please let the instructor know if you do receive the right answers through ChatGPT. * We are using the honor system during this quiz, and would like you to accept the following: 1. You will not share the quiz with anyone. 2. You will not talk to anyone about the information on the quiz until the answers are made public. * Each answer has 4 points. * You must answer in valid numeric/C99/assembly representation and/or English alphabet except your formal name. * Message **[the instructor via Facebook](https://www.facebook.com/JservFans)** if you found something wrong. * :timer_clock: 09:10 ~ 10:20AM on Dec 20, 2022 ::: ## Problem A You are asked to create a memory hierarchy with an 8-word (32 byte) cache line and a 32-bit virtual and physical address for this problem. Think about the following two loops: - [ ] Loop A c int sum = 0; for (int i = 0; i < (1 << 8); i++) for (int j = 0; j < (1 << 6); j++) sum += X[i][j] * Y[i][j];  - [ ] Loop B c int sum = 0; for (int j = 0; j < (1 << 6); j++) for (int i = 0; i < (1 << 8); i++) sum += X[i][j] * Y[i][j];  Assume X[0][0] is stored at address 0x0. Assume further that matrices X and Y are stored contiguously and that they are both kept in [row-major order](https://en.wikipedia.org/wiki/Row-_and_column-major_order), i.e. X[i][j] is next to X[i][j + 1] in memory and X[256][64] is next to Y[0][0]. sum is not stored in the cache. Also assume that caches are initially empty. All elements of the matrices are 32-bit integers. 1. Suppose we have a processor with a 32 KiB 4-way set-associative [D\$](https://inst.eecs.berkeley.edu/~cs61c/sp20/pdfs/lectures/lec16.pdf) with FIFO replacement policy. Determine how many compulsory, conflict, and capacity misses will occur each time Loop A and Loop B are executed. Complete the following table. | - | Compulsory | Conflict and Capacity | Total Misses | |:--:|:----------:|:---------------------:|:------------:| | Loop A | __ A01 __ | __ A02 __ | __ A03 __ | | Loop B | __ A04 __ | __ A05 __ | __ A06 __ | > * A01 = ? > * A02 = ? > * A03 = ? > * A04 = ? > * A05 = ? > * A06 = ? 2. A 256 KiB 8-way set-associative L2$ with a FIFO replacement policy is what we would assume you implement. Assume the L2$ includes the L1$ and the L1 D$ is still a 32 KiB 4-way set-associative. How many cache misses at the L2$ occur when Loops A and B are run? For Loop A: __ A07 __; For Loop B: __ A08 __ > * A07 = ? > * A08 = ? 3. You determine that the L1 hit time is 4 cycles, the L2 hit time is 12 cycles, and the main memory (DRAM) access time is 50 cycles based on thorough profiling and performance study. What is the average memory access time (AMAT) while executing Loop A and B? For Loop A: __ A09 __ cycles; For Loop B: __ A10 __ cycles > * A09 = ? > * A10 = ? 4. Consider using a [Virtually-Indexed Physically-Tagged](https://en.wikipedia.org/wiki/CPU_cache) (VIPT) cache as your implementation. Is it possible to use virtual address aliasing with this cache, assuming a 4 KiB page size? what is the minimum number of set-associativity that is required to avoid aliasing given that the cache size is fixed at 32 KiB? __ A11 __ (Answer in Yes or No), associtvity has to be at least __ A12 __ (Answer in number or N/A if it is not possible.) > * A11 = ? > * A12 = ? --- ## Problem B Consider the loop shown below, which adds two vectors in a [saturated manner](https://en.wikipedia.org/wiki/Saturation_arithmetic); X[i], Y[i], and MAX are all int32_t-type variables. c for (int i = 0; i < N; i++) { Y[i] += X[i]; if (Y[i] > MAX) Y[i] = MAX; }  In RISC-V assembly, the loop is as follow: c # x1 is a pointer to the beginning of "X" # x2 is a pointer to the beginning of "Y" # x3 is "MAX" # x4 is "i" loop: ld x5, 0(x1) # x5 has "X[i]" ld x6, 0(x2) # x6 has "Y[i]" addi x1, x1, 4 # increment X pointer addi x2, x2, 4 # increment Y pointer addi x4, x4, -1 # decrement "i" add x7, x5, x6 blt x7, x3, skip_sat mv x7, x3 skip_sat: sw x7, -4(x2) bnez x4, loop end:  Consider that this loop is executed on a multi-threaded, 5-stage classic RISC processor where: * Loads and stores have a latency of 20 cycles. * Taken branches have a latency of 2 cycles. * All other instructions (including non-taken branches) have a latency of 1 cycle. * Perfect branch prediction is present. Do not reorder the loop's instructions. 1. How many threads are required to ensure that fixed round-robin scheduling never causes us to stall? __ B01 __ > * B01 = ? 2. Let's say we employ a coarse-grained thread scheduling policy that changes between threads anytime a stall might result from a RAW hazard or a taken branch. Stalls are not caused by WAR or WAW hazards. How many threads are required to ensure that this scheduling policy will never cause a stall? __ B02 __ Just consider about the steady-state execution. > * B02 = ? --- ## Problem C The following figure shows a classical fully-bypass 5-stage pipeline that has been enhanced by the addition of an unpipelined divider running parallel to the ALU. The graphic does not show any bypass routes. Up until it provides a complete 32-bit result, this iterative division outputs 2 bits every cycle. ![](https://hackmd.io/_uploads/Hkih7IAuj.png) 1. What is the latency in cycles of a divide operation? __ C01 __ cycles > * C01 = ? 2. What is the occupancy of a divide operation in cycles? __ C02 __ cycles > * C02 = ? 3. Note that the div instruction in RISC-V cannot raise a data-dependent exception. To avoid pipeline stalls while a multi-cycle divide operation is in progress, the pipeline control logic allows subsequent instructions that do not depend on the divide result to be issued and completed before the divide has completed. What additional hazards might be caused by div instructions, aside from the structural hazard on the divider itself? __ C03 __ > * C03 = ? --- ## Problem D Consider [SAXPY operation](https://developer.nvidia.com/blog/six-ways-saxpy/) on 32-bit integer values: c for (i = 0; i < len; i++) { y[i] = a * x[i]+ y[i] }  We are interested in both the CPI and how many [cycles-per-element (CPE)](https://www.cs.cmu.edu/afs/cs/academic/class/15213-s10/www/lectures/11-opt.pdf) the above takes to execute. Consider the following RV32IM assembly implementation of this operation: c // x1 holds pointer to x // x2 holds pointer to y // x3 holds a // x4 holds len add x5, x0, x0 LOOP: bge x5, x4, DONE lw x6, 0(x1) mul x6, x6, x3 lw x7, 0(x2) add x6, x6, x7 sw x6, 0(x2) addi x1, x1, #4 addi x2, x2, #4 addi x5, x5, #1 j LOOP DONE:  Consider a classic 5-stage in-order pipeline with full-bypassing, for the first 12 dynamic instructions. Assume no cache misses, and no branch prediction, and len > 2. Note that unconditional branches are resolved in D. ![](https://hackmd.io/_uploads/Byf8u80di.png) 1. How many instruction bytes are fetched per loop iteration? __ D01 __ bytes > * D01 = ? 2. How many data bytes are loaded per loop iteration? __ D02 __ bytes > * D02 = ? 3. What does CPI converge to as the number of completed iterations increases? __ D03 __ > * D03 = ? 5. You are allowed to reorder the assembly that achieves a lower CPE, without adding new instructions. What is the CPI and CPE of the reordering? CPI: __ D04 __, CPE: __ D05 __ > * D04 = ? > * D05 = ? --- ## Problem E The case with a variable number producer threads (as questions specify) and a single consumer will be consumed. The same code is executed by producer threads to produce two elements of 8 bytes each for the consumer. Thread C acts as the consumer. An example with two producers: ![](https://hackmd.io/_uploads/SyjrcLRdi.png) A FIFO manages the communication for each thread. For all, there is only one FIFO. The additional components were added by the producers at the tail-pointed position (in adjacent locations). The consumer reads one element at the head-directed position before incrementing it. ![](https://hackmd.io/_uploads/HkOqcIC_o.png) The code is as follows (you can not change or reorder intructions): ![](https://hackmd.io/_uploads/rkiyjI0ds.png) 1. Assume that we maintain [quiescent consistency](https://en.wikipedia.org/wiki/Concurrent_data_structure#Basic_principles), but now we will have two producer threads and one consumer. Further assume that the architecture provides an atomic instruction: ![](https://hackmd.io/_uploads/ByRV68Ruj.png) Rewrite the producer's code to ensure correctness. Spin: Load R~tail~, 0(tail) Addi R~tail2~, R~tail~, 2 Swap(tail), E01 if E02 goto Spin Store 0(R~tail~), x Store 1(R~tail~), y :arrow_right: Fill out E01 (an identifier) and E02 (an expression) > * E01 = ? > * E02 = ? 2. In previous part, how many memory (or cache) accesses does each failed attempt to enter the critical section produce? __ E03 __ > * E03 = ? 3. Pick one of the following alternative atomic operations and describe how you would use it to reduce the number of memory or cache accesses. __ E04 __ (Answer in one of Fetch-and-Add, Test-and-Set, and Compare-and-Swap) ![](https://hackmd.io/_uploads/H1WtkD0us.png) > * E04 = ? --- ## Problem F Consider the following two RISC-V programs executing processes that have virtual addresses assigned to them. It should be noted that this code converts each pseudoinstruction into a single RISC-V instruction. ![](https://hackmd.io/_uploads/ryZE-PA_s.png) Assume the operating system (OS) schedules Process B first. For the following questions, if you can not tell a value based on the information given, write N/A. Scenario A : A timer interrupt occurs just prior to the execution of li t1, 10 in Process B. Process A runs for some time, then another timer interrupt occurs and control is returned to Process B. 1. What are the values in the following registers immediately after returning to Process B? * t0 : __ F01 __ * t1 : __ F02 __ * pc : __ F03 __ > * F01 = ? > * F02 = ? > * F03 = ? 2. What are the values in the following registers just after the first ecall in Process A completes? * t0 : __ F04 __ * t1 : __ F05 __ * pc : __ F06 __ > * F04 = ? > * F05 = ? > * F06 = ? --- ## Problem G Let's implement the lock for [mutual exclusion](https://en.wikipedia.org/wiki/Mutual_exclusion) using RISC-V 64-bit [AMO instructions](https://msyksphinz-self.github.io/riscv-isadoc/html/rva.html) (RV64A). ![](https://hackmd.io/_uploads/rk_lLP0di.png) Both threads execute: c li xone, 1 spin: // Acquire Lock G01 xlock, xone, (xlockp) bnez xlock, spin // Critical Section ld xdata, (xdatap) add xdata, 1 sd xdata, (xdatap) G02 // Release Lock  Fill out G01 and G02 following the RV64A instructions. > * G01 = ? > * G02 = ?