--- tags: computer-arch --- # Quiz6 of Computer Architecture (2023 Fall) > Solutions ## Problem `A` Remember the tasks you completed in [Homework 3](https://hackmd.io/@sysprog/2023-arch-homework3). Please respond to the following questions based on that. 1. Please provide a summary of your work, including the accomplishments you achieved, in English. > * A01 = ? > (只要描述就給分,限英語作答) 2. Describe the main challenge you encountered while working on this homework and explain how you managed to overcome it, in English. > * A02 = ? > (只要描述就給分,限英語作答) --- ## Problem `B` We aim to enhance the efficiency of a program designed to calculate the total of all elements in a 64x64 element matrix M. This matrix is arranged in memory following the [row-major order](https://en.wikipedia.org/wiki/Row-_and_column-major_order). It is presumed that the variable `sum` is located in a shared memory area. We are evaluating two different code segments for this purpose: - [ ] Program I ```c int sum = 0; for (int j = 0; j < 8 * 8; j++) { for (int i = 0; i < 8 * 8; i++) { sum += M[i][j]; } } ``` - [ ] Program II ```c int sum = 0; for (int i = 0; i < 8 * 8; i++) { for (int j = 0; j < 8 * 8; j++) { sum += M[i][j]; } } ``` Both programs operate on a direct-mapped cache that has a capacity of 64 words and a block size of 8 words. Additionally, the data cache is distinct and separate from the instruction cache. To reduce the overall number of data cache misses, which program is preferable? A cache line contains array elements that share the same value of `[i]` and have sequential values of `[j]`. Given that Program II sequentially accesses the elements in a row of M with incrementing values of `j`, it results in only one cache miss for every 8 accesses, as a new line is brought in after every eighth access. 1. How many total data cache misses occur during the execution of Program II? > * B01 = ? > ==512== > $64 \times 64 /8 = 512$ 2. Consider running Program II on a multi-threaded processor equipped with 8 threads. What resources are shared among these threads? > * B02 = ? > ==sum== (此為變數名稱) > The variable sum is shared by all the threads. 3. (Continuing from the previous question) Do these shared resources lead to any issues? If so, show the proposed solutions. > * B03 = ? > ==提及semaphore或mutex就給分== > If a semaphore is utilized to control the updating of the `sum` variable, issues may arise in accurately updating the value of `sum`. --- ## Problem `C` Assume that we are going to build a minimal operating system kernel on RISC-V with the "A" (atomic) extension. Critical defines and routines are listed below. ```c typedef int atomic_t; struct mutex { uint32_t lock; atomic_t waiters; }; /* find the most significant bit. Not defined in n == 0. */ #define __fls(n) (31 - __builtin_clz(n)) static inline atomic_val_t atomic_or(atomic_t *addr, atomic_val_t bits) { return __atomic_fetch_or(addr, bits, __ATOMIC_SEQ_CST); } static inline atomic_val_t atomic_clear_bits(atomic_t *addr, atomic_val_t bits) { return __atomic_fetch_and(addr, ~bits, __ATOMIC_SEQ_CST); } ``` Both `__atomic_fetch_or` and `__atomic_fetch_and` are provided by [GCC atomics extension](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html). `__builtin_clz` is a GCC builtin as well. See [Other Built-in Functions Provided by GCC ](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html). 1. Let's consider the following code snippet that attempts to implement the mutex lock routine. Fill in C1 at Line 13 to make it function correctly. ```c= void mutex_lock(struct mutex *mtx) { uint32_t locked; uint32_t id = 1 << task_get_current(); atomic_or(&mtx->waiters, id); while (1) { asm volatile( /* set lock value */ "li %0, 2\n\t" /* attempt to acquire lock */ "__C1__ %0, %0, %1\n\t" : "=&r"(locked), "+A"(mtx->lock)); /* we got it ! */ if (!locked) break; /* Contention on the mutex */ /* Sleep waiting for our turn */ task_wait_event_mask(TASK_EVENT_MUTEX, 0); } atomic_clear_bits(&mtx->waiters, id); } ``` > * C01 = ? > ==`amoswap.w.aq`== (只要有 `amoswap` 且存在 `aq`,就給分) 2. Let’s consider the following code snippet that attempts to implement the mutex unlock routine. Fill in C2 at Line 7 to make it function correctly. ```c= void mutex_unlock(struct mutex *mtx) { uint32_t waiters; task_ *tsk = current_task; /* give back the lock */ asm volatile("__C2__ zero, zero, %0\n\t" : "+A"(mtx->lock)); waiters = mtx->waiters; while (waiters) { task_id_t id = __fls(waiters); waiters &= ~BIT(id); /* Somebody is waiting on the mutex */ task_set_event(id, TASK_EVENT_MUTEX); } /* Ensure no event is remaining from mutex wake-up */ atomic_clear_bits(&tsk->events, TASK_EVENT_MUTEX); } ``` > * C2 = ? > ==`amoswap.w.aqrl`== (只要有 `amoswap` 且存在 `rl`,就給分) > The code was taken from https://github.com/coreboot/chrome-ec/blob/main/core/riscv-rv32i/task.c --- ## Problem `D` This set of problems centers on cache coherence. Crucial assumptions for these questions are: * Processors A, B, and C each have their respective private caches. * These private caches operate under the MOESI protocol and can snoop on other caches via a shared bus. * The [MOESI protocol](https://en.wikipedia.org/wiki/MOESI_protocol), relevant to these questions, is depicted in the diagram below, showing the possible state transitions. ![MOESI](https://hackmd.io/_uploads/rkMCNQAUT.png) For each subpart listed below, identify the initial and subsequent cache states that best represent the transition described in the provided scenario. The state options are as follows: M (Modified), O (Owned), E (Exclusive), S (Shared), I (Invalid). Note: Each subpart should be treated as a separate scenario. Assume that at the start of each subpart, location X is not initially in the cache of either processor A or B. Your answers should be one of the characters from `MOESI`. 1. Processor B executes a write operation to location X, where initially, A's cache is in StateA1 and B's cache is in StateB1. Subsequently, when A performs a read operation from location X, A's cache changes to StateA2, while 𝐵's cache is then in StateB2. * This results in a transition for A's cache from StateA1 (D01) to StateA2 (D02). * The cache of processor 𝐵 undergoes a transition from StateB1 (D03) to StateB2 (D04). > * D01 = ? > ==I== > * D02 = ? > ==S== > * D03 = ? > ==M== > * D04 = ? > ==O== 2. Processors A, B, and C each perform a read operation from location X. Following this, A executes a write operation to location X. Then, B reads from location X, after which C writes to it. Subsequently, A reads from location X again. At this juncture, the cache state of A is StateA1 (D05), B's cache is in StateB1 (D06), and C's cache is in StateC1 (D07). > * D05 = ? > ==S== > * D06 = ? > ==I== > * D07 = ? > ==O== 3. Processor B initiates a read from location X, whereupon A's cache is in StateA1 and 𝐵's cache is in StateB1. Following this, when A reads from location X, its cache transitions to StateA2 while B's cache changes to StateB2. * This marks a shift for A's cache from StateA1 (D08) to StateA2 (D09). * Concurrently, 𝐵's cache evolves from StateB1 (D10) to StateB2 (D11). > * D08 = ? > ==I== > * D09 = ? > ==S== > * D10 = ? > ==E== > * D11 = ? > ==S== --- ## Problem `E` Examine these two RISC-V atomic instructions, which we are utilizing to implement a locking mechanism for critical sections within our new interactive application. These instructions employ specific register(s) designated for holding the reservation flag, address, and the result of the store-conditional operation. Consider the following RISC-V atomic instructions: * `lr.w rd, rs1` - This instruction sets the value of register `R[rd]` to the memory content at `M[R[rs1]]`. - It also places a reservation on the memory location `M[R[rs1]]`. * `sc.w rd, rs1, rs2` - If the memory location `M[R[rs1]]` holds a reservation, then `R[rd]` is set to 0, and the content of `M[R[rs1]]` is updated with the value from `R[rs2]`. - If there is no reservation, `R[rd]` is set to 1. The initial task involves creating the `Exchange` function. This function leverages the load-reserved and store-conditional synchronization primitives for the atomic exchange of values between `Mem[a0]` and `a1`. Complete the process by inserting the appropriate instruction into the first blank labeled `E01`, and then fill the second blank, labeled `E02`, with the corresponding instruction required there. ```c # Input Parameters: # a0: Address in memory where the atomic exchange is to occur # a1: The value to be atomically stored at Mem[a0] # # Output: # a0: The value originally held at Mem[a0] before the atomic operation Exchange: lr.w t0, a0 __ E01 __ __ E02 __ mv a0, t0 ret ``` > * E01 = ? > ==`sc.w t1, a0, a1`== > * E02 = ? > ==`bnez t1, Exchange`== Then, consider a lock function for critical sections using the newly implemented atomic exchange synchronization function. Here is how it is structured: ```c # Input Parameters: # a0: Address in memory where the lock is located Lock: addi, sp, sp, -8 sw ra, 0(sp) sw s0, 4(sp) mv s0, a0 Spin: mv a0, s0 li a1, 1 jal ra, Exchange bnez a0, Spin lw ra, 0(sp) lw s0, 4(sp) addi sp, sp, 8 ret ``` Upon observing performance degradation in parts of the code where numerous threads vie for the same lock, we realize that this might be linked to our system's MSI (Modified, Shared, Invalid) cache coherence protocol. In the context of the MSI cache coherence protocol, every processor initiates write requests while executing the `Exchange` function, leading to mutual invalidation in the spin loop. What is the impact? > * E03 = ? > ==degrade performance== (只要提及會讓系統效能低落,就給分) > This results in increased bus traffic and consequently, a decline in overall performance. To ensure our new application's compatibility with platforms using the MSI coherence protocol, we revise the `Lock` function to circumvent the previously identified issue. This new structure aims to reduce cache invalidations and bus traffic while trying to acquire the lock, enhancing performance on MSI coherence platforms. Here is how we can structure the updated function, with placeholders for the necessary adjustments: ```c # Arguments: # a0: The memory address of the lock Lock: addi, sp, sp, -8 sw ra, 0(sp) sw s0, 4(sp) mv s0, a0 Spin1: mv a0, s0 li a1, 1 Spin2: lw t0, 0(s0) __ E04 __ __ E05 __ bnez a0, Spin1 lw ra, 0(sp) lw s0, 4(sp) addi sp, sp, 8 ret ``` > * E04 = ? > ==`bnez t0, Spin2`== > * E05 = ? > ==`jal ra, Exchange`== --- ## Problem `F` For optimizing Load-Reserved/Store-Conditional (LR/SC) operations in scenarios with numerous readers and fewer writers of a shared variable, we suggest certain alterations to the cache-based approach, within an MSI coherence framework: * Initially, the LR operation brings the cache line into a shared state rather than a modified state. * The reservation set by LR gets canceled if the line is invalidated or if the same address is modified by another store operation from the local processor before executing SC. * If the reservation remains intact, the SC operation tries to elevate the line's state from shared to modified. The SC is successful if it completes the coherence transaction without any intermediate invalidation. Contrastingly, in the original implementation, the reservation is annulled whenever the line exits the modified state, which can occur when another sharer performs a regular load on the line. This proposed modification allows readers to not interfere with the reservation until the SC is executed, thereby enhancing the likelihood of successful operations. 1. In a scenario where the LR/SC instruction sequence is restricted such that there are no other load or store operations between the LR and SC, and the shared variable is isolated in its own cache line without co-locating with other variables, does this modification guarantee prevention of [livelock](https://en.wikipedia.org/wiki/Deadlock#Livelock)? Answer in Yes or No. > * F01 = ? > ==Yes== > If a Store-Conditional (SC) operation fails because of invalidation, it implies that another SC or store operation has been successful. The limitations imposed on the Load-Reserved/Store-Conditional (LR/SC) sequence and the exclusive cache line allocation for the variable preclude voluntary eviction and false sharing as causes for line invalidation and reservation clearance. It's important to note that this query focuses on the absence of livelock, not on ensuring that every individual thread will always succeed in its SC operation due to possible contention. What matters is that the system, in its entirety, continues to make progress, even if some threads may not complete their SC operations successfully. 2. We aim to eliminate the additional coherence traffic required to transition the cache line from the shared state to the modified state, especially when there are no other readers accessing the line in the interval between the LR and SC operations. Is it feasible to implement a similar optimization in a system using the MESI coherence protocol? Answer in Yes or No. > * F02 = ? > ==Yes== > In this modification, the LR operation secures the cache line in an exclusive state. The reservation is not canceled when the line is downgraded from exclusive to shared. The SC is successful only under two conditions: if the line has remained continuously in the exclusive state, which then gets seamlessly transitioned to modified, or if it transitions from shared to modified without any disruption caused by invalidation. While this approach does reintroduce the potential for livelock, it generally enhances forward progress compared to the standard MESI-based approach discussed in lectures. Given the context, where there are relatively fewer writers (utilizing LR) and more readers (using regular loads), the impact is less significant because readers no longer disrupt the reservations. --- ## Problem `G` Assume that we have a VIPT (Virtually-Indexed, Physically-Tagged) cache implementation, which is a 4-way set associative cache. This cache uses the following address partitioning scheme for both its cache input and virtual address: * Tag bits: [31:11] * Index bits: [10:6] * Line offset: [5:0] For the virtual address, the partitioning is as follows: * VPN (Virtual Page Number) bits: [31:9] * Page offset: [8:0] Here is a description of a basic VIPT cache implementation along with its associated TLB, both of which are accessed simultaneously. It's important to note that in a VIPT cache, the verification of the tag is conducted using the Physical Page Number (PPN), which is acquired either from the TLB or through a page table walk. ![4-way Fully Set-Associative TLB Implementation](https://hackmd.io/_uploads/r16XGICLT.png) > $\uparrow$ 4-way Fully Set-Associative TLB Implementation ![4-way Set-Associative VIPT Cache Implementation](https://hackmd.io/_uploads/HJCPz8R8p.png) > $\uparrow$ 4-way Set-Associative VIPT Cache Implementation For ease of understanding, let's assume that the TLB consistently hits and there are no page faults, meaning the Physical Page Number (PPN) is always valid in a Page Table Entry (PTE). As depicted in the provided diagrams, the TLB access and the cache lookup occur concurrently. The outcome, indicated by the "TLB Hit" signal, is then utilized to ascertain the validity of the cache entry. Considering the block delay expressions for each component, as detailed below, and referring to the diagrams of the VIPT cache and TLB, please proceed with the following questions. Bear in mind that the ceil(X) function yields the smallest integer greater than or equal to X. | Component | Delay equation (ps) | |:----------|:-------------------:| | AND Gate | 40 | | OR Gate | 50 | | Output-data/buffer driver | 180 | | N-to-1 MUX | 50 $\times$ ceil($\log_2{N}$) + 100 | | Comparator | 30 $\times$ (num tag bits) + 70 | | Memory array | 30 $\times$ ceil($\log_2{number\_of\_sets}$ + 30 $\times$ ceil($\log_2{number\_bits\_per\_set}$) + 100 | | Decoder | 30 $\times$ (number\_index\_bits) + 80 | 1. Among the three specified paths, identify the critical path. It is not necessary to consider other paths for determining the critical path in this context, even though the ones listed here do not encompass all possible options. The paths are: (select one) * `(a)` Cache index -> Output data * `(b)` VPN tag check -> Output data * `(c)` VPN tag check -> Cache hit > * G01 = ? > ==b== > For the longest path from the VPN tag check to the output data in the data cache: VPN -> Comparator (30 x 23 bits + 70 = 760) -> AND Gate (40) -> Buffer Driver (180) -> Comparator (30 x 21 bits + 70 = 700) -> AND Gate (40) + 2 x Buffer Driver (2 x 180) -> Data Output (2080 ps) > For the longest path from the Input Address index to the output data in the data cache: Index -> Tag Decoder (30 x 5 bits + 80 = 230) -> Memory Array (30 x 5 + 30 x 7 + 100) -> Comparator (30 x 21 + 70) -> AND Gate (40) + 2 x Buffer Driver (2 x 180) -> Data Output (1790 ps) > For the longest path from the VPN tag check to the cache hit signal in the data cache: VPN -> Comparator (30 x 23 bits + 70 = 760) -> AND Gate (40) -> Buffer Driver (180) -> Comparator (30 x 21 bits + 70 = 700) -> AND Gate (40) -> OR Gate (50 ps) -> AND Gate (40 ps) -> Cache Hit (1810 ps) --- ## Problem `H` Translate the following loop into vector assembly code, utilizing the RISC-V vector instruction set architecture (ISA): * Arrays `A` and `B` consist of 64-bit integers and are non-overlapping. * The value of `N` is provided in the register `a0`. * The base addresses for arrays `A` and `B` are supplied in registers `a1` and `a2`, respectively. - [ ] C source ```c for (i = 0; i < N; i++) A[i * 3 + 2] = A[i * 3] + (A[i * 3 + 1] * B[i]); ``` - [ ] RISC-V Assembly with vector extension ```c addi t1, x0, 24 # Set the stride length in bytes loop: vsetvli t0, a0, e64, m8 # Set VL, SEW=64, LMUL=8 for vector operation addi a3, a1, 8 # Calculate address for &A[3*i+1] addi a4, a1, 16 # Calculate address for &A[3*i+2] vlse.v v0, (a1), t1 # Load values from A[3*i] vlse.v v8, (a3), t1 # Load values from A[3*i+1] __ H01 __ # Load values from B[i] vmadd.vv v8, v16, v0 # Compute v8 = (v8 * v16) + v0 __ H02 __ # Load values from A[3*i+2] sub a0, a0, t0 # Decrease VL slli t2, t0, 4 # Multiply 2*VL by 16 to get byte count __ H03 __ # Multiply VL by 8 to get byte count add t2, t2, t0 # Calculate total byte offset for 3*VL add a2, a2, t0 # Update pointer B to next set of elements __ H04 __. # Update pointer A to next set of elements bnez a0, loop # If there are more elements, repeat the loop ``` Fill in H01 to H04 to make the above work as expected. > * H01 = ? > ==`vle.v v16, (a2)`== > * H02 = ? > ==`vlse.v v8, (a4), t1`== > * H03 = ? > ==`slli t0, t0, 3`== > * H04 = ? > ==`add a1, a1, t2`== --- ## Problem `I` Reflect on the given code that executes an in-place slide operation, shifting non-zero elements in array `A` forward by a distance `M`. For parallel execution on a multi-threaded processor, imagine dividing the loop so that each thread handles iterations where (`i % T`) equals TID. Here, `T` represents the total number of threads and TID is a unique identifier for each thread, ranging from 0 to $T-1$. ```c for (i = 0; i < N; i++) { if (A[i + M]) A[i] = A[i + M]; } ``` N and M are arbitrary integers with N being greater than both M and T. The code runs on a multi-threaded in-order core which lacks a data cache and features flawless branch prediction without penalties for either taken or not-taken branches. Additionally, there's no overhead associated with threading. * The latency for main memory access is 60 cycles. * Once a load instruction is issued by the processor, it can proceed with other instructions until it encounters an instruction that depends on the loaded value. * The latency for integer arithmetic operations is set at 1 cycle. Create the assembly code to be executed by each thread, treating the elements of array A as 32-bit integers. You are allowed to use any available registers, such as t1-t6. Here are the parameter registers: * Array (A) address: passed in a0 * Displacement (M): passed in a1 * Total number of threads (T): passed in a2 * Thread ID (TID): passed in a3 ```c # Initialize the upper boundary for the loop (t0 = A + N) addi t0, a0, N*4 # Convert M, T, and TID from elements to byte offsets slli a1, a1, 2 # Scale M to represent bytes slli a2, a2, 2 # Scale T to represent bytes slli a3, a3, 2 # Scale TID to represent bytes # Adjust the starting pointer of array A based on the thread ID add a0, a0, a3 # Start of the loop loop: # Calculate address for the element to load and load it into t1 add t1, a0, a1 __ I01 __ # Check if the loaded value is zero; if yes, skip the store operation __ I02 __ # Store the non-zero value back into the array at the adjusted position sw t1, 0(a0) # Label for skipping the store operation skip: # Move to the next element for this thread add a0, a0, a2 # Check if the end of the array is reached; if not, continue looping __ I03 __ ``` This code ensures that each thread works on distinct elements of the array, determined by its thread ID, while moving non-zero elements forward by the displacement `M`. > * I01 = ? > ==`lw t1, 0(t1)`== > * I02 = ? > ==`beqz t1, skip`== > * I03 = ? > ==`bltu a0, t0, loop`== Imagine a scenario where thread switching occurs every cycle, adhering to a fixed round-robin scheduling sequence. In instances where a thread is not prepared to execute during its scheduled turn, the pipeline introduces a bubble (idle cycle). Considering this, determine the minimum number of threads required to ensure the pipeline is always fully utilized without compromising the correctness of the execution. Take into account that M is set to 90 and N is substantially large. It is also observed that maintaining a consistent ascending order of thread scheduling by TID would prevent any memory dependency issues. What is the minimum number of threads needed to always fully utilize the pipeline while maintaining correct execution? > * I04 = ? > ==60== > At least 60 threads are required to hide a 60-cycle latency. --- ## Problem `J` A Virtual Machine Monitor (VMM, or hypervisor) operates multiple guest operating systems on a single host machine. These guest OSs function in user (unprivileged) mode, while the VMM operates in supervisor (privileged) mode. Each guest virtual machine's OS handles its own page tables, representing the link between its guest virtual address space and guest physical address space (termed "virtual-to-real" mapping). Subsequently, these guest physical addresses need to be mapped onto the host physical addresses. To efficiently utilize the existing hardware Translation Lookaside Buffer (TLB), the VMM creates and manages shadow page tables. These shadow tables directly map the guest virtual address space to the host physical address space (known as "virtual-to-physical" mapping). While the guest OS runs in user mode, the VMM adjusts the hardware's page table base pointer to align with the shadow page table. This setup allows the TLB to function seamlessly, as though there was no virtualization involved. ![vmm](https://hackmd.io/_uploads/S11UCLC86.png) 1. Assuming both the guest and host machines employ three-level page tables and the host features a hardware-refilled TLB, consider the scenario when the guest OS is operational. In this case, if a TLB access requires 1 cycle and each memory access incurs a latency of 50 cycles, what would be the latency experienced during a TLB miss? __ J01 __ cycles > * J01 = ? > ==151== > 1 + 50 + 50 + 50 = 151 cycles > A hardware TLB refill involves a page table walk of the shadow page tables in memory, same as if there were no virtualization in effect. Note that the guest page tables are not directly involved. > The TLB miss latency must also include the cost of the initial TLB lookup (1 cycle). 2. To facilitate a guest virtual machine with an 8 KiB page size on a host machine that uses a 4 KiB page size, especially for adequate support of I/O operations like DMA requests, what is the required number of host pages that must be physically contiguous? > * J02 = ? > ==2== > Each guest page is mapped to two host pages; a leaf entry in the guest page table corresponds to two consecutive entries in the shadow page table with the same permissions. To properly support I/O (i.e., DMA requests), the two host pages should be physically contiguous.