Solutions
A
Remember the tasks you completed in Homework 3. Please respond to the following questions based on that.
Please provide a summary of your work, including the accomplishments you achieved, in English.
- A01 = ?
(只要描述就給分,限英語作答)
Describe the main challenge you encountered while working on this homework and explain how you managed to overcome it, in English.
- A02 = ?
(只要描述就給分,限英語作答)
B
We aim to enhance the efficiency of a program designed to calculate the total of all elements in a 64x64 element matrix M. This matrix is arranged in memory following the row-major order. It is presumed that the variable sum
is located in a shared memory area. We are evaluating two different code segments for this purpose:
Both programs operate on a direct-mapped cache that has a capacity of 64 words and a block size of 8 words. Additionally, the data cache is distinct and separate from the instruction cache.
To reduce the overall number of data cache misses, which program is preferable? A cache line contains array elements that share the same value of [i]
and have sequential values of [j]
. Given that Program II sequentially accesses the elements in a row of M with incrementing values of j
, it results in only one cache miss for every 8 accesses, as a new line is brought in after every eighth access.
How many total data cache misses occur during the execution of Program II?
- B01 = ?
512
Consider running Program II on a multi-threaded processor equipped with 8 threads. What resources are shared among these threads?
- B02 = ?
sum (此為變數名稱)
The variable sum is shared by all the threads.
(Continuing from the previous question) Do these shared resources lead to any issues? If so, show the proposed solutions.
- B03 = ?
提及semaphore或mutex就給分
If a semaphore is utilized to control the updating of thesum
variable, issues may arise in accurately updating the value ofsum
.
C
Assume that we are going to build a minimal operating system kernel on RISC-V with the "A" (atomic) extension. Critical defines and routines are listed below.
Both __atomic_fetch_or
and __atomic_fetch_and
are provided by GCC atomics extension. __builtin_clz
is a GCC builtin as well. See Other Built-in Functions Provided by GCC .
- C01 = ?
amoswap.w.aq
(只要有amoswap
且存在aq
,就給分)
- C2 = ?
amoswap.w.aqrl
(只要有amoswap
且存在rl
,就給分)
The code was taken from https://github.com/coreboot/chrome-ec/blob/main/core/riscv-rv32i/task.c
D
This set of problems centers on cache coherence. Crucial assumptions for these questions are:
For each subpart listed below, identify the initial and subsequent cache states that best represent the transition described in the provided scenario. The state options are as follows: M (Modified), O (Owned), E (Exclusive), S (Shared), I (Invalid).
Note: Each subpart should be treated as a separate scenario. Assume that at the start of each subpart, location X is not initially in the cache of either processor A or B. Your answers should be one of the characters from MOESI
.
Processor B executes a write operation to location X, where initially, A's cache is in StateA1 and B's cache is in StateB1. Subsequently, when A performs a read operation from location X, A's cache changes to StateA2, while 𝐵's cache is then in StateB2.
- D01 = ?
I- D02 = ?
S- D03 = ?
M- D04 = ?
O
Processors A, B, and C each perform a read operation from location X. Following this, A executes a write operation to location X. Then, B reads from location X, after which C writes to it. Subsequently, A reads from location X again. At this juncture, the cache state of A is StateA1 (D05), B's cache is in StateB1 (D06), and C's cache is in StateC1 (D07).
- D05 = ?
S- D06 = ?
I- D07 = ?
O
Processor B initiates a read from location X, whereupon A's cache is in StateA1 and 𝐵's cache is in StateB1. Following this, when A reads from location X, its cache transitions to StateA2 while B's cache changes to StateB2.
- D08 = ?
I- D09 = ?
S- D10 = ?
E- D11 = ?
S
E
Examine these two RISC-V atomic instructions, which we are utilizing to implement a locking mechanism for critical sections within our new interactive application. These instructions employ specific register(s) designated for holding the reservation flag, address, and the result of the store-conditional operation.
Consider the following RISC-V atomic instructions:
lr.w rd, rs1
R[rd]
to the memory content at M[R[rs1]]
.M[R[rs1]]
.sc.w rd, rs1, rs2
M[R[rs1]]
holds a reservation, then R[rd]
is set to 0, and the content of M[R[rs1]]
is updated with the value from R[rs2]
.R[rd]
is set to 1.The initial task involves creating the Exchange
function. This function leverages the load-reserved and store-conditional synchronization primitives for the atomic exchange of values between Mem[a0]
and a1
. Complete the process by inserting the appropriate instruction into the first blank labeled E01
, and then fill the second blank, labeled E02
, with the corresponding instruction required there.
- E01 = ?
sc.w t1, a0, a1
- E02 = ?
bnez t1, Exchange
Then, consider a lock function for critical sections using the newly implemented atomic exchange synchronization function. Here is how it is structured:
Upon observing performance degradation in parts of the code where numerous threads vie for the same lock, we realize that this might be linked to our system's MSI (Modified, Shared, Invalid) cache coherence protocol. In the context of the MSI cache coherence protocol, every processor initiates write requests while executing the Exchange
function, leading to mutual invalidation in the spin loop. What is the impact?
- E03 = ?
degrade performance (只要提及會讓系統效能低落,就給分)
This results in increased bus traffic and consequently, a decline in overall performance.
To ensure our new application's compatibility with platforms using the MSI coherence protocol, we revise the Lock
function to circumvent the previously identified issue. This new structure aims to reduce cache invalidations and bus traffic while trying to acquire the lock, enhancing performance on MSI coherence platforms. Here is how we can structure the updated function, with placeholders for the necessary adjustments:
- E04 = ?
bnez t0, Spin2
- E05 = ?
jal ra, Exchange
F
For optimizing Load-Reserved/Store-Conditional (LR/SC) operations in scenarios with numerous readers and fewer writers of a shared variable, we suggest certain alterations to the cache-based approach, within an MSI coherence framework:
Contrastingly, in the original implementation, the reservation is annulled whenever the line exits the modified state, which can occur when another sharer performs a regular load on the line. This proposed modification allows readers to not interfere with the reservation until the SC is executed, thereby enhancing the likelihood of successful operations.
In a scenario where the LR/SC instruction sequence is restricted such that there are no other load or store operations between the LR and SC, and the shared variable is isolated in its own cache line without co-locating with other variables, does this modification guarantee prevention of livelock? Answer in Yes or No.
- F01 = ?
Yes
If a Store-Conditional (SC) operation fails because of invalidation, it implies that another SC or store operation has been successful. The limitations imposed on the Load-Reserved/Store-Conditional (LR/SC) sequence and the exclusive cache line allocation for the variable preclude voluntary eviction and false sharing as causes for line invalidation and reservation clearance. It's important to note that this query focuses on the absence of livelock, not on ensuring that every individual thread will always succeed in its SC operation due to possible contention. What matters is that the system, in its entirety, continues to make progress, even if some threads may not complete their SC operations successfully.
We aim to eliminate the additional coherence traffic required to transition the cache line from the shared state to the modified state, especially when there are no other readers accessing the line in the interval between the LR and SC operations. Is it feasible to implement a similar optimization in a system using the MESI coherence protocol? Answer in Yes or No.
- F02 = ?
Yes
In this modification, the LR operation secures the cache line in an exclusive state. The reservation is not canceled when the line is downgraded from exclusive to shared. The SC is successful only under two conditions: if the line has remained continuously in the exclusive state, which then gets seamlessly transitioned to modified, or if it transitions from shared to modified without any disruption caused by invalidation. While this approach does reintroduce the potential for livelock, it generally enhances forward progress compared to the standard MESI-based approach discussed in lectures. Given the context, where there are relatively fewer writers (utilizing LR) and more readers (using regular loads), the impact is less significant because readers no longer disrupt the reservations.
G
Assume that we have a VIPT (Virtually-Indexed, Physically-Tagged) cache implementation, which is a 4-way set associative cache. This cache uses the following address partitioning scheme for both its cache input and virtual address:
For the virtual address, the partitioning is as follows:
Here is a description of a basic VIPT cache implementation along with its associated TLB, both of which are accessed simultaneously. It's important to note that in a VIPT cache, the verification of the tag is conducted using the Physical Page Number (PPN), which is acquired either from the TLB or through a page table walk.
4-way Fully Set-Associative TLB Implementation
4-way Set-Associative VIPT Cache Implementation
For ease of understanding, let's assume that the TLB consistently hits and there are no page faults, meaning the Physical Page Number (PPN) is always valid in a Page Table Entry (PTE). As depicted in the provided diagrams, the TLB access and the cache lookup occur concurrently. The outcome, indicated by the "TLB Hit" signal, is then utilized to ascertain the validity of the cache entry. Considering the block delay expressions for each component, as detailed below, and referring to the diagrams of the VIPT cache and TLB, please proceed with the following questions. Bear in mind that the ceil(X) function yields the smallest integer greater than or equal to X.
Component | Delay equation (ps) |
---|---|
AND Gate | 40 |
OR Gate | 50 |
Output-data/buffer driver | 180 |
N-to-1 MUX | 50 ceil() + 100 |
Comparator | 30 (num tag bits) + 70 |
Memory array | 30 ceil( + 30 ceil() + 100 |
Decoder | 30 (number_index_bits) + 80 |
(a)
Cache index -> Output data(b)
VPN tag check -> Output data(c)
VPN tag check -> Cache hit
- G01 = ?
b
For the longest path from the VPN tag check to the output data in the data cache: VPN -> Comparator (30 x 23 bits + 70 = 760) -> AND Gate (40) -> Buffer Driver (180) -> Comparator (30 x 21 bits + 70 = 700) -> AND Gate (40) + 2 x Buffer Driver (2 x 180) -> Data Output (2080 ps)
For the longest path from the Input Address index to the output data in the data cache: Index -> Tag Decoder (30 x 5 bits + 80 = 230) -> Memory Array (30 x 5 + 30 x 7 + 100) -> Comparator (30 x 21 + 70) -> AND Gate (40) + 2 x Buffer Driver (2 x 180) -> Data Output (1790 ps)
For the longest path from the VPN tag check to the cache hit signal in the data cache: VPN -> Comparator (30 x 23 bits + 70 = 760) -> AND Gate (40) -> Buffer Driver (180) -> Comparator (30 x 21 bits + 70 = 700) -> AND Gate (40) -> OR Gate (50 ps) -> AND Gate (40 ps) -> Cache Hit (1810 ps)
H
Translate the following loop into vector assembly code, utilizing the RISC-V vector instruction set architecture (ISA):
A
and B
consist of 64-bit integers and are non-overlapping.N
is provided in the register a0
.A
and B
are supplied in registers a1
and a2
, respectively.Fill in H01 to H04 to make the above work as expected.
- H01 = ?
vle.v v16, (a2)
- H02 = ?
vlse.v v8, (a4), t1
- H03 = ?
slli t0, t0, 3
- H04 = ?
add a1, a1, t2
I
Reflect on the given code that executes an in-place slide operation, shifting non-zero elements in array A
forward by a distance M
. For parallel execution on a multi-threaded processor, imagine dividing the loop so that each thread handles iterations where (i % T
) equals TID. Here, T
represents the total number of threads and TID is a unique identifier for each thread, ranging from 0 to .
N and M are arbitrary integers with N being greater than both M and T. The code runs on a multi-threaded in-order core which lacks a data cache and features flawless branch prediction without penalties for either taken or not-taken branches. Additionally, there's no overhead associated with threading.
Create the assembly code to be executed by each thread, treating the elements of array A as 32-bit integers. You are allowed to use any available registers, such as t1-t6. Here are the parameter registers:
This code ensures that each thread works on distinct elements of the array, determined by its thread ID, while moving non-zero elements forward by the displacement M
.
- I01 = ?
lw t1, 0(t1)
- I02 = ?
beqz t1, skip
- I03 = ?
bltu a0, t0, loop
Imagine a scenario where thread switching occurs every cycle, adhering to a fixed round-robin scheduling sequence. In instances where a thread is not prepared to execute during its scheduled turn, the pipeline introduces a bubble (idle cycle). Considering this, determine the minimum number of threads required to ensure the pipeline is always fully utilized without compromising the correctness of the execution. Take into account that M is set to 90 and N is substantially large. It is also observed that maintaining a consistent ascending order of thread scheduling by TID would prevent any memory dependency issues. What is the minimum number of threads needed to always fully utilize the pipeline while maintaining correct execution?
- I04 = ?
60
At least 60 threads are required to hide a 60-cycle latency.
J
A Virtual Machine Monitor (VMM, or hypervisor) operates multiple guest operating systems on a single host machine. These guest OSs function in user (unprivileged) mode, while the VMM operates in supervisor (privileged) mode. Each guest virtual machine's OS handles its own page tables, representing the link between its guest virtual address space and guest physical address space (termed "virtual-to-real" mapping). Subsequently, these guest physical addresses need to be mapped onto the host physical addresses.
To efficiently utilize the existing hardware Translation Lookaside Buffer (TLB), the VMM creates and manages shadow page tables. These shadow tables directly map the guest virtual address space to the host physical address space (known as "virtual-to-physical" mapping). While the guest OS runs in user mode, the VMM adjusts the hardware's page table base pointer to align with the shadow page table. This setup allows the TLB to function seamlessly, as though there was no virtualization involved.
Assuming both the guest and host machines employ three-level page tables and the host features a hardware-refilled TLB, consider the scenario when the guest OS is operational. In this case, if a TLB access requires 1 cycle and each memory access incurs a latency of 50 cycles, what would be the latency experienced during a TLB miss? __ J01 __ cycles
- J01 = ?
151
1 + 50 + 50 + 50 = 151 cycles
A hardware TLB refill involves a page table walk of the shadow page tables in memory, same as if there were no virtualization in effect. Note that the guest page tables are not directly involved.
The TLB miss latency must also include the cost of the initial TLB lookup (1 cycle).
To facilitate a guest virtual machine with an 8 KiB page size on a host machine that uses a 4 KiB page size, especially for adequate support of I/O operations like DMA requests, what is the required number of host pages that must be physically contiguous?
- J02 = ?
2
Each guest page is mapped to two host pages; a leaf entry in the guest page table corresponds to two consecutive entries in the shadow page table with the same permissions. To properly support I/O (i.e., DMA requests), the two host pages should be physically contiguous.