Quiz6 of Computer Architecture (2023 Fall)

Solutions

Problem `A`

Remember the tasks you completed in Homework 3. Please respond to the following questions based on that.

Please provide a summary of your work, including the accomplishments you achieved, in English.
- A01 = ?
  (只要描述就給分，限英語作答)
Describe the main challenge you encountered while working on this homework and explain how you managed to overcome it, in English.
- A02 = ?
  (只要描述就給分，限英語作答)

Problem `B`

We aim to enhance the efficiency of a program designed to calculate the total of all elements in a 64x64 element matrix M. This matrix is arranged in memory following the row-major order. It is presumed that the variable sum is located in a shared memory area. We are evaluating two different code segments for this purpose:

Program I

int sum = 0;
for (int j = 0; j < 8 * 8; j++) {
    for (int i = 0; i < 8 * 8; i++) {
        sum += M[i][j];
    }
}

Program II

int sum = 0;
for (int i = 0; i < 8 * 8; i++) {
    for (int j = 0; j < 8 * 8; j++) {
        sum += M[i][j];
    }
}

Both programs operate on a direct-mapped cache that has a capacity of 64 words and a block size of 8 words. Additionally, the data cache is distinct and separate from the instruction cache.

To reduce the overall number of data cache misses, which program is preferable? A cache line contains array elements that share the same value of [i] and have sequential values of [j]. Given that Program II sequentially accesses the elements in a row of M with incrementing values of j, it results in only one cache miss for every 8 accesses, as a new line is brought in after every eighth access.

How many total data cache misses occur during the execution of Program II?
- B01 = ?
  512
  
  $64 \times 64 / 8 = 512$
Consider running Program II on a multi-threaded processor equipped with 8 threads. What resources are shared among these threads?
- B02 = ?
  sum (此為變數名稱)
  The variable sum is shared by all the threads.
(Continuing from the previous question) Do these shared resources lead to any issues? If so, show the proposed solutions.
- B03 = ?
  提及semaphore或mutex就給分
  If a semaphore is utilized to control the updating of the sum variable, issues may arise in accurately updating the value of sum.

Problem `C`

Assume that we are going to build a minimal operating system kernel on RISC-V with the "A" (atomic) extension. Critical defines and routines are listed below.

typedef int atomic_t;
struct mutex {                             
    uint32_t lock;
    atomic_t waiters;
};

/* find the most significant bit. Not defined in n == 0. */
#define __fls(n) (31 - __builtin_clz(n))

static inline atomic_val_t atomic_or(atomic_t *addr, atomic_val_t bits)
{
    return __atomic_fetch_or(addr, bits, __ATOMIC_SEQ_CST);
}
static inline atomic_val_t atomic_clear_bits(atomic_t *addr, atomic_val_t bits)
{
    return __atomic_fetch_and(addr, ~bits, __ATOMIC_SEQ_CST);
}

Both __atomic_fetch_or and __atomic_fetch_and are provided by GCC atomics extension. __builtin_clz is a GCC builtin as well. See Other Built-in Functions Provided by GCC .

Let's consider the following code snippet that attempts to implement the mutex lock routine. Fill in C1 at Line 13 to make it function correctly.
























void mutex_lock(struct mutex *mtx)
{
    uint32_t locked;
    uint32_t id = 1 << task_get_current();

    atomic_or(&mtx->waiters, id);

    while (1) {
        asm volatile(
            /* set lock value */
            "li %0, 2\n\t"
            /* attempt to acquire lock */
            "__C1__ %0, %0, %1\n\t"
            : "=&r"(locked), "+A"(mtx->lock));
        /* we got it ! */                                                                                                                                     
        if (!locked)
            break;
        /* Contention on the mutex */
        /* Sleep waiting for our turn */
        task_wait_event_mask(TASK_EVENT_MUTEX, 0);
    }

    atomic_clear_bits(&mtx->waiters, id);
}

C01 = ?
amoswap.w.aq (只要有 amoswap 且存在 aq，就給分)

Let’s consider the following code snippet that attempts to implement the mutex unlock routine. Fill in C2 at Line 7 to make it function correctly.





















void mutex_unlock(struct mutex *mtx)
{
    uint32_t waiters;
    task_ *tsk = current_task;

    /* give back the lock */
    asm volatile("__C2__ zero, zero, %0\n\t" : "+A"(mtx->lock));
    waiters = mtx->waiters;

    while (waiters) {
        task_id_t id = __fls(waiters);

        waiters &= ~BIT(id);

        /* Somebody is waiting on the mutex */
        task_set_event(id, TASK_EVENT_MUTEX);
    }

    /* Ensure no event is remaining from mutex wake-up */
    atomic_clear_bits(&tsk->events, TASK_EVENT_MUTEX);
}

C2 = ?
amoswap.w.aqrl (只要有 amoswap 且存在 rl，就給分)
The code was taken from https://github.com/coreboot/chrome-ec/blob/main/core/riscv-rv32i/task.c

Problem `D`

This set of problems centers on cache coherence. Crucial assumptions for these questions are:

Processors A, B, and C each have their respective private caches.
These private caches operate under the MOESI protocol and can snoop on other caches via a shared bus.
The MOESI protocol, relevant to these questions, is depicted in the diagram below, showing the possible state transitions.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

For each subpart listed below, identify the initial and subsequent cache states that best represent the transition described in the provided scenario. The state options are as follows: M (Modified), O (Owned), E (Exclusive), S (Shared), I (Invalid).

Note: Each subpart should be treated as a separate scenario. Assume that at the start of each subpart, location X is not initially in the cache of either processor A or B. Your answers should be one of the characters from MOESI.

Processor B executes a write operation to location X, where initially, A's cache is in StateA1 and B's cache is in StateB1. Subsequently, when A performs a read operation from location X, A's cache changes to StateA2, while 𝐵's cache is then in StateB2.
- This results in a transition for A's cache from StateA1 (D01) to StateA2 (D02).
- The cache of processor 𝐵 undergoes a transition from StateB1 (D03) to StateB2 (D04).
- D01 = ?
  I
- D02 = ?
  S
- D03 = ?
  M
- D04 = ?
  O
Processors A, B, and C each perform a read operation from location X. Following this, A executes a write operation to location X. Then, B reads from location X, after which C writes to it. Subsequently, A reads from location X again. At this juncture, the cache state of A is StateA1 (D05), B's cache is in StateB1 (D06), and C's cache is in StateC1 (D07).
- D05 = ?
  S
- D06 = ?
  I
- D07 = ?
  O
Processor B initiates a read from location X, whereupon A's cache is in StateA1 and 𝐵's cache is in StateB1. Following this, when A reads from location X, its cache transitions to StateA2 while B's cache changes to StateB2.
- This marks a shift for A's cache from StateA1 (D08) to StateA2 (D09).
- Concurrently, 𝐵's cache evolves from StateB1 (D10) to StateB2 (D11).
- D08 = ?
  I
- D09 = ?
  S
- D10 = ?
  E
- D11 = ?
  S

Problem `E`

Examine these two RISC-V atomic instructions, which we are utilizing to implement a locking mechanism for critical sections within our new interactive application. These instructions employ specific register(s) designated for holding the reservation flag, address, and the result of the store-conditional operation.

Consider the following RISC-V atomic instructions:

lr.w rd, rs1
- This instruction sets the value of register R[rd] to the memory content at M[R[rs1]].
- It also places a reservation on the memory location M[R[rs1]].
sc.w rd, rs1, rs2
- If the memory location M[R[rs1]] holds a reservation, then R[rd] is set to 0, and the content of M[R[rs1]] is updated with the value from R[rs2].
- If there is no reservation, R[rd] is set to 1.

The initial task involves creating the Exchange function. This function leverages the load-reserved and store-conditional synchronization primitives for the atomic exchange of values between Mem[a0] and a1. Complete the process by inserting the appropriate instruction into the first blank labeled E01, and then fill the second blank, labeled E02, with the corresponding instruction required there.

# Input Parameters:
# a0: Address in memory where the atomic exchange is to occur
# a1: The value to be atomically stored at Mem[a0]
#
# Output:
# a0: The value originally held at Mem[a0] before the atomic operation
Exchange:
    lr.w t0, a0
    __ E01 __
    __ E02 __
    mv a0, t0
    ret

E01 = ?
sc.w t1, a0, a1

E02 = ?
bnez t1, Exchange

Then, consider a lock function for critical sections using the newly implemented atomic exchange synchronization function. Here is how it is structured:

# Input Parameters:
# a0: Address in memory where the lock is located
Lock:
    addi, sp, sp, -8
    sw ra, 0(sp)
    sw s0, 4(sp)
    mv s0, a0
Spin:
    mv a0, s0
    li a1, 1
    jal ra, Exchange
    bnez a0, Spin
    lw ra, 0(sp)
    lw s0, 4(sp)
    addi sp, sp, 8
    ret

Upon observing performance degradation in parts of the code where numerous threads vie for the same lock, we realize that this might be linked to our system's MSI (Modified, Shared, Invalid) cache coherence protocol. In the context of the MSI cache coherence protocol, every processor initiates write requests while executing the Exchange function, leading to mutual invalidation in the spin loop. What is the impact?

E03 = ?
degrade performance (只要提及會讓系統效能低落，就給分)
This results in increased bus traffic and consequently, a decline in overall performance.

To ensure our new application's compatibility with platforms using the MSI coherence protocol, we revise the Lock function to circumvent the previously identified issue. This new structure aims to reduce cache invalidations and bus traffic while trying to acquire the lock, enhancing performance on MSI coherence platforms. Here is how we can structure the updated function, with placeholders for the necessary adjustments:

# Arguments:
# a0: The memory address of the lock
Lock:
    addi, sp, sp, -8
    sw ra, 0(sp)
    sw s0, 4(sp)
    mv s0, a0
Spin1:
    mv a0, s0
    li a1, 1
Spin2:
    lw t0, 0(s0)
    __ E04 __
    __ E05 __
    bnez a0, Spin1
    lw ra, 0(sp)
    lw s0, 4(sp)
    addi sp, sp, 8
    ret

E04 = ?
bnez t0, Spin2

E05 = ?
jal ra, Exchange

Problem `F`

For optimizing Load-Reserved/Store-Conditional (LR/SC) operations in scenarios with numerous readers and fewer writers of a shared variable, we suggest certain alterations to the cache-based approach, within an MSI coherence framework:

Initially, the LR operation brings the cache line into a shared state rather than a modified state.
The reservation set by LR gets canceled if the line is invalidated or if the same address is modified by another store operation from the local processor before executing SC.
If the reservation remains intact, the SC operation tries to elevate the line's state from shared to modified. The SC is successful if it completes the coherence transaction without any intermediate invalidation.

Contrastingly, in the original implementation, the reservation is annulled whenever the line exits the modified state, which can occur when another sharer performs a regular load on the line. This proposed modification allows readers to not interfere with the reservation until the SC is executed, thereby enhancing the likelihood of successful operations.

In a scenario where the LR/SC instruction sequence is restricted such that there are no other load or store operations between the LR and SC, and the shared variable is isolated in its own cache line without co-locating with other variables, does this modification guarantee prevention of livelock? Answer in Yes or No.
- F01 = ?
  Yes
  If a Store-Conditional (SC) operation fails because of invalidation, it implies that another SC or store operation has been successful. The limitations imposed on the Load-Reserved/Store-Conditional (LR/SC) sequence and the exclusive cache line allocation for the variable preclude voluntary eviction and false sharing as causes for line invalidation and reservation clearance. It's important to note that this query focuses on the absence of livelock, not on ensuring that every individual thread will always succeed in its SC operation due to possible contention. What matters is that the system, in its entirety, continues to make progress, even if some threads may not complete their SC operations successfully.
We aim to eliminate the additional coherence traffic required to transition the cache line from the shared state to the modified state, especially when there are no other readers accessing the line in the interval between the LR and SC operations. Is it feasible to implement a similar optimization in a system using the MESI coherence protocol? Answer in Yes or No.
- F02 = ?
  Yes
  In this modification, the LR operation secures the cache line in an exclusive state. The reservation is not canceled when the line is downgraded from exclusive to shared. The SC is successful only under two conditions: if the line has remained continuously in the exclusive state, which then gets seamlessly transitioned to modified, or if it transitions from shared to modified without any disruption caused by invalidation. While this approach does reintroduce the potential for livelock, it generally enhances forward progress compared to the standard MESI-based approach discussed in lectures. Given the context, where there are relatively fewer writers (utilizing LR) and more readers (using regular loads), the impact is less significant because readers no longer disrupt the reservations.

Problem `G`

Assume that we have a VIPT (Virtually-Indexed, Physically-Tagged) cache implementation, which is a 4-way set associative cache. This cache uses the following address partitioning scheme for both its cache input and virtual address:

Tag bits: [31:11]
Index bits: [10:6]
Line offset: [5:0]

For the virtual address, the partitioning is as follows:

VPN (Virtual Page Number) bits: [31:9]
Page offset: [8:0]

Here is a description of a basic VIPT cache implementation along with its associated TLB, both of which are accessed simultaneously. It's important to note that in a VIPT cache, the verification of the tag is conducted using the Physical Page Number (PPN), which is acquired either from the TLB or through a page table walk.

$↑$ 4-way Fully Set-Associative TLB Implementation

$↑$ 4-way Set-Associative VIPT Cache Implementation

For ease of understanding, let's assume that the TLB consistently hits and there are no page faults, meaning the Physical Page Number (PPN) is always valid in a Page Table Entry (PTE). As depicted in the provided diagrams, the TLB access and the cache lookup occur concurrently. The outcome, indicated by the "TLB Hit" signal, is then utilized to ascertain the validity of the cache entry. Considering the block delay expressions for each component, as detailed below, and referring to the diagrams of the VIPT cache and TLB, please proceed with the following questions. Bear in mind that the ceil(X) function yields the smallest integer greater than or equal to X.

Component	Delay equation (ps)
AND Gate	40
OR Gate	50
Output-data/buffer driver	180
N-to-1 MUX	50 $\times$ ceil( $\log_{2} N$ ) + 100
Comparator	30 $\times$ (num tag bits) + 70
Memory array	30 $\times$ ceil( $\log_{2} n u m b e r_o f_s e t s$ + 30 $\times$ ceil( $\log_{2} n u m b e r_b i t s_p e r_s e t$ ) + 100
Decoder	30 $\times$ (number_index_bits) + 80

Among the three specified paths, identify the critical path. It is not necessary to consider other paths for determining the critical path in this context, even though the ones listed here do not encompass all possible options. The paths are: (select one)
- (a) Cache index -> Output data
- (b) VPN tag check -> Output data
- (c) VPN tag check -> Cache hit
- G01 = ?
  b
  For the longest path from the VPN tag check to the output data in the data cache: VPN -> Comparator (30 x 23 bits + 70 = 760) -> AND Gate (40) -> Buffer Driver (180) -> Comparator (30 x 21 bits + 70 = 700) -> AND Gate (40) + 2 x Buffer Driver (2 x 180) -> Data Output (2080 ps)
  For the longest path from the Input Address index to the output data in the data cache: Index -> Tag Decoder (30 x 5 bits + 80 = 230) -> Memory Array (30 x 5 + 30 x 7 + 100) -> Comparator (30 x 21 + 70) -> AND Gate (40) + 2 x Buffer Driver (2 x 180) -> Data Output (1790 ps)
  For the longest path from the VPN tag check to the cache hit signal in the data cache: VPN -> Comparator (30 x 23 bits + 70 = 760) -> AND Gate (40) -> Buffer Driver (180) -> Comparator (30 x 21 bits + 70 = 700) -> AND Gate (40) -> OR Gate (50 ps) -> AND Gate (40 ps) -> Cache Hit (1810 ps)

Problem `H`

Translate the following loop into vector assembly code, utilizing the RISC-V vector instruction set architecture (ISA):

Arrays A and B consist of 64-bit integers and are non-overlapping.
The value of N is provided in the register a0.
The base addresses for arrays A and B are supplied in registers a1 and a2, respectively.

C source

for (i = 0; i < N; i++)
    A[i * 3 + 2] = A[i * 3] + (A[i * 3 + 1] * B[i]);

RISC-V Assembly with vector extension

        addi t1, x0, 24         # Set the stride length in bytes
loop:   vsetvli t0, a0, e64, m8 # Set VL, SEW=64, LMUL=8 for vector operation
        addi a3, a1, 8          # Calculate address for &A[3*i+1]
        addi a4, a1, 16         # Calculate address for &A[3*i+2]
        vlse.v v0, (a1), t1     # Load values from A[3*i]
        vlse.v v8, (a3), t1     # Load values from A[3*i+1]
        __ H01 __               # Load values from B[i]
        vmadd.vv v8, v16, v0    # Compute v8 = (v8 * v16) + v0
        __ H02 __               # Load values from A[3*i+2]
        sub a0, a0, t0          # Decrease VL
        slli t2, t0, 4          # Multiply 2*VL by 16 to get byte count
        __ H03 __               # Multiply VL by 8 to get byte count
        add t2, t2, t0          # Calculate total byte offset for 3*VL
        add a2, a2, t0          # Update pointer B to next set of elements
        __ H04 __.              # Update pointer A to next set of elements
        bnez a0, loop           # If there are more elements, repeat the loop

Fill in H01 to H04 to make the above work as expected.

H01 = ?
vle.v v16, (a2)

H02 = ?
vlse.v v8, (a4), t1

H03 = ?
slli t0, t0, 3

H04 = ?
add a1, a1, t2

Problem `I`

Reflect on the given code that executes an in-place slide operation, shifting non-zero elements in array A forward by a distance M. For parallel execution on a multi-threaded processor, imagine dividing the loop so that each thread handles iterations where (i % T) equals TID. Here, T represents the total number of threads and TID is a unique identifier for each thread, ranging from 0 to

T - 1

for (i = 0; i < N; i++) {
    if (A[i + M])
        A[i] = A[i + M];
}

N and M are arbitrary integers with N being greater than both M and T. The code runs on a multi-threaded in-order core which lacks a data cache and features flawless branch prediction without penalties for either taken or not-taken branches. Additionally, there's no overhead associated with threading.

The latency for main memory access is 60 cycles.
Once a load instruction is issued by the processor, it can proceed with other instructions until it encounters an instruction that depends on the loaded value.
The latency for integer arithmetic operations is set at 1 cycle.

Create the assembly code to be executed by each thread, treating the elements of array A as 32-bit integers. You are allowed to use any available registers, such as t1-t6. Here are the parameter registers:

Array (A) address: passed in a0
Displacement (M): passed in a1
Total number of threads (T): passed in a2
Thread ID (TID): passed in a3

# Initialize the upper boundary for the loop (t0 = A + N)
addi t0, a0, N*4

# Convert M, T, and TID from elements to byte offsets
slli a1, a1, 2 # Scale M to represent bytes
slli a2, a2, 2 # Scale T to represent bytes
slli a3, a3, 2 # Scale TID to represent bytes

# Adjust the starting pointer of array A based on the thread ID
add a0, a0, a3

# Start of the loop
loop:
    # Calculate address for the element to load and load it into t1
    add t1, a0, a1
    __ I01 __

    # Check if the loaded value is zero; if yes, skip the store operation
    __ I02 __

    # Store the non-zero value back into the array at the adjusted position
    sw t1, 0(a0)

# Label for skipping the store operation
skip:
    # Move to the next element for this thread
    add a0, a0, a2

    # Check if the end of the array is reached; if not, continue looping
    __ I03 __

This code ensures that each thread works on distinct elements of the array, determined by its thread ID, while moving non-zero elements forward by the displacement M.

I01 = ?
lw t1, 0(t1)

I02 = ?
beqz t1, skip

I03 = ?
bltu a0, t0, loop

Imagine a scenario where thread switching occurs every cycle, adhering to a fixed round-robin scheduling sequence. In instances where a thread is not prepared to execute during its scheduled turn, the pipeline introduces a bubble (idle cycle). Considering this, determine the minimum number of threads required to ensure the pipeline is always fully utilized without compromising the correctness of the execution. Take into account that M is set to 90 and N is substantially large. It is also observed that maintaining a consistent ascending order of thread scheduling by TID would prevent any memory dependency issues. What is the minimum number of threads needed to always fully utilize the pipeline while maintaining correct execution?

I04 = ?
60
At least 60 threads are required to hide a 60-cycle latency.

Problem `J`

A Virtual Machine Monitor (VMM, or hypervisor) operates multiple guest operating systems on a single host machine. These guest OSs function in user (unprivileged) mode, while the VMM operates in supervisor (privileged) mode. Each guest virtual machine's OS handles its own page tables, representing the link between its guest virtual address space and guest physical address space (termed "virtual-to-real" mapping). Subsequently, these guest physical addresses need to be mapped onto the host physical addresses.

To efficiently utilize the existing hardware Translation Lookaside Buffer (TLB), the VMM creates and manages shadow page tables. These shadow tables directly map the guest virtual address space to the host physical address space (known as "virtual-to-physical" mapping). While the guest OS runs in user mode, the VMM adjusts the hardware's page table base pointer to align with the shadow page table. This setup allows the TLB to function seamlessly, as though there was no virtualization involved.

Assuming both the guest and host machines employ three-level page tables and the host features a hardware-refilled TLB, consider the scenario when the guest OS is operational. In this case, if a TLB access requires 1 cycle and each memory access incurs a latency of 50 cycles, what would be the latency experienced during a TLB miss? __ J01 __ cycles
- J01 = ?
  151
  1 + 50 + 50 + 50 = 151 cycles
  A hardware TLB refill involves a page table walk of the shadow page tables in memory, same as if there were no virtualization in effect. Note that the guest page tables are not directly involved.
  The TLB miss latency must also include the cost of the initial TLB lookup (1 cycle).
To facilitate a guest virtual machine with an 8 KiB page size on a host machine that uses a 4 KiB page size, especially for adequate support of I/O operations like DMA requests, what is the required number of host pages that must be physically contiguous?
- J02 = ?
  2
  Each guest page is mapped to two host pages; a leaf entry in the guest page table corresponds to two consecutive entries in the shadow page table with the same permissions. To properly support I/O (i.e., DMA requests), the two host pages should be physically contiguous.

Quiz6 of Computer Architecture (2023 Fall)

Problem A

Problem B

Problem C

Problem D

Problem E

Problem F

Problem G

Problem H

Problem I

Problem J