# Annotate and explain Quiz6/7 with additional challenge Problems:Quiz6
## Problem B
We aim to enhance the efficiency of a program designed to calculate the total of all elements in a 64x64 element matrix M. This matrix is arranged in memory following the row-major order. It is presumed that the variable sum is located in a shared memory area. We are evaluating two different code segments for this purpose:
Program I
```cpp
int sum = 0;
for (int j = 0; j < 8 * 8; j++) {
for (int i = 0; i < 8 * 8; i++) {
sum += M[i][j];
}
}
```
Program II
```cpp
int sum = 0;
for (int i = 0; i < 8 * 8; i++) {
for (int j = 0; j < 8 * 8; j++) {
sum += M[i][j];
}
}
```
Both programs operate on a direct-mapped cache that has a capacity of 64 words and a block size of 8 words. Additionally, the data cache is distinct and separate from the instruction cache.
To reduce the overall number of data cache misses, which program is preferable? A cache line contains array elements that share the same value of [i] and have sequential values of [j]. Given that Program II sequentially accesses the elements in a row of M with incrementing values of j, it results in only one cache miss for every 8 accesses, as a new line is brought in after every eighth access.
1. How many total data cache misses occur during the execution of Program II?
>B01 = ?
512
64 x 64/8 = 512
2. Consider running Program II on a multi-threaded processor equipped with 8 threads. What resources are shared among these threads?
>B02 = ?
sum (此為變數名稱)
The variable sum is shared by all the threads.
3. (Continuing from the previous question) Do these shared resources lead to any issues? If so, show the proposed solutions.
>B03 = ?
提及semaphore或mutex就給分
If a semaphore is utilized to control the updating of the sum variable, issues may arise in accurately updating the value of sum.
### 📝 Annotate
:::info
B01:
Since the matrix is 64x64, there are a total of 4096 elements. Given that each cache line can store 8 words, there are 4096 / 8 = 512 cache misses in total.
B02-B03:
Shared resources can lead to race conditions, where multiple threads attempt to update the same variable simultaneously. To address this issue, semaphores or mutexes can be used to control updates to the sum variable, ensuring that only one thread can write to sum at a time.
:::
## Problem C
Assume that we are going to build a minimal operating system kernel on RISC-V with the “A” (atomic) extension. Critical defines and routines are listed below.
```cpp
typedef int atomic_t;
struct mutex {
uint32_t lock;
atomic_t waiters;
};
/* find the most significant bit. Not defined in n == 0. */
#define __fls(n) (31 - __builtin_clz(n))
static inline atomic_val_t atomic_or(atomic_t *addr, atomic_val_t bits)
{
return __atomic_fetch_or(addr, bits, __ATOMIC_SEQ_CST);
}
static inline atomic_val_t atomic_clear_bits(atomic_t *addr, atomic_val_t bits)
{
return __atomic_fetch_and(addr, ~bits, __ATOMIC_SEQ_CST);
}
```
Both __atomic_fetch_or and __atomic_fetch_and are provided by GCC atomics extension. __builtin_clz is a GCC builtin as well. See Other Built-in Functions Provided by GCC .
1. Let’s consider the following code snippet that attempts to implement the mutex lock routine. Fill in C1 at Line 13 to make it function correctly.
```cpp=
void mutex_lock(struct mutex *mtx)
{
uint32_t locked;
uint32_t id = 1 << task_get_current();
atomic_or(&mtx->waiters, id);
while (1) {
asm volatile(
/* set lock value */
"li %0, 2\n\t"
/* attempt to acquire lock */
"__C1__ %0, %0, %1\n\t"
: "=&r"(locked), "+A"(mtx->lock));
/* we got it ! */
if (!locked)
break;
/* Contention on the mutex */
/* Sleep waiting for our turn */
task_wait_event_mask(TASK_EVENT_MUTEX, 0);
}
atomic_clear_bits(&mtx->waiters, id);
}
```
>* C01 = ?
amoswap.w.aq(只要有 amoswap 且存在 aq,就給分)
2. Let’s consider the following code snippet that attempts to implement the mutex unlock routine. Fill in C2 at Line 7 to make it function correctly.
```cpp=
void mutex_unlock(struct mutex *mtx)
{
uint32_t waiters;
task_ *tsk = current_task;
/* give back the lock */
asm volatile("__C2__ zero, zero, %0\n\t" : "+A"(mtx->lock));
waiters = mtx->waiters;
while (waiters) {
task_id_t id = __fls(waiters);
waiters &= ~BIT(id);
/* Somebody is waiting on the mutex */
task_set_event(id, TASK_EVENT_MUTEX);
}
/* Ensure no event is remaining from mutex wake-up */
atomic_clear_bits(&tsk->events, TASK_EVENT_MUTEX);
}
```
>* C2 = ?
amoswap.w.aqrl (只要有 amoswap 且存在 rl,就給分)
The code was taken from https://github.com/coreboot/chrome-ec/blob/main/core/riscv-rv32i/task.c
### 📝 Annotate
:::info
C01:
The function of `amoswap.w.aq` is to ensure that the current task and other tasks do not change the lock's state simultaneously, thereby guaranteeing exclusive access to the mutex. The `aq` flag in this context indicates that the operation establishes a memory order barrier, ensuring the order between this operation and subsequent memory operations.
C02:
The `amoswap.w.aqrl` instruction is used in multi-core or multi-threaded environments for synchronization. It ensures that the operation of releasing a mutex lock is atomic and ordered from the perspective of other cores. This guarantees safe access to the mutex in a multi-threaded environment, thereby avoiding race conditions and inconsistent behavior.The `rl` in `amoswap.w.aqrl` stands for release memory barrier.
#### RV32A
I've implemented a possible solution, but QEMU is not running it properly. I'm currently investigating the root cause of the issue.
```cpp
#define BIT(n) (1U << (n))
typedef unsigned int task_id_t;
typedef unsigned int atomic_val_t;
// Assume `task_get_current`, `task_wait_event_mask`, `task_set_event`, and `current_task` are defined elsewhere.
extern task_id_t task_get_current(void);
extern void task_wait_event_mask(int, int);
extern void task_set_event(task_id_t, int);
extern task_id_t current_task;
// The mutex structure and atomic operations are already defined.
void mutex_lock(struct mutex *mtx)
{
uint32_t locked;
uint32_t id = BIT(task_get_current());
atomic_or(&mtx->waiters, id);
while (1) {
asm volatile(
/* set lock value */
"li %0, 2\n\t"
/* attempt to acquire lock */
"amoswap.w.aq %0, %0, %1\n\t"
: "=&r"(locked), "+A"(mtx->lock));
/* we got it! */
if (!locked)
break;
/* Contention on the mutex */
/* Sleep waiting for our turn */
task_wait_event_mask(TASK_EVENT_MUTEX, 0);
}
atomic_clear_bits(&mtx->waiters, id);
}
void mutex_unlock(struct mutex *mtx)
{
uint32_t waiters;
task_id_t tsk = current_task;
/* give back the lock */
asm volatile("amoswap.w.aqrl zero, zero, %0\n\t" : "+A"(mtx->lock));
waiters = mtx->waiters;
while (waiters) {
task_id_t id = __fls(waiters);
waiters &= ~BIT(id);
/* Somebody is waiting on the mutex */
task_set_event(id, TASK_EVENT_MUTEX);
}
/* Ensure no event is remaining from mutex wake-up */
atomic_clear_bits(&tsk->events, TASK_EVENT_MUTEX);
}
```
:::
## Problem D
This set of problems centers on cache coherence. Crucial assumptions for these questions are:
- Processors A, B, and C each have their respective private caches.
- These private caches operate under the MOESI protocol and can snoop on other caches via a shared bus.
- The MOESI protocol, relevant to these questions, is depicted in the diagram below, showing the possible state transitions.

For each subpart listed below, identify the initial and subsequent cache states that best represent the transition described in the provided scenario. The state options are as follows: M (Modified), O (Owned), E (Exclusive), S (Shared), I (Invalid).
Note: Each subpart should be treated as a separate scenario. Assume that at the start of each subpart, location X is not initially in the cache of either processor A or B. Your answers should be one of the characters from MOESI.
1. Processor B executes a write operation to location X, where initially, A’s cache is in StateA1 and B’s cache is in StateB1. Subsequently, when A performs a read operation from location X, A’s cache changes to StateA2, while 𝐵’s cache is then in StateB2.
* This results in a transition for A’s cache from StateA1 (D01) to StateA2 (D02).
* The cache of processor 𝐵 undergoes a transition from StateB1 (D03) to StateB2 (D04).
>- D01 = ?
I
>- D02 = ?
S
>- D03 = ?
M
>- D04 = ?
O
### 📝 Annotate
:::info
D01-D04:
In the MOESI cache coherence protocol, when processor B writes to location X and the initial cache state for processor A is StateA1 while processor B's cache is in StateB1, it implies that location X has been modified in B's cache and is invalid in A's cache. Since neither A nor B has the location in their cache initially, StateA1 is Invalid (I).
Since B's cache now has the latest copy of the data which is now being shared, the cache state for A transitions from Invalid (I) to Shared (S).
After B performs the write operation, its cache state becomes Modified (M) because it has altered the data.
When A reads from location X, it needs to fetch the data from B's cache or the main memory, hence B's cache changes from Modified (M) to Owned (O).
:::
2. Processors A, B, and C each perform a read operation from location X. Following this, A executes a write operation to location X. Then, B reads from location X, after which C writes to it. Subsequently, A reads from location X again. At this juncture, the cache state of A is StateA1 (D05), B’s cache is in StateB1 (D06), and C’s cache is in StateC1 (D07).
>- D05 = ?
S
>- D06 = ?
I
>- D07 = ?
O
### 📝 Annotate
:::info
D05-D07:
All processors A, B, and C perform read operations from location X. Since they are all reading, the cache line for location X will be in a Shared (S) state for all caches.
Processor A then performs a write operation to location X. This action changes A's cache line to a Modified (M) state because A has the latest version of the data. The cache lines for B and C will transition to Invalid (I), as their data is now outdated.
Processor B reads from location X. To do this, B must re-fetch the data, which will cause A's cache to change from Modified (M) to Owned (O), because A still has the latest version of the data but now shares it with B. After the read, B's cache line will be in a Shared (S) state.
Processor C performs a write operation to location X. This action will change C's cache line to Modified (M) and will invalidate the cache lines for A and B, turning them into Invalid (I).
Finally, when A reads from location X again, A's cache line will transition from Invalid (I) to Shared (S), because it is retrieving the data that is currently modified by C.
:::
3. Processor B initiates a read from location X, whereupon A’s cache is in StateA1 and 𝐵’s cache is in StateB1. Following this, when A reads from location X, its cache transitions to StateA2 while B’s cache changes to StateB2.
- This marks a shift for A’s cache from StateA1 (D08) to StateA2 (D09).
- Concurrently, 𝐵’s cache evolves from StateB1 (D10) to StateB2 (D11).
>- D08 = ?
I
>- D09 = ?
S
>- D10 = ?
E
>- D11 = ?
S
### 📝 Annotate
:::info
D08-D11:
When Processor B initiates a read from location X, A's cache is initially in StateA1 Invalid (I), as it does not contain the data. B’s cache is in StateB1 Exclusive (E), as it is the only cache with that data, and it has not been modified.
After the read by Processor B, A's cache needs to fetch the data, so it transitions from StateA1 Invalid (I) to StateA2 Shared (S), because now both A and B have the same unmodified data from memory. B's cache would then also transition from StateB1 Exclusive (E) to StateB2 Shared (S), as the data is no longer exclusive to B's cache but is shared between A and B.
:::
## Problem E
Examine these two RISC-V atomic instructions, which we are utilizing to implement a locking mechanism for critical sections within our new interactive application. These instructions employ specific register(s) designated for holding the reservation flag, address, and the result of the store-conditional operation.
Consider the following RISC-V atomic instructions:
- lr.w rd, rs1
- This instruction sets the value of register R[rd] to the memory content at M[R[rs1]].
- It also places a reservation on the memory location M[R[rs1]].
- sc.w rd, rs1, rs2
- If the memory location M[R[rs1]] holds a reservation, then R[rd] is set to 0, and the content of M[R[rs1]] is updated with the value from R[rs2].
- If there is no reservation, R[rd] is set to 1.
The initial task involves creating the Exchange function. This function leverages the load-reserved and store-conditional synchronization primitives for the atomic exchange of values between Mem[a0] and a1. Complete the process by inserting the appropriate instruction into the first blank labeled E01, and then fill the second blank, labeled E02, with the corresponding instruction required there.
```cpp
# Input Parameters:
# a0: Address in memory where the atomic exchange is to occur
# a1: The value to be atomically stored at Mem[a0]
#
# Output:
# a0: The value originally held at Mem[a0] before the atomic operation
Exchange:
lr.w t0, a0
__ E01 __
__ E02 __
mv a0, t0
ret
```
>* E01 = ?
sc.w t1, a0, a1
>* E02 = ?
bnez t1, Exchange
Then, consider a lock function for critical sections using the newly implemented atomic exchange synchronization function. Here is how it is structured:
```cpp
# Input Parameters:
# a0: Address in memory where the lock is located
Lock:
addi, sp, sp, -8
sw ra, 0(sp)
sw s0, 4(sp)
mv s0, a0
Spin:
mv a0, s0
li a1, 1
jal ra, Exchange
bnez a0, Spin
lw ra, 0(sp)
lw s0, 4(sp)
addi sp, sp, 8
ret
```
Upon observing performance degradation in parts of the code where numerous threads vie for the same lock, we realize that this might be linked to our system’s MSI (Modified, Shared, Invalid) cache coherence protocol. In the context of the MSI cache coherence protocol, every processor initiates write requests while executing the Exchange function, leading to mutual invalidation in the spin loop. What is the impact?
>- E03 = ?
==Not required to answer==
degrade performance (只要提及會讓系統效能低落,就給分)
This results in increased bus traffic and consequently, a decline in overall performance.
To ensure our new application’s compatibility with platforms using the MSI coherence protocol, we revise the Lock function to circumvent the previously identified issue. This new structure aims to reduce cache invalidations and bus traffic while trying to acquire the lock, enhancing performance on MSI coherence platforms. Here is how we can structure the updated function, with placeholders for the necessary adjustments:
```cpp
# Arguments:
# a0: The memory address of the lock
Lock:
addi, sp, sp, -8
sw ra, 0(sp)
sw s0, 4(sp)
mv s0, a0
Spin1:
mv a0, s0
li a1, 1
Spin2:
lw t0, 0(s0)
__ E04 __
__ E05 __
bnez a0, Spin1
lw ra, 0(sp)
lw s0, 4(sp)
addi sp, sp, 8
ret
```
>- E04 = ?
bnez t0, Spin2
>- E05 = ?
jal ra, Exchange
### 📝 Annotate
:::info
#### RV32A
I've implemented a possible solution, but QEMU is not running it properly. I'm currently investigating the root cause of the issue.
```
Exchange:
lr.w t0, (a0) # Load-Reserved from address in a0 to t0
sc.w t1, a0, (a1) # Store-Conditional value from a1 to address in a0
bnez t1, Exchange # Branch if Store-Conditional failed
mv a0, t0 # Move the original value to a0
ret # Return from the function
Lock:
addi sp, sp, -8 # Allocate space on the stack
sw ra, 0(sp) # Save return address
sw s0, 4(sp) # Save s0
mv s0, a0 # Move the lock address to s0
Spin1:
mv a0, s0 # Set up arguments for Exchange function
li a1, 1 # Set a1 to 1 (lock value)
Spin2:
lw t0, 0(s0) # Load the lock value
bnez t0, Spin2 # Spin if lock is not free
jal ra, Exchange # Call Exchange function
bnez a0, Spin1 # Spin again if lock was not acquired
lw ra, 0(sp) # Restore return address
lw s0, 4(sp) # Restore s0
addi sp, sp, 8 # Deallocate stack space
ret # Return from the function
```
:::
## Problem F
For optimizing Load-Reserved/Store-Conditional (LR/SC) operations in scenarios with numerous readers and fewer writers of a shared variable, we suggest certain alterations to the cache-based approach, within an MSI coherence framework:
- Initially, the LR operation brings the cache line into a shared state rather than a modified state.
- The reservation set by LR gets canceled if the line is invalidated or if the same address is modified by another store operation from the local processor before executing SC.
- If the reservation remains intact, the SC operation tries to elevate the line’s state from shared to modified. The SC is successful if it completes the coherence transaction without any intermediate invalidation.
Contrastingly, in the original implementation, the reservation is annulled whenever the line exits the modified state, which can occur when another sharer performs a regular load on the line. This proposed modification allows readers to not interfere with the reservation until the SC is executed, thereby enhancing the likelihood of successful operations.
1. In a scenario where the LR/SC instruction sequence is restricted such that there are no other load or store operations between the LR and SC, and the shared variable is isolated in its own cache line without co-locating with other variables, does this modification guarantee prevention of livelock? Answer in Yes or No.
>- F01 = ?
Yes
If a Store-Conditional (SC) operation fails because of invalidation, it implies that another SC or store operation has been successful. The limitations imposed on the Load-Reserved/Store-Conditional (LR/SC) sequence and the exclusive cache line allocation for the variable preclude voluntary eviction and false sharing as causes for line invalidation and reservation clearance. It’s important to note that this query focuses on the absence of livelock, not on ensuring that every individual thread will always succeed in its SC operation due to possible contention. What matters is that the system, in its entirety, continues to make progress, even if some threads may not complete their SC operations successfully.
2. We aim to eliminate the additional coherence traffic required to transition the cache line from the shared state to the modified state, especially when there are no other readers accessing the line in the interval between the LR and SC operations. Is it feasible to implement a similar optimization in a system using the MESI coherence protocol? Answer in Yes or No.
>- F02 = ?
Yes
In this modification, the LR operation secures the cache line in an exclusive state. The reservation is not canceled when the line is downgraded from exclusive to shared. The SC is successful only under two conditions: if the line has remained continuously in the exclusive state, which then gets seamlessly transitioned to modified, or if it transitions from shared to modified without any disruption caused by invalidation. While this approach does reintroduce the potential for livelock, it generally enhances forward progress compared to the standard MESI-based approach discussed in lectures. Given the context, where there are relatively fewer writers (utilizing LR) and more readers (using regular loads), the impact is less significant because readers no longer disrupt the reservations.
## Problem G
Assume that we have a VIPT (Virtually-Indexed, Physically-Tagged) cache implementation, which is a 4-way set associative cache. This cache uses the following address partitioning scheme for both its cache input and virtual address:
- Tag bits: [31:11]
- Index bits: [10:6]
- Line offset: [5:0]
For the virtual address, the partitioning is as follows:
- VPN (Virtual Page Number) bits: [31:9]
- Page offset: [8:0]
Here is a description of a basic VIPT cache implementation along with its associated TLB, both of which are accessed simultaneously. It’s important to note that in a VIPT cache, the verification of the tag is conducted using the Physical Page Number (PPN), which is acquired either from the TLB or through a page table walk.


For ease of understanding, let’s assume that the TLB consistently hits and there are no page faults, meaning the Physical Page Number (PPN) is always valid in a Page Table Entry (PTE). As depicted in the provided diagrams, the TLB access and the cache lookup occur concurrently. The outcome, indicated by the “TLB Hit” signal, is then utilized to ascertain the validity of the cache entry. Considering the block delay expressions for each component, as detailed below, and referring to the diagrams of the VIPT cache and TLB, please proceed with the following questions. Bear in mind that the ceil(X) function yields the smallest integer greater than or equal to X.

1. Among the three specified paths, identify the critical path. It is not necessary to consider other paths for determining the critical path in this context, even though the ones listed here do not encompass all possible options. The paths are: (select one)
- (a) Cache index -> Output data
- (b) VPN tag check -> Output data
- (c) VPN tag check -> Cache hit
> - G01 = ?
b
For the longest path from the VPN tag check to the output data in the data cache: VPN -> Comparator (30 x 23 bits + 70 = 760) -> AND Gate (40) -> Buffer Driver (180) -> Comparator (30 x 21 bits + 70 = 700) -> AND Gate (40) + 2 x Buffer Driver (2 x 180) -> Data Output (2080 ps)
For the longest path from the Input Address index to the output data in the data cache: Index -> Tag Decoder (30 x 5 bits + 80 = 230) -> Memory Array (30 x 5 + 30 x 7 + 100) -> Comparator (30 x 21 + 70) -> AND Gate (40) + 2 x Buffer Driver (2 x 180) -> Data Output (1790 ps)
For the longest path from the VPN tag check to the cache hit signal in the data cache: VPN -> Comparator (30 x 23 bits + 70 = 760) -> AND Gate (40) -> Buffer Driver (180) -> Comparator (30 x 21 bits + 70 = 700) -> AND Gate (40) -> OR Gate (50 ps) -> AND Gate (40 ps) -> Cache Hit (1810 ps)
## Problem H
Translate the following loop into vector assembly code, utilizing the RISC-V vector instruction set architecture (ISA):
- Arrays A and B consist of 64-bit integers and are non-overlapping.
- The value of N is provided in the register a0.
- The base addresses for arrays A and B are supplied in registers a1 and a2, respectively.
C source
```cpp
for (i = 0; i < N; i++)
A[i * 3 + 2] = A[i * 3] + (A[i * 3 + 1] * B[i]);
```
RISC-V Assembly with vector extension
```cpp
addi t1, x0, 24 # Set the stride length in bytes
loop: vsetvli t0, a0, e64, m8 # Set VL, SEW=64, LMUL=8 for vector operation
addi a3, a1, 8 # Calculate address for &A[3*i+1]
addi a4, a1, 16 # Calculate address for &A[3*i+2]
vlse.v v0, (a1), t1 # Load values from A[3*i]
vlse.v v8, (a3), t1 # Load values from A[3*i+1]
__ H01 __ # Load values from B[i]
vmadd.vv v8, v16, v0 # Compute v8 = (v8 * v16) + v0
__ H02 __ # Load values from A[3*i+2]
sub a0, a0, t0 # Decrease VL
slli t2, t0, 4 # Multiply 2*VL by 16 to get byte count
__ H03 __ # Multiply VL by 8 to get byte count
add t2, t2, t0 # Calculate total byte offset for 3*VL
add a2, a2, t0 # Update pointer B to next set of elements
__ H04 __. # Update pointer A to next set of elements
bnez a0, loop # If there are more elements, repeat the loop
```
Fill in H01 to H04 to make the above work as expected.
>- H01 = ?
vle.v v16, (a2)
>- H02 = ?
vlse.v v8, (a4), t1
>- H03 = ?
slli t0, t0, 3
>- H04 = ?
add a1, a1, t2
### 📝 Annotate
:::info
H01 : This line loads values from the current index of array B (pointer a2) into the vector register v16. The vle.v instruction is used to load 64-bit integers into the vector register.
H02: This line loads values from the index A[3*i+2] of array A (address calculated as a4) into the vector register v8. The vlse.v instruction is used for strided loading, fetching elements from array A spaced 24 bytes apart (with a stride of t1).
H03 : This instruction multiplies the vector length VL by 8 (since each element is 64 bits, or 8 bytes), to calculate the total number of bytes processed in each iteration of the loop. slli is a logical left shift instruction that multiplies the value of VL by 8.
H04 : This instruction adds the base address of array A (a1) with the total byte offset (t2) calculated in the previous step, to update the pointer of array A to the next set of elements to be processe
#### RV32 V extension
I've implemented a possible solution, but QEMU is not running it properly. I'm currently investigating the root cause of the issue.
```
addi t1, x0, 12 # Set the stride length in bytes for 32-bit integers
loop: vsetvli t0, a0, e32, m8 # Set VL, SEW=32, LMUL=8 for vector operation
addi a3, a1, 4 # Calculate address for &A[3*i+1], considering 32-bit integers
addi a4, a1, 8 # Calculate address for &A[3*i+2], considering 32-bit integers
vlse.v v0, (a1), t1 # Load values from A[3*i]
vlse.v v8, (a3), t1 # Load values from A[3*i+1]
vle.v v16, (a2) # Load values from B[i]
vmadd.vv v8, v16, v0 # Compute v8 = (v8 * v16) + v0
vlse.v v8, (a4), t1 # Load values from A[3*i+2]
sub a0, a0, t0 # Decrease VL
slli t2, t0, 2 # Multiply 2*VL by 4 (32-bit) to get byte count
slli t0, t0, 3 # Multiply VL by 8 (for 32-bit elements) to get byte count
add t2, t2, t0 # Calculate total byte offset for 3*VL
add a2, a2, t0 # Update pointer B to next set of elements
add a1, a1, t2 # Update pointer A to next set of elements
bnez a0, loop # If there are more elements, repeat the loop
```
:::
## Problem I
Reflect on the given code that executes an in-place slide operation, shifting non-zero elements in array A forward by a distance M. For parallel execution on a multi-threaded processor, imagine dividing the loop so that each thread handles iterations where (i % T) equals TID. Here, T represents the total number of threads and TID is a unique identifier for each thread, ranging from 0 to T-1.
```cpp
for (i = 0; i < N; i++) {
if (A[i + M])
A[i] = A[i + M];
}
```
N and M are arbitrary integers with N being greater than both M and T. The code runs on a multi-threaded in-order core which lacks a data cache and features flawless branch prediction without penalties for either taken or not-taken branches. Additionally, there’s no overhead associated with threading.
- The latency for main memory access is 60 cycles.
- Once a load instruction is issued by the processor, it can proceed with other instructions until it encounters an instruction that depends on the loaded value.
- The latency for integer arithmetic operations is set at 1 cycle.
Create the assembly code to be executed by each thread, treating the elements of array A as 32-bit integers. You are allowed to use any available registers, such as t1-t6. Here are the parameter registers:
- Array (A) address: passed in a0
- Displacement (M): passed in a1
- Total number of threads (T): passed in a2
- Thread ID (TID): passed in a3
```cpp
# Initialize the upper boundary for the loop (t0 = A + N)
addi t0, a0, N*4
# Convert M, T, and TID from elements to byte offsets
slli a1, a1, 2 # Scale M to represent bytes
slli a2, a2, 2 # Scale T to represent bytes
slli a3, a3, 2 # Scale TID to represent bytes
# Adjust the starting pointer of array A based on the thread ID
add a0, a0, a3
# Start of the loop
loop:
# Calculate address for the element to load and load it into t1
add t1, a0, a1
__ I01 __
# Check if the loaded value is zero; if yes, skip the store operation
__ I02 __
# Store the non-zero value back into the array at the adjusted position
sw t1, 0(a0)
# Label for skipping the store operation
skip:
# Move to the next element for this thread
add a0, a0, a2
# Check if the end of the array is reached; if not, continue looping
__ I03 __
```
This code ensures that each thread works on distinct elements of the array, determined by its thread ID, while moving non-zero elements forward by the displacement M.
>- I01 = ?
lw t1, 0(t1)
>- I02 = ?
beqz t1, skip
>- I03 = ?
bltu a0, t0, loop
Imagine a scenario where thread switching occurs every cycle, adhering to a fixed round-robin scheduling sequence. In instances where a thread is not prepared to execute during its scheduled turn, the pipeline introduces a bubble (idle cycle). Considering this, determine the minimum number of threads required to ensure the pipeline is always fully utilized without compromising the correctness of the execution. Take into account that M is set to 90 and N is substantially large. It is also observed that maintaining a consistent ascending order of thread scheduling by TID would prevent any memory dependency issues. What is the minimum number of threads needed to always fully utilize the pipeline while maintaining correct execution?
>- I04 = ?
60
At least 60 threads are required to hide a 60-cycle latency.
### 📝 Annotate
:::info
I01:
The instruction add t1, a0, a1 calculates the target address from which data needs to be loaded. The subsequent instruction lw t1, 0(t1) then actually loads the data from this calculated address into the register t1. we need to determine whether the loaded data is a non-zero value. If it is non-zero, we will store it back into the corresponding position in array A, represented by a0.
I02:
The instruction beqz t1, skip is used to check whether the value just loaded into register t1 is zero.
If t1 is equal to zero, it indicates that the loaded array element is zero, so the program jumps to the skip label, bypassing the subsequent store operation.
If t1 is not equal to zero, the program does not jump, and instead continues to execute the next instruction, storing the non-zero value back into the array.
I03:
The instruction bltu a0, t0, loop controls the continuation of the loop.
When a0 is less than t0, this means there are more array elements to be processed. Therefore, the program jumps back to the loop label and continues the loop.
When a0 is greater than or equal to t0, it indicates that all elements have been processed, and the loop ends.
I04:
60
At least 60 threads are required to hide a 60-cycle latency.
:::
## Problem J
A Virtual Machine Monitor (VMM, or hypervisor) operates multiple guest operating systems on a single host machine. These guest OSs function in user (unprivileged) mode, while the VMM operates in supervisor (privileged) mode. Each guest virtual machine’s OS handles its own page tables, representing the link between its guest virtual address space and guest physical address space (termed “virtual-to-real” mapping). Subsequently, these guest physical addresses need to be mapped onto the host physical addresses.
To efficiently utilize the existing hardware Translation Lookaside Buffer (TLB), the VMM creates and manages shadow page tables. These shadow tables directly map the guest virtual address space to the host physical address space (known as “virtual-to-physical” mapping). While the guest OS runs in user mode, the VMM adjusts the hardware’s page table base pointer to align with the shadow page table. This setup allows the TLB to function seamlessly, as though there was no virtualization involved.

1. Assuming both the guest and host machines employ three-level page tables and the host features a hardware-refilled TLB, consider the scenario when the guest OS is operational. In this case, if a TLB access requires 1 cycle and each memory access incurs a latency of 50 cycles, what would be the latency experienced during a TLB miss? __ J01 __ cycles
>- J01 = ?
151
1 + 50 + 50 + 50 = 151 cycles
A hardware TLB refill involves a page table walk of the shadow page tables in memory, same as if there were no virtualization in effect. Note that the guest page tables are not directly involved.
The TLB miss latency must also include the cost of the initial TLB lookup (1 cycle).
2. To facilitate a guest virtual machine with an 8 KiB page size on a host machine that uses a 4 KiB page size, especially for adequate support of I/O operations like DMA requests, what is the required number of host pages that must be physically contiguous?
>- J02 = ?
2
Each guest page is mapped to two host pages; a leaf entry in the guest page table corresponds to two consecutive entries in the shadow page table with the same permissions. To properly support I/O (i.e., DMA requests), the two host pages should be physically contiguous.
### 📝 Annotate
:::info
J01:
The processor attempts to find the mapping of the memory address in the TLB, which takes 1 cycle.
If not found in the TLB, the first-level page table is traversed, which takes 50 cycles.
Then the second-level page table is traversed, which also takes 50 cycles.
Finally, the third-level page table is traversed, taking another 50 cycles.
1 + 50 +50 +50 = 151
:::