CA-2019Fall: Quiz 6

`A`: Cache Coherence

Consider a multicore system where processors P1 and P2 each have a private, snooping-based, write-back cache that uses the MESI coherence protocol. Suppose the processors make a sequence of accesses to the shared variable A in the order listed below, i.e., P1 completes four accesses, followed by two accesses by P2:

P1: LD A
P1: ST A
P1: LD A
P1: ST A
P2: LD A
P2: ST A

Please fill in the table below showing the shared-bus operations (if any) required to complete each access (one of BusRd, BusRdX, or BusWB). Give the state of the cache line for location A in each of P1's and P2's cache after the access is complete (one of M, E, S, or I).

Access	Shared bus transaction(s)	P1's cache state for A	P2's cache state for A
Initial state	–-	I	I
After P1: LD A	–-	E	I
After P1: ST A	–-	?	?
After P1: LD A	–-	?	?
After P1: ST A	–-	?	?
After P2: LD A	P2: BusRd(A), P1: BusWB(A,…)	?	?
After P2: ST A	P2: BusRdX(A)	?	?

Here’s the state transition diagram for the MESI cache coherence protocol:

`B`: Cache Coherence

Processors P1, P2, and P3 all have private caches that are kept coherent with the snoopy, writeback MSI protocol that we saw in lecture. The state transition diagram for MSI is shown below.

All three processors access a shared memory location X in the order as follows:

P1: ST X
P2: ST X
P3: LD X
P1: LD X
P2: LD X
P3: ST X

There is an error in the implementation of the MSI protocol that causes all of the private caches to issue a BusWB request whenever downgrading (i.e., transitioning from a more privileged to a less privileged state, such as from M to S or from S to I). Will this implementation still maintain cache coherence? Briefly explain why or why not.

`C`: Lock

While analyzing some code, you find that a big performance bottleneck involves many threads trying to acquire a single lock. Conceptually, the code is as follows:

int mutex = 0;
while (true) {
    noncritical_code();
    lock(&mutex);
    critical_code();
    unlock(&mutex);
}

Assume for all questions that our processor is using a directory protocol.

We will use the atomic instruction test_and_set to implement the lock(mutex) and unlock(mutex) functions. In C, the instruction has the following function prototype:

int return_value = test_and_set(int *maddr);

Recall that test&set atomically reads the memory address maddr and writes a 1 to the location, returning the original value. Using test_and_set, we arrive at the following first-draft implementation for the lock() and
unlock() functions:

void inline lock(int *mutex_ptr) {
    while (test_and_set(mutex_ptr) == 1)
        /* wait */ ;
}
void inline unlock(int *mutex_ptr) {
    *mutex_ptr = 0;
}

Let us analyze the behavior of Test&Set while running 1,000 threads on a 1,000 cores.

(1) Consider the following scenario: At the start of the program, the lock is invalid in all caches. Then, every thread executes Test&Set once. The first thread wins the lock, while the other threads will find that the lock is taken. How many invalidation messages must be sent when all 1,000 threads execute test_and_set once?
(2) While the first thread is in the critical section (the "winning thread"), the remaining threads continue to execute test_and_set, attempting to acquire the lock. Each waiting thread is able to execute test_and_set 5 times before the winning thread frees the lock. How many invalidation messages must be sent while the winning thread was executing the critical section? (Assume that every thread is interleaved)
(3) How many invalidation messages must be sent when the winning thread frees the lock? Assume the critical section is very long, and all 999 other threads have been waiting to acquire the lock.

Anwser

B =>
Yes, cache coherence is still maintained. There will be more BusWB requests issued than needed, but the extra BusWB requests from S->I state transitions will be writing unmodified data back to memory.

C =>
1 ::
1,000 Test&Sets are performed in the above scenario.
Test&Set is an atomic read-write operation and requires exclusive access to the lock’s address.
Therefore, each Test&Set invalidates the previous core who performed Test&Set. However, the
first core had no one to invalidate, because the lock was initially uncached. Therefore, 999
invalidation messages were sent.
Invalidations 999

2 ::
999 cores are spinning, each executes T&S five times for a total of 4995 Test&Sets performed.
Each Test&Set invalidates the previous core who performed Test&Set. Therefore, 4995
invalidation messages were sent.
(This assumes that every thread is interleaved)
Invalidations: 4995

3 :::
Freeing the lock involves writing to the lock’s address which requires invalidating anybody else
who has cached that address. Because all of the other cores are spinning on Test&Set, and only
one core will have the lock address at a time, the winning lock will invalidate only the last core
to perform a Test&Set.
Invalidations: 1

CA-2019Fall: Quiz 6

A: Cache Coherence

B: Cache Coherence

C: Lock

`A`: Cache Coherence

`B`: Cache Coherence

`C`: Lock