Try   HackMD

Annotate and explain Quiz6 and 7 with additional challenge Problems

曾謙文

Quiz 6

Problem A

Given that we are developing an operating system kernel for RISC-V, implement mutex routines to handle locking and unlocking.

#include <stdbool.h> typedef enum { MUTEX_UNLOCKED, MUTEX_LOCKED } mutex_t; /* Attempt to acquire the mutex lock. Returns true if the lock was * successfully acquired, and false otherwise. */ bool mutex_trylock(mutex_t *mutex) { int old; asm volatile ( "__ A01 __ %0, %1, (%2)" /* Fill in your answer */ : "=r"(old) : "r"(MUTEX_LOCKED), "r"(mutex) : "memory" ); /* If the previous value of the mutex was MUTEX_LOCKED, it means * the lock was already held. Return false to indicate the lock * was not acquired. */ return __ A02 __ ; /* Fill in your answer */ } /* Spin until the mutex lock is acquired. */ void mutex_lock(mutex_t *mutex) { while (!mutex_trylock(mutex)) { /* Busy wait */ } } /* Release the mutex lock. */ void mutex_unlock(mutex_t *mutex) { asm volatile ( "__ A03 __ zero, zero, (%0)" /* Fill in your answer */ : : "r"(mutex) : "memory" ); }

Complete the above implementation to ensure the mutex functions correctly using the RISC-V Atomic extension.

A01 = ?

amoswap.w.aq

A02 = ?

old != MUTEX_LOCKED OR old != 1

A03 = ?

amoswap.w.rl

📝 Annotate

A01 amoswap.w.aq
The amoswap.w.aq instruction performs an atomic swap of a word in memory with a register value. The .aq (acquire) suffix ensures memory ordering semantics such that no subsequent memory accesses are reordered before this atomic operation. In the context of mutex_trylock, this ensures that the attempt to lock the mutex is visible to all threads/processes and enforces proper memory synchronization.

This instruction atomically swaps the value of the mutex with MUTEX_LOCKED, which attempts to acquire the lock. The previous value (old) is stored to determine if the lock was already held.


A02(old != MUTEX_LOCKED)
The return value of mutex_trylock depends on whether the lock was successfully acquired. The variable old holds the mutex's previous value:

  • If old == MUTEX_UNLOCKED (or 0), it indicates that the lock was not held, and the function can acquire the lock by setting it to MUTEX_LOCKED.
  • If old == MUTEX_LOCKED (or 1), it indicates the lock is already held, and the function returns false.

Thus, the return condition old != MUTEX_LOCKED (or equivalently old != 1) reflects whether the lock acquisition was successful.


A03amoswap.w.rl

The amoswap.w.rl instruction atomically swaps a word in memory with a register value and enforces release semantics. The .rl (release) suffix ensures memory ordering semantics such that no previous memory accesses are reordered after this atomic operation. In the context of mutex_unlock, this ensures that all updates made while holding the lock are visible before the mutex is unlocked.
This instruction sets the mutex back to MUTEX_UNLOCKED (or 0) atomically, ensuring that no thread observes the lock in an inconsistent state.

Next, we will implement a condition variable based on the mutex implementation provided earlier.

/* * RISC-V lightweight condition variable implementation using mutexes. */ typedef struct { volatile unsigned int status; volatile unsigned int waiting; /* Mutex for protecting the condition variable state */ mutex_t mutex; } cond_t; /* Initialize the condition variable */ #define cond_init(cond) \ do { \ (cond)->status = 0; \ (cond)->waiting = 0; \ mutex_unlock(&(cond)->mutex); \ } while (0) /* Broadcast to all threads waiting on the condition variable */ #define cond_broadcast(cond) \ do { \ mutex_lock(&(cond)->mutex); \ (cond)->status = 1; \ mutex_unlock(&(cond)->mutex); \ } while (0) /* Wait on the condition variable */ #define cond_wait(cond, mtx) \ do { \ mutex_lock(&(cond)->mutex); \ __ A04 __ ; * Fill in your answer */ \ mutex_unlock(&(cond)->mutex); \ \ /* Release the external mutex */ \ mutex_unlock(mtx); \ \ /* Wait for broadcast signal */ \ while (1) { \ __ A05 __ ; /* Your answer */ \ if ((cond)->status == 1) { \ (cond)->waiting--; \ if ((cond)->waiting == 0) { \ __ A06 __ ; /* Your answer */ \ } \ __ A07 __ ; /* Your answer */ \ break; \ } \ __ A08 __ ; /* Your answer */ ; \ } \ \ /* Reacquire the external mutex */ \ mutex_lock(mtx); \ } while (0)

Complete the above implementation to ensure the condition variable function-like macros correctly.

A04 = ?

(cond)->waiting++

A05 = ?

mutex_lock(&(cond)->mutex)

A06 = ?

(cond)->status = 0

A07 = ?

mutex_unlock(&(cond)->mutex)

A08 = ?

mutex_unlock(&(cond)->mutex)

📝 Annotate

A04 (cond)->waiting++

When a thread waits on a condition variable, it must indicate that it is waiting by incrementing the waiting counter. This ensures that the condition variable tracks how many threads are currently blocked and waiting for a broadcast signal.

By incrementing (cond)->waiting, the thread signals that it has joined the queue of waiting threads. This step is essential for proper synchronization and allows the cond_broadcast function to determine when all waiting threads have been notified.


A05 mutex_lock(&(cond)->mutex)

To check or modify the status of the condition variable, the thread must acquire the condition variable's internal mutex. This ensures that only one thread can access or modify the condition variable's state at a time.

This line locks the internal mutex of the condition variable, ensuring thread-safe access to its status field during the waiting loop.


A06 (cond)->status = 0
After all waiting threads have been signaled, the status of the condition variable should be reset to 0 to indicate that no further notifications are pending. This allows the condition variable to be reused in subsequent synchronization operations.
Resetting the status ensures that the condition variable does not incorrectly retain a "signaled" state after all waiting threads have been handled.


A07 mutex_unlock(&(cond)->mutex)
After modifying the condition variable's state, the thread releases the condition variable's internal mutex. This allows other threads to access the condition variable and ensures proper synchronization.
This line unlocks the internal mutex, ensuring that the condition variable remains available for other threads to use or modify.


A08 mutex_unlock(&(cond)->mutex)
If the thread finds that the condition variable's status is not set to 1, it releases the internal mutex and waits for the next opportunity to check the condition. This prevents a deadlock and allows other threads to operate on the condition variable.
This ensures that the thread does not block the condition variable's internal mutex while it waits for a broadcast signal.

Lastly, we aim to implement pthread_spin_lock using the RISC-V Atomic extension. Below is the C implementation:

typedef int pthread_spinlock_t; int pthread_spin_trylock(pthread_spinlock_t *lock); #define EBUSY 16 /* Device or resource busy */

And the RISC-V implementation:

/* Attempt to acquire a spinlock without blocking. Return 0 on success or EBUSY if the lock is already held. */ pthread_spin_trylock: /* Save the input arguments */ /* a0: lock pointer */ /* Load the value of the lock into t0 */ lw t0, 0(a0) /* t0 = *lock */ /* Set t1 to the value of EBUSY */ li t1, EBUSY /* t1 = EBUSY */ /* Check if the lock is already held */ __ A09 __ # Your answer! /* Attempt to acquire the lock using an atomic swap */ __ A10 __ # Your answer! /* t0 = atomic swap(*lock, EBUSY) */ lock_held: /* Return the result */ mv a0, t0 /* Move the result into a0 (return value) */ ret /* Return from the function */

A09 = ?

bnez t0, lock_held

A10 = ?

amoswap.w.aq t0, t1, (a0)

📝 Annotate

A09 bnez t0, lock_held
The instruction bnez t0, lock_held checks whether the lock is already held by testing if the value in t0 is non-zero. If t0 is non-zero (indicating that the lock is held), the function jumps to the lock_held label to return the value of EBUSY. This avoids attempting to acquire the lock when it is already held.

  • bnez (branch if not equal to zero) checks whether the lock's current value (t0) is non-zero.
  • If the lock is held, it branches to the lock_held label to handle the case where the lock cannot be acquired.

A10 amoswap.w.aq t0, t1, (a0)
The amoswap.w.aq instruction performs an atomic swap between a memory location (*lock) and a register value (t1, which holds EBUSY). This atomic operation ensures that the attempt to acquire the lock is thread-safe. The .aq (acquire) suffix ensures proper memory ordering so that no subsequent memory operations are reordered before the lock is acquired.
aq t1, (a0) swaps the value of the lock (*lock) with t1(which isEBUSY`).

  • The old value of the lock is stored in t0, which is later returned to indicate whether the lock was successfully acquired (t0 == 0) or already held (t0 != 0).

Rewrite Problem A from Quiz 6 to operate in a bare-metal environment

Rewrite Problem A from Quiz 6: Operating in a bare-metal environment (using QEMU/RV32). The program should implement a mutex lock and condition variable, and include a test program.

  • Implementation:
    • In the implementation, the producer sequentially generates items 1 through 10. When the circular buffer reaches its maximum capacity of 5 items, the producer enters a waiting state until space becomes available. The condition variable not_full controls this waiting mechanism.
    • Meanwhile, consumers retrieve items in the same order they were produced (1-10), maintaining FIFO behavior. When no items are available (buffer empty), consumers wait on the not_empty condition variable until the producer adds new items.
  • Implementation Flowchart:
    implenmetation
    toolchain64

  • Result:
    result_basic

  • Code
    flowchart


Problem B

Let's assume we are developing an operating system kernel, and now it's time to implement the context switching mechanism for RISC-V.
image

The generic code:

typedef struct {
	/*
	 * Note that sp must be the first element in the task struct
	 * for __switchto() to work.
	 */
	uint32_t sp; /* Saved stack pointer for context switch */
	atomic_t events; /* Bitmaps of received events */
	uint64_t runtime; /* Time spent in task */
	uint32_t *stack; /* Start of stack */
} task_;

/* Contexts for all tasks */
static task_ tasks[TASK_ID_COUNT] __attribute__((section(".bss.tasks")));

/* Reserve space to discard context on first context switch. */
uint32_t scratchpad[TASK_SCRATCHPAD_SIZE]
	__attribute__((section(".bss.task_scratchpad")));

task_ *current_task = (task_ *) scratchpad;

/*
 * Determines whether IRQs should chain to svc_handler(). This flag should be set if:
 *
 * 1) Task scheduling is active.
 *
 * 2) An interrupt sets an event that could unblock a higher-priority task. 
 *    svc_handler() will check for a potential task switch and clear this flag 
 *    afterward.
 */
int need_resched;

static inline task_ *__task_id_to_ptr(task_id_t id)
{
    return tasks + id;
}

task_ *next_sched_task(void)
{
    task_ *new_task = __task_id_to_ptr(__fls(tasks_ready & tasks_enabled));
    return new_task;
}

/* in interrupt context */
volatile bool in_interrupt;

void end_irq_handler(void)
{
}

IRQ code:

__irq_exit:
    jal end_irq_handler    /* Jump to the end of the interrupt handler */
    la t0, in_interrupt    /* Clear the in_interrupt flag to signal the end of the interrupt */
    sb zero, 0(t0)
    csrr sp, mscratch      /* Restore the stack pointer (sp) from the mscratch register */
/** * Task context switching * * Change the task scheduled after returning from an interruption. * * This function must be called in interrupt context. * * Save the registers of the current task below the interrupt context on * its task, then restore the live registers of the next task and set the * process stack pointer to the new stack. * * The structure of the saved context on the stack is: * ra, a0-a7, t0-t6 (caller-saved), s0-s11 (callee-saved), mepc * */ .global __switch_task __switch_task: /* Get the (new) highest priority task pointer in a0 */ jal next_sched_task /* Pointer to the current task (which we are switching from) */ la t1, current_task lw t0, 0(t1) /* Reset the rescheduling request */ la t2, need_resched sw zero, 0(t2) /* If no task switch is needed, return to the current task */ beq a0, t0, __irq_exit /* Save the new scheduled task */ sw a0, 0(t1) /* Save the current system stack pointer */ add t3, sp, zero /* Restore the current process stack pointer */ csrr sp, mscratch /* Save the task program counter */ csrr t5, mepc /* Save callee-saved registers (s0-s11) on the current process stack */ sw s11, -12*4(sp) sw s10, -11*4(sp) sw s9, -10*4(sp) sw s8, -9*4(sp) sw s7, -8*4(sp) sw s6, -7*4(sp) sw s5, -6*4(sp) sw s4, -5*4(sp) sw s3, -4*4(sp) sw s2, -3*4(sp) sw s1, -2*4(sp) sw s0, -1*4(sp) /* Save the program counter on the current process stack */ __ B01 __ # Your answer here /* Adjust the stack pointer */ addi sp, sp, __ B02 __ # Your answer here /* Save the task stack pointer in its context */ sw sp, 0(t0) /* Get the new scheduled task stack pointer */ lw sp, 0(a0) /* Adjust the stack pointer */ addi sp, sp, __ B03 __ # Your answer here /* Restore the program counter from the next task's stack context */ lw t0, __ B06 __ csrw mepc, t0 /* Restore callee-saved registers (s0-s11) from the next task's stack */ lw s11, -12*4(sp) lw s10, -11*4(sp) lw s9, -10*4(sp) lw s8, -9*4(sp) lw s7, -8*4(sp) lw s6, -7*4(sp) lw s5, -6*4(sp) lw s4, -5*4(sp) lw s3, -4*4(sp) lw s2, -3*4(sp) lw s1, -2*4(sp) lw s0, -1*4(sp) /* Save sp to the scratch register and switch to the system stack */ csrw __ B04 __ , __ B05 __ /* Restore the system stack */ add sp, t3, zero j __irq_exit

B01 = ?

sw t5, -13*4(sp)

B02 = ?

-13*4

B03 = ?

13*4

B04 = ?

mscratch

B05 = ?

sp

B06 = ?

-13*4(sp)

📝 Annotate

B01 sw t5, -13*4(sp)
The program counter (PC) of the current task is saved to the stack using sw t5, -13*4(sp). Since t5 holds the value of the program counter (mepc), this step ensures the task's execution state is stored and can be restored later when the task is rescheduled.
This saves the program counter at the -13*4 offset from the current stack pointer, ensuring proper ordering in the task's saved context.


B02 -13*4
After saving the registers and program counter to the stack, the stack pointer (sp) needs to be adjusted to account for the 13 saved values (12 callee-saved registers plus the program counter). The adjustment is negative because the stack grows downwards.
The value -13*4 reduces the stack pointer to reserve space for the saved context.


B03 13*4
When restoring the stack pointer for the next task, it must be incremented by the same amount that was subtracted during the save process (13*4). This adjustment ensures the stack pointer is restored to the correct position for the new task's context.
The value 13*4 accounts for the saved registers and program counter when restoring the new task's stack pointer.


B04 mscratch
The mscratch register is used as a temporary storage location for saving the system stack pointer (sp). Writing the current stack pointer (sp) to mscratch ensures that it can be restored later when switching back to the system stack.
The mscratch register serves as a scratchpad for saving/restoring the system stack pointer during the context switch.


B05 sp
The current system stack pointer (sp) is written to the mscratch register. This ensures that the system stack pointer can be restored when exiting the interrupt or performing subsequent context switches.
Writing sp to mscratch ensures the system stack pointer is preserved during the context switch.


B06 -13*4(sp)
To restore the program counter for the next task, the value is loaded from the stack at the offset -13*4 relative to the stack pointer (sp). This corresponds to the position where the program counter was saved during the context save process.
This retrieves the program counter of the new task from the saved context on its stack, enabling the resumption of its execution.


Rewrite Problem B from Quiz 6 to operate in a bare-metal environment

Rewrite Problem B from Quiz 6: Operating in a bare-metal environment (using QEMU/RV32). The program should implement a mutex lock and condition variable, and include a test program.

  • Implementation:

    1. System Initialization

      • Stack pointer and privilege setup
      • Interrupt vector config
      • UART init
    2. Task Management

      • Two tasks implementation
      • Stack/context setup
      • Context switching
    3. Interrupt Handling

      • Timer interrupt
      • ECALL handling
      • Context save/restore
    4. Scheduling

      • Round-robin scheduling
      • Time slice rotation
      • Task yielding
    5. Debug Features

      • UART debug output
      • Task switch monitoring
  • Implementation Flowchart:
    flowchartcontext

  • Code
    I try to use gdb set the break point and find the it will jump directly to handle_trap and loop.
    Possible questions:

    • UART initialization failed
    • Interrupt handling error
    • Program counter (PC) set incorrectly

    Selection_007
    Selection_006
    gdb2

Problem C

Consider a complex task divided into four segments: PartA, PartB, PartC, and PartD. Each segment is assigned to a core in a quad-core processor system, where each core has its own dedicated cache. Cache coherence is maintained using a snoopy-based, write-invalidate MSI (Modified, Shared, Invalid) protocol, as depicted. The objective is to analyze the performance of this system and explore potential optimizations for improvement.

Each core begins by accessing and updating its local copy of the population data, stored under a designated label (e.g., pop_A for the population of PartA). This process is illustrated in the pseudocode below:

call pop_count # Invoke a function to calculate the additional population in the area. lw a1, <pop_label> # Load the previous population count. add a1, a1, a0 # Compute the new total by adding the additional population. sw a1, <pop_label> # Store the updated population count back into memory.

The sequence of memory accesses for this process is as follows:

(1) lw a1, pop_A (2) lw a1, pop_B (3) lw a1, pop_C (4) lw a1, pop_D
(5) sw a1, pop_A (6) sw a1, pop_B (7) sw a1, pop_C (8) sw a1, pop_D

Our focus is to closely analyze the activities occurring on Core A. Since the memory accesses are independent (i.e., pop_A, pop_B, pop_C, and pop_D correspond to distinct memory locations), we can concentrate solely on Core A's memory interactions to understand its behavior. Please complete the table below by filling in columns C01 and C02. For each entry, describe the bus transaction(s) triggered by each access and the resulting state of the cache line for pop_A after the access.

Access Shared bus transactions Cache A
Initial state - pop_A: I
lw a1, pop_A BusRd pop_A: C01
sw a1, pop_A BusRdX pop_A: C02

C01 = ?

S

C02 = ?

M

📝 Annotate

The MSI protocol defines three cache states:

  • M (Modified): The block has been changed in the cache, making the cache data inconsistent with the memory. A cache holding a block in the "M" state must write the block back to memory when it is evicted.
  • S (Shared): The block is unmodified and can exist in read-only mode in one or more caches. This block can be evicted from the cache without writing back to memory.
  • I (Invalid): The block is either not present in the cache or has been invalidated by a bus request. To store it in the cache, the block must be fetched from memory or another cache.

Actions and State Transitions:

  • When Core A reads data (lw a1, pop_A), it issues a BusRd request to load the data from memory, moving the cache line from the I state to the S state.
  • When Core A writes to pop_A (sw a1, pop_A), it issues a BusRdX request to gain exclusive access to the data, transitioning the cache line to the M state.

We are interested in exploring whether adopting the MESI protocol, as illustrated, could improve the performance of this processor. Assuming the four cores utilize a snoopy-based, write-invalidate MESI protocol, what would be the sequence of bus transactions and the cache line states for pop_A on Core A? Please complete the table below by filling in columns C03 and C04.

Access Shared bus transactions Cache A
Initial state - pop_A: I
lw a1, pop_A BusRd pop_A: C03
sw a1, pop_A - pop_A: C04

C03 = ?

E

C04 = ?

M

📝 Annotate

The MESI protocol improves multi-core processor performance by efficiently managing cache line states across cores. Here's how Core A interacts with its cache line pop_A under the MESI protocol:

Initial State:

  • pop_A starts in the Invalid (I) state, meaning Core A's cache does not have the data or the data is outdated.

After Load Operation (lw a1, pop_A):

  1. A BusRd transaction is triggered to fetch the data from memory because pop_A is not in the Modified (M) or Exclusive (E) state.
  2. After the operation, pop_A transitions to the Exclusive (E) state in Core A's cache. This means Core A now has the data and can read it without intending to modify it, assuming no other core has the same data cached.

After Store Operation (sw a1, pop_A):

  1. Since pop_A is already in the Exclusive (E) state, no additional bus transaction is required.
  2. The state of pop_A changes to Modified (M), indicating Core A has updated the data, and the changes are not yet written back to memory.

Each core then updates the overall census count, stored at a shared location labeled sum. This process is illustrated in the following pseudocode:

lw a2, sum       # Load the current global census count.
add a2, a2, a0   # Add the local population count to the global count.
sw a2, sum       # Store the updated global census count.

The resulting sequence of memory accesses is as follows:

(1) lw a2, sum (3) lw a2, sum (4) lw a2, sum (6) lw a2, sum
(2) sw a2, sum (7) sw a2, sum (5) sw a2, sum (8) sw a2, sum

We observe a critical issue in the current census counting implementation: there is a possibility that two processors may simultaneously read sum, each add their local population count to it independently, and then write back to sum, resulting in inaccuracies.

To ensure the final value of sum accurately reflects the combined a0 values from all processors, semaphores should be integrated strategically to enforce mutual exclusion. These semaphores should minimize instruction blocking while preserving data correctness. Define and initialize any necessary semaphores, then insert WAIT and SIGNAL commands into the pseudocode below to ensure accurate updates to sum. It is not required to include ecall instructions for this modification.

Semaphore Initialization:

sum_sem = 1

Pseudocode:

call pop_count # L1: counts the new population in the area lw a1, <pop_label> # L2: loads the previous population add a1, a1, a0 # L3: calculates the new total sw a1, <pop_label> # L4: stores the updated population lw a2, sum # L5: retrieves the current global count add a2, a2, a0 # L6: updates the global population count sw a2, sum # L7: writes the updated population count

Specify the exact instruction, labeled as L1 to L7, after which the WAIT(sum_sem) command should be inserted to ensure exclusive access to sum during critical updates, thus maintaining data accuracy.

C05 = ?

L4or sw a1, <pop_label>

Identify the exact instruction, labeled as L1 to L7, after which the SIGNAL(sum_sem) command should be placed. This ensures exclusive access to sum during critical updates, maintaining data accuracy.

C06 = ?

L7 or sw a2, sum

📝 Annotate

To ensure mutual exclusion when updating the global variable sum, the WAIT and SIGNAL semaphore operations should enclose the critical section where sum is accessed. This guarantees that only one processor at a time can read and update sum, preventing race conditions.

C05 L4
Reason: After completing updates to the local population (L4), but before retrieving the global sum value at L5, we must ensure mutual exclusion. The WAIT(sum_sem) ensures that no other processor can access sum until the current processor finishes its update.

C06 L7
Reason: After writing the updated global population count (L7), the critical section is complete, and the semaphore can be released. The SIGNAL(sum_sem) allows other processors to access sum.

  • Modified Pseudocode with Semaphore Integration
    ​​​​call pop_count # L1: counts the new population in the area ​​​​lw a1, <pop_label> # L2: loads the previous population ​​​​add a1, a1, a0 # L3: calculates the new total ​​​​sw a1, <pop_label> # L4: stores the updated population ​​​​WAIT(sum_sem) # C05: Acquire semaphore before accessing sum ​​​​lw a2, sum # L5: retrieves the current global count ​​​​add a2, a2, a0 # L6: updates the global population count ​​​​sw a2, sum # L7: writes the updated population count ​​​​SIGNAL(sum_sem) # C06: Release semaphore after updating sum

We are writing code to compute the greatest common divisor (GCD) using popcount.

/* Count the number of 1 bits in the binary representation of x */
int popcount(long long x)
{
    x = (x & 0x5555555555555555LL) + ((x >> 1) & 0x5555555555555555LL);
    x = (x & 0x3333333333333333LL) + ((x >> 2) & 0x3333333333333333LL);
    x = (x & 0x0f0f0f0f0f0f0f0fLL) + ((x >> 4) & 0x0f0f0f0f0f0f0f0fLL);
    x = (x & 0x00ff00ff00ff00ffLL) + ((x >> 8) & 0x00ff00ff00ff00ffLL);
    x = (x & 0x0000ffff0000ffffLL) + ((x >> 16) & 0x0000ffff0000ffffLL);
    return (x & 0x00000000ffffffffLL) + (x >> 32 & 0x00000000ffffffffLL);
}

/* Count the number of trailing zeros in x */
int ctz(long long x)
{
    return popcount( /* C07: Your answer! */ );
}

/* binary GCD algorithm */
long long gcd(long long a, long long b)
{
    if (a == 0) return b; // GCD(0, b) = b
    if (b == 0) return a; // GCD(a, 0) = a

    // Count the common factors of 2 in a and b
    int k = ctz( /* C08: Your answer! */ );
    a >>= ctz(a); // Remove factors of 2 from a
    b >>= ctz(b); // Remove factors of 2 from b

    while (a != b) {
        if (a > b) {
            a -= b; // Reduce a
            a >>= ctz(a); // Remove factors of 2
        } else {
            b -= a; // Reduce b
            b >>= ctz(b); // Remove factors of 2
        }
    }

    /* Restore the common factors of 2 */
    return /* C09: Your answer! */;
}

Complete the above GCD implementation to make it work as expected.

Reference: Binary GCD algorithm

C07 = ?

(x & (-x)) - 1

C08 = ?

a | b

C09 = ?

a << k

Problem D

The diagram below illustrates a classic fully-bypassed 5-stage pipeline, enhanced with an unpipelined divider operating in parallel with the ALU. While bypass paths are not depicted, the iterative divider generates 2 bits per cycle until producing a complete 32-bit result.

In this pipeline design, asynchronous interrupts are handled during the MEM stage and trigger a jump to a dedicated interrupt trap handler address. The interrupt latency is defined as the number of cycles from when an interrupt request is raised in the MEM stage to when the first instruction of the interrupt handler reaches the MEM stage.

Consider the following code execution scenario. An interrupt is raised during cycle 8, causing a jump to isr. The label isr refers to an Interrupt Service Routine, which increments a counter at a fixed memory address before resuming the original execution context. The architectural guarantee of precise interrupts must be preserved. Assume all memory accesses take one cycle in the MEM stage.

lw x2, 0(x1) div x1, x2, x3 slli x3, x2, 1 lui x4, 0x100 addi x4, x4, 0xf xor x5, x3, x4 sub x3, x5, x2 ... isr: sw x1, 0(x0) # Save register to a known location lw x1, 4(x0) # Load counter value addi x1, x1, 1 sw x1, 4(x0) # Increment counter and store it lw x1, 0(x0) # Restore register before returning mret # Return from interrupt handler

Notably, the div instruction in RISC-V does not raise data-dependent exceptions. To prevent pipeline stalls during a multi-cycle divide operation, the pipeline control logic permits subsequent instructions that are independent of the divide result to proceed and complete before the divide finishes execution.

The RISC-V instructions MRET, HRET, SRET, and URET are used to return from traps in M-mode, H-mode, S-mode, and U-mode, respectively. When executing an xRET instruction:

  • The privilege level changes to the value stored in xPP.
  • The interrupt-enable bit (yIE) is updated with the value in xPIE.
  • The xPIE field is set to 1.
  • The xPP field is reset to U.

Trap handlers typically save the current system state:

PC -> MEPC Privileged mode -> MPP MIE -> MPIE

And restore the system state upon executing the MRET instruction:

MEPC -> PC MPP -> Privileged mode MPIE -> MIE

Reference: The RISC-V Instruction Set Manual: Volume II - Privileged Architecture

What is the interrupt latency for the code above? __ D01 __ cycles

14
cycles 8 to 21, inclusive.

The interrupt is raised in cycle 4. The earliest uncommitted instruction when the interrupt is taken is labeled with "EPC" (exception PC), while "MTVEC" denotes the first instruction of the interrupt trap handler.

📝 Annotate

According to problem statement, the div instruction does not trigger exceptions that depend on the specific data being processed. During the execution of a division, which may take multiple cycles to complete, the processor's pipeline control logic allows other instructions that do not depend on the division result to continue and finish execution without waiting for the division to complete.
The div generates 2 bits at a time, and it takes 16 cycles to complete the 32-bit quotient; if it has run exactly 6 cycles at the moment of detecting the interrupt, there are still 10 cycles left.
The design assumes that the multi-period div cannot be "safely rolled back" midway, and can only wait until it reaches the W stage.
Waiting 10 more cycles between detecting the interrupt (MEM) and being able to actually "flush + redirect to isr".

To ensure the program continues executing correctly, which instruction should the isr return to? Answer in one of the instructions shown above.
D02 = ?

lui x4, 0x100
The EPC (exception PC) should point to the earliest uncommitted instruction.

📝 Annotate

When the external interrupt request arrives, it is detected during the MEM stage of the instruction lw x2, 0(x1). Because RISC‑V mandates precise interrupts, this lw is allowed to complete and commit in the following cycle, ensuring that it has written its result correctly. Meanwhile, the processor must identify which subsequent instructions have not yet safely committed by the time the interrupt is taken. Any such instructions must be either flushed or rolled back so that the system can restart from a well‐defined architectural state.

A closer inspection of the pipeline timing reveals that the earliest instruction still in a partially executed (i.e., uncommitted) state is lui x4, 0x100. Although instructions such as div x1, x2, x3 or slli x3, x2, 1 may have entered the pipeline already, by the time the interrupt is recognized and the pipeline is flushed, those instructions are deemed safely retired or do not need re‐execution. Therefore, the privilege architecture sets the EPC (Exception Program Counter) to the first instruction that has not fully committed—namely lui x4, 0x100. When the ISR finishes and executes mret, control flow resumes at that instruction

To reduce interrupt latency, we propose constraining the destination register for a div instruction. The ABI could be modified to reserve a specific x register to always store the divide result, allowing the interrupt handler to avoid using it. In a more generalized exception handler, reading this register could be deferred until the end of the context save, potentially hiding the divide latency.

Does this approach require any hardware modifications? __ D03 __ (Answer in Yes or No)
What is the corresponding limitation? __ D04 __ (Explain!)

No
It relies on code voluntarily complying with the ABI. (意思相近就給分)

📝 Annotate

The main limitation is that a general‐purpose register must be “sacrificed” to hold the div result. This reduces flexibility: other code cannot freely use that register, and compilers must insert additional move instructions if the result is needed in a different register.


Problem E

Suppose you have been tasked with evaluating the performance of two processor designs:

  • Processor1
  • Processor2

You plan to benchmark both processors using the following assembly sequence:

L1: slli x11, x10, 2 # x11 = x10 << 2 (calculate the byte offset) lw x12, 0x80(x11) # Load word from memory at address (0x80 + x11) into x12 add x13, x13, x12 # x13 = x13 + x12 addi x10, x10, 1 # Increment x10 by 1 blt x10, x14, L1 # Branch to L1 if x10 < x14 (loop condition) sub x14, x14, x15 # x14 = x14 - x15 xori x15, x14, 0x1 # x15 = x14 ^ 0x1

Processor1 is a 4-stage pipelined processor with full bypassing, designed to reduce the number of cycles required to execute each instruction. It combines the EXE and MEM stages, with load requests sent in the EXE stage and received in the WB stage one cycle later. The IF stage speculatively fetches the instruction at PC+4 and annuls incorrectly fetched instructions after a branch misprediction. Branches are resolved in the EXE stage.

Processor2 is a 6-stage pipelined processor with full bypassing, designed to improve performance by increasing the clock speed. The EXE stage is divided into two stages (EXE1 and EXE2) to reduce the critical path through the ALU. The IF stage always fetches the instruction at PC+4 and annuls incorrectly fetched instructions after a taken branch. ALU and Branch ALU results become available in the EXE2 stage.

How many cycles does it take Processor1 to execute one iteration of the loop? __ E01 __ cycle(s)

8

📝 Annotate

Tshe loop instruction:

slli x11, x10, 2 lw x12, 0x80(x11) add x13, x13, x12 addi x10, x10, 1 blt x10, x14, L1

In this 4-stage pipeline (IF, ID, EX, WB), the lw instruction computes the address in the EX stage and retrieves the data in the WB stage, and the following instruction immediately uses the load result (such as add) , because the data is not valid until WB, an extra bubble of 1 cycle must be inserted to avoid using invalid register values. In addition, the conditional branch (blt x10, x14, L1) can only determine whether to jump in the EX stage, so if the branch is established, the next sequential instruction has been fetched in advance and must be canceled, resulting in a 1-cycle branch penalty. Expect. In summary, from the time the first loop instruction slli x11, x10, 2 is fetched (IF) to the time the next loop fetches slli x11, x10, 2 again, there will be a lot of time due to load-use dependencies and The branch penalty adds a total of 2 cycles, bringing the total number of full loop cycles to 8.

How many cycles does it take Processor2 to execute one iteration of the loop? __ E02 __ cycle(s)

12

📝 Annotate

Cycle 3 NOP:
Inserted to delay the add instruction because the result of lw is not available yet (waiting for lw to proceed to MEM).

Cycle 5 NOP:
Inserted because the add instruction still cannot execute in EXE2 without the result from lw, which only becomes available in WB.
These NOPs are necessary to resolve data hazards caused by the pipeline's inability to forward/load data earlier in this particular processor design.

What is the clock period for Processor1? __ E03 __ ns per loop iteration

48
8 cycles (6 ns/cycle) = 48 ns per loop iteration

Loop Time=Cycles per Loop×Clock Period per Cycle


Don't do this! You don't have to copy and paste from the original reference solution. Instead, annotate along with your thoughts and experiments.

Quiz 7

Problem B

General Matrix Multiply (GEMM) is a widely used algorithm in linear algebra, machine learning, statistics, and various other fields. It offers a more complex trade-off space compared to the previous tutorial, as there are multiple ways to structure the computation. These include techniques such as blocking, inner products, outer products, and systolic arrays. Below is the reference implementation:

void sgemm( int m, int n, int k, float *XA, int lda, float *XB, int ldb, float *XC, int ldc ) { /* Local variables */ int i, j, p; float alpha = 1.0, beta = 1.0; /* Early exit for invalid input */ if (m == 0 || n == 0 || k == 0) return; for (i = 0; i < m; i++) { for (j = 0; j < n; j++) { for (p = 0; p < k; p++) { XC[j * ldc + i] += XA[p * lda + i] * XB[j * ldb + p]; } } } }

Take note of the variables i, j, and p used in the nested loops, as they play a key role in iterating over the computation.

Adjusting the order of i, j, and p does not affect the result, but it can impact performance. Considering the cache, what is the optimal order for best performance? Answer in the form "i-j-p".

  • B01 = ?

    j-p-i
    image
    The j-p-i order is well-suited for cache prefetching, resulting in a higher L1 cache hit rate compared to the ipj order.

📝 Annotate

column‐major storage
i is the fastest‐varying (row) index in memory.
j is effectively the column index for C and B.
p is the “summation” index over the shared dimension k.

In other words, from outer to inner the optimal order is j → p → i

What order results in the worst performance?

  • B02 = ?

    i-p-j

📝 Annotate

The worst performance occurs when the loop order disrupts data locality, causing frequent cache misses due to non-contiguous memory access. For column-major storage (as used in the reference implementation), the worst order is:

i → p → j

Why is "i-p-j" the worst?

  • Access to C[j * ldc + i]
    In this order, j is in the innermost loop. Since j corresponds to the column index, changing j results in large strides of ldc in memory. This results in poor locality when writing to C.

  • Access to A[p * lda + i]
    With p in the middle loop, i is fixed while iterating over p. Since p varies, the memory access to A[p * lda + i] skips rows in memory, causing poor cache utilization.

  • Access to B[j * ldb + p]
    With j in the innermost loop, p is fixed while iterating over j. This results in large memory strides for B[j * ldb + p] due to ldb.

Comparison to other orders

Every other loop order maintains some degree of locality for at least one array. For instance:

  • j → i → p keeps access to C contiguous in the innermost loop.
  • p → i → j keeps access to A contiguous for the innermost loop.

But in i → p → j, none of the arrays benefit from good locality, leading to the worst cache performance.

Test the performance of a specific order across different data sizes. From the graph, it can be observed that as the data size increases, overall performance decreases (the fluctuations in the middle may be related to cache evictions and reloads).
image

It can be observed that as the data size increases, FLOPS (Floating-point operations per second) will decline. Why does FLOPS decrease rapidly as the matrix size grows larger?

  • B03 = ?

    When matrices A and B are smaller than the L2 cache, GEMM only needs to read memory equivalent to the size of A and B from DDR. However, when A and B exceed the size of the L2 cache, due to the row-major layout of B or the column-major layout of A being non-contiguous in memory, GEMM reads more memory from DDR than the size of A and B. This increases cache misses, leading to degraded performance.

📝 Annotate

As the matrix size grows larger, the rapid decline in FLOPS (floating-point operations per second) can be explained by memory hierarchy effects, specifically cache inefficiencies:

Key Reasons for Performance Decrease:
Cache Misses
Small matrices often fit entirely within the CPU's L1 or L2 cache, allowing data reuse with minimal memory access overhead. However, as the matrix size grows, the working set (data required for computation) exceeds cache capacity. This leads to:

Increased cache evictions.
More frequent memory accesses to slower levels of the memory hierarchy (e.g., L3 cache or main memory).
Poor locality due to non-contiguous memory access patterns, especially for column-major storage.
Memory Bandwidth Limitation
Once the data no longer fits in the cache, the CPU becomes memory-bound rather than compute-bound. The performance becomes limited by the bandwidth of the memory subsystem, which cannot supply data to the CPU fast enough to sustain peak floating-point throughput.

Cache Associativity Conflicts
For large matrices, access patterns may cause conflicts in cache lines due to limited associativity, resulting in frequent cache line invalidations or evictions.

DRAM Latency
Accessing main memory (DRAM) introduces significant latency compared to accessing data in the cache. The rapid performance decline is often due to the processor idling while waiting for memory accesses to complete.

To make GEMM faster, let's divide a large matrix into smaller submatrices so that each submatrix can fit entirely in the cache.

First, partition the matrix (along the

m- and
n
-dimensions) without blocking the
k
-dimension, and then expand:

As shown in the diagram:
image

  • The
    A
    matrix (the middle one) is divided along the
    m
    -dimension, with each submatrix sized
    (MR,k)
    .
  • The
    B
    matrix (the right one) is divided along the
    n
    -dimension, with each submatrix sized
    (k,NR)
    .

The multiplication of the submatrix

(MR,k) from
A
and the submatrix
(k,NR)
from
B
produces a submatrix
(MR,NR)
in
C
(the left one).

Use optimization techniques such as loop unrolling and register caching (reusing register results). Next, divide the large matrix into smaller submatrices to ensure they fit entirely in the cache, thereby improving access performance. The process involves splitting the large matrix into smaller submatrices that can fully reside in the cache for efficient access.
image

The parameters must meet certain constraints:

MC×KC must be less than half the size of the L2 cache.

image

Consider the corresponding code below:

#include <stddef.h> #include <stdio.h> #include <stdlib.h> /* Define block sizes */ #define dgemm_mc 72 /* Block size for the m dimension */ #define dgemm_nc 4080 /* Block size for the n dimension */ #define dgemm_kc 256 /* Block size for the k dimension */ /* Helper macros */ #define min(a, b) ((a) < (b) ? (a) : (b)) /* Function to allocate aligned memory */ static inline void *malloc_aligned(size_t alignment, size_t size, size_t type_size) { return aligned_alloc(alignment, size * type_size); } /* Pack matrix A */ inline void packa_mcxkc(int m, int k, float *xa, int ldxa, float *packa) { for (int p = 0; p < k; p++) { for (int i = 0; i < m; i++) { *packa++ = *(xa + p * ldxa + i); } } } /* Pack matrix B */ inline void packb_kcxnc(int n, int k, float *xb, int ldxb, float *packb) { for (int j = 0; j < n; j++) { for (int p = 0; p < k; p++) { *packb++ = *(xb + j * ldxb + p); } } } /* Macro kernel */ void macro_kernel(int m, int n, int k, float *packa, float *packb, float *c, int ldc) { int i, p, j; for (j = 0; j < n; j++) { /* Start 2-nd loop */ for (p = 0; p < k; p++) { /* Start 1-st loop */ float elem_b = packb[j * k + p]; float *p_elem_a = &packa[p * m]; float *p_elem_c = &c[j * ldc]; for (i = 0; i < m; i++) { /* Start 0-th loop */ *p_elem_c++ += *p_elem_a++ * elem_b; } /* End 0-th loop */ } /* End 1-st loop */ } /* End 2-nd loop */ } /* SGEMM implementation */ void sgemm(int m, int n, int k, float *xa, int lda, float *xb, int ldb, float *c, /* must be aligned */ int ldc /* ldc must also be aligned */ ) { int ic, ib, jc, jb, pc, pb; float *packa, *packb; /* Early return if possible */ if (m == 0 || n == 0 || k == 0) { printf("sgemm(): early return\n"); return; } /* Allocate packing buffers */ packa = malloc_aligned(64, (dgemm_kc * (dgemm_mc + 1)), sizeof(float)); packb = malloc_aligned(64, (dgemm_kc * (dgemm_nc + 1)), sizeof(float)); for (jc = 0; jc < n; jc += dgemm_nc) { /* 5-th loop around micro-kernel */ jb = min(n - jc, dgemm_nc); for (pc = 0; pc < k; pc += pb) { /* 4-th loop around micro-kernel */ pb = min(k - pc, dgemm_kc); packb_kcxnc(jb, pb, &xb[/* B04: Fill your code here */], ldb, packb); for (ic = 0; ic < m; ic += ib) { /* 3-rd loop around micro-kernel */ ib = min(m - ic, dgemm_mc); packa_mcxkc(ib, pb, &xa[/* B05: Fill your code here */], lda, packa); macro_kernel(ib, jb, pb, packa, packb, &c[/* B05: Fill your code here */], ldc); } /* End 3-rd loop around micro-kernel */ } /* End 4-th loop around micro-kernel */ } /* End 5-th loop around micro-kernel */ free(packa); free(packb); }
  • B04 = ? (Line 94)

    jc * ldb + pc

  • B05 = ? (Line 99)

    pc * lda + ic

  • B06 = ? (Line 101)

    jc * ldc + ic

📝 Annotate

B04 (Line 94)

This corresponds to the starting address of the submatrix of (B) that is being packed. The starting position is determined by the current block column ((jc)) and the current position in the (k)-dimension ((pc)):

&xb[jc * ldb + pc]
  • jc * ldb: Shifts to the (jc)-th column block of (B).
  • pc: Shifts to the current position in the (k)-dimension.

B05 (Line 99)

This corresponds to the starting address of the submatrix of (A) that is being packed. The starting position is determined by the current block row ((ic)) and the current position in the (k)-dimension ((pc)):

&xa[pc * lda + ic]
  • pc * lda: Shifts to the (pc)-th row block of (A).
  • ic: Shifts to the current position in the (m)-dimension.

B06 (Line 101)

This corresponds to the starting address of the submatrix of (C) that is being updated. The starting position is determined by the current block row ((ic)) and the current block column ((jc)):

&c[jc * ldc + ic]
  • jc * ldc: Shifts to the (jc)-th column block of (C).
  • ic: Shifts to the current position in the (m)-dimension.

Final Code with Missing Parts Filled:

packb_kcxnc(jb, pb, &xb[jc * ldb + pc], ldb, packb);

packa_mcxkc(ib, pb, &xa[pc * lda + ic], lda, packa);

macro_kernel(ib, jb, pb, packa, packb, &c[jc * ldc + ic], ldc);

  1. Matrix (B):

    • (B) is packed into blocks to improve cache locality for the inner loops.
    • The starting address takes into account the current column offset ((jc)) and row offset ((pc)).
  2. Matrix (A):

    • (A) is packed row-wise to improve cache locality for the micro-kernel computation.
    • The starting address takes into account the current row offset ((ic)) and (k)-dimension offset ((pc)).
  3. Matrix (C):

    • (C) is the output matrix, updated block by block.
    • The starting address is determined by the offsets (ic) (row) and (jc) (column).

These adjustments ensure the correct blocks of data are passed into the packing functions and the macro kernel.


Problem C

Determining the center of a peak in discrete data, such as a series of bins, can be useful for identifying the "center" of a signal, particularly in contexts like discrete Fourier transform (DFT) bins, touch sensors, or detecting impulses. This method works especially well with normal distributions but can also be effective for any symmetric or nearly symmetric distribution.

To locate the center, identify the highest peak in the dataset and examine the values of the bins immediately to its left and right. Using these values, you can calculate the center's position relative to the peak using the formula below, which estimates the offset

d from the highest cell:

int n = the highest value in a local data set; float lowercell = data[n-1]; float center = data[n]; float uppercell = data[n+1]; float d = (uppercell - lowercell) * 0.5 / min(uppercell - center, lowercell - center);

image

This approach does not handle cases involving multiple peaks or multipath effects. In such scenarios, it is necessary to identify other peaks, account for their contributions, and iteratively refine the data to isolate and analyze additional peaks. Additionally, this method only applies to sources with a distinct hill-shaped distribution, as mathematically such centers can be detected.

For computational efficiency, consider using fixed-point arithmetic instead of floating-point. Fixed-point math provides higher accuracy with the same bit-width and is generally faster, particularly in embedded systems, even when a floating-point unit (FPU) is available. This makes it a practical choice for resource-constrained environments where performance and precision are critical.

Consider the code below:

void sum_fixed() { /* Assumes a fixed-point representation with 16 fractional bits */ int a = 132345; /* Represents 2.019424438 in floating-point (a / 65536) */ int b = 7491; /* Represents 0.114303589 in floating-point (b / 65536) */ int sum = a + b; /* Fixed-point addition */ printf("Sum: %f\n", sum / 65536.0); /* Output as floating-point: 2.133728 */ }

This example illustrates the use of fixed-point arithmetic with 16 fractional bits, a common technique in embedded systems for efficient numerical computation. The variables a and b are represented in a fixed-point format, where the fractional part is scaled by

216=65536. For instance, the value a = 132345 corresponds to
2.019424438=13234565536
, and b = 7491 corresponds to
0.114303589=749165536
.

The addition operation a + b combines the fixed-point representations directly without any conversion, keeping the result in the same fixed-point format. To display the result in a human-readable floating-point format, the program divides the sum by

65536.0, which is the scaling factor for 16 fractional bits.

The program outputs the sum as 2.133728, demonstrating how fixed-point arithmetic preserves computational efficiency while allowing precise representation of fractional numbers. This approach is particularly valuable in resource-constrained environments where performance and accuracy are critical.

The code below shows a method for computing the product while avoiding overflows.

void product_fixed() { int a = 132345; /* Represents 2.019424438 in floating-point (a / 65536) */ int big_b = 4442414; /* Represents 67.785858154 in floating-point (big_b / 65536) */ /* Adjusted multiplication to prevent overflow */ int product = (/* C01: Fill your code */) * (/* C02: Fill your code */); /* Avoids overflow with slight precision loss */ printf("Product (correct): %f\n", product / 65536.0); /* Correct result: 136.629456 */ }

Complete the code to ensure it works as expected. The variable a must be defined before the variable big_b. You should use shifts.

  • C01 = ?

    a >> 8 or equivalent

  • C02 = ?

    big_b>> 8 or equivalent

📝 Annotate

In fixed-point arithmetic, multiplication can cause the result to overflow, especially when the values ​​of the two numbers are large. Therefore, it is necessary to reduce the amplitude of the value through "shifting and scaling" to avoid overflow. The specific steps are:

Shift-scale operands: Shift one or both operands right before multiplication to reduce their values.
Shift the result to restore it: After multiplication, shift the result right appropriately to adjust it back to the fixed decimal point ratio.

To prevent overflow:

  1. Scale both a and big_b down by shifting each of them by 8 bits (dividing by
    28=256
    .
  2. Perform the multiplication.
  3. Restore the fixed-point scale by shifting the result further by 8 bits.

  1. C01: a >> 8
    Scale down a by dividing it by (2^8 = 256) to reduce its magnitude.

  2. C02: big_b >> 8
    Similarly, scale down big_b by dividing it by (2^8 = 256).


By scaling both operands equally:

  • Further reduce the likelihood of overflow during multiplication.
  • Distribute the scaling evenly, minimizing the loss of precision compared to scaling only one operand.

Let's perform division in fixed-point arithmetic, reversing the approach of multiplication.

void division_fixed() { int a = 132345; /* Represents 2.019424438 in floating-point (a / 65536) */ int b = 7491; /* Represents 0.114303589 in floating-point (b / 65536) */ /* Adjust numerator and denominator to maximize precision and avoid overflows */ int division = (/* C03: Fill your code */) / (/* C04: Fill your code */); printf("Division: %f\n", division / 65536.0); /* Result: 17.674072 (Close to expected) */ }

The use of shifts achieves three goals:

  1. It avoids overflows by reducing the size of the numerator and denominator appropriately.
  2. It maximizes the precision of the numerator by applying a larger shift to it.
  3. It maintains the correct scaling for the fixed-point arithmetic.

Complete the code to ensure it works as expected. You should use shifts.

  • C03 = ?

    a << 12 or equivalent

  • C04 = ?

    b >> 4 or equivalent
    The result, when converted back to floating-point by dividing by

    65536.0, is approximately
    17.674072
    , which is close to the expected value. This approach demonstrates how to manage precision and avoid overflow in fixed-point division while adhering to the constraints of the representation.

📝 Annotate

In fixed-point arithmetic, it's common to adjust the scaling of operands to maintain precision and prevent overflow during operations. This often involves shifting operations, which are efficient in binary systems. For instance, left-shifting a number by 12 bits (a << 12) effectively multiplies it by

212, and right-shifting by 4 bits (b >> 4) divides it by
24
. These adjustments help align the operands to a common scaling factor, facilitating accurate division. For a comprehensive understanding of fixed-point arithmetic and the role of scaling and shifting.


Problem D

You are expected to calculate an approximate square root by determining the next power of 2 using the __builtin_clz function provided by GCC/Clang.

int mysqrt(int i) { if (i == 0) return 0; int x = 1 << (/* D01: Fill your code */); x = (x + i / x) / 2; return x; }

image

Complete the code to ensure it works as expected. You should invoke the __builtin_clz function.

  • D01 = ?

    (32 - __builtin_clz(i)) / 2 or equivalent

📝 Annotate

Using 32 - __builtin_clz(i) ensures better precision and logical consistency when calculating the initial approximation of the square root in a 32-bit integer system. Here's the reasoning:

32-bit Integer Length:
__builtin_clz(i) calculates the leading zero count in the binary representation of i. Subtracting this from 32 gives the number of significant bits (or bit length) of i, which accurately reflects its magnitude in binary.

Purpose of the Approximation:
The formula 1 << (32 - __builtin_clz(i)) / 2 estimates the square root of i by calculating a power of 2 near the actual square root. For example, if i = 16 (binary 10000), its bit length is 5 (32 - 27 leading zeros), leading to an initial approximation of

25/2=4, close to the actual square root.

Here is the extracted description with LaTeX annotations:

The Fast Fourier Transform (FFT) is a computational method used to efficiently compute the discrete Fourier transform of a sequence. This implementation also includes a function for convolution, defined as:

conv(a,b)=c,wherec[x]=ia[i]b[xi].

This convolution function calculates the resulting sequence

c by summing the product of elements from the input sequences
a
and
b
, with one sequence reversed and shifted. It is particularly useful in signal processing and applications requiring fast computation of polynomial or data convolutions. Consider the corresponding code below:

#include <math.h> #include <complex.h> #include <string.h> #define PI 3.14159265358979323846 void fft(complex double *x, complex double *roots, int N) { if (N <= 1) return; /* Allocate memory for even and odd parts */ complex double *even = malloc(N / 2 * sizeof(complex double)); complex double *odd = malloc(N / 2 * sizeof(complex double)); complex double *rs = malloc(N / 2 * sizeof(complex double)); for (int i = 0; i < N / 2; i++) { even[i] = x[2 * i]; odd[i] = x[2 * i + 1]; rs[i] = roots[2 * i]; } fft(even, rs, N / 2); fft(odd, rs, N / 2); for (int k = 0; k < N / 2; k++) { complex double t = roots[k] * odd[k]; x[k] = even[k] + t; x[k + N / 2] = even[k] - t; } free(even); free(odd); free(rs); } void conv(const double *a, const double *b, int na, int nb, double **result, int *nc) { int s = na + nb - 1; int L = /* D02: Fill your code */; int n = 1 << L; *nc = s; complex double *av = calloc(n, sizeof(complex double)); complex double *bv = calloc(n, sizeof(complex double)); complex double *roots = malloc(n * sizeof(complex double)); for (int i = 0; i < n; i++) roots[i] = cexp(-2 * PI * I * i / n); for (int i = 0; i < na; i++) av[i] = a[i]; for (int i = 0; i < nb; i++) bv[i] = b[i]; fft(av, roots, n); fft(bv, roots, n); /* Conjugate roots */ for (int i = 0; i < n; i++) roots[i] = conj(roots[i]); complex double *cv = malloc(n * sizeof(complex double)); for (int i = 0; i < n; i++) /* D03: Fill your code */; fft(cv, roots, n); *result = malloc(s * sizeof(double)); for (int i = 0; i < s; i++) (*result)[i] = creal(cv[i]) / n; free(av); free(bv); free(roots); free(cv); } int main() { double a[] = {1.0, 2.0, 3.0}; double b[] = {4.0, 5.0}; int na = sizeof(a) / sizeof(a[0]), nb = sizeof(b) / sizeof(b[0]); double *c; int nc; conv(a, b, na, nb, &c, &nc); printf("Convolution result:\n"); for (int i = 0; i < nc; i++) printf("%f\n", c[i]); free(c); return 0; }

The reference output:

Convolution result:
4.000000
13.000000
22.000000
15.000000

Complete the code to ensure it works as expected. You should invoke the __builtin_clz function.

  • D02 = ? (Line 38)

    32 - __builtin_clz(s)

  • D03 = ? (Line 61)

    av[i] * bv[i] or equivalent

📝 Annotate

D02

s Represents the Convolution Size:

  • s = na + nb - 1 is the total number of elements in the convolution result.
  • The goal is to find the smallest power of 2 greater than or equal to s.

Using __builtin_clz:

  • __builtin_clz(s) calculates the number of leading zeros in the binary representation of s. For a 32-bit integer, subtracting this count from 32 gives the number of bits required to represent s.
  • To ensure the result is rounded up to the next power of 2 (when s is not already a power of 2), we don't subtract 1 from s.

Updated Calculation:

  • L = 32 - __builtin_clz(s) calculates the number of bits required.
  • n = 1 << L ensures n is a power of 2 that accommodates s.

D03

D03 should perform the point-wise multiplication of the two frequency-domain arrays av and bv to compute the convolution in the frequency domain.

  • The convolution theorem states that convolution in the time domain is equivalent to element-wise multiplication in the frequency domain.
  • D03 is responsible for multiplying each corresponding element of the arrays av and bv.
  • av and bv contain the FFT-transformed values of a and b, respectively.
  • Multiplying these in the frequency domain is necessary before performing the inverse FFT to get the convolution result in the time domain.

Problem E

The core concept of data-level parallelism is vectorized computation, which involves performing operations on multiple elements of a single vector simultaneously.
vector

Consider the following function, which computes the product of all elements in an array:

int product_naive(int n, int *a) { int product = 1; for (int i = 0; i < n; i++) product *= a[i]; return product; }

Your task is to vectorize this function using the RISC-V Vector extension to leverage parallel processing.

int product_vectorized(int n, int *a) { int result[4] = {1, 1, 1, 1}; for (int i = 0; i < n / 4 * 4; i += 4) { asm volatile ( "__ E01 __ t0, zero, e32, m1\n" "__ E02 __ v0, (%1)\n" "vmul.vv v1, v1, v0\n" "__ E03 __ v1, (%0)\n" : "=r"(result) : "r"(a + i), "r"(result) : "t0", "v0", "v1" ); } /* Handle tail case */ for (int i = n / 4 * 4; i < n; i++) result[0] *= a[i]; return result[0] * result[1] * result[2] * result[3]; }

Complete the code to ensure it works as expected.

  • E01 = ?

    vsetvli

  • E02 = ?

    vlw.v

  • E03 = ?

    vsw.v

📝 Annotate

  1. vsetvli: Configures the vector register length based on the element type (e32 for 32-bit integers) and group multiplier (m1 for single register per element). The register t0 stores the calculated vector length.
  2. vlw.v: Loads 4 consecutive 32-bit elements from memory into the vector register v0.
  3. vsw.v: Stores the result in v1 back into the memory location pointed to by result.

Problem F

Your task is to implement a simulator for the RISC-V instruction set. Below is a partial C source code snippet for instruction decoding and execution.

#define SEXT(x, len) ({ struct { int64_t n : len; } __x = { .n = x }; (uint64_t)__x.n; }) #define R(i) gpr(i) #define CSR(i) csr(i) #define Mr vaddr_read #define Mw vaddr_write #define PC _info->pc #define NPC _info->dnpc #define rd decode_rd(inst) #define rs1 decode_rs1(inst) #define rs2 decode_rs2(inst) typedef struct { void (*handler)(); } funct7_item; typedef struct { funct7_item funct7_pool[0x80]; void (*handler)(); } funct3_item; typedef struct { funct3_item funct3_pool[0x8]; void (*handler)(); } opcode_item; static opcode_item pool[0x80]; static uint32_t inst; static exec_t *_info = NULL; #define IMM(TYPE) decode_imm ## TYPE(inst) #define INST_EXEC(NAME, TYPE, ...) \ static void __ ## NAME() { \ __VA_ARGS__; \ } #define _____ -1 #define _________ -1 #define RULE(NAME, TYPE, FUNCT7, FUNCT3, OPCODE) \ do { \ if (FUNCT3 == _____) { \ pool[OPCODE].handler = __ ## NAME; \ break; \ } \ if (FUNCT7 == _________) { \ pool[OPCODE].funct3_pool[FUNCT3].handler = __ ## NAME; \ break; \ } \ pool[OPCODE].funct3_pool[FUNCT3].funct7_pool[FUNCT7].handler = __ ## NAME; \ } while (0) // special handler for ones that can be matched with raw inst... static inline void __special_handler() { extern word_t isa_raise_intr(word_t no, vaddr_t epc); switch (inst) { case 0b00000000000100000000000001110011: // ebreak SET_STATE(gpr(10) ? ABORT : END); break; case 0b00000000000000000000000001110011: // ecall NPC = isa_raise_intr(0xb, PC); break; case 0b00110000001000000000000001110011: // mret NPC = CSR(MEPC); set_csr_field(MSTATUS, MSTATUS_MIE, MSTATUS_MIE_SHIFT, get_csr_field(MSTATUS, MSTATUS_MPIE, MSTATUS_MPIE_SHIFT)); set_csr_field(MSTATUS, MSTATUS_MPIE, MSTATUS_MPIE_SHIFT, 1); break; default: assert(0); } } INST_EXEC(add, R, R(rd) = R(rs1) + R(rs2)) INST_EXEC(and, R, R(rd) = R(rs1) & R(rs2)) INST_EXEC(div, R, R(rd) = (sword_t)R(rs1) / (sword_t)R(rs2)) INST_EXEC(divu, R, R(rd) = R(rs1) / R(rs2)) INST_EXEC(mret, R, __special_handler()) INST_EXEC(mul, R, R(rd) = R(rs1) * R(rs2)) INST_EXEC(mulh, R, R(rd) = (int64_t)(sword_t)R(rs1) * (int64_t)(sword_t)R(rs2) >> 32LL) INST_EXEC(mulhu, R, R(rd) = (uint64_t)R(rs1) * (uint64_t)R(rs2) >> 32ULL) INST_EXEC(or, R, R(rd) = R(rs1) | R(rs2)) INST_EXEC(rem, R, R(rd) = (sword_t)R(rs1) % (sword_t)R(rs2)) INST_EXEC(remu, R, R(rd) = R(rs1) % R(rs2)) INST_EXEC(sll, R, R(rd) = R(rs1) << R(rs2)) INST_EXEC(slt, R, R(rd) = (sword_t)R(rs1) < (sword_t)R(rs2)) INST_EXEC(sltu, R, R(rd) = R(rs1) < R(rs2)) INST_EXEC(sra, R, R(rd) = (sword_t)R(rs1) >> (R(rs2) & 0x1f)) INST_EXEC(srl, R, R(rd) = R(rs1) >> (R(rs2) & 0x1f)) INST_EXEC(sub, R, R(rd) = R(rs1) - R(rs2)) INST_EXEC(xor, R, R(rd) = R(rs1) ^ R(rs2)) INST_EXEC(addi, I, R(rd) = R(rs1) + IMM(I)) INST_EXEC(andi, I, R(rd) = R(rs1) & IMM(I)) INST_EXEC(csrrs, I, word_t imm = IMM(I); word_t t = CSR(imm); __ F01 __; R(rd) = t) INST_EXEC(csrrw, I, word_t imm = IMM(I); word_t t = CSR(imm); __ F02 __; R(rd) = t) INST_EXEC(ebreak, I, __special_handler()) INST_EXEC(ecall, I, __special_handler()) INST_EXEC(jalr, I, word_t t = PC + 4; NPC = (__ F03 __) & ~1; R(rd) = t) INST_EXEC(lb, I, R(rd) = SEXT(Mr(R(rs1) + IMM(I), 1), 8)) INST_EXEC(lbu, I, R(rd) = Mr(R(rs1) + IMM(I), 1)) INST_EXEC(lh, I, R(rd) = SEXT(Mr(R(rs1) + IMM(I), 2), __ F04 __)) INST_EXEC(lhu, I, R(rd) = Mr(R(rs1) + IMM(I), __ F05 __)) INST_EXEC(lw, I, R(rd) = Mr(R(rs1) + IMM(I), __ F06 __)) INST_EXEC(ori, I, R(rd) = R(rs1) | IMM(I)) INST_EXEC(slli, I, R(rd) = R(rs1) << (IMM(I) & __ F07 __)) INST_EXEC(slti, I, R(rd) = (sword_t)R(rs1) < (sword_t)IMM(I)) INST_EXEC(sltiu, I, R(rd) = R(rs1) < IMM(I)) INST_EXEC(srai, I, R(rd) = (sword_t)R(rs1) >> (IMM(I) & 0x3f)) INST_EXEC(srli, I, R(rd) = R(rs1) >> (IMM(I) & __ F08 __)) INST_EXEC(xori, I, R(rd) = R(rs1) ^ IMM(I)) INST_EXEC(sb, S, Mw(R(rs1) + IMM(S), 1, R(rs2))) INST_EXEC(sh, S, Mw(R(rs1) + IMM(S), 2, R(rs2))) INST_EXEC(sw, S, Mw(R(rs1) + IMM(S), 4, R(rs2))) INST_EXEC(beq, B, if (R(rs1) == R(rs2)) NPC = PC + IMM(B)) INST_EXEC(bge, B, if ((sword_t)R(rs1) >= (sword_t)R(rs2)) NPC = PC + IMM(B)) INST_EXEC(bgeu, B, if (R(rs1) >= R(rs2)) NPC = PC + IMM(B)) INST_EXEC(blt, B, if ((sword_t)R(rs1) < (sword_t)R(rs2)) NPC = PC + IMM(B)) INST_EXEC(bltu, B, if (R(rs1) < R(rs2)) NPC = PC + IMM(B)) INST_EXEC(bne, B, if (R(rs1) != R(rs2)) NPC = PC + IMM(B)) INST_EXEC(auipc, U, R(rd) = PC + IMM(U)) INST_EXEC(lui, U, R(rd) = IMM(U)) INST_EXEC(jal, J, R(rd) = PC + 4, __ F09 __) void init_inst_pool() { memset(pool, 0, sizeof(pool)); RULE(add, R, 0b0000000, 0b000, 0b0110011); RULE(and, R, 0b0000000, __ F10 __, 0b0110011); RULE(div, R, 0b0000001, 0b100, 0b0110011); RULE(divu, R, 0b0000001, 0b101, 0b0110011); RULE(mret, R, 0b0011000, 0b000, 0b1110011); RULE(mul, R, 0b0000001, 0b000, 0b0110011); RULE(mulh, R, 0b0000001, 0b001, 0b0110011); RULE(mulhu, R, 0b0000001, 0b011, 0b0110011); RULE(or, R, 0b0000000, __ F11 __, 0b0110011); RULE(rem, R, 0b0000001, 0b110, 0b0110011); RULE(remu, R, 0b0000001, 0b111, 0b0110011); RULE(sll, R, 0b0000000, 0b001, 0b0110011); RULE(slt, R, 0b0000000, 0b010, 0b0110011); RULE(sltu, R, 0b0000000, 0b011, 0b0110011); RULE(sra, R, 0b0100000, 0b101, 0b0110011); RULE(srl, R, 0b0000000, 0b101, 0b0110011); RULE(sub, R, 0b0100000, 0b000, 0b0110011); RULE(xor, R, 0b0000000, 0b100, 0b0110011); RULE(addi, I, _________, __ F12 __, 0b0010011); RULE(andi, I, _________, 0b111, 0b0010011); RULE(csrrs, I, _________, 0b010, 0b1110011); RULE(csrrw, I, _________, 0b001, 0b1110011); RULE(ebreak, I, 0b0000000, 0b000, __ F13 __); RULE(ecall, I, 0b0000000, 0b000, __ F14 __); RULE(jalr, I, _________, 0b000, __ F17 __); RULE(lb, I, _________, 0b000, 0b0000011); RULE(lbu, I, _________, 0b100, 0b0000011); RULE(lh, I, _________, 0b001, 0b0000011); RULE(lhu, I, _________, 0b101, 0b0000011); RULE(lw, I, _________, 0b010, 0b0000011); RULE(ori, I, _________, 0b110, 0b0010011); RULE(slli, I, 0b000000, 0b001, 0b0010011); RULE(slti, I, _________, 0b010, 0b0010011); RULE(sltiu, I, _________, 0b011, 0b0010011); RULE(srai, I, __ F15 __, 0b101, 0b0010011); RULE(srli, I, __ F16 __, 0b101, 0b0010011); RULE(xori, I, _________, 0b100, 0b0010011); RULE(sb, S, _________, 0b000, 0b0100011); RULE(sh, S, _________, 0b001, 0b0100011); RULE(sw, S, _________, 0b010, 0b0100011); RULE(beq, B, _________, 0b000, 0b1100011); RULE(bge, B, _________, 0b101, 0b1100011); RULE(bgeu, B, _________, 0b111, 0b1100011); RULE(blt, B, _________, 0b100, 0b1100011); RULE(bltu, B, _________, 0b110, 0b1100011); RULE(bne, B, _________, 0b001, 0b1100011); RULE(auipc, U, _________, _____, __ F18 __); RULE(lui, U, _________, _____, __ F19 __); RULE(jal, J, _________, _____, __ F20 __); } static inline void insn_exec() { void (*handler)() = NULL; uint32_t opcode = decode_opcode(inst); uint32_t funct3 = decode_funct3(inst); uint32_t funct7 = decode_funct7(inst); if ((handler = pool[opcode].funct3_pool[funct3].handler)) { handler(); } else if ((handler = pool[opcode].funct3_pool[funct3].funct7_pool[funct7].handler)) { handler(); } else if ((handler = pool[opcode].handler)) { handler(); } else if ((handler = pool[opcode].funct3_pool[funct3].funct7_pool[funct7 >> 1].handler)) { handler(); } else { Error("Failed to execute inst 0x%" PRIx32 " at PC 0x%" PRIaddr "", inst, cpu.pc); SET_STATE(ABORT); } __ F21 __; /* reset $zero */ } void inst_exec_once(exec_t *info) { extern void itrace(exec_t *info, uint32_t inst); /* Instruction fetch */ inst = vaddr_ifetch(info->snpc, sizeof(inst)); itrace(info, inst); info->snpc += (vaddr_t)sizeof(uint32_t); _info = info; _info->dnpc = _info->snpc; /* Instruction execution */ insn_exec(); }
  • F01 = ? (Line 99)

    CSR(imm) = t | R(rs1)

  • F02 = ? (Line 100)

    CSR(imm) = R(rs1)

  • F03 = ? (Line 103)

    R(rs1) + IMM(I)

  • F04 = ? (Line 106)

    16

  • F05 = ?

    2

  • F06 = ?

    4

  • F07 = ?

    0x3f or equivalent

  • F08 = ?

    0x3f or equivalent

  • F09 = ?

    NPC = PC + IMM(J) or equivalent

  • F10 = ?

    0b111 or equivalent

  • F11 = ?

    0b110 or equivalent

  • F12 = ?

    0b000 or equivalent

  • F13 = ?

    0b1110011 or equivalent

  • F14 = ?

    0b1110011 or equivalent

  • F15 = ?

    0b010000 or equivalent

  • F16 = ?

    0b000000 or equivalent

  • F17 = ?

    0b1100111 or equivalent

  • F18 = ?

    0b0010111 or equivalent

  • F19 = ?

    0b0110111 or equivalent

  • F20 = ?

    0b1101111 or equivalent

  • F21 = ?

    R(0) = 0 or equivalent

📝 Annotate

F01 = CSR(imm) = t | R(rs1)
This instruction (csrrs) reads a value from a special register (CSR), does an OR operation with rs1, saves the result to rd, and updates the CSR with the new value.


F02 = CSR(imm) = R(rs1)
For csrrw, it writes the value of rs1 into a special register (CSR) and saves the old value of the CSR into rd.


F03 = R(rs1) + IMM(I)
This is for jalr, which calculates the target address by adding the value in rs1 and an immediate value. It ensures the target address is even (aligned).


F04 = 16
The lh instruction reads 16 bits (2 bytes) of data from memory and extends it to 32 or 64 bits with the same sign.


F05 = 2
The lhu instruction reads 16 bits (2 bytes) from memory, but it treats the data as positive (unsigned).


F06 = 4
The lw instruction reads 32 bits (4 bytes) of data from memory.


F07 = 0x3f
This ensures the shift for slli does not exceed the maximum allowed (63 for 64-bit registers).


F08 = 0x3f
Similar to slli, this ensures srli shifts within a valid range (up to 63 for 64-bit).


F09 = NPC = PC + IMM(J)
The jal instruction jumps to a new program counter (PC) calculated by adding the immediate value to the current PC.


F10 = 0b111
This is the funct3 value for the and instruction, which does a bitwise AND operation.


F11 = 0b110
This is the funct3 value for the or instruction, which does a bitwise OR operation.


F12 = 0b000
This funct3 value is for addi, which adds an immediate value to rs1 and saves the result to rd.


F13 = 0b1110011
This is the opcode for the ebreak instruction, which triggers a break (for debugging).


F14 = 0b1110011
This is the opcode for the ecall instruction, which makes a system call or software interrupt.


F15 = 0b010000
This is the funct7 value for srai, which does an arithmetic right shift (keeps the sign).


F16 = 0b000000
This is the funct7 value for srli, which does a logical right shift (fills with zeros).


F17 = 0b1100111
This is the opcode for jalr, a jump instruction based on a register value.


F18 = 0b0010111
This is the opcode for auipc, which calculates an address by adding an immediate value to the PC.


F19 = 0b0110111
This is the opcode for lui, which loads an immediate value into the upper bits of a register.


F20 = 0b1101111
This is the opcode for jal, which performs an unconditional jump to a new PC.


F21 = R(0) = 0
The x0 register in RISC-V is always zero. This makes sure its value does not change.


Problem G

Our processor features an 8-entry direct-mapped TLB, indexed using the lowest-order bits of the virtual page number. The program below operates on a matrix

A, which consists of 4 rows and 256 columns. Each element in
A
is a 32-bit integer, laid out in row-major order (i.e., consecutive elements in the same row occupy contiguous memory locations). The program computes the sum of the entries in a top-right submatrix
B
of matrix
A
. Assume
A
starts at virtual address 0x0000, and the result of the sum is stored in a register. Instruction fetches can be ignored.

int sum = 0; for (int i = 0; i < H; i++) for (int j = 0; j < W; j++) sum += A[i][j];

matrix

How many TLB misses will this program encounter when

H=4 and
W=256
(i.e., when
B=A
)?

  • G01 = ?

    32
    Matrix

    A spans 32 contiguous 128B pages. Since all elements in
    A
    are accessed sequentially in the order they are laid out in memory, there will be a total of 32 TLB misses.

📝 Annotate

  1. A matrix
    A
    with 4 rows and 256 columns.
  • Each element is 4 bytes (32-bit integer).
  • The matrix is stored in row-major order, meaning:
    • A[0][0] to A[0][255] represent the first row.
    • A[1][0] to A[1][255] for the second row, and so on.
  1. Calculate the sum of all elements in matrix

    A.

    • The outer loop iterates over
      H
      = 4 rows.
    • The inner loop iterates over
      W
      = 256 columns.
  2. Virtual Memory and TLB (Translation Lookaside Buffer):

    • The matrix
      A
      is stored in virtual memory.
    • Each memory access requires the CPU to check the TLB (a small cache) to find the corresponding physical memory location.
    • Direct-mapped TLB:
      • The TLB has 8 entries (slots).
      • Each entry corresponds to part of a virtual page.
      • If two virtual pages map to the same entry, the older page is replaced, causing a TLB Miss.

When the program accesses matrix

A, which spans multiple virtual pages,32 TLB Misses occur in total.

  1. A
    has:
    4rows×256columns=1024elements
  2. Each element is 4 bytes, so the total size of
    A
    is:
    1024×4=4096bytes (4 KB)
  3. Each page is 128 bytes, so the matrix ( A ) is divided into:
    4096128=32pages
  4. Each row occupies ( 256 \times 4 = 1024 , \text{bytes} ), which spans:
    1024128=8pages per row
    • The matrix
      A
      as a whole spans 32 pages.
  5. Each time a new page is accessed for the first time, it causes a TLB Miss. Since there are 32 pages in total, and the program accesses each one sequentially, there will be 32 TLB Misses

What is the highest possible TLB hit rate that this program can achieve?

  • G02 = ?

    3132
    Keep in mind that each page contains 32 elements of matrix
    A
    . The maximum TLB hit rate occurs when there is 1 miss to load a page into the TLB, followed by 31 hits, resulting in a hit rate of
    31/32
    .
    To achieve this,
    W
    must be
    32×N
    , where
    0<N8
    , and
    H
    can be any value.

When matrix

A is accessed, the program processes its elements row by row in memory order. Each page contains 32 elements, and due to the direct-mapped nature of the TLB, entering a new page causes 1 TLB miss. The remaining 31 accesses within that page result in TLB hits because the page is already loaded into the TLB. This access pattern ensures that the hit rate for a single page is
3132
.

To achieve the highest TLB hit rate, the matrix's width

W must be a multiple of
32×N
, where
0<N8
. This ensures that the accesses align perfectly with page boundaries, avoiding unnecessary TLB conflicts. The number of rows
H
does not affect the hit rate, as long as
W
meets the alignment condition. With this optimized access pattern, the program can achieve the maximum TLB hit rate of
3132
.

It has been suggested that increasing the page size could significantly reduce TLB misses in the direct-mapped TLB for this program. For which values of

H (G03, range) and
W
(G04, range) would doubling the page size fail to improve the TLB hit rate?

  • G03 = ?

    32 or equivalent

  • G04 = ?

    any
    With the larger page size, each page now contains 64 elements, and a single row of

    A spans 4 pages. For
    W32
    and any
    H
    , the number of hits remains unchanged compared to smaller pages, as increasing the page size does not reduce compulsory misses.

📝 Annotate

When the number of rows

H32, increasing the page size does not improve the TLB hit rate. This is because with a smaller number of rows, the program’s access pattern does not span enough pages for the larger page size to reduce the number of compulsory misses. However, whether
H
impacts the hit rate largely depends on whether the matrix width
W
aligns well with the current page size.

With the larger page size, each page holds 64 elements, so one row of

A takes up 4 pages. If
W32
, no matter the value of
H
, the number of hits will be the same as with the smaller pages. This is because making the pages bigger does not reduce the required misses.

What is the smallest page size, as a power of 2, that ensures our program incurs at least one TLB hit for all values of

H and
W
where
H×W>1
?

  • G05 = ?

    2048B or equivalent
    To determine the minimum page size, consider the scenario with the lowest hit rate (0%), where accessing one element per column resulted in no reuse, as elements from different columns belonged to different pages. Therefore, the minimum page size required to guarantee at least one TLB hit must be greater than 1024B. Since page sizes must be powers of two, the smallest such size is 2048B.

📝 Annotate

To find the smallest page size, think about the situation where the TLB hit rate is the lowest (0%). This happens when each element in a column is on a different page, so there is no reuse of pages. To make sure there is at least one TLB hit, the page size needs to be big enough to hold multiple columns in the same page. This means the page size must be larger than 1024 bytes. Since page sizes must be powers of two, the smallest size that works is 2048 bytes.

The program is now modified by swapping the order of the two nested loops, as shown below:

int sum = 0; for (int j = 0; j &lt; W; j++) for (int i = 0; i &lt; H; i++) sum += A[i][j];

Assume

W=8. For which values of
H
will doubling the page size enhance the TLB hit rate of the above program?

  • G06 = ?

    2 or equivalent
    For
    H2
    , doubling the page size causes TLB entry conflicts between the first and third rows, as well as between the second and fourth rows.

📝 Annotate

When the loop order is swapped, the program accesses elements column by column instead of row by row. With

W=8, doubling the page size improves the TLB hit rate only if
H2
. This is because, for
H2
, the larger page size reduces conflicts in the TLB that occur when rows share the same TLB entry. For example, without doubling the page size, conflicts happen between the first and third rows, as well as between the second and fourth rows. By increasing the page size, these conflicts are reduced, leading to better TLB performance.


Problem H

This question involves analyzing an out-of-order machine. A program consisting of 50% integer and 50% floating-point instructions is profiled. The average latencies for these instructions, assuming infinite reservation station and commit queue sizes, are as follows:

Instruction Type Decode to Issue Decode to Commit FU Latency
Integer 3 cycles 8 cycles 2 cycles
Floating-Point 6 cycles 20 cycles 12 cycles

If the processor commits an average of 0.5 instructions per cycle for this program, how many entries are occupied on average in the reservation stations and the commit queue?

Integer Reservation Station:

  • H01 = ?

    0.75
    Floating-Point Reservation Station

  • H02 = ?

    1.5
    Commit Queue:

  • H03 =?

    7

We use Little's Law to calculate the occupancy ((N)) of each structure:

N=T×L
where
T
is the throughput, and
L
is the latency.

First, the throughput for integer (

Tint) and floating-point (
Tfp
) instructions is 0.25 instructions per cycle each, as the total throughput is 0.5 instructions per cycle and the program consists of 50% integer and 50% floating-point instructions.

For the integer reservation station, the latency is 3 cycles. Using Little's Law, the occupancy is:

Nint=0.25×3=0.75
For the floating-point reservation station, the latency is 6 cycles. The occupancy is:
Nfp=0.25×6=1.5

To compute the occupancy of the commit queue, we calculate the occupancy for each type of instruction separately and then sum them.

For integer instructions, the latency is 8 cycles, so the occupancy is:

Ncommit-int=0.25×8=2
For floating-point instructions, the latency is 20 cycles, so the occupancy is:
Ncommit-fp=0.25×20=5

The total commit queue occupancy is:
Ncommit-total=2+5=7

The core of this problem lies in combining Little's Law, CPU pipeline operations, and instruction type differences to calculate the average occupancy of the Reservation Station and Commit Queue using a weighted average method. It demonstrates how the instruction mix ratio and latency characteristics influence the operational efficiency of CPU resources.


Problem I

Suppose you are designing a branch predictor for a pipelined in-order processor with the following stages.
processor

The processor operates with the following characteristics:

  • It can issue at most one instruction per cycle.
  • Branch addresses are determined at the end of the B stage.
  • Branch conditions (taken or not taken) are resolved at the end of the E stage.

Within this pipeline, control flow instructions are handled as follows:

  • During the A stage, the instruction at
    PC+4
    is fetched by default.
  • In the B stage (Branch Address Calculation/Begin Decode), conditional branch instructions (e.g., BLE/BNE) consult the branch predictor. If a branch is predicted to be taken, all subsequent instructions in the pipeline are flushed, and the PC is redirected to the calculated branch target address.
  • JAL instructions execute the jump to the target address in the B stage, flushing all subsequent instructions in the pipeline. The implementation of JALR does not need to be considered.

For the following questions, consider the given C code and its corresponding RISC-V assembly equivalent. Note that there are no branches other than those explicitly included in the assembly code.

  • C Code
int X[1000000]; int a, b; ... for (int i = 1000000; i > 0; i--) { a = X[i]; ... / * Some work independent of a or b */ ... if (a >= 0) { b += b + b; } else { b += b * b; } }
  • RISC-V Assembly
# x1, x2, and x3 contain the values of &X, a, and b respectively. _loop: LW x2, 0(x1) # Load word: x2 = X[i] ADDI x1, x1, -1 # Decrement x1 to point to the next element ... # Some work independent of a or b ... B1: BLT x2, x0, _else # Branch to _else if a < 0 ADD x3, x3, x3 # b = b + b JAL x0, B2 # Jump to B2 _else: MUL x3, x3, x3 # b = b * b B2: BNE x1, x0, _loop # Continue the loop if x1 != 0

This setup forms the basis for analyzing the control flow and behavior of the program.

Assume that branch B1 is always Not Taken, while the branch predictor predicts all branches as Taken. On average, how many instructions will the processor flush per iteration of the loop?

  • I01 = ?

    13

Encountering 2 conditional branches. The first one is incorrectly predicted as taken, leading to 7 flushed instructions. Additionally, the unconditional JAL is taken, causing 3 more instructions to be flushed. In total, the number of flushed instructions is

7+3+3=13 cycles.

The branch predictor uses a single branch history table (BHT) indexed by the lower 10 bits of the PC. Each entry in the table is a 1-bit counter that predicts "taken" if the value is 1 and "not taken" if the value is 0. The table is updated with the actual branch outcome at the end of the execute stage (1 for "taken," 0 for "not taken"). Given that the elements of array

X are uniformly distributed between
[101,100]
, will the predictor achieve high or low accuracy for branches
B1
and
B2
?

Branch B1 will have a prediction accuracy of approximately __ I02 __ (rate) since the branch direction is completely random.

  • I02 = ?

    50%

B2 is __ I03 __ (condition).

  • I03 = ?

    always taken or equivalent

📝 Annotate

B2 is always taken. B1, however, will only be predicted correctly about 50% of the time since its branch direction is completely random.


Problem J

To design a directory-based MSI coherence protocol, the directory uses a bit vector to represent the sharer set for each cache line. In an

N-core system, the directory maintains an
N
-bit wide bit vector for each line it tracks. If the
i
-th bit of the bit vector is set, it indicates that core
i
's cache holds a copy of the line in either the Shared (S) or Modified (M) state.

Assume the processor has

N cores, each with a 1KB private cache using 32B cache lines. How many entries must the directory contain to track all the cache lines in the private caches?

  • J01 = ?

    32×N or equivalent
    Each core has 32 private cache lines, and with ( N ) cores, the total number of cache lines is $32 \times N $.

📝 Annotate

Annotated Explanation:

This problem asks how many entries the directory must have to track all cache lines in a system using the MSI coherence protocol.

  1. Cache Line Calculation:
    Each core has a private 1 KB cache, and each cache line is 32 bytes. So, the number of cache lines per core is:

    102432=32cache lines

  2. Total Cache Lines:
    With

    N cores, each having 32 cache lines, the total number of cache lines across all cores is:
    32×N

  3. Directory Entries:
    The directory needs to track each cache line in the system. So, the number of directory entries is equal to the total number of cache lines:

    32×N

The design of the sharer sets is modified to replace the bit vector with a set of sharer pointers. Each pointer stores the core ID of one sharer. For instance, if cores 10 and 12 hold a cache line in the Shared (S) state, the sharer set in the directory will consist of two pointers, "10" and "12." Each sharer set can hold a maximum of 16 sharer pointers.

In an

N-core system, where
N
is a power of 2, how many total bits are required to represent all sharer pointers in a sharer set?

  • J02 = ?

    16×logN or equivalent
    With 16 pointers and each pointer requiring
    logN
    bits to represent the core ID, the total number of bits needed is
    16×logN
    .

📝 Annotate

The bit vector is replaced with a set of sharer pointers.
Each pointer represents the core ID of a sharer holding a cache line in the Shared (S) state.
The maximum number of sharer pointers in the sharer set is 16.

N is the number of cores in the system, and
N
is a power of 2.

Bits Required to Represent a Single Pointer

  • A core ID ranges from
    0
    to
    N1
  • Since
    N
    is a power of 2, the number of bits required to represent a single core ID is:
    Bits per pointer=logN

Total Bits for 16 Sharer Pointers

  • Each sharer set can hold a maximum of 16 sharer pointers.

  • The total bits required to represent all sharer pointers in a set is:

    Total bits=Bits per pointer×Number of pointers

    Total bits=(logN)×16


Problem K

The following questions examine memory accesses from multiple cores in a cache-coherent shared memory system. Each question evaluates the possible outcomes under the following memory consistency models:

  • Sequential Consistency (SC): Ensures that memory operations appear to execute in a single global order, consistent with the program order of each core.
  • Total Store Order (TSO), x86-style: Allows stores to be reordered after later loads. Stores from the same core are visible in the same order to other cores. Store-to-load forwarding within the same core is permitted.
  • Relaxed Memory Order (RMO): Permits both loads and stores to be reordered after later loads and stores, with store-to-load forwarding allowed.

Assume all registers (e.g., r1, r2, ) and memory locations (e.g., a, b, ) are initialized to 0.

Consider a cache-coherent shared-memory system executing the following two threads on two different cores. Assume that memory locations a, b, and c are initialized to 0.

  • Thread Execution 1
Thread 1 (T1) Thread 2 (T2)
ST (a) <- 1 LD r1 <- (a)
LD r2 <- (b) ST (b) <- 1
ST (a) <- 2

Identify the memory consistency models where the final values r1 = 1, r2 = 0, and (a) = 1 are possible. (Answer in SC, TSO, and RMO)

  • K01 = ?

    TSO, RMO

📝 Annotate

  • Multiple cores in a cache-coherent shared memory system

    • In a cache-coherent shared memory system, a cache coherence protocol is usually used to ensure that multiple cores maintain consistent access to the shared memory.
    • These protocols ensure that when one core modifies a memory location, other cores are immediately aware of the change, avoiding data inconsistencies.
  • Sequential Consistency (SC)

    • For each independent processing unit, the program order is maintained during execution.
    • The entire program is executed in a certain order across all processors.
  • Total Store Order (TSO)

    • TSO uses write buffers to delay writes to memory, allowing other instructions to run without waiting. This increases processing efficiency.
    • TSO provides tools like memory barriers (LFENCE, SFENCE, MFENCE) to control the order of memory operations, ensuring consistent program behavior.
  • Relaxed Memory Order (RMO)

    • RMO permits memory operations to be reordered, enabling hardware and compilers to optimize performance without strictly following the program order.
    • RMO relies on tools like memory fences to ensure correct ordering when needed, leaving it up to the programmer to manage memory consistency.

This trade-off provides flexibility and higher efficiency.

  • A wide range of memory models
    wideRangeMemoryModel
    • Modern processors use weaker memory models (such as TSO) to achieve hardware optimizations and provide synchronization primitives (such as memory barriers) to help programmers manage consistency issues.
    • Ensuring program correctness while optimizing performance necessitates close cooperation between hardware and software developers. Processor designers and compiler developers work together to ensure that the instruction sequences generated by compilers adhere to the specified memory models. This collaboration provides programmers with the necessary tools to manage memory coherency effectively.

  • Thread Execution 1
Thread 1 (T1) Thread 2 (T2)
ST (a) <- 1 LD r1 <- (a)
LD r2 <- (b) ST (b) <- 1
ST (a) <- 2
  1. T1 Operation Sequence:

    • ST (a) <- 1: Store the value 1 into
      a
      .
    • LD r2 <- (b): Load the value from
      b
      into
      r2
      .
    • ST (a) <- 2: Store the value 2 into
      a
      .
  2. T2 Operation Sequence:

    • LD r1 <- (a): Load the value from
      a
      into
      r1
      .
    • ST (b) <- 1: Store the value 1 into
      b
      .
  3. Final values r1 = 1, r2 = 0, and (a) = 1


Analysis Sequential Consistency (SC):

  • In SC, Thread 1's ST (a) <- 2 must happen after ST (a) <- 1 in program order. All threads must see this order, so a = 2 should be the final value.
  • If a = 1 is observed, it means ST (a) <- 2 was not applied globally, which violates SC.
  • SC cannot satisfy the given conditions

Analysis Total Store Order (TSO):

  • r1 = 1: because Thread 2 reads the old value of a in the shared memory ST (a) <- 2 has not been committed yet).
  • r2 = 0: because when Thread 1 reads b, Thread 2’s ST (a) <- 1 has not yet been committed.
  • a = 1: Because Thread 1's ST (a) <- 2 is still in the write buffer and has not been written to the shared memory.
  • TSO can satisfy the given conditions.

Analysis Relaxed Memory Order (RMO)

  • RMO permits the maximum degree of reordering for both load and store operations.
  • RMO can satisfy the given conditions.

Consider the following four threads executing on a cache-coherent shared-memory system:

  • Thread Execution 2
Thread 1 (T1) Thread 2 (T2) Thread 3 (T3) Thread 4 (T4)
ST (a) <- 1 ST (a) <- 2 LD r1 <- (a) LD r2 <- (a)
LD r3 <- (a) LD r4 <- (a)

Determine the memory consistency models where the final values r1 = 1, r2 = 2, r3 = 2, and r4 = 1 are possible. (Answer in SC, TSO, and RMO)

  • K02 = ?

    RMO

📝 Annotate

Analysis Sequential Consistency (SC)

  • If final values r1 = 1, r2 = 2, r3 = 2, and r4 = 1, we can make a possible execution order is as follows:
    1. T1: ST (a) <- 1
    2. T3: LD r1 <- (a) (reads a = 1)
    3. T2: ST (a) <- 2
    4. T4: LD r2 <- (a) (reads a = 2)
    5. T3: LD r3 <- (a) (reads a = 2)
    6. T4: LD r4 <- (a) (reads a = 2)
  • Then r4 should be 2 because it should be ordered.
  • SC cannot satisfy the given conditions

Analysis Total Store Order (TSO)

  • If r1 = 1, r2 = 2, r3 = 2, and r4 = 1 are possible, we can make a possible execution order is as follows:
    1. T1: ST (a) <- 1
    2. T3: LD r1 <- (a) (reads a = 1)
    3. T2: ST (a) <- 2
    4. T4: LD r2 <- (a) (reads a = 2)
    5. T3: LD r3 <- (a) (reads a = 2)
    6. T4: LD r4 <- (a) (reads a = 2)
  • Then r4 should be 2 because it should be ordered.
  • TSO cannot satisfy the given conditions

Analysis Relaxed Memory Order (RMO)

  • RMO permits the maximum degree of reordering for both load and store operations.
  • RMO can satisfy the given conditions.

Consider the following three threads executing on a cache-coherent shared-memory system:

  • Thread Execution 3
Thread 1 (T1) Thread 2 (T2) Thread 3 (T3)
ST (a) <- 1 LD r1 <- (a) LD r3 <- (b)
LD r2 <- (b) LD r4 <- (a)
ST (b) <- 1

Determine the memory consistency models where the final values r1 = 1, r2 = 0, r3 = 1, and r4 = 0 are possible. (Answer in SC, TSO, and RMO)

  • K03 = ?

    TSO, RMO

📝 Annotate

Analysis Sequential Consistency (SC)

  • If final values r1 = 1, r2 = 0, r3 = 1, and r4 = 0, we can make a possible execution order is as follows:

    1. T1: ST (a) <- 1 (a = 1)
    2. T2: LD r1 <- (a) ( r1 = 1)
    3. T1: ST (b) <- 1 (b = 1)
    4. T3: LD r3 <- (b) (r3 = 1)
    5. T2: LD r2 <- (b) (r2 = 1)
    6. T3: LD r4 <- (a) (r4 = 1)
  • SC cannot satisfy the given conditions

Analysis Total Store Order (TSO)

  • If the final values are r1 = 1, r2 = 0, r3 = 1, and r4 = 0, a possible execution order under TSO is as follows:

    1. T1: ST (a) <- 1
      • Thread 1 writes 1 to a. This value is placed in the write buffer of T1 but has not yet been propagated to shared memory.
    2. T2: LD r1 <- (a)
      • Thread 2 reads a and sees the value 1 because TSO allows store-to-load forwarding. T2 can see the value in T1's write buffer.
    3. T3: LD r3 <- (b)
      • Thread 3 reads b and sees the value 1, assuming ST (b) <- 1 from T1 has been propagated to shared memory.
    4. T2: LD r2 <- (b)
      • Thread 2 reads b and sees the value 0, as ST (b) <- 1 might still be in T1's write buffer and not yet visible in shared memory.
    5. T1: ST (b) <- 1
      • Thread 1 writes 1 to b and eventually propagates it to shared memory.
    6. T3: LD r4 <- (a)
      • Thread 3 reads a and sees the value 0, as ST (a) <- 1 may still be in T1's write buffer and not yet propagated to shared memory.
  • TSO can satisfy the given conditions

Analysis Relaxed Memory Order (RMO)

  • RMO permits the maximum degree of reordering for both load and store operations.
  • RMO can satisfy the given conditions.

A sequence lock is a synchronization primitive used to allow a single writer thread to "publish" data that can be read by multiple reader threads. Its key property is that the writer is never blocked, though readers may be temporarily blocked. The code provided below is correct when executed on a system with sequential memory consistency but is not guaranteed to be correct under weaker memory consistency models.

typedef struct { volatile uint64_t seq; volatile object_t object; } seqlock_t; /* Writer function, called by a single thread */ void publish(seqlock_t* seqlock, object_t* src) { atomic_add(&seqlock->seq, 1); // Increment sequence number (odd implies write in progress) memcpy(&seqlock->object, src, sizeof(object_t)); // Copy data to the shared object atomic_add(&seqlock->seq, 1); // Increment sequence number (even implies write completed) } /* Reader function, can be called by any thread */ void get_obj(seqlock_t* seqlock, object_t* dest) { while (1) { uint64_t seq = seqlock->seq; // Capture current sequence number if (seq % 2 == 1) continue; // Skip if a write is in progress memcpy(dest, &seqlock->object, sizeof(object_t)); // Copy the object if (seqlock->seq == seq) break; // Verify no writes occurred during the read } }

The writer thread increments the sequence number twice—once before writing and once after. An odd sequence number indicates that a write operation is ongoing. The writer ensures that readers see either the previous consistent state or the fully updated state.

Readers check the sequence number before and after reading the object. If the sequence number changes during the read or is odd at the start, the reader retries. This ensures that the reader observes a consistent view of the data.

Notes

  1. Non-Atomic memcpy: Since memcpy is not atomic, readers can observe partial updates if the sequence lock's guarantees are violated.
  2. No Memory Barriers: On this system, atomic operations (e.g., atomic_add) do not act as memory barriers, which may lead to reordering of instructions under weaker memory consistency models.
  3. Object Size: The type object_t is assumed to be an opaque structure larger than 16 bytes, making it unsuitable for atomic memory operations. The sequence lock ensures that each reader observes a complete and consistent version of the object.

If this sequence lock is used on a multicore system with TSO (Total Store Order) consistency, would the code remain correct? If yes, briefly explain why. If no, clearly specify the memory barriers that need to be added to ensure correctness.

  • K04 = ? (yes or no)

    Yes

  • K05 = ? (Explain)

    (意思相近就給分)
    The code would still be correct under TSO. TSO only allows stores to be re-ordered after loads. get_obj() does not have any stores to shared data, so it is clearly correct.
    publish() is also correct, but it requires more justification.

📝 Annotate

The code remains correct under Total Store Order (TSO). TSO permits the reordering of stores only after loads. The get_obj() function has no stores to shared data, so it is clearly valid.

For the publish() function, it is also correct but requires more detailed justification. Analyzing the loads and stores to shared data, with atomics split into load/store pairs, we observe the following:

  • Load (LD) sequence: Cannot be reordered later due to a Read-After-Write (RAW) dependency.
  • Store (ST) sequence: Cannot be reordered later because TSO disallows Store-Store (ST-ST) reordering.
  • Store to object[0] object[n]: Cannot be reordered later because TSO disallows Store-Load (ST-LD) reordering.
  • Load (LD) sequence: Cannot be reordered later due to a RAW dependency.
  • Store (ST) sequence: Cannot be reordered later due to a RAW dependency.
  • Load (LD) sequence: Cannot be reordered later due to a RAW dependency.
  • Store (ST) sequence: Cannot be reordered later due to a RAW dependency.

If this sequence lock is used on a multicore system with RMO (Relaxed Memory Order) consistency, would the code remain correct? If yes, briefly explain why. If no, clearly specify the memory barriers that must be added to ensure correctness.

  • K06 = ? (yes or no)

    No

  • K07 = ? (Explain)

    (意思相近就給分)
    With RMO, we need to add fences in both publish() and get_obj(). Specifically, we need RR fences both before and after the memcpy() in get_obj() and WW fences before and after the memcpy() in publish().

📝 Annotate

Under Relaxed Memory Order (RMO), fences are required in both publish() and get_obj(). Specifically:

  • In get_obj(): Read-Read (RR) fences must be added both before and after the memcpy() operation.
  • In publish(): Write-Write (WW) fences must be added both before and after the memcpy() operation.

Writer needs:

atomic_add(&seqlock->seq, 1);
MEMBAR_STORE_STORE();  // Prevent reordering of seq increment and data write
memcpy(&seqlock->object, src, sizeof(object_t));
MEMBAR_STORE_STORE();  // Prevent reordering of data write and final seq increment
atomic_add(&seqlock->seq, 1);

Reader needs:

while (1) {
    uint64_t seq = seqlock->seq;
    if (seq % 2 == 1) continue;
    MEMBAR_LOAD_LOAD();  // Prevent reordering of seq read and data read
    memcpy(dest, &seqlock->object, sizeof(object_t));
    MEMBAR_LOAD_LOAD();  // Prevent reordering of data read and final seq read
    if (seqlock->seq == seq) break;
}

These barriers ensure proper ordering of sequence number operations relative to data access.

The code above uses two atomic additions in the publish function. Assuming a correct implementation on either TSO or RMO (with the necessary barriers in place), would replacing the atomic increments with ordinary non-atomic increments be sufficient? Why or why not?

  • K08 = ? (yes or no)

    Yes

  • K09 = ? (Explain)

    Non-atomic instructions are fine.

    The problem states that only a single thread will call publish(). Therefore, there will never be data races on &seqlock->seq. We can simply regard the store to &seqlock->seq as the serialization point.

📝 Annotate

non-atomic increments would be sufficient because:

  1. Only a single thread performs write operations (publish()), eliminating write-write race conditions
  2. Memory barriers (in RMO) already ensure proper ordering between sequence number updates and data writes.

Problem L

This problem evaluates a virtual memory system with the following characteristics:

  1. Two page sizes are supported: 4KB and 1MB (instead of the traditional 4MB).
  2. Each Page Table Entry (PTE) is 16 bytes (instead of 8 bytes).

The page table structure is summarized below, including the sizes of the page tables and data pages (not drawn to scale).
virtual-memory

The processor includes a data TLB with 64 entries, and each entry can map either a 4KB page or a 1MB page. After a TLB miss, a hardware engine walks the page table to reload the TLB. The TLB operates using a first-in/first-out (FIFO) replacement policy.

The following program adds the elements from two 1MB arrays and stores the results in a third 1MB array. Assume that

1MB=1,048,576Bytes, and the starting addresses of the arrays are as follows:

byte A[1048576]; // 1MB array at 0x00001000000 byte B[1048576]; // 1MB array at 0x00001100000 byte C[1048576]; // 1MB array at 0x00001200000 for (int i = 0; i < 1048576; i++) C[i] = A[i] + B[i];

Assume this program is the only process running on the system, and ignore instruction memory or operating system overheads. The data TLB is initially empty. The system has no cache, and each memory lookup has a latency of 100 cycles.

If all data pages are 4KB, compute the ratio of cycles for address translation to cycles for data access.
Address translation cycles

  • L01 = ?

    300
    100 + 100 +100 (for L1, L2 and L3 PTE)
    Data access cycles

  • L02 = ?

    400K or equivalent
    4K * 100
    (there is no cache, this assumes that memory access is byte-wise)

If all data pages are 1MB, compute the ratio of cycles for address translation to cycles for data access.
Address translation cycles

  • L03 = ?

    200
    100 + 100 (for L1, L2 PTE)

Data access cycles

  • L04 = ?

    100M or equivalent
    1M * 100
    (there is no cache, this assumes that memory access is byte-wise)

Assume the system includes a PTE cache with a latency of one cycle. The PTE cache stores page table entries and has unlimited capacity. Calculate the ratio of cycles spent on address translation to cycles spent on data access for the case where data pages are 4KB.

Address translation cycles

  • L05 = ?

    77200
    (256*3 + 3 + 1) * 100
    (Note that the arrays are contiguous and share some PTE entries. 256 L3 PTEs per array * 3 arrays, 1 L2 PTE per array * 3 arrays, 1 L1 PTE)

Data access cycles

  • L06 = ?

    300M or equivalent
    3M*100

What is the minimum number of entries required in the PTE cache to achieve the same performance as an unlimited PTE cache? Assume the PTE cache does not store L3 PTE entries and all data pages are 4KB.

  • L07 = ?

    4
    (1 for L1 and 3 for L2)

📝 Annotate

  1. Page Table Hierarchy:

    • Page tables map virtual memory to physical memory. Multi-level page tables (e.g., L1, L2, L3) are common.
    • Each level of the page table lookup incurs additional memory access latency.
  2. Translation Lookaside Buffer (TLB):

    • A cache for page table entries (PTEs) that reduces address translation overhead.
    • When TLB is full, replacement strategies like FIFO are used.
  3. Memory Access Latency:

    • Each memory access incurs a fixed latency (100 cycles in this case).
    • Address translation involves multiple PTE lookups, while data access directly fetches the data.
  4. Page Sizes:

    • Smaller pages (4KB): Require more page table entries and cause higher address translation overhead.
    • Larger pages (1MB): Reduce the number of PTE lookups, minimizing translation costs.
  5. PTE Cache:

    • A cache that stores PTEs to reduce the cost of page table lookups.
    • The performance depends on the cache size and what levels of PTEs it stores.

  • Address Translation Cycles (L01):
    Each address translation requires 3 memory accesses (L1, L2, and L3 PTE lookups), each taking 100 cycles:

    L01=100+100+100=300cycles

  • Data Access Cycles (L02):
    Each 4KB page requires a single memory access of 100 cycles. A 1MB array has (256) pages, so:

    L02=256×100=400Kcycles


  • Address Translation Cycles (L03):
    Each address translation requires 2 memory accesses (L1 and L2 PTE lookups), each taking 100 cycles:

    L03=100+100=200cycles

  • Data Access Cycles (L04):
    A 1MB page requires a single memory access of 100 cycles. For 1MB:

    L04=1M×100=100Mcycles


  • Address Translation Cycles (L05):

    • 256×3
      : L3 PTEs (256 per array for 3 arrays).
    • 1×3
      : L2 PTEs (1 per array for 3 arrays).
    • (1): L1 PTE.
      Total PTE lookups:
      Total Lookups=256×3+3+1=772

      Each lookup takes 100 cycles:
      L05=772×100=77200cycles
  • Data Access Cycles (L06):
    Each data access takes 100 cycles. For 3MB (3 arrays of 1MB each):

    L06=3M×100=300Mcycles


  • Assumptions:

    • L3 PTEs are not cached.
    • Cache must store 1 L1 PTE and 1 L2 PTE for each array.
  • Calculation:

    • 1 entry for the L1 PTE.
    • 3 entries for the L2 PTEs (1 per array for 3 arrays).
      Minimum Cache Entries=1+3=4

Problem M

Suppose we have the following two functions, each running in its own thread:

#define SIZE (1 << 30) char array[SIZE]; void thread1() { while (true) { for (int i = 0; i < SIZE; i += 64) array[i] += 1; } } void thread2() { int i = 1; while (true) { // Easy to predict: loops forever int x = i++; while (x != 1) { // Medium difficulty to predict if ((x & 0x1) == 0) // Hard to predict x >>= 1; else x = 3 * x + 1; } } }
  • Assembly Code for thread1:
thread1(): # Assume a1 holds the address of `array` and a5 holds SIZE .outer: mv a2, a1 # Set a2 to point to `array` li a3, 1 # Initialize loop counter .inner: lbu a4, 0(a2) # Load byte from `array[i]` addi a2, a2, 64 # Increment pointer by 64 addi a3, a3, 1 # Increment loop counter addi a4, a4, 1 # Increment value in `array[i]` sb a4, -64(a2) # Store updated value back to `array[i]` bne a3, a5, .inner # Repeat loop if counter < SIZE j .outer # Restart the outer loop
  • Assembly Code for thread2:
thread2(): li a4, 0 # Initialize i li a6, 1 # Set constant for x != 1 check li a3, 3 # Set multiplier for 3 * x + 1 .outer: addi a4, a4, 1 # Increment i mv a1, a4 # Copy i to x .inner: be a1, a6, .outer # If x == 1, exit inner loop andi a2, a1, 1 # Check if x is odd bne a2, zero, .odd # If x is odd, go to .odd srai a1, a1, 1 # If x is even, divide x by 2 j .inner # Repeat inner loop .odd: mul a1, a1, a3 # Multiply x by 3 addi a1, a1, 1 # Add 1 to x j .inner # Repeat inner loop

Processor Design
processor

The program runs on an in-order, two-way coarse-grained multithreaded processor with the following characteristics:

  • Instruction and data cache hits take 1 cycle.
  • All code fits in the instruction cache, so there are no instruction cache misses.
  • Data cache misses have a latency of 100 cycles.
  • The data cache is 32KB with 64B cache lines.
  • The ALU takes 1 cycle.
  • The pipeline has full bypassing.
  • Branch misprediction penalty is 5 cycles, meaning the time from fetching the mispredicted branch's following instruction to fetching the correct target is 5 cycles.

If each of these threads were executed individually on this processor, without multithreading, which thread would achieve a higher IPC?

  • M01 = ?

    thread2

Explain your reasoning.

  • M02 = ?

    (意思相近就給分)
    Cache misses (100 cycle penalty) are much more expensive than branch mispredictions. So, even though thread2 can mispredict ~2 branches per iteration, the penalty is much smaller, resulting in higher IPC.

Suppose the core switches between threads upon encountering a cache miss. When a data cache miss occurs, the processor flushes the pipeline, switches to the other thread, and continues executing that thread until it encounters a miss. The pipeline flush discards the load instruction and all subsequent instructions, up to 6 instructions in total (including the load).

Would this design enable both threads to execute close to their peak single-threaded IPC in steady state?
No. thread2 has no memory instructions that could miss in cache. Therefore, once we switch to thread2, we would never have another cache miss again, starving thread1 entirely.

To fix this, we could switch on:

  • both branch mispredicts and cache misses
  • both cache misses and responses from memory
  • cache misses and then switch back __ M03 __ cycles later
  • M03 = ?

    100

Now, consider running multiple instances of thread1 on this CPU. To achieve maximum IPC, more than 2-way multithreading will be required. What is the minimum number of thread1 instances that must run concurrently on this pipeline to maximize IPC?

  • M04 = ?

    9
    Each time a cache miss occurs, 6 instructions are flushed from the pipeline. These instructions must be re-executed when the thread is switched back, assuming the load will now hit in the cache.

    In each inner-loop iteration, there are 6 instructions to execute, plus an additional 6 instructions that were flushed, resulting in a total of 12 instructions per loop iteration. To maximize IPC, at least ceil(100 / 12) = 9 threads are required.

Suppose we implement a policy of switching on branch instructions. In a coarse-grained multithreaded implementation, generating a "switch" signal for the multithreading control logic can occur at various stages of the pipeline. If the switch occurs at stage

N, stages
0
through
N1
must be flushed.

Now, consider running two instances of thread2. Which of the following switching points would result in the highest overall IPC? Select one of them.
a. Switch on every branch (as predicted by the BTB)
b. Switch on every branch misprediction (detected in the ALU)
c. Switch on every branch (as determined during decode)
d. Switch on every branch predicted taken (as determined by the branch predictor)

  • M05 = ?

    a

📝 Annotate

Knowledge Requirements and Supplementary Information

  1. Instructions Per Cycle (IPC):

    • IPC measures the performance of a multithreaded processor, indicating the number of instructions executed per cycle.
    • The impact of memory latency and branch misprediction on IPC is essential.
  2. Memory Hierarchy (Cache) and Latency:

    • Cache hits and misses significantly affect execution efficiency.
    • High-latency memory misses (e.g., 100 cycles) can severely slow down performance.
  3. Branch Prediction and Mispredictions:

    • Branch mispredictions introduce pipeline delays.
    • The accuracy of branch prediction influences processor efficiency.
  4. Multithreading Switching Policies:

    • Understanding how thread switching during cache misses or branch mispredictions can improve performance.
  5. Basic Math for IPC Calculation:

    • Estimating the impact of loop instructions, miss penalties, and flush delays on performance.

  • M01,M02: Thread2 would achieve higher IPC.
    • Thread1 experiences frequent cache misses with a penalty of 100 cycles, while Thread2 only incurs branch misprediction penalties of 5 cycles. Therefore, Thread2 is more efficient overall.

  • M03: No.
  • Thread2 has no memory instructions that could result in cache misses. Once the processor switches to Thread2, it would not switch back, effectively starving Thread1.
  • The solutions:
    1. Switch on both cache misses and branch mispredictions.
    2. Switch on both cache misses and memory responses.
    3. Switch on cache misses and return to the previous thread after 100 cycles.

  • M04: At least 9 copies.
  • Each cache miss causes a 6-instruction flush, and re-execution also requires 6 instructions, totaling 12 instructions per iteration. To fully hide the 100-cycle miss penalty, at least
    100/12=9
    threads are required.

  • M05: Switching at the branch predicted by the Branch Target Buffer (BTB).
    • The earlier the switch occurs, the fewer pipeline stages are flushed. Switching at the BTB prediction minimizes the amount of wasted work and maximizes IPC.

Problem N

Multitasking is a fundamental concept in computer systems, allowing multiple tasks or processes to run concurrently, improving resource utilization and system responsiveness. Traditional multitasking in operating systems relies on threads and processes managed by the kernel. These kernel-level threads are preemptively scheduled, meaning the operating system decides when to switch between tasks. While effective, this approach introduces significant overhead due to context switching, where the CPU saves and restores registers, memory mappings, and other state information.

In scenarios where lightweight concurrency is needed, kernel-level multitasking often proves excessive. For instance, tasks such as cooperative multitasking, event-driven programming, or managing independent components within a simulation do not require preemption. Here, user-level constructs like coroutines or userspace threads become powerful tools.

Coroutines provide a mechanism for cooperative multitasking, where tasks explicitly yield control to other tasks at well-defined points. Unlike kernel-level threads, coroutines operate entirely in user space, requiring minimal overhead. This makes them an excellent choice for applications like asynchronous programming, where tasks need to pause while waiting for external events such as I/O completion, without blocking other operations. Coroutines maintain their execution state, including the program counter and local variables, allowing them to resume from where they left off seamlessly. This feature makes them particularly useful in game development, where behaviors like AI or animations require frequent context switching within a single-threaded environment.

Userspace threads, on the other hand, are an abstraction similar to coroutines but with a broader scope. They simulate traditional threads but are scheduled and managed entirely in user space, independent of the kernel. Userspace threads offer greater control over scheduling policies and avoid the overhead of kernel-mode context switches. However, unlike coroutines, userspace threads can preempt one another, making them closer in behavior to kernel threads but much more efficient for many workloads.

The trade-off between coroutines and userspace threads lies in complexity and flexibility. Coroutines are simpler, focusing on cooperative multitasking, while userspace threads provide additional features like preemption. In both cases, these constructs shine in environments where performance and fine-grained control are critical. They reduce the reliance on the kernel, avoid heavy context-switching overhead, and enable developers to implement concurrency with precision tailored to the application's needs.

For example, a high-performance web server might use coroutines to handle thousands of simultaneous connections without creating thousands of threads, which would strain the system's resources. Similarly, scientific simulations with many interacting entities might leverage userspace threads to model their behavior efficiently. Both approaches showcase how user-level multitasking constructs can outperform kernel-level alternatives in specific domains, enabling scalable and efficient multitasking solutions.

Implementing general-purpose coroutines typically requires a second call stack, a feature not natively supported by the C language. A practical, though platform-specific, approach involves using inline assembly to manipulate the stack pointer during the coroutine's initial setup. After obtaining a second call stack, the setjmp and longjmp functions from the standard C library can be used to switch between coroutines.

These functions work by saving and restoring the stack pointer, program counter, callee-saved registers, and other internal state required by the ABI. This ensures that returning to a coroutine after yielding restores its full state, as if resuming from a paused function call. Alternatively, minimalist implementations can achieve similar results by using inline assembly to swap only the stack pointer and program counter, ignoring other registers. While this approach clobbers all other registers, it can be much faster than relying on setjmp and longjmp, which must conservatively save all registers specified by the ABI. By contrast, the minimalist method allows the compiler to manage register spilling, saving only what is actively in use.

We now aim to implement a minimalist coroutine system on the RV32I architecture, assuming a 32-bit system without floating-point operations.

  • main.c
#include <stddef.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> /* The context_t structure holds the saved state of a thread, * including the return address (ra), stack pointer (sp), * callee-saved registers (s0-s11), the entry function pointer, * and a user-defined data pointer. */ typedef struct { uintptr_t ra; uintptr_t sp; uintptr_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11; void (*entry)(void *); void *data; } context_t[1]; extern void switch_context(context_t from, context_t to); extern void helper(void); /* Initialize the context_t structure for a thread. Sets up * the stack pointer, assigns the helper function as the return address, * and stores the thread entry function and associated data. */ static void initialize_context(context_t ctx, void *stack_base, size_t stack_size, void (*entry)(void *data), void *data) { #define RED_ZONE 128 uintptr_t stack_end = (uintptr_t) stack_base + stack_size - RED_ZONE; stack_end &= /* N01. Fill your code here */; /* Ensure the stack is 16-byte aligned */ memset(ctx, 0, sizeof *ctx); ctx->sp = stack_end; ctx->ra = (uintptr_t) helper; ctx->data = data; ctx->entry = entry; } #define ITERATIONS (1000 * 1000) #define STACKSIZE 32768 static context_t thread1, thread2, thread3; /* * Entry function for thread1. It performs a fixed number of context * switches between thread1 and thread2, and finally prints the total * number of context switches. */ static void thread1_func(void *data) { for (int i = 0; i < ITERATIONS; ++i) switch_context(thread1, thread2); printf("%d context switches\n", 3 * ITERATIONS); } /* Entry function for thread2. It alternates between thread1 and thread3 * for a series of context switches, then loops indefinitely switching * to thread3. */ static void thread2_func(void *data) { switch_context(thread2, thread1); // Switch to thread1 switch_context(thread2, thread1); // Switch to thread1 switch_context(thread2, thread3); // Switch to thread3 switch_context(thread2, thread1); // Switch to thread1 for (int i = 0;; ++i) switch_context(thread2, thread3); // Loop switching to thread3 } /* * Entry function for thread3. It alternates between thread2 and thread1, * then loops indefinitely switching to thread1. */ static void thread3_func(void *data) { switch_context(thread3, thread2); // Switch to thread2 switch_context(thread3, thread1); // Switch to thread1 for (int i = 0;; ++i) switch_context(thread3, thread1); // Loop switching to thread1 } #define ALIGN 64 // Cache line alignment /* * Allocates memory aligned to the cache line size to ensure proper alignment. */ static void *aligned_malloc(int size) { void *mem = malloc(size + ALIGN + sizeof(void *)); void **ptr = (void **) (((uintptr_t) mem + ALIGN + sizeof(void *)) & /* N02. Fill your code here */); ptr[-1] = mem; return ptr; } #define RED_ZONE 128 /* * Sets up the stack for a new thread and initializes its context. */ static void init_context(context_t ctx, void (*entry)(void *data), void *data) { void *stack = aligned_malloc(STACKSIZE + RED_ZONE + 8) - 8; initialize_context(ctx, stack, STACKSIZE, entry, data); } /* * Main function initializes thread2 and thread3, then starts thread1. */ int main() { init_context(thread2, thread2_func, (void *) 0xDEEEECAF); init_context(thread3, thread3_func, (void *) 0xF000000D); thread1_func((void *) 0xBABEBABE); return 0; }
  • context.S
.text .align 4 .globl switch_context .type switch_context, @function /* * Saves the callee-saved registers from the current thread context * into the memory pointed to by a0, then loads the saved state * for the next thread context from the memory pointed to by a1. */ switch_context: sw ra, 0(a0) // Save return address sw sp, 4(a0) // Save stack pointer sw s0, 8(a0) // Save callee-saved registers sw s1, 12(a0) sw s2, 16(a0) sw s3, 20(a0) sw s4, 24(a0) sw s5, 28(a0) sw s6, 32(a0) sw s7, 36(a0) sw s8, 40(a0) sw s9, 44(a0) sw s10, 48(a0) sw s11, 52(a0) lw ra, 0(a1) // Load return address lw sp, 4(a1) // Load stack pointer lw s0, 8(a1) // Load callee-saved registers lw s1, 12(a1) lw s2, 16(a1) lw s3, 20(a1) lw s4, 24(a1) lw s5, 28(a1) lw s6, 32(a1) lw s7, 36(a1) lw s8, 40(a1) lw s9, 44(a1) lw s10, 48(a1) lw s11, 52(a1) ret // Return to the next context .size switch_context, .-switch_context /* * The helper function is the entry point for a new thread. * It retrieves the entry function and its argument from the stack, * then transfers control to the entry function. */ .align 4 .globl helper .type helper, @function helper: lw a0, __ N03 __ (a1) // Load the argument (data) from the stack lw t0, __ N04 __ (a1) // Load the entry function pointer jr t0 // Jump to the entry function .size helper, .-helper

Build the above using the command:

$ $(CROSS_COMPILE)gcc -march=rv32i -mabi=ilp32 -o coroutine.elf context.S main.c

Complete the code to ensure it works as expected.

  • N01 = ?
    -16 or equivalent
  • N02 = ?
    ~(ALIGN - 1)
  • N03 = ?
    56
  • N04 = ?
    60

📝 Annotate

  • Coroutines is a more simple way to achieve cooperative multitasking
  • Userspace threads provide a more flexible and efficient alternative to kernel-level threads by implementing thread management entirely in the user space. Unlike coroutines, userspace threads can support preemptive multitasking, allowing threads to interrupt and switch execution without explicit yielding from the running thread.

N01 = -16 (or equivalent)
main.c

uintptr_t stack_end = (uintptr_t) stack_base + stack_size - RED_ZONE; stack_end &= /* N01. Fill your code here */;

To make sure that the stack_end is 16-byte aligned, the pointer(stackpointer, sp)
Alignment is usually done with bitwise operations (bitwise AND).In binary, -16 is represented as 11111110000 (32-bit). This creates a mask that clears the lowest 4 bits of any address.
By performing the operation stack_end &= -16;, the lowest 4 bits of stack_end are set to 0, making it a multiple of 16.

N02 = ~(ALIGN - 1)
main.c

static void *aligned_malloc(int size) { void *mem = malloc(size + ALIGN + sizeof(void *)); void **ptr = (void **) (((uintptr_t) mem + ALIGN + sizeof(void *)) &amp; /* N02. Fill your code here */); ptr[-1] = mem; return ptr; }

To align an address, we clear the lower 𝑁 bits, where 𝑁 is the power of 2 represented by ALIGN

N03 = 56
N04 = 60
I think the answers should switch.
context.S

.align 4 .globl helper .type helper, @function helper: lw a0, __ N03 __ (a1) // Load the argument (data) from the stack lw t0, __ N04 __ (a1) // Load the entry function pointer jr t0 // Jump to the entry function .size helper, .-helper

In structure of context_t :

typedef struct { uintptr_t ra; // Offset: 0 uintptr_t sp; // Offset: 4 uintptr_t s0; // Offset: 8 uintptr_t s1; // Offset: 12 uintptr_t s2; // Offset: 16 uintptr_t s3; // Offset: 20 uintptr_t s4; // Offset: 24 uintptr_t s5; // Offset: 28 uintptr_t s6; // Offset: 32 uintptr_t s7; // Offset: 36 uintptr_t s8; // Offset: 40 uintptr_t s9; // Offset: 44 uintptr_t s10; // Offset: 48 uintptr_t s11; // Offset: 52 void (*entry)(void *); // Offset: 56 void *data; // Offset: 60 } context_t[1];

Thus, these offsets are used in the assembly file (context.S) to retrieve the values of entry and data during a context switch or function call.


Reference