--- tags: computer-arch --- # Quiz6 of Computer Architecture (2025 Fall) > solutions ## Problem A: Atomic Memory Operations and Synchronization Primitives This problem explores the implementation of atomic memory operations in hardware (Chisel HDL) and their application to build synchronization primitives for concurrent programming. Background: The RISC-V "A" (Atomic) extension provides atomic memory operations for multiprocessor synchronization. The `amoswap.w` instruction atomically loads a value from memory, stores a new value, and returns the original value—all as a single indivisible operation. ``` amoswap.w rd, rs2, (rs1) ``` Semantics: ``` temp = M[rs1] // Load word from memory address in rs1 M[rs1] = rs2 // Store value from rs2 to the same address rd = temp // Return original value to rd ``` The operation is atomic: no other processor can access the memory location between the load and store phases. - [ ] Part 1: Chisel HDL Implementation of AMOSWAP We are implementing the `amoswap.w` instruction in a simple RV32I processor using Chisel 3. The processor has a 5-stage pipeline (IF, ID, EX, MEM, WB), and atomic operations are handled in the MEM stage. ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ RV32I 5-Stage Pipeline │ ├─────────┬─────────┬─────────┬─────────────────────────┬─────────────────┤ │ IF │ ID │ EX │ MEM │ WB │ │ (Fetch) │(Decode) │(Execute)│ (Memory Access) │ (Writeback) │ │ │ │ │ ┌───────────────────┐ │ │ │ │ rs1 ───┼────────>│ │ AtomicMemoryUnit │ │ │ │ │ rs2 ───┼────────>│ │ ┌─────────────┐ │──┼──> rd │ │ │ │ │ │ │ FSM │ │ │ │ │ │ │ │ │ └─────────────┘ │ │ │ │ │ │ │ └─────────┬─────────┘ │ │ └─────────┴─────────┴─────────┴────────────┼────────────┴─────────────────┘ │ ┌────────────▼──────────┐ │ Memory Bus │ │ (valid/ready/wen) │ └───────────────────────┘ ``` AMOSWAP Finite State Machine: ``` req_valid && amo_op==1 ┌───────┐ ─────────────> ┌───────┐ mem_ready ┌───────┐ │ sIdle │ │ sRead │ ───────────────> │sWrite │ └───────┘ <───────────── └───────┘ └───────┘ ^ mem_ready && │ │ │ resp_valid │ wen=0, save rdata │ wen=1 │ ▼ ▼ └──────────────────────────────────────────────────┘ AMOSWAP.W Execution Timeline: Cycle 1 (sRead): M[addr] ──────> rdata_reg (wen=0) Cycle 2 (sWrite): wdata_reg ────> M[addr] (wen=1) Result: rdata_reg ────> rd (resp_valid=1) ``` The memory interface uses a simple handshake protocol: - `io.mem.valid`: Request is valid - `io.mem.ready`: Memory is ready to accept request - `io.mem.addr`: Memory address (32-bit) - `io.mem.wdata`: Write data (32-bit) - `io.mem.rdata`: Read data (32-bit) - `io.mem.wen`: Write enable - `io.mem.amo_op`: Atomic operation type (0 = none, 1 = swap) Complete the Chisel module for the atomic memory unit: ```scala import chisel3._ import chisel3.util._ class AtomicMemoryUnit extends Module { val io = IO(new Bundle { // Pipeline interface val req_valid = Input(Bool()) val req_addr = Input(UInt(32.W)) val req_wdata = Input(UInt(32.W)) // Value to swap (from rs2) val req_amo_op = Input(UInt(2.W)) // 0=none, 1=swap val resp_data = Output(UInt(32.W)) // Original value (to rd) val resp_valid = Output(Bool()) val busy = Output(Bool()) // Memory interface val mem_valid = Output(Bool()) val mem_ready = Input(Bool()) val mem_addr = Output(UInt(32.W)) val mem_wdata = Output(UInt(32.W)) val mem_rdata = Input(UInt(32.W)) val mem_wen = Output(Bool()) }) // State machine for atomic operation val sIdle :: sRead :: sWrite :: Nil = Enum(3) val state = RegInit(sIdle) // Registers to hold operation state val addr_reg = Reg(UInt(32.W)) val wdata_reg = Reg(UInt(32.W)) val rdata_reg = Reg(UInt(32.W)) // Default outputs io.mem_valid := false.B io.mem_addr := addr_reg io.mem_wdata := wdata_reg io.mem_wen := false.B io.resp_data := rdata_reg io.resp_valid := false.B io.busy := state =/= sIdle switch(state) { is(sIdle) { when(io.req_valid && io.req_amo_op === 1.U) { // Start atomic swap operation addr_reg := io.req_addr wdata_reg := io.req_wdata state := __A01__ } } is(sRead) { // Phase 1: Read original value from memory io.mem_valid := true.B io.mem_wen := __A02__ when(io.mem_ready) { rdata_reg := __A03__ state := sWrite } } is(sWrite) { // Phase 2: Write new value to memory io.mem_valid := true.B io.mem_wen := __A04__ when(io.mem_ready) { io.resp_valid := __A05__ state := sIdle } } } } ``` Fill in * A01 = ? (Hint: Review the FSM diagram—what operation must complete before writing can begin?) * A02 = ? (Hint: Consider what `wen` stands for and whether this phase modifies memory) * A03 = ? (Hint: What signal carries the data read from memory that needs to be saved?) * A04 = ? (Hint: Consider whether memory needs to be updated during this phase) * A05 = ? (Hint: What signal indicates the atomic operation has finished?) - [ ] Part 2: Memory Ordering with Acquire/Release Semantics The RISC-V "A" extension supports memory ordering through `.aq` (acquire) and `.rl` (release) suffixes: - `amoswap.w.aq`: Acquire semantics - no subsequent memory operations can be reordered before this - `amoswap.w.rl`: Release semantics - no prior memory operations can be reordered after this - `amoswap.w.aqrl`: Both acquire and release (full barrier) ``` Memory Ordering Visualization: ACQUIRE (.aq) - "Lock the door behind you" ┌──────────────────────────────────────────────────────────────┐ │ Time ───────────────────────────────────────────────────> │ │ │ │ [prior ops] amoswap.w.aq [subsequent ops] │ │ ↓ ║ ↓ │ │ may float ════╬════▶ BLOCKED from │ │ before aq ║ moving before aq │ │ ║ │ │ Barrier effect: ────╬────────────────────────────────────> │ │ atomic op ops stay after │ └──────────────────────────────────────────────────────────────┘ RELEASE (.rl) - "Finish everything before leaving" ┌──────────────────────────────────────────────────────────────┐ │ Time ───────────────────────────────────────────────────> │ │ │ │ [prior ops] amoswap.w.rl [subsequent ops] │ │ ↓ ║ ↓ │ │ BLOCKED from ════╬════▶ may float │ │ moving after rl ║ after rl │ │ ║ │ │ <───────────────────╬──── Barrier effect: │ │ ops stay before ║ atomic op │ └──────────────────────────────────────────────────────────────┘ Critical Section Pattern: ┌────────────────────────────────────────────────────────────┐ │ │ │ amoswap.w.aq (lock) critical section amoswap.w.rl │ │ ║ ↕ ║ │ │ ║◀── ops cannot ──────────────── escape ──▶║ │ │ ║ escape │ ║ │ │ ╚═══════════════════════╪══════════════════╝ │ │ ops stay inside │ └────────────────────────────────────────────────────────────┘ ``` Extend the Chisel module to support ordering bits. The instruction encoding places: - Bit 26: `aq` (acquire) - Bit 25: `rl` (release) ```scala class AtomicMemoryUnitWithOrdering extends Module { val io = IO(new Bundle { // ... (same as before, plus ordering signals) val req_aq = Input(Bool()) // Acquire bit val req_rl = Input(Bool()) // Release bit // Fence signals to pipeline val fence_acquire = Output(Bool()) // Block younger loads/stores val fence_release = Output(Bool()) // Wait for older stores }) // ... (state machine as before) // Registers to latch ordering bits when operation starts val aq_reg = Reg(Bool()) val rl_reg = Reg(Bool()) // In sIdle state, when starting atomic operation: // aq_reg := io.req_aq // rl_reg := io.req_rl // Memory ordering logic // Acquire: Assert fence AFTER the atomic completes // Release: Assert fence BEFORE the atomic starts io.fence_acquire := __A06__ io.fence_release := __A07__ } ``` When should the acquire fence be asserted? __ A06 __ a. `io.req_valid && io.req_aq` b. `io.resp_valid && aq_reg` c. `state === sRead && aq_reg` d. `state === sIdle && io.req_aq` When should the release fence be asserted? __ A07 __ a. `io.resp_valid && rl_reg` b. `state === sIdle && io.req_valid && io.req_rl` c. `state === sWrite && rl_reg` d. `io.req_valid && io.req_rl` Fill in * A06 = ? (Hint: Acquire prevents younger ops from moving before the atomic. When should this fence be active—before or after the atomic completes?) * A07 = ? (Hint: Release ensures older ops complete before the atomic starts. When should this fence be active?) - [ ] Part 3: Binary Semaphore in RV32I Assembly Using the `amoswap.w` instruction, implement a binary semaphore. A binary semaphore has two states: 0 (available) and 1 (acquired). Semaphore memory layout: ``` sem_t: .word 0 # 0 = available, 1 = acquired ``` Memory ordering is achieved using `.aq` (acquire) and `.rl` (release) instruction suffixes: ``` ┌───────────────────────────────────────────────────────────────┐ │ amoswap.w.aq rd, rs2, (rs1) # atomic + acquire barrier │ │ amoswap.w.rl rd, rs2, (rs1) # atomic + release barrier │ │ amoswap.w.aqrl rd, rs2, (rs1) # atomic + full barrier │ └───────────────────────────────────────────────────────────────┘ The .aq/.rl suffixes embed the memory barrier into the atomic instruction. No separate fence instruction is needed. ``` Complete the `sem_wait` (P operation) and `sem_post` (V operation) routines: ```assembly # sem_wait: Acquire the semaphore (blocking) # Input: a0 = pointer to semaphore # Clobbers: t0, t1 sem_wait: li t1, 1 # Value to swap in (acquired state) sem_wait_retry: __A08__ t0, t1, (a0) # Atomically swap with acquire __A09__ t0, zero, sem_wait_retry # If old value was 1, retry ret # .aq suffix provides acquire barrier # sem_post: Release the semaphore # Input: a0 = pointer to semaphore # Clobbers: t0 sem_post: __A10__ zero, zero, (a0) # Atomically store 0 with release ret # .rl suffix provides release barrier ``` Fill in * A08 = ? (Hint: Think about what memory ordering is needed when entering a critical section) * A09 = ? (Hint: We retry if the semaphore was already held (t0 ≠ 0). Which conditional branch instruction implements this test?) * A10 = ? (Hint: What memory ordering ensures all prior operations complete before leaving the critical section?) - [ ] Part 4: Mutex with Ownership Tracking in C A mutex differs from a binary semaphore by tracking ownership—only the thread that acquired the lock can release it. This prevents accidental or malicious unlocking by other threads. Atomic Primitives: This implementation uses RISC-V inline assembly for atomic operations. The `_Atomic` qualifier indicates variables accessed atomically. Unlike C11's `<stdatomic.h>`, we implement primitives directly via `amoswap.w` instructions with acquire/release ordering. ``` Atomic Operation Memory Model: ┌─────────────────────────────────────────────────────────────────┐ │ Thread A Thread B │ │ ───────── ───────── │ │ critical_section(); │ │ atomic_store(&lock, 0); ─────────> atomic_load(&lock); │ │ [release] memory if (lock == 0) │ │ fence atomic_swap(&lock, 1);│ │ [acquire] │ │ critical_section(); │ └─────────────────────────────────────────────────────────────────┘ ``` ```c #include <stdint.h> #include <stdbool.h> /* _Atomic qualifier: Indicates a variable that must be accessed atomically. * Similar to C11 atomics but implemented via RISC-V A extension instructions. * The compiler treats these as volatile with memory ordering constraints. */ #define _Atomic volatile typedef uint32_t tid_t; // Thread ID type typedef struct { _Atomic uint32_t lock; // 0 = unlocked, non-zero = owner's tid tid_t owner; // Current owner (valid only when locked) } mutex_t; #define MUTEX_INITIALIZER { .lock = 0, .owner = 0 } // Get current thread ID (provided by OS/runtime) extern tid_t get_current_tid(void); // Atomic load with acquire semantics static inline uint32_t atomic_load(_Atomic uint32_t *addr) { uint32_t val; __asm__ __volatile__ ( "lw %0, (%1)\n\t" "fence r, rw" // acquire barrier : "=r"(val) : "r"(addr) : "memory" ); return val; } // Atomic store with release semantics static inline void atomic_store(_Atomic uint32_t *addr, uint32_t val) { __asm__ __volatile__ ( "fence rw, w\n\t" // release barrier "sw %0, (%1)" :: "r"(val), "r"(addr) : "memory" ); } // Atomic swap with acquire semantics (for lock acquisition) static inline uint32_t atomic_swap_acquire(_Atomic uint32_t *addr, uint32_t newval) { uint32_t result; __asm__ __volatile__ ( "amoswap.w.aq %0, %1, (%2)" : "=r"(result) : "r"(newval), "r"(addr) : "memory" ); return result; } // Atomic swap with release semantics (for lock release) static inline uint32_t atomic_swap_release(_Atomic uint32_t *addr, uint32_t newval) { uint32_t result; __asm__ __volatile__ ( "amoswap.w.rl %0, %1, (%2)" : "=r"(result) : "r"(newval), "r"(addr) : "memory" ); return result; } // Atomic swap with full barrier (acquire + release) static inline uint32_t atomic_swap(_Atomic uint32_t *addr, uint32_t newval) { uint32_t result; __asm__ __volatile__ ( "amoswap.w.aqrl %0, %1, (%2)" : "=r"(result) : "r"(newval), "r"(addr) : "memory" ); return result; } // Atomic fetch-and-add static inline uint32_t atomic_fetch_add(_Atomic uint32_t *addr, uint32_t val) { uint32_t result; __asm__ __volatile__ ( "amoadd.w.aqrl %0, %1, (%2)" : "=r"(result) : "r"(val), "r"(addr) : "memory" ); return result; } // Atomic fetch-and-sub static inline uint32_t atomic_fetch_sub(_Atomic uint32_t *addr, uint32_t val) { return atomic_fetch_add(addr, -val); } // Initialize mutex void mutex_init(mutex_t *m) { atomic_store(&m->lock, 0); m->owner = 0; } // Try to acquire mutex (non-blocking) // Returns: true if acquired, false if already held bool mutex_trylock(mutex_t *m) { tid_t self = get_current_tid(); // Attempt to swap our tid into the lock uint32_t old = atomic_swap((volatile uint32_t *)&m->lock, self); if (old == __A11__) { // Lock was free, we acquired it m->owner = __A12__; return true; } else if (old == self) { // We already held the lock (recursive attempt) // Restore lock state and return false (not re-entrant) atomic_swap((volatile uint32_t *)&m->lock, __A13__); return false; } // Lock held by another thread, restore their ownership atomic_swap((volatile uint32_t *)&m->lock, old); return __A14__; } // Acquire mutex (blocking) void mutex_lock(mutex_t *m) { while (!mutex_trylock(m)) { // Spin wait (could add yield/sleep for efficiency) __asm__ __volatile__ ("" ::: "memory"); // Compiler barrier } } // Release mutex // Returns: true if released, false if not owner bool mutex_unlock(mutex_t *m) { tid_t self = get_current_tid(); // Ownership check if (m->owner != __A15__) { return false; // Not the owner, cannot unlock } m->owner = 0; atomic_store(&m->lock, __A16__); // Release the lock return true; } ``` Fill in * A11 = ? (Hint: What value indicates the lock was free/unlocked?) * A12 = ? (Hint: After acquiring, who should be recorded as the owner?) * A13 = ? (Hint: If we already held the lock, what value should we restore to maintain our ownership?) * A14 = ? (Hint: If another thread holds the lock, did trylock succeed or fail?) * A15 = ? (Hint: Before unlocking, we verify the caller is the owner. What should we compare m->owner against?) * A16 = ? (Hint: What value represents the unlocked state?) - [ ] Part 5: Producer-Consumer Problem with Mutex and Condition Variables The classic producer-consumer problem requires synchronization between threads that produce data and threads that consume it. We use a bounded buffer with mutex protection. Problem setup: - Bounded buffer of size `BUFFER_SIZE` - Multiple producers adding items - Multiple consumers removing items - Must block when buffer is full (producers) or empty (consumers) ```c /* Uses atomic primitives defined in Part 4 */ #define BUFFER_SIZE 8 // Condition variable (simplified spin-wait version) typedef struct { _Atomic uint32_t waiters; // Number of waiting threads _Atomic uint32_t signal; // Signal flag } cond_t; #define COND_INITIALIZER { .waiters = 0, .signal = 0 } // Forward declarations (implementation in Part 6) void cond_wait(cond_t *cv, mutex_t *m); void cond_signal(cond_t *cv); void cond_broadcast(cond_t *cv); // Bounded buffer structure typedef struct { int buffer[BUFFER_SIZE]; int head; // Next position to write int tail; // Next position to read int count; // Number of items in buffer mutex_t lock; // Protects all fields cond_t not_full; // Signaled when buffer has space cond_t not_empty; // Signaled when buffer has items } bounded_buffer_t; void buffer_init(bounded_buffer_t *bb) { bb->head = 0; bb->tail = 0; bb->count = 0; mutex_init(&bb->lock); // Initialize condition variables atomic_store(&bb->not_full.waiters, 0); atomic_store(&bb->not_full.signal, 0); atomic_store(&bb->not_empty.waiters, 0); atomic_store(&bb->not_empty.signal, 0); } // Producer: Add item to buffer void produce(bounded_buffer_t *bb, int item) { mutex_lock(&bb->lock); // Wait while buffer is full while (bb->count == __A17__) { cond_wait(&bb->not_full, &bb->lock); } // Add item to buffer bb->buffer[bb->head] = item; bb->head = (bb->head + 1) % __A18__; bb->count++; // Signal that buffer is not empty cond_signal(&bb->__A19__); mutex_unlock(&bb->lock); } // Consumer: Remove item from buffer int consume(bounded_buffer_t *bb) { mutex_lock(&bb->lock); // Wait while buffer is empty while (bb->count == __A20__) { cond_wait(&bb->not_empty, &bb->lock); } // Remove item from buffer int item = bb->buffer[bb->tail]; bb->tail = (bb->tail + 1) % BUFFER_SIZE; bb->__A21__--; // Signal that buffer is not full cond_signal(&bb->__A22__); mutex_unlock(&bb->lock); return item; } ``` Fill in * A17 = ? (Hint: The buffer is full when count equals what maximum capacity?) * A18 = ? (Hint: Circular buffer wrap-around uses modulo of what value?) * A19 = ? (Hint: After producing an item, which condition variable should be signaled to wake waiting consumers?) * A20 = ? (Hint: The buffer is empty when count equals what?) * A21 = ? (Hint: After removing an item, which field should be decremented?) * A22 = ? (Hint: After consuming an item, which condition variable should be signaled to wake waiting producers?) - [ ] Part 6: Condition Variable Implementation Complete the condition variable implementation using the atomic operations defined in Part 4. Note: This is a simplified spin-wait implementation for educational purposes. In practice, `cond_signal` with a single flag may wake multiple waiters (similar to broadcast behavior). A production implementation would use futex or other OS primitives for proper single-waiter signaling. ```c /* Uses atomic primitives defined in Part 4 */ // Wait on condition variable // Precondition: mutex m is held by caller // Postcondition: mutex m is held by caller, condition may have been signaled void cond_wait(cond_t *cv, mutex_t *m) { // Increment waiter count atomic_fetch_add(&cv->waiters, 1); // Release the mutex before sleeping mutex_unlock(__A23__); // Spin until signaled while (atomic_load(&cv->signal) == __A24__) { __asm__ __volatile__ ("" ::: "memory"); } // Decrement waiter count atomic_fetch_sub(&cv->waiters, 1); // Clear signal if we're the last waiter if (atomic_load(&cv->waiters) == 0) { atomic_store(&cv->signal, 0); } // Reacquire the mutex before returning mutex_lock(__A25__); } // Signal one waiting thread void cond_signal(cond_t *cv) { if (atomic_load(&cv->__A26__) > 0) { atomic_store(&cv->signal, __A27__); } } // Wake all waiting threads void cond_broadcast(cond_t *cv) { atomic_store(&cv->signal, __A28__); } ``` Fill in * A23 = ? (Hint: Which mutex pointer should be passed to mutex_unlock before waiting?) * A24 = ? (Hint: What signal value means "no signal sent yet"—keep spinning while signal equals this) * A25 = ? (Hint: Which mutex pointer should be passed to mutex_lock after waking?) * A26 = ? (Hint: Before signaling, we check if any threads are waiting. Which field tracks this count?) * A27 = ? (Hint: What value should signal be set to in order to wake waiters?) * A28 = ? (Hint: Broadcast uses the same signal value as cond_signal since the flag-based approach wakes all waiters) - [ ] Part 7: Deadlock Prevention Analysis Consider this scenario with two buffers and two threads: ```c bounded_buffer_t buffer_A, buffer_B; // Thread 1: Transfer from A to B void transfer_A_to_B(void) { int item = consume(&buffer_A); produce(&buffer_B, item); } // Thread 2: Transfer from B to A void transfer_B_to_A(void) { int item = consume(&buffer_B); produce(&buffer_A, item); } ``` If both buffers are empty and both threads run simultaneously: What happens? __ A29 __ a. Both threads complete successfully b. Deadlock occurs - each thread waits for the other's buffer c. One thread completes, one blocks forever d. Race condition causes data corruption To mitigate this issue, which solution is most appropriate? __ A30 __ a. Use recursive mutexes b. Add timeout to condition wait c. Ensure buffers are never empty simultaneously d. Remove the condition variables Fill in * A29 = ? (Hint: Both threads try to consume from empty buffers. Think about circular wait conditions.) * A30 = ? (Hint: Consider mechanisms that allow threads to break out of indefinite waits) - [ ] Part 8: Verification Programs The following test programs verify correctness of the mutex and producer-consumer implementations. Mutex Verification (tests mutual exclusion and ownership): ```c /* Mutex verification - compile with: riscv32-unknown-elf-gcc -O2 -march=rv32ima */ #include <stdint.h> #include <stdbool.h> #define _Atomic volatile #define NUM_THREADS 4 #define ITERATIONS 1000 /* Include atomic primitives and mutex from Part 4 */ static mutex_t test_mutex = MUTEX_INITIALIZER; static volatile uint32_t shared_counter = 0; static volatile uint32_t error_count = 0; /* Simulated thread ID (in real system, provided by OS) */ static _Atomic uint32_t next_tid = 1; tid_t get_current_tid(void) { static __thread tid_t my_tid = 0; if (my_tid == 0) my_tid = atomic_fetch_add(&next_tid, 1); return my_tid; } /* Test: Increment counter under mutex protection */ void test_mutex_increment(void) { for (int i = 0; i < ITERATIONS; i++) { mutex_lock(&test_mutex); /* Critical section: read-modify-write must be atomic */ uint32_t old = shared_counter; shared_counter = old + 1; /* Verify no concurrent modification */ if (shared_counter != old + 1) atomic_fetch_add(&error_count, 1); mutex_unlock(&test_mutex); } } /* Test: Ownership enforcement */ void test_ownership(void) { mutex_t m = MUTEX_INITIALIZER; tid_t self = get_current_tid(); mutex_lock(&m); /* Verify we are the owner */ if (m.owner != self) atomic_fetch_add(&error_count, 1); /* Unlock should succeed (we are owner) */ if (!mutex_unlock(&m)) atomic_fetch_add(&error_count, 1); /* Second unlock should fail (already released) */ if (mutex_unlock(&m)) atomic_fetch_add(&error_count, 1); } /* Verification entry point */ int verify_mutex(void) { /* Run ownership test */ test_ownership(); /* In multi-threaded environment, spawn NUM_THREADS calling test_mutex_increment() */ /* After all threads complete: */ /* Expected: shared_counter == NUM_THREADS * ITERATIONS */ /* Expected: error_count == 0 */ return (error_count == 0) ? 0 : -1; } ``` Producer-Consumer Verification (tests bounded buffer correctness): ```c /* Producer-consumer verification */ #include <stdint.h> #include <stdbool.h> #define _Atomic volatile #define BUFFER_SIZE 8 #define NUM_PRODUCERS 2 #define NUM_CONSUMERS 2 #define ITEMS_PER_PROD 100 /* Include all primitives from Parts 4, 5, 6 */ static bounded_buffer_t buffer; static _Atomic uint32_t produced_sum = 0; static _Atomic uint32_t consumed_sum = 0; static _Atomic uint32_t items_produced = 0; static _Atomic uint32_t items_consumed = 0; /* Producer thread function */ void producer_thread(int producer_id) { for (int i = 0; i < ITEMS_PER_PROD; i++) { int item = producer_id * 1000 + i; /* Unique item value */ produce(&buffer, item); atomic_fetch_add(&produced_sum, (uint32_t)item); atomic_fetch_add(&items_produced, 1); } } /* Consumer thread function */ void consumer_thread(void) { while (atomic_load(&items_consumed) < NUM_PRODUCERS * ITEMS_PER_PROD) { int item = consume(&buffer); atomic_fetch_add(&consumed_sum, (uint32_t)item); atomic_fetch_add(&items_consumed, 1); } } /* Verification checks */ int verify_producer_consumer(void) { buffer_init(&buffer); /* In multi-threaded environment: * - Spawn NUM_PRODUCERS threads running producer_thread(id) * - Spawn NUM_CONSUMERS threads running consumer_thread() * - Wait for all to complete */ /* Verification invariants */ int errors = 0; /* 1. All produced items must be consumed */ if (items_produced != items_consumed) { errors++; /* Lost or duplicated items */ } /* 2. Sum of produced must equal sum of consumed */ if (produced_sum != consumed_sum) { errors++; /* Data corruption */ } /* 3. Buffer must be empty at end */ if (buffer.count != 0) { errors++; /* Residual items */ } /* 4. No buffer overflow/underflow during execution */ /* (Checked implicitly by bounded buffer logic) */ return (errors == 0) ? 0 : -1; } ``` Verification Summary: | Test | Property Verified | Expected Result | |------|------------------|-----------------| | Mutex increment | Mutual exclusion | counter == N × iterations | | Ownership test | Owner-only unlock | unlock fails for non-owner | | Producer-consumer | No lost items | produced == consumed | | Sum verification | No corruption | sum(produced) == sum(consumed) | | Buffer bounds | No over/underflow | 0 ≤ count ≤ BUFFER_SIZE | ### ✅ Solution to Problem A: A01: ==sRead== Explanation: When a swap request arrives, the FSM transitions to the read state to first load the original value from memory. A02: ==false.B== Explanation: During the read phase, we're loading data from memory, so write enable must be false. A03: ==io.mem_rdata== Explanation: The data read from memory (`io.mem_rdata`) is saved to `rdata_reg` for later return to the destination register. A04: ==true.B== Explanation: During the write phase, we're storing the new value to memory, so write enable must be true. A05: ==true.B== Explanation: When the write completes, we assert `resp_valid` to indicate the atomic operation is done and the result is ready. A06: ==b== Explanation: Acquire semantics require that the fence is asserted AFTER the atomic operation completes (`io.resp_valid`), preventing younger operations from executing before the atomic. The `aq_reg` register preserves the acquire bit from when the operation started, ensuring correct behavior across the multi-cycle FSM. A07: ==b== Explanation: Release semantics require that the fence is asserted BEFORE the atomic operation starts (in `sIdle` when valid request arrives), ensuring all prior stores complete before the atomic operation begins. A08: ==amoswap.w.aq== Explanation: Use atomic swap with acquire semantics. The `.aq` suffix ensures no subsequent memory operations can be reordered before this atomic operation completes. This is the preferred single-instruction method. A09: ==bne== (also accept: ==bnez==) Explanation: If `t0` (old value) is non-zero (semaphore was already acquired), branch back to retry. The `bne` (branch if not equal) instruction compares `t0` with `zero` and branches if they differ. Note: `bnez t0, label` is a pseudo-instruction that assembles to `bne t0, zero, label`. A10: ==amoswap.w.rl== Explanation: Use atomic swap with release semantics. The `.rl` suffix ensures all prior memory operations complete before this atomic operation. Storing zero releases the semaphore. ``` Memory barrier embedded in .aq/.rl suffixes: ┌─────────────────────────────────────────────────────────────────────┐ │ sem_wait: sem_post: │ │ amoswap.w.aq t0, t1, (a0) amoswap.w.rl zero, zero, (a0) │ │ │ │ │ │ ▼ ▼ │ │ ════════════▶ ◀════════════ │ │ barrier here barrier here │ │ (acquire) (release) │ │ │ │ │ │ [critical section] [prior ops flushed] │ │ ops stay AFTER ops stay BEFORE │ └─────────────────────────────────────────────────────────────────────┘ ``` A11: ==0== Explanation: A lock value of 0 indicates the mutex is free (unlocked state). A12: ==self== Explanation: After acquiring the lock, set the owner field to the current thread's ID. A13: ==self== Explanation: If we already held the lock (recursive attempt on non-reentrant mutex), restore our own tid to maintain lock state. A14: ==false== Explanation: When the lock is held by another thread, trylock fails and returns false. A15: ==self== Explanation: Compare the stored owner with the current thread's ID to verify ownership before unlocking. A16: ==0== Explanation: Store 0 to release the lock (return to unlocked state). A17: ==BUFFER_SIZE== Explanation: The buffer is full when count equals the maximum capacity (BUFFER_SIZE). A18: ==BUFFER_SIZE== Explanation: Circular buffer arithmetic uses modulo BUFFER_SIZE to wrap around. A19: ==not_empty== Explanation: After producing (adding an item), signal `not_empty` to wake consumers waiting for data. A20: ==0== Explanation: The buffer is empty when count is 0. A21: ==count== Explanation: After consuming an item, decrement the count of items in the buffer. A22: ==not_full== Explanation: After consuming (removing an item), signal `not_full` to wake producers waiting for space. A23: ==m== Explanation: Pass the mutex pointer `m` to `mutex_unlock` to release it before waiting. A24: ==0== Explanation: A signal value of 0 means no signal has been sent; spin while this condition holds. A25: ==m== Explanation: Pass the mutex pointer `m` to `mutex_lock` to reacquire it before returning. A26: ==waiters== Explanation: Check if any threads are waiting (`waiters > 0`) before setting the signal flag. A27: ==1== Explanation: Set signal to 1 to wake waiting threads. A28: ==1== Explanation: Broadcast also sets signal to 1; all waiters checking the flag will wake up. A29: ==b== Explanation: Deadlock occurs because Thread 1 waits for buffer_A to have items (blocked in `cond_wait` on `not_empty`), while Thread 2 waits for buffer_B to have items. Neither can produce because both are blocked waiting to consume first. A30: ==b== Explanation: Adding a timeout to the condition wait allows threads to break out of potential deadlock situations, retry with different logic, or signal an error. Note: This is a mitigation strategy rather than true prevention—it allows recovery from deadlock but doesn't eliminate the circular wait condition. In practice, restructuring the algorithm (e.g., using non-blocking try-consume operations) would be a more robust solution, but timeout is the most practical choice among the given options. Detailed explanation for memory ordering (A06-A07 and A08-A10): Memory Ordering Semantics: * Acquire (`.aq`): Operations after the atomic cannot be reordered before it - Think of it as "acquiring" a lock - nothing inside the critical section can escape before the door closes - The `.aq` suffix on AMO instructions provides this guarantee in hardware - Prevents subsequent loads/stores from being speculatively executed before the atomic * Release (`.rl`): Operations before the atomic cannot be reordered after it - Think of it as "releasing" a lock - everything inside the critical section must be done first - The `.rl` suffix on AMO instructions provides this guarantee in hardware - Ensures all prior loads/stores are visible before the atomic releases the lock Hardware Implementation: ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ Instruction │ Hardware Behavior │ ├───────────────────────┼─────────────────────────────────────────────────┤ │ amoswap.w │ Atomic only, no ordering │ │ amoswap.w.aq │ Atomic + drain store buffer AFTER completion │ │ amoswap.w.rl │ Atomic + drain store buffer BEFORE start │ │ amoswap.w.aqrl │ Atomic + full memory barrier (both) │ └─────────────────────────────────────────────────────────────────────────┘ ``` --- ## Problem B: Fixed-Point Arithmetic for Software 3D Rendering This problem explores the implementation of fixed-point arithmetic for a software 3D renderer targeting RV32IM (RISC-V 32-bit with integer multiply/divide, but no floating-point unit). ``` Software 3D Rendering Pipeline (Fixed-Point Implementation) ════════════════════════════════════════════════════════════ ┌─────────────────────────────────────────────────────┐ │ 3D Model Data (Vertices, Normals in Q15.16) │ └───────────────────────┬─────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────┐ │ GEOMETRY STAGE │ │ ─────────────────────────────────────────────── │ │ Transforms (Model → World → Camera → Clip) │ │ Backface culling, Clipping │ │ Key ops: FP_MUL (matrix-vector), vec3_dot │ └───────────────────────┬─────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────┐ │ RASTERIZATION STAGE │ │ ─────────────────────────────────────────────── │ │ Triangle → Pixels, Depth buffering │ │ Barycentric interpolation │ │ Key ops: FP_DIV (1/z), FP_MUL (interpolation) │ └───────────────────────┬─────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────┐ │ PIXEL STAGE │ │ ─────────────────────────────────────────────── │ │ Lighting (Phong), Texture sampling │ │ Color computation with saturation │ │ Key ops: FP_MUL, fp_sin/fp_cos, fp_sqrt │ └───────────────────────┬─────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────┐ │ Framebuffer (Final integer pixel colors) │ └─────────────────────────────────────────────────────┘ Summary: All stages use Q15.16 fixed-point arithmetic, enabling 3D rendering on integer-only hardware (RV32IM). ``` Background: Embedded 3D graphics (game consoles, IoT displays, automotive HUDs) often run on processors without floating-point units (FPU). The RV32IM instruction set provides only integer multiply/divide, making fixed-point arithmetic essential. Fixed-point represents fractional values using integers with an implicit binary point, enabling fast computation while maintaining sufficient precision for real-time graphics. Key advantages of fixed-point for embedded 3D rendering: - Deterministic timing (no FPU pipeline stalls or denormal penalties) - Smaller code size (no soft-float library needed) - Lower power consumption (integer ALU vs FPU) - Predictable precision (unlike floating-point's variable precision across magnitude ranges) The Q15.16 format provides ~4.8 decimal digits of precision with a range of ±32768, suitable for most 3D graphics workloads where coordinates are normalized or bounded. We use the Q15.16 fixed-point format: ``` ┌────────────────────────────────────────────────────────────────┐ │ Q15.16 Fixed-Point Format │ ├────────────────────────────────────────────────────────────────┤ │ Bit 31 │ Bits 30-16 (15 bits) │ Bits 15-0 (16 bits) │ │ Sign │ Integer Part │ Fractional Part │ ├────────────────────────────────────────────────────────────────┤ │ Range: -32768.0 to +32767.99998 (approximately ±32768) │ │ Precision: 1/65536 ≈ 0.0000153 (about 4.8 decimal digits) │ │ Storage: int32_t │ └────────────────────────────────────────────────────────────────┘ Example representations: 1.0 = 0x00010000 = 65536 0.5 = 0x00008000 = 32768 -1.0 = 0xFFFF0000 = -65536 π ≈ 0x0003243F = 205887 (actual: 3.14159...) ``` - [ ] Part 1: Fixed-Point Type Definition and Basic Conversions Define the fixed-point type and conversion macros in C99. Note: Use symbolic constants (e.g., `FP_BITS`, `FP_ONE`) rather than literal values where macros are already defined. This ensures code maintainability if the Q-format changes. ```c #include <stdint.h> /* Q15.16 fixed-point type */ typedef int32_t fixed_t; /* Number of fractional bits */ #define FP_BITS __B01__ /* Scaling factor: 2^16 = 65536 */ #define FP_ONE (1LL << FP_BITS) #define FP_HALF (1LL << (FP_BITS - 1)) /* Integer to fixed-point conversion * Note: We use multiplication instead of left-shift to avoid * undefined behavior with negative values in C99. */ #define INT_TO_FP(i) ((fixed_t) ((int64_t) (i) * __B02__)) /* Float to fixed-point conversion */ #define FLOAT_TO_FP(f) ((fixed_t) ((f) * __B03__)) /* Fixed-point to integer (floors toward negative infinity for negatives) */ #define FP_TO_INT(f) ((f) >> __B04__) /* Fixed-point to float */ #define FP_TO_FLOAT(f) ((float) (f) / __B05__) ``` Fill in * B01 = ? (Hint: Examine the bit layout diagram—the fractional part spans which bit range?) * B02 = ? (Hint: Use the symbolic constant for the scaling factor, not a literal number) * B03 = ? (Hint: Same scaling factor applies for float conversion) * B04 = ? (Hint: To discard fractional bits, shift by the symbolic constant representing their count) * B05 = ? (Hint: Divide by the scaling factor to convert back to float) - [ ] Part 2: Fixed-Point Multiplication When multiplying two Q15.16 numbers, we must handle the intermediate result carefully. Note: Use the standard C99 fixed-width type name for 64-bit signed integers. ``` Multiplication Analysis: ┌───────────────────────────────────────────────────────────────┐ │ A = a × 2^16 (Q15.16 representation of value 'a') │ │ B = b × 2^16 (Q15.16 representation of value 'b') │ │ │ │ A × B = (a × 2^16) × (b × 2^16) │ │ = a × b × 2^32 │ │ │ │ But we want: (a × b) × 2^16 (Q15.16 result) │ │ Therefore: Result = (A × B) >> 16 │ │ │ │ Problem: A × B requires 64 bits to avoid overflow! │ │ 32-bit × 32-bit = up to 64-bit result │ └───────────────────────────────────────────────────────────────┘ ``` Complete the multiplication macro: ```c /* Fixed-point multiplication * Cast to int64_t to prevent overflow, then shift right to rescale */ #define FP_MUL(a, b) ((fixed_t) (((__B06__) (a) * (b)) >> __B07__)) ``` Example verification: ```c fixed_t x = FLOAT_TO_FP(2.5); // x = 163840 (0x28000) fixed_t y = FLOAT_TO_FP(1.5); // y = 98304 (0x18000) fixed_t z = FP_MUL(x, y); // z = 245760 (0x3C000) = 3.75 in Q15.16 ``` Why is the 64-bit cast necessary? __ B08 __ a. The compiler requires explicit type hints for optimization b. Without it, 32×32 multiplication would overflow before the shift c. It forces the result into a signed type for correct rounding d. It reserves additional stack memory for the operation Fill in * B06 = ? (Hint: What 64-bit signed integer type prevents overflow during 32×32 multiplication?) * B07 = ? (Hint: After 64-bit multiplication, shift right by how many bits to return to Q15.16 format?) * B08 = ? (Hint: Consider what happens to a 64-bit product if only 32 bits are kept before shifting) - [ ] Part 3: Fixed-Point Division Division requires scaling the dividend before the divide operation: ``` Division Analysis: ┌───────────────────────────────────────────────────────────────┐ │ A = a × 2^16 (Q15.16 representation of value 'a') │ │ B = b × 2^16 (Q15.16 representation of value 'b') │ │ │ │ A / B = (a × 2^16) / (b × 2^16) │ │ = a / b (fractional bits cancel!) │ │ │ │ But we want: (a / b) × 2^16 (Q15.16 result) │ │ Therefore: Result = (A × 2^16) / B = (A << 16) / B │ │ │ │ Problem: A << 16 requires 48+ bits to avoid overflow! │ └───────────────────────────────────────────────────────────────┘ ``` Complete the division macro with zero-guard: ```c /* Fixed-point division with divide-by-zero protection */ #define FP_DIV(a, b) \ ((b) == 0 ? 0 : (fixed_t) (((__B09__) (a) * FP_ONE) / (b))) ``` Fill in * B09 = ? (Hint: The dividend needs 48+ bits after scaling. What type accommodates this?) - [ ] Part 4: Precision Loss Analysis Consider a 3D graphics scenario: computing the depth buffer value for perspective-correct interpolation. ```c /* Compute 1/z for depth buffering (z is distance from camera) */ fixed_t compute_inverse_z(fixed_t z) { /* z is in range [near_plane, far_plane], typically [1.0, 1000.0] */ return FP_DIV(FP_ONE, z); // Returns 1/z in Q15.16 } ``` Precision analysis for different z values: ``` ┌─────────────────────────────────────────────────────────────────┐ │ z value │ Exact 1/z │ Fixed-point 1/z │ Relative Error │ ├───────────┼────────────┼──────────────────┼─────────────────────┤ │ 1.0 │ 1.0 │ 65536/65536=1.0 │ 0% │ │ 10.0 │ 0.1 │ 6553/65536 │ 0.01% │ │ 100.0 │ 0.01 │ 655/65536 │ 0.05% │ │ 1000.0 │ 0.001 │ 65/65536 │ 0.82% │ │ 10000.0 │ 0.0001 │ 6/65536 │ 8.45% │ └─────────────────────────────────────────────────────────────────┘ Note: Relative error = |actual - expected| / expected × 100% - z=10: |6553/65536 - 0.1| / 0.1 = |0.09999 - 0.1| / 0.1 ≈ 0.01% - z=1000: |65/65536 - 0.001| / 0.001 = |0.000992 - 0.001| / 0.001 ≈ 0.82% ``` What happens as z increases toward the far plane? __ B10 __ a. Precision improves because larger numbers are more stable b. Precision degrades because fewer fractional bits represent the small result c. Precision remains constant due to the fixed 16-bit fractional part d. The result becomes negative due to sign bit overflow Fill in * B10 = ? (Hint: As 1/z gets smaller, how many of the 16 fractional bits can represent meaningful precision?) - [ ] Part 5: Trigonometric Functions for 3D Rotation For rotation matrices, we need sine and cosine. Instead of lookup tables, we use the Bhaskara I approximation (7th century Indian mathematician): ``` Bhaskara I Sine Approximation: ┌───────────────────────────────────────────────────────────────┐ │ For x in [0, π]: │ │ │ │ 16x(π - x) │ │ sin(x) ≈ ───────────────── │ │ 5π² - 4x(π - x) │ │ │ │ Maximum error: ~0.3% (sufficient for real-time graphics) │ └───────────────────────────────────────────────────────────────┘ ``` Complete the fixed-point sine implementation: ```c /* Fixed-point constants */ #define FP_PI 205887 /* π in Q15.16 ≈ 3.14159 */ #define FP_TWO_PI 411775 /* 2π in Q15.16 */ /* Reduce angle to [0, 2π) range */ static inline fixed_t angle_normalize(fixed_t angle) { /* Handle negative angles */ while (angle < 0) angle += FP_TWO_PI; /* Reduce to [0, 2π) */ while (angle >= FP_TWO_PI) angle -= FP_TWO_PI; return angle; } /* Bhaskara I sine approximation for angle in [0, π] */ static fixed_t fp_sin_positive(fixed_t x) { /* x_pi_minus_x = x(π - x) */ fixed_t pi_minus_x = FP_PI - x; fixed_t x_pi_minus_x = FP_MUL(x, pi_minus_x); /* numerator = 16 * x(π - x) */ int64_t numerator = (int64_t) x_pi_minus_x << 4; /* multiply by 16 */ /* denominator = 5π² - 4x(π - x) */ int64_t denom = (int64_t) FP_MUL(FP_PI, FP_PI) * 5 - ((int64_t) x_pi_minus_x << 2); /* multiply by 4 */ if (denom == 0) return 0; return (fixed_t) ((numerator << FP_BITS) / denom); } /* Full sine function for any angle */ fixed_t fp_sin(fixed_t angle) { angle = angle_normalize(angle); if (angle <= FP_PI) { /* First half: [0, π] - use approximation directly */ return __B11__; } else { /* Second half: (π, 2π) - use sin(x) = -sin(x - π) */ return __B12__; } } /* Cosine using phase shift: cos(x) = sin(x + π/2) */ fixed_t fp_cos(fixed_t angle) { /* π/2 in Q15.16 */ #define FP_PI_HALF 102944 return fp_sin(__B13__); } ``` Fill in * B11 = ? (Hint: The approximation is valid for [0, π]. What expression uses it without modification?) * B12 = ? (Hint: Use the symmetry of sine to map angles in (π, 2π) back to the [0, π] range. Consider the sign and offset needed) * B13 = ? (Hint: How are cosine and sine related through a phase shift?) - [ ] Part 6: 3D Vector Operations Complete the fixed-point 3D vector operations used in transformation matrices: ```c typedef struct { fixed_t x, y, z; } vec3_fp_t; /* Vector dot product: a · b = ax*bx + ay*by + az*bz */ fixed_t vec3_dot(const vec3_fp_t *a, const vec3_fp_t *b) { /* Accumulate in 64-bit to prevent overflow */ int64_t sum = 0; sum += (int64_t) a->x * b->x; sum += (int64_t) a->y * b->y; sum += (int64_t) a->z * b->z; return (fixed_t) (sum >> __B14__); } /* Vector length squared: |v|² = x² + y² + z² */ fixed_t vec3_length_sq(const vec3_fp_t *v) { return vec3_dot(__B15__, __B16__); } /* Normalize vector to unit length */ void vec3_normalize(vec3_fp_t *v) { fixed_t len_sq = vec3_length_sq(v); if (len_sq == 0) return; /* Compute 1/length using integer square root and division */ fixed_t len = fp_sqrt(len_sq); /* Assume fp_sqrt is implemented */ v->x = FP_DIV(v->x, __B17__); v->y = FP_DIV(v->y, len); v->z = FP_DIV(v->z, len); } ``` Fill in * B14 = ? (Hint: How does multiplying two Q15.16 values affect the position of the radix point?) * B15 = ? (Hint: Length squared is v·v. What should the first argument be?) * B16 = ? (Hint: For dot product with itself, both arguments are the same vector) * B17 = ? (Hint: What scalar makes a vector unit length?) - [ ] Part 7: Matrix-Vector Multiplication for 3D Transformation In 3D graphics, we transform vertices using 4×4 matrices. For efficiency, we implement 3×3 rotation followed by translation: ```c typedef struct { fixed_t m[3][3]; /* 3×3 rotation/scale matrix */ vec3_fp_t t; /* Translation vector */ } transform_t; /* Transform a 3D point: result = M * v + t */ void transform_point(const transform_t *xform, const vec3_fp_t *in, vec3_fp_t *out) { /* Row 0: out->x = m[0][0]*in->x + m[0][1]*in->y + m[0][2]*in->z + t.x */ int64_t rx = (int64_t) xform->m[0][0] * in->x + (int64_t) xform->m[0][1] * in->y + (int64_t) xform->m[0][2] * in->z; out->x = (fixed_t) (rx >> FP_BITS) + xform->t.__B18__; /* Row 1 */ int64_t ry = (int64_t) xform->m[1][0] * in->x + (int64_t) xform->m[1][1] * in->y + (int64_t) xform->m[1][2] * in->z; out->y = (fixed_t) (ry >> FP_BITS) + xform->t.y; /* Row 2 */ int64_t rz = (int64_t) xform->m[2][0] * in->x + (int64_t) xform->m[2][1] * in->y + (int64_t) xform->m[2][2] * in->z; out->z = (fixed_t) (rz >> FP_BITS) + xform->t.__B19__; } ``` Fill in * B18 = ? (Hint: The translation vector component added to out->x should be t.?) * B19 = ? (Hint: The translation vector component added to out->z should be t.?) - [ ] Part 8: Fixed-Point Square Root for Distance Calculations Computing distances requires square root. We use an integer square root algorithm adapted for fixed-point: ```c /* Integer square root using Newton-Raphson method * Note: Must accept 64-bit input since scaled values exceed 32 bits */ static uint32_t isqrt64(uint64_t n) { if (n == 0) return 0; uint64_t x = n; uint64_t y = (x + 1) >> 1; while (y < x) { x = y; y = (x + n / x) >> 1; } return (uint32_t) x; /* Result fits in 32 bits for our input range */ } /* Fixed-point square root * Input: x in Q15.16 (represents value v where x = v * 2^16) * Output: sqrt(v) in Q15.16 * * Analysis: * If x = v * 2^16, then sqrt(v) = sqrt(x / 2^16) = sqrt(x) / 2^8 * But we want result in Q15.16: sqrt(v) * 2^16 = sqrt(x) * 2^8 * * Better approach: Scale input by 2^16 first: * sqrt(x * 2^16) = sqrt(x) * 2^8 * Then result is already in Q15.16 format! * * Note: x << 16 can be up to 48 bits, so we need 64-bit isqrt. */ fixed_t fp_sqrt(fixed_t x) { if (x <= 0) return 0; /* Scale to maintain precision: sqrt(x << 16) */ uint64_t scaled = (uint64_t) x << __B20__; return (fixed_t) isqrt64(scaled); } ``` Why do we shift the input left by 16 bits before taking the square root? __ B21 __ a. To convert from Q15.16 to integer format for the isqrt function b. To double the precision of the result c. To ensure the square root result is properly scaled to Q15.16 format d. To prevent negative numbers from causing undefined behavior Fill in * B20 = ? (Hint: Consider how sqrt affects the scaling factor of a fixed-point value) * B21 = ? (Hint: Consider what format the result needs to be in and how scaling the input affects sqrt output) - [ ] Part 9: Floating-Point to Fixed-Point Migration Validation When converting a floating-point 3D renderer to fixed-point, we must validate accuracy. Complete this test harness: ```c #include <math.h> #include <stdio.h> /* Maximum acceptable error for graphics (sub-pixel precision) */ #define MAX_ERROR_THRESHOLD 0.01f /* 1% relative error */ /* Test case structure */ typedef struct { const char *name; float input; float expected; fixed_t (*fp_func)(fixed_t); float (*float_func)(float); } test_case_t; /* Compute relative error */ float relative_error(float expected, float actual) { if (expected == 0.0f) return (actual == 0.0f) ? 0.0f : 1.0f; return fabsf((actual - expected) / __B22__); } /* Validate fixed-point against floating-point */ int validate_accuracy(void) { int errors = 0; /* Test sine at various angles */ float test_angles[] = {0.0f, 0.5f, 1.0f, 1.5707f, 3.14159f}; int n_angles = sizeof(test_angles) / sizeof(test_angles[0]); for (int i = 0; i < n_angles; i++) { float angle = test_angles[i]; /* Compute using both methods */ float fp_result = FP_TO_FLOAT(fp_sin(FLOAT_TO_FP(angle))); float exact = sinf(__B23__); float error = relative_error(exact, fp_result); if (error > MAX_ERROR_THRESHOLD) { printf("FAIL: sin(%.4f) = %.6f, got %.6f, error %.2f%%\n", angle, exact, fp_result, error * __B24__); errors++; } } /* Test multiplication precision */ fixed_t a = FLOAT_TO_FP(123.456f); fixed_t b = FLOAT_TO_FP(0.789f); fixed_t c = FP_MUL(a, b); float expected_mul = 123.456f * 0.789f; float actual_mul = FP_TO_FLOAT(c); float mul_error = relative_error(expected_mul, actual_mul); if (mul_error > MAX_ERROR_THRESHOLD) { printf("FAIL: 123.456 * 0.789 = %.6f, got %.6f\n", expected_mul, actual_mul); errors++; } return errors; } ``` Fill in * B22 = ? (Hint: Relative error = |actual - expected| / |?|) * B23 = ? (Hint: Pass the same angle to sinf() for the reference calculation) * B24 = ? (Hint: To convert a decimal fraction to percentage, multiply by ?) - [ ] Part 10: Overflow Detection and Saturation In safety-critical graphics code, we implement saturating arithmetic to prevent wrap-around: ```c #define FP_MAX ((fixed_t) 0x7FFFFFFF) /* Maximum positive Q15.16 */ #define FP_MIN ((fixed_t) 0x80000000) /* Minimum negative Q15.16 */ /* Saturating addition: clamp result to [FP_MIN, FP_MAX] */ fixed_t fp_add_sat(fixed_t a, fixed_t b) { int64_t result = (int64_t) a + (int64_t) b; if (result > __B25__) return FP_MAX; if (result < __B26__) return FP_MIN; return (fixed_t) result; } /* Saturating multiplication */ fixed_t fp_mul_sat(fixed_t a, fixed_t b) { int64_t product = (int64_t) a * (int64_t) b; int64_t result = product >> FP_BITS; if (result > FP_MAX) return __B27__; if (result < FP_MIN) return FP_MIN; return (fixed_t) result; } ``` What is the primary trade-off of saturating arithmetic compared to wrapping? __ B28 __ a. Saturating is slower but prevents visual artifacts from sign-flips b. Saturating uses more memory but is always mathematically correct c. Saturating requires FPU support but is more precise d. Saturating is faster but loses precision at extreme values Fill in * B25 = ? (Hint: Check overflow against the maximum representable value) * B26 = ? (Hint: Check underflow against the minimum representable value) * B27 = ? (Hint: When overflow occurs, return the maximum value instead of wrapping) * B28 = ? (Hint: Consider speed vs. correctness trade-offs when values exceed representable range) ### ✅ Solution to Problem B: B01: ==16== Explanation: Q15.16 format uses 16 bits for the fractional part (bits 0-15). B02: ==FP_ONE== Explanation: To convert an integer to fixed-point, multiply by the scaling factor `FP_ONE` (2^16 = 65536). B03: ==FP_ONE== Explanation: Same scaling factor applies for float-to-fixed conversion. The float is multiplied by 65536. B04: ==FP_BITS== Explanation: Right-shift by 16 bits (FP_BITS) extracts the integer part, discarding the fractional bits. Note: For negative values, arithmetic right-shift floors toward negative infinity (e.g., -1.5 becomes -2), not truncates toward zero. B05: ==FP_ONE== Explanation: Divide by 65536 to convert back to floating-point representation. B06: ==int64_t== Explanation: Cast to 64-bit signed integer to hold the 32×32 multiplication result without overflow. B07: ==FP_BITS== Explanation: After 64-bit multiplication, shift right by 16 bits to rescale back to Q15.16 format. B08: ==b== Explanation: Without the 64-bit cast, a 32-bit × 32-bit multiplication can overflow (produce up to 64 bits), but only 32 bits would be kept, losing the upper bits before the right shift can rescale the result. B09: ==int64_t== Explanation: The dividend must be scaled up by 2^16 before division. This requires 48+ bits (32 + 16), so we use `int64_t`. B10: ==b== Explanation: As z increases, 1/z becomes a very small fraction. With only 16 fractional bits, representing 0.001 would require 65.536, but we can only store 65, introducing ~0.82% quantization error. At z=10000, the error grows to ~8.45%. Larger z values produce smaller results with fewer significant bits. B11: ==fp_sin_positive(angle)== Explanation: For angles in [0, π], the Bhaskara approximation is used directly. B12: ==-fp_sin_positive(angle - FP_PI)== Explanation: For angles in (π, 2π), use the identity sin(x) = -sin(x - π). Subtract π and negate the result. B13: ==angle + FP_PI_HALF== Explanation: The identity cos(x) = sin(x + π/2) converts cosine to sine with a 90° phase shift. B14: ==FP_BITS== Explanation: The dot product accumulates three Q15.16 × Q15.16 products (each Q30.32), then right-shifts by 16 to return to Q15.16 format. B15: ==v== Explanation: Length squared is the dot product of a vector with itself: v · v. B16: ==v== Explanation: Both arguments are the same vector pointer for computing |v|². B17: ==len== Explanation: To normalize, divide each component by the vector's length. B18: ==x== Explanation: The x-component of the translation vector is added to the x-component of the transformed point. B19: ==z== Explanation: The z-component of the translation vector is added to the z-component of the transformed point. B20: ==FP_BITS== Explanation: Shifting left by 16 bits scales the input so that after taking the square root, the result is properly in Q15.16 format. If input x represents value v (x = v × 2^16), then sqrt(x × 2^16) = sqrt(v × 2^32) = sqrt(v) × 2^16, which is the Q15.16 representation of sqrt(v). Note: The scaled value can be up to 48 bits, so `isqrt64` must accept a 64-bit input. B21: ==c== Explanation: The left shift ensures the square root result is automatically in Q15.16 format. Without this scaling, we would get sqrt(x)/2^8 instead of the correctly formatted fixed-point result. B22: ==expected== Explanation: Relative error is |actual - expected| / |expected|, comparing the difference to the expected value's magnitude. B23: ==angle== Explanation: Pass the same angle to `sinf()` for the reference floating-point calculation. B24: ==100== Explanation: Multiply by 100 to convert the fractional error (e.g., 0.01) to a percentage (1%). B25: ==FP_MAX== Explanation: If the 64-bit result exceeds the maximum 32-bit signed value (0x7FFFFFFF), saturate to FP_MAX. B26: ==FP_MIN== Explanation: If the result is less than the minimum 32-bit signed value (0x80000000), saturate to FP_MIN. B27: ==FP_MAX== Explanation: When overflow is detected, return the maximum representable value instead of wrapping around. B28: ==a== Explanation: Saturating arithmetic requires extra comparisons (slower), but prevents the catastrophic visual artifacts that occur when large positive values wrap to negative (causing geometry to flip inside-out or colors to invert). Fixed-Point Accuracy Summary: | Operation | Typical Error | Graphics Impact | |-----------|---------------|-----------------| | Basic arithmetic | < 0.002% | Negligible | | Multiplication | < 0.01% | Sub-pixel | | Division | 0.05-0.1% | Acceptable | | Sine/Cosine | ~0.3% | Barely visible | | Square root | < 0.1% | Minimal | | Deep z-buffer (z>1000) | 0.8-8.5% | Use log-depth or higher precision | Key Design Decisions in Fixed-Point 3D Rendering: 1. Q15.16 vs Q16.15: Q15.16 sacrifices one bit of integer range for doubled fractional precision. For normalized coordinates [-1, 1], this is optimal. 2. 64-bit intermediates: Always use `int64_t` for multiplication/division to prevent overflow. The extra computation cost is negligible compared to visual artifacts from overflow. 3. Bhaskara approximation: The ~0.3% sine error is imperceptible in real-time graphics, where frame-to-frame variation masks small inaccuracies. 4. Saturation vs. wrapping: Saturation adds branches but prevents sign-flip artifacts that would otherwise cause geometry to "explode" when coordinates overflow. --- ## Problem C: Cache Optimization and Branch Prediction in Software 3D Rendering This problem explores cache locality optimization and branch prediction behavior in the context of the fixed-point 3D renderer from Problem B. Understanding these microarchitectural interactions is critical for achieving real-time performance on embedded systems without hardware acceleration. Context: You are optimizing a software 3D renderer running on an RV32IM processor. The renderer transforms vertices through model-view-projection matrices and rasterizes triangles to a framebuffer with depth testing. System Configuration: * Processor: 5-stage RISC-V pipeline (IF, ID, EX, MEM, WB) * L1 Data Cache: 8 KiB, 2-Way Set Associative, 64-byte block size, LRU replacement - Total Size: 8192 bytes - Block Size: 64 bytes * Memory Latency: Hit = 1 cycle, Miss Penalty = 40 cycles * Fixed-Point Format: Q15.16 (32-bit, from Problem B) ``` Software 3D Rendering Pipeline Memory Access Patterns: ════════════════════════════════════════════════════════════════════════ ┌────────────────────────────────────────────────────────────────────┐ │ VERTEX PROCESSING STAGE │ │ ───────────────────────────────────────────────────────────── │ │ Input: Vertex Buffer (positions, normals, UVs) │ │ Operations: Matrix × Vector (transform_point from Problem B) │ │ Access Pattern: Sequential vertices, strided attributes │ │ Cache Challenge: AoS vs SoA layout affects spatial locality │ └──────────────────────────────┬─────────────────────────────────────┘ ▼ ┌────────────────────────────────────────────────────────────────────┐ │ RASTERIZATION STAGE │ │ ───────────────────────────────────────────────────────────── │ │ Input: Transformed triangles, Framebuffer, Depth buffer │ │ Operations: Scanline iteration, depth comparison, pixel writes │ │ Access Pattern: Row-major framebuffer, random depth reads │ │ Cache Challenge: Working set size vs cache capacity │ └──────────────────────────────┬─────────────────────────────────────┘ ▼ ┌────────────────────────────────────────────────────────────────────┐ │ PIXEL SHADING STAGE │ │ ───────────────────────────────────────────────────────────── │ │ Input: Interpolated attributes, Texture data │ │ Operations: Lighting (vec3_dot), texture sampling │ │ Access Pattern: Dependent on triangle coverage │ │ Cache Challenge: Texture access locality, branch divergence │ └────────────────────────────────────────────────────────────────────┘ ``` - [ ] Part 1: Cache Geometry Analysis Calculate the cache parameters for the L1 Data Cache: * Number of cache sets: $\text{Sets} = \frac{\text{Total Size}}{\text{Associativity} \times \text{Block Size}} = \frac{8192}{2 \times 64}$ > Answer: __ C01 __ sets (Hint: Apply the formula shown above) * Number of index bits (to select which set): > Answer: __ C02 __ bits (Hint: log₂ of the number of sets) * Number of block offset bits (to select byte within block): > Answer: __ C03 __ bits (Hint: log₂ of the block size in bytes) * How many Q15.16 fixed-point values (`fixed_t`, 4 bytes each) fit in one cache block? > Answer: __ C04 __ values (Hint: block size / bytes per value) - [ ] Part 2: Vertex Buffer Layout - AoS vs SoA A 3D model consists of vertices with position (x, y, z), normal (nx, ny, nz), and texture coordinates (u, v). Consider two memory layouts: Layout 1: Array of Structures (AoS) ```c /* Each vertex is a contiguous structure */ typedef struct { fixed_t x, y, z; /* Position: 12 bytes */ fixed_t nx, ny, nz; /* Normal: 12 bytes */ fixed_t u, v; /* Texture coords: 8 bytes */ } vertex_aos_t; /* Total: 32 bytes per vertex */ vertex_aos_t vertices_aos[1024]; /* 32 KiB total */ ``` Layout 2: Structure of Arrays (SoA) ```c /* Attributes stored in separate arrays */ typedef struct { fixed_t x[1024]; /* All X positions: 4 KiB */ fixed_t y[1024]; /* All Y positions: 4 KiB */ fixed_t z[1024]; /* All Z positions: 4 KiB */ fixed_t nx[1024]; /* All X normals: 4 KiB */ fixed_t ny[1024]; /* All Y normals: 4 KiB */ fixed_t nz[1024]; /* All Z normals: 4 KiB */ fixed_t u[1024]; /* All U coords: 4 KiB */ fixed_t v[1024]; /* All V coords: 4 KiB */ } vertex_soa_t; /* Total: 32 KiB */ vertex_soa_t vertices_soa; ``` Operation: Transform all vertex positions using `transform_point()` from Problem B. ```c /* AoS version: Transform positions */ void transform_positions_aos(const transform_t *xform, vertex_aos_t *verts, int n) { for (int i = 0; i < n; i++) { vec3_fp_t in = {verts[i].x, verts[i].y, verts[i].z}; vec3_fp_t out; transform_point(xform, &in, &out); verts[i].x = out.x; verts[i].y = out.y; verts[i].z = out.z; } } /* SoA version: Transform positions */ void transform_positions_soa(const transform_t *xform, vertex_soa_t *verts, int n) { for (int i = 0; i < n; i++) { vec3_fp_t in = {verts->x[i], verts->y[i], verts->z[i]}; vec3_fp_t out; transform_point(xform, &in, &out); verts->x[i] = out.x; verts->y[i] = out.y; verts->z[i] = out.z; } } ``` Memory Access Analysis for AoS (vertex_aos_t): ``` Vertex 0: bytes [0:31] → Cache block 0 (bytes 0-63) Vertex 1: bytes [32:63] → Cache block 0 (same block!) Vertex 2: bytes [64:95] → Cache block 1 Vertex 3: bytes [96:127] → Cache block 1 Each cache block (64 bytes) contains exactly 2 complete vertices. For position-only access (x, y, z = 12 bytes per vertex): - Load vertex 0: MISS, loads block with vertices 0-1 - Access position fields: 12 of 64 bytes used (18.75% utilization) - The normal and UV data (20 bytes) are loaded but not used ``` Memory Access Analysis for SoA (vertex_soa_t): ``` x[0:15]: bytes [0:63] → Cache block 0 (16 x-values) x[16:31]: bytes [64:127] → Cache block 1 (16 x-values) y[0:15]: bytes [4096:4159] → Different cache region Each cache block contains 16 consecutive values of ONE attribute. For position-only access: - Load x[0]: MISS, loads x[0:15] - Load y[0]: MISS, loads y[0:15] - Load z[0]: MISS, loads z[0:15] - Access x[1:15], y[1:15], z[1:15]: All HITs (spatial locality) ``` For position-only transformation (x, y, z), which layout has better cache utilization? __ C05 __ (Hint: Consider which layout loads only the data you need vs. loading unused attributes) a. AoS - because each vertex is contiguous b. SoA - because each attribute array is contiguous c. Both are equivalent d. Depends on cache associativity In the AoS layout, how many bytes per vertex are loaded but NOT used during position-only transformation? __ C06 __ (Hint: Each vertex is 32 bytes total. Position uses 12 bytes. What's left?) a. 0 bytes b. 12 bytes c. 20 bytes d. 32 bytes For SoA with 1024 vertices, how many compulsory cache misses occur for reading all x, y, z positions? (Assume cold cache) * x array: 1024 values ÷ 16 values/block = 64 misses * y array: 64 misses * z array: 64 misses > Answer: __ C07 __ total misses (Hint: Sum the misses for all three arrays) - [ ] Part 3: Matrix-Vector Multiplication Cache Behavior The `transform_point()` function from Problem B performs: ```c out->x = (rx >> FP_BITS) + xform->t.x; out->y = (ry >> FP_BITS) + xform->t.y; out->z = (rz >> FP_BITS) + xform->t.z; ``` Where `rx`, `ry`, `rz` each require 3 multiplications and 2 additions: ```c int64_t rx = (int64_t) xform->m[0][0] * in->x + (int64_t) xform->m[0][1] * in->y + (int64_t) xform->m[0][2] * in->z; ``` The `transform_t` structure occupies: ```c typedef struct { fixed_t m[3][3]; /* 9 × 4 = 36 bytes */ vec3_fp_t t; /* 3 × 4 = 12 bytes */ } transform_t; /* Total: 48 bytes (fits in one cache block) */ ``` Temporal Locality Analysis: ``` When transforming N vertices with the SAME transform matrix: - First vertex: Load xform (1 cache miss for 48-byte struct) - Vertices 2 to N: xform remains in cache (temporal reuse) Per vertex, transform_point accesses: - 9 matrix elements (m[0][0..2], m[1][0..2], m[2][0..2]) - 3 translation elements (t.x, t.y, t.z) - Total: 12 element accesses per vertex For N=1024 vertices: 12 × 1024 = 12,288 total element accesses But only 1 cache block loaded (first access misses, rest hit). ``` If we transform 1024 vertices with a single transformation matrix, how many vertices benefit from the matrix being cached after the first vertex loads it? __ C08 __ (Hint: The first vertex causes a cache miss. How many subsequent vertices reuse that cached data?) a. 0 vertices (no reuse, each vertex reloads) b. 1023 vertices (all except the first) c. 512 vertices (half due to cache pressure) d. 1024 vertices (including the first) - [ ] Part 4: Depth Buffer and Branch Prediction During rasterization, each pixel undergoes a depth test before being written: ```c /* Depth buffer: stores 1/z for each pixel (Q15.16 format) */ fixed_t depth_buffer[SCREEN_HEIGHT][SCREEN_WIDTH]; /* e.g., 320×240 */ /* Framebuffer: RGB565 format (16 bits per pixel) */ uint16_t framebuffer[SCREEN_HEIGHT][SCREEN_WIDTH]; /* Rasterize a single triangle scanline */ void rasterize_scanline(int y, int x_start, int x_end, fixed_t z_start, fixed_t z_step, uint16_t color) { fixed_t z = z_start; for (int x = x_start; x < x_end; x++) { /* Depth test: closer pixels have LARGER 1/z values */ if (z > depth_buffer[y][x]) { /* Pixel passes depth test - update buffers */ depth_buffer[y][x] = z; framebuffer[y][x] = color; } z += z_step; /* Interpolate depth across scanline */ } } ``` Branch Prediction Analysis: The depth test `if (z > depth_buffer[y][x])` exhibits different branch patterns depending on the scene: Scenario A - First triangle on clear screen (depth_buffer initialized to 0): ``` All pixels pass the depth test (z > 0 always true for visible geometry) Branch pattern: TAKEN, TAKEN, TAKEN, TAKEN, ... (100% taken) ``` Scenario B - Occluded triangle behind existing geometry: ``` All pixels fail the depth test (new z < existing z) Branch pattern: NOT-TAKEN, NOT-TAKEN, NOT-TAKEN, ... (0% taken) ``` Scenario C - Intersecting triangles: ``` Some pixels pass, some fail (mixed pattern) Branch pattern: TAKEN, NOT-TAKEN, TAKEN, TAKEN, NOT-TAKEN, ... ``` Using a 2-bit saturating counter (initial state = Weakly-Not-Taken, state 1): For Scenario A (all taken), processing 8 pixels: Starting state: 1 (Weakly-Not-Taken, predicts NOT-TAKEN) | Pixel | Prediction | Actual | Match? | New State | |:-----:|:----------:|:------:|:------:|:---------:| | 1 | NOT-TAKEN | Taken | ❌ | 2 | | 2 | TAKEN | Taken | ✓ | 3 | | 3 | TAKEN | Taken | ✓ | 3 | | 4 | TAKEN | Taken | ✓ | 3 | | 5 | TAKEN | Taken | ✓ | 3 | | 6 | TAKEN | Taken | ✓ | 3 | | 7 | TAKEN | Taken | ✓ | 3 | | 8 | TAKEN | Taken | ✓ | 3 | Mispredictions in Scenario A: __ C09 __ (Hint: Count how many ❌ marks appear in the table above) For Scenario B (all not-taken), processing 8 pixels starting from state 1: | Pixel | Prediction | Actual | Match? | New State | |:-----:|:----------:|:------:|:------:|:---------:| | 1 | NOT-TAKEN | Not-taken | ✓ | 0 | | 2 | NOT-TAKEN | Not-taken | ✓ | 0 | | ... | NOT-TAKEN | Not-taken | ✓ | 0 | Mispredictions in Scenario B: __ C10 __ (Hint: State 1 predicts NOT-TAKEN. Does this ever mismatch?) For Scenario C with pattern [T, N, T, T, N, T, N, N] starting from state 1: (T = Taken, N = Not-taken) > Mispredictions in Scenario C: __ C11 __ (Hint: Trace through the 2-bit saturating counter for each branch outcome) Which scenario causes the most branch mispredictions? __ C12 __ (Hint: Mixed patterns cause frequent state transitions in the 2-bit counter) a. Scenario A (all taken) b. Scenario B (all not-taken) c. Scenario C (mixed pattern) d. All scenarios have equal mispredictions - [ ] Part 5: Cache Blocking for Scanline Rasterization Consider a 320×240 framebuffer with 16-bit pixels (RGB565): ``` Framebuffer size: 320 × 240 × 2 bytes = 153,600 bytes (150 KiB) Depth buffer size: 320 × 240 × 4 bytes = 307,200 bytes (300 KiB) Combined working set: 450 KiB >> 8 KiB L1 cache ``` Problem: The combined framebuffer and depth buffer exceed the L1 cache by 56×. Naive Rasterization Access Pattern: ``` For each triangle: For each scanline y in triangle: For each pixel x in scanline: Read depth_buffer[y][x] ← Likely cache miss Write depth_buffer[y][x] ← Same location (hit after miss) Write framebuffer[y][x] ← Different address (likely miss) ``` Cache Miss Analysis for Full-Screen Triangle: Each scanline of the framebuffer: * 320 pixels × 2 bytes = 640 bytes = 10 cache blocks Each scanline of the depth buffer: * 320 pixels × 4 bytes = 1280 bytes = 20 cache blocks For a single scanline (cold cache): * Depth buffer reads: 20 misses (320 pixels ÷ 16 pixels per block) * Framebuffer writes: 10 misses (320 pixels ÷ 32 pixels per block) Cache Blocking Optimization (Tile-Based Rendering): ```c #define TILE_WIDTH 32 #define TILE_HEIGHT 32 /* Tile-based depth buffer (fits in cache) */ /* 32 × 32 × 4 = 4096 bytes = 64 cache blocks */ void rasterize_tiled(triangle_t *triangles, int n_triangles) { /* Process screen in tiles */ for (int tile_y = 0; tile_y < SCREEN_HEIGHT; tile_y += TILE_HEIGHT) { for (int tile_x = 0; tile_x < SCREEN_WIDTH; tile_x += TILE_WIDTH) { /* For each tile, process ALL triangles that overlap it */ for (int t = 0; t < n_triangles; t++) { if (triangle_overlaps_tile(&triangles[t], tile_x, tile_y)) { rasterize_triangle_to_tile(&triangles[t], tile_x, tile_y, TILE_WIDTH, TILE_HEIGHT); } } } } } ``` Tile Working Set Analysis: ``` Tile framebuffer: 32 × 32 × 2 = 2048 bytes = 32 cache blocks Tile depth buffer: 32 × 32 × 4 = 4096 bytes = 64 cache blocks Total tile working set: 6144 bytes = 96 cache blocks L1 cache capacity: 8192 bytes = 128 blocks Tile working set fits in cache with 32 blocks to spare! ``` How many cache blocks does a 32×32 tile's depth buffer require? > Answer: __ C13 __ blocks (Hint: Divide the tile depth buffer size by the cache block size) Does the tile working set (framebuffer + depth buffer) fit in the 8 KiB L1 cache? __ C14 __ (Hint: Total tile working set is 6144 bytes. Compare to cache size of 8192 bytes.) a. Yes, with room to spare b. Yes, exactly fits c. No, slightly exceeds capacity d. No, significantly exceeds capacity What is the primary benefit of tile-based rendering for cache performance? __ C15 __ (Hint: By processing all triangles for one tile before moving on, what type of cache reuse is achieved?) a. Reduces total memory bandwidth b. Enables parallel processing c. Converts capacity misses to hits through temporal locality d. Eliminates compulsory misses - [ ] Part 6: Prefetching and Memory Access Scheduling Modern software renderers use explicit prefetching to hide memory latency. Consider this optimized scanline rasterizer: ```c /* Prefetch hint (compiler intrinsic or inline assembly) */ static inline void prefetch(const void *addr) { /* RISC-V doesn't have standard prefetch, but Zicbop extension adds it */ /* For this analysis, assume prefetch loads a cache block without stalling */ __asm__ __volatile__("prefetch.r (%0)" :: "r"(addr)); } void rasterize_scanline_prefetch(int y, int x_start, int x_end, fixed_t z_start, fixed_t z_step, uint16_t color) { fixed_t z = z_start; int prefetch_distance = 16; /* Prefetch 16 pixels ahead */ for (int x = x_start; x < x_end; x++) { /* Issue prefetch for future depth buffer access */ if (x + prefetch_distance < x_end) { prefetch(&depth_buffer[y][x + prefetch_distance]); } /* Depth test (data should be in cache from earlier prefetch) */ if (z > depth_buffer[y][x]) { depth_buffer[y][x] = z; framebuffer[y][x] = color; } z += z_step; } } ``` Prefetch Distance Analysis: ``` Memory latency: 40 cycles (miss penalty) Loop body: ~10 cycles (assuming cache hits) To hide latency completely: prefetch_distance × cycles_per_iteration ≥ memory_latency prefetch_distance × 10 ≥ 40 prefetch_distance ≥ 4 pixels With prefetch_distance = 16: - Prefetch issued 160 cycles before data needed - Memory latency (40 cycles) fully hidden - Extra margin for timing variations ``` If the loop body takes 10 cycles and memory latency is 40 cycles, what is the minimum prefetch distance to hide latency? __ C16 __ (Hint: prefetch_distance × cycles_per_iteration ≥ memory_latency) a. 2 pixels b. 4 pixels c. 8 pixels d. 16 pixels What happens if the prefetch distance is too small? __ C17 __ (Hint: If the prefetch doesn't have enough lead time, what happens when we need the data?) a. Data arrives too late, causing stalls b. Cache pollution from unused prefetches c. Increased power consumption d. Branch mispredictions What happens if the prefetch distance is too large? __ C18 __ (Hint: Consider what might happen to prefetched data in a limited-size cache before it's used) a. Data arrives too late, causing stalls b. Prefetched data may be evicted before use c. No negative effect d. Memory bandwidth is reduced - [ ] Part 7: Lighting Calculation Cache Efficiency The Phong lighting model from Problem B requires multiple dot products: ```c /* Phong lighting calculation */ fixed_t calculate_lighting(const vec3_fp_t *normal, const vec3_fp_t *light_dir, const vec3_fp_t *view_dir, fixed_t ambient, fixed_t diffuse_coef, fixed_t specular_coef, int shininess) { /* Diffuse: N · L */ fixed_t n_dot_l = vec3_dot(normal, light_dir); if (n_dot_l < 0) n_dot_l = 0; /* Clamp negative */ fixed_t diffuse = FP_MUL(diffuse_coef, n_dot_l); /* Specular: (R · V)^shininess where R = 2(N·L)N - L */ vec3_fp_t reflect; fixed_t two_n_dot_l = n_dot_l << 1; reflect.x = FP_MUL(two_n_dot_l, normal->x) - light_dir->x; reflect.y = FP_MUL(two_n_dot_l, normal->y) - light_dir->y; reflect.z = FP_MUL(two_n_dot_l, normal->z) - light_dir->z; fixed_t r_dot_v = vec3_dot(&reflect, view_dir); if (r_dot_v < 0) r_dot_v = 0; /* Approximate power function for specular */ fixed_t specular = r_dot_v; for (int i = 1; i < shininess; i++) { specular = FP_MUL(specular, r_dot_v); /* specular = r_dot_v^shininess */ } specular = FP_MUL(specular_coef, specular); return ambient + diffuse + specular; } ``` For per-pixel lighting across a triangle, the function is called once per visible pixel. Consider a 32×32 tile with 50% pixel coverage: ``` Pixels to light: 32 × 32 × 0.5 = 512 pixels Per-pixel inputs: normal (12 bytes), light_dir (12 bytes), view_dir (12 bytes) Total input data per pixel: 36 bytes ``` If `light_dir` and `view_dir` are constant across the tile (directional light, orthographic view), how many times are they loaded from memory vs reused from cache? * `light_dir`: Loaded once, reused __ C19 __ times (Hint: 512 total pixels - 1 initial load = ?) * `view_dir`: Loaded once, reused __ C19 __ times (same answer) The `normal` vector varies per-pixel (interpolated across triangle). If normals are stored in SoA layout: ```c /* Per-tile interpolated normals (SoA layout) */ fixed_t tile_nx[32][32]; /* 4 KiB */ fixed_t tile_ny[32][32]; /* 4 KiB */ fixed_t tile_nz[32][32]; /* 4 KiB */ ``` For 512 lit pixels with normals in SoA layout, how many cache misses occur for reading normals? (Assume normals are accessed in row-major order within the tile) Each row: 32 values = 128 bytes = 2 cache blocks 16 rows (50% coverage) × 2 blocks × 3 components = __ C20 __ misses (Hint: Multiply: 16 × 2 × 3) - [ ] Part 8: Optimization Trade-offs Summary Complete the following analysis comparing optimization strategies: | Optimization | Cache Benefit | Performance Cost | Best Use Case | |--------------|---------------|------------------|---------------| | SoA layout | Better spatial locality | Requires restructuring data | __ C21 __ | | Tile-based rendering | Working set fits in cache | Triangle list traversal overhead | Large scenes | | Prefetching | Hides memory latency | Instruction overhead | __ C22 __ | | Loop blocking | Temporal reuse | Code complexity | Matrix operations | Fill in the missing cells: C21 = ? (Hint: SoA avoids loading unused attributes. When is this most beneficial?) a. Small models with few vertices b. Large models with many vertices accessing few attributes c. Real-time animation with changing vertex data d. Static scenes with full attribute access C22 = ? (Hint: Prefetching requires knowing where to fetch from. What access pattern allows this?) a. Random access patterns b. Sequential access patterns with predictable stride c. Write-heavy workloads d. Computation-bound kernels ### ✅ Solution to Problem C: C01: ==64== Explanation: Sets = 8192 / (2 × 64) = 8192 / 128 = 64 sets. C02: ==6== Explanation: log₂(64) = 6 bits needed to address 64 sets. C03: ==6== Explanation: log₂(64) = 6 bits needed to address 64 bytes within a block. C04: ==16== Explanation: 64 bytes per block ÷ 4 bytes per fixed_t = 16 values. C05: ==b== Explanation: SoA provides better cache utilization for position-only transformation because the x, y, z arrays are contiguous, allowing full spatial reuse of each cache block. In AoS, loading a vertex brings in normal and UV data (20 bytes) that aren't used. C06: ==c== Explanation: In AoS, each 32-byte vertex contains: position (12 bytes, used) + normal (12 bytes, unused) + UV (8 bytes, unused) = 20 bytes unused per vertex loaded. C07: ==192== Explanation: 64 misses for x + 64 misses for y + 64 misses for z = 192 compulsory misses. Each array of 1024 elements spans 4096 bytes = 64 cache blocks. C08: ==b== Explanation: The transformation matrix (48 bytes) fits entirely within one 64-byte cache block. When processing the first vertex, the cache block is loaded (1 compulsory miss). For all subsequent 1023 vertices, the matrix data remains in cache and is accessed via cache hits. This is temporal locality—the same data structure reused across many iterations. C09: ==1== Explanation: Starting from state 1 (predicts NOT-TAKEN), the first Taken branch is a misprediction. After that, state transitions to 2 (predicts TAKEN), and all subsequent Taken branches are hits. C10: ==0== Explanation: Starting from state 1 (predicts NOT-TAKEN), all Not-taken branches match the prediction. The predictor is already in the correct state for this workload. C11: ==6== Explanation: Trace through the pattern [T, N, T, T, N, T, N, N] starting from state 1: | Pixel | State | Prediction | Actual | Match? | New State | |:-----:|:-----:|:----------:|:------:|:------:|:---------:| | 1 | 1 | NOT-TAKEN | T | ❌ | 2 | | 2 | 2 | TAKEN | N | ❌ | 1 | | 3 | 1 | NOT-TAKEN | T | ❌ | 2 | | 4 | 2 | TAKEN | T | ✓ | 3 | | 5 | 3 | TAKEN | N | ❌ | 2 | | 6 | 2 | TAKEN | T | ✓ | 3 | | 7 | 3 | TAKEN | N | ❌ | 2 | | 8 | 2 | TAKEN | N | ❌ | 1 | Total mispredictions: 6 (pixels 1, 2, 3, 5, 7, 8). The mixed pattern causes frequent state oscillation, preventing the 2-bit counter from stabilizing on any prediction. C12: ==c== Explanation: Scenario C (mixed pattern) causes the most mispredictions (6) because the 2-bit saturating counter cannot adapt to rapidly changing branch behavior. Scenario A has only 1 misprediction, Scenario B has 0. C13: ==64== Explanation: 32 × 32 × 4 bytes = 4096 bytes ÷ 64 bytes per block = 64 cache blocks. C14: ==a== Explanation: Tile working set is 6144 bytes (96 blocks), L1 cache is 8192 bytes (128 blocks). 6144 < 8192, so the tile fits with 2048 bytes (32 blocks) to spare for other data like the transformation matrix. C15: ==c== Explanation: Tile-based rendering ensures the working set fits in cache, converting capacity misses (data evicted due to cache size) into hits through temporal locality. The same pixels are accessed multiple times across different triangles within the tile, and they remain in cache throughout tile processing. C16: ==b== Explanation: Minimum prefetch distance = memory latency ÷ cycles per iteration = 40 ÷ 10 = 4 pixels. This ensures the prefetch has time to complete before the data is needed. C17: ==a== Explanation: If the prefetch distance is too small, the prefetch doesn't have enough time to complete before the data is needed, resulting in memory stalls waiting for the data to arrive. C18: ==b== Explanation: If the prefetch distance is too large, the prefetched data may be evicted from cache by intervening accesses before it is actually used, wasting the prefetch bandwidth and potentially causing a miss anyway. C19: ==511== Explanation: `light_dir` and `view_dir` are each loaded once (1 access) and then reused for the remaining 511 pixels (512 total pixels - 1 initial load = 511 reuses). This is temporal locality—the same data accessed repeatedly. C20: ==96== Explanation: Each row of 32 values requires 2 cache blocks (32 × 4 = 128 bytes, 128 ÷ 64 = 2 blocks). For 16 rows at 50% coverage and 3 normal components (nx, ny, nz): 16 rows × 2 blocks × 3 components = 96 cache misses. C21: ==b== Explanation: SoA layout is best for large models where only a subset of attributes are accessed per pass. When transforming positions only, SoA avoids loading unused normal and UV data that AoS would bring into cache. C22: ==b== Explanation: Prefetching is most effective for sequential access patterns with predictable stride, where the next memory location can be accurately predicted. Random access patterns cannot benefit from prefetching because the next address is unpredictable. Cache Optimization Summary for Software 3D Rendering: Key Insights from This Problem: 1. Data Layout Matters (Parts 2-3) - SoA vs AoS can mean 16× difference in useful bytes per cache miss - Choose layout based on access patterns, not convenience 2. Working Set Management (Part 5) - Tile-based rendering keeps working set in L1 cache - 56× reduction in effective memory traffic for framebuffer/depth 3. Branch Prediction Interacts with Workload (Part 4) - Depth testing branches depend on scene geometry - Mixed occlusion patterns cause significant misprediction overhead - Consider branchless alternatives for high-misprediction scenarios 4. Temporal vs Spatial Locality (Parts 3, 7) - Transformation matrices: temporal (same data, many vertices) - Vertex attributes: spatial (sequential data, one access each) - Optimize for the dominant pattern in each stage 5. Prefetching Requires Predictability (Part 6) - Effective for scanline traversal (predictable stride) - Less effective for per-triangle processing (variable coverage) --- ## Problem D: Context Switching and Task Isolation in a Minimal RISC-V Kernel This problem explores the implementation of context switching in a minimal RISC-V operating system kernel, with task isolation using Physical Memory Protection (PMP). Background: A cooperative multitasking kernel manages multiple tasks, each with its own execution context (registers, stack, program counter). Context switching saves the current task's state and restores another task's state, creating the illusion of concurrent execution. On RISC-V, M-mode firmware typically handles interrupts and delegates to S-mode for task scheduling. ``` Context Switch Overview: Task A (Running) Context Switch Task B (Ready) ┌─────────────────┐ ┌─────────────────┐ │ Registers │ │ Registers │ │ ┌─────────────┐ │ ┌─────────────────┐ │ ┌─────────────┐ │ │ │ ra, sp, ... │─┼────────>│ 1. Save A's │ │ │ ra, sp, ... │ │ │ │ s0-s11 │ │ │ registers │ │ │ s0-s11 │ │ │ │ mepc │ │ │ to stack │ │ │ mepc │ │ │ └─────────────┘ │ │ │ │ └─────────────┘ │ │ │ │ 2. Save A's SP │ │ │ │ Stack │ │ to TCB │ │ Stack │ │ ┌─────────────┐ │ │ │ │ ┌─────────────┐ │ │ │ saved regs │◀┼─────────│ 3. Load B's SP │──────────┼─│ saved regs │ │ │ │ saved mepc │ │ │ from TCB │ │ │ saved mepc │ │ │ │ local vars │ │ │ │ │ │ local vars │ │ │ └─────────────┘ │ │ 4. Restore B's │ │ └─────────────┘ │ └─────────────────┘ │ registers │ └─────────────────┘ │ from stack │ TCB A └─────────────────┘ TCB B ┌────────┐ ┌────────┐ │ sp ────┼──────> saved stack pointer saved sp <──────┼── sp │ │ events │ │ events │ │ stack │ │ stack │ └────────┘ └────────┘ TCB = Task Control Block (stores task metadata) ``` - [ ] Part 1: Task Control Block Structure The kernel maintains a Task Control Block (TCB) for each task. The TCB stores the minimal state needed for context switching. ```c #include <stdint.h> #include <stdatomic.h> #define TASK_ID_COUNT 8 /* Maximum number of tasks */ #define TASK_STACK_SIZE 1024 /* Stack size per task (words) */ typedef struct { /* * IMPORTANT: sp MUST be the first field in the struct. * The assembly context switch code relies on this layout: * lw sp, 0(a0) loads sp from task->sp when a0 points to task * sw sp, 0(t0) saves sp to task->sp when t0 points to task */ uint32_t sp; /* Saved stack pointer */ atomic_uint events; /* Pending event bitmap */ uint64_t runtime; /* CPU time consumed (cycles) */ uint32_t *stack_base; /* Bottom of allocated stack */ } task_t; /* Array of all task control blocks */ static task_t tasks[TASK_ID_COUNT]; /* Pointer to currently running task */ task_t *current_task; /* Bitmap of tasks ready to run (bit N = task N is ready) */ uint32_t tasks_ready; /* Bitmap of tasks enabled for scheduling */ uint32_t tasks_enabled; ``` Why must `sp` be the first field in `task_t`? __ D01 __ (Hint: Look at the assembly instructions `lw sp, 0(a0)` and `sw sp, 0(t0)`. What offset is used?) a. RISC-V calling convention requires it b. Assembly code uses fixed offset 0 to access the stack pointer directly c. The compiler aligns the first field to cache line boundaries d. It reduces the size of the structure - [ ] Part 2: Stack Frame Layout During a context switch, callee-saved registers (`s0`-`s11`) and the program counter (`mepc`) must be preserved. The kernel saves these below the current stack pointer. ``` Stack Frame Layout After Context Save: Higher addresses ┌─────────────────────────┐ ▲ │ ... (previous frames) │ │ ├─────────────────────────┤ │ │ Local variables │ │ ├─────────────────────────┤ │ sp ───> │ (current stack top) │ ◄── Original SP │ ├─────────────────────────┤ │ │ s0 (frame pointer) │ offset: -1*4 = -4 │ │ s1 │ offset: -2*4 = -8 │ │ s2 │ offset: -3*4 = -12 │ │ s3 │ offset: -4*4 = -16 │ │ s4 │ offset: -5*4 = -20 │ │ s5 │ offset: -6*4 = -24 │ │ s6 │ offset: -7*4 = -28 │ │ s7 │ offset: -8*4 = -32 │ │ s8 │ offset: -9*4 = -36 │ │ s9 │ offset: -10*4 = -40 │ │ s10 │ offset: -11*4 = -44 │ │ s11 │ offset: -12*4 = -48 │ │ mepc (return address) │ offset: -13*4 = -52 │ ├─────────────────────────┤ ▼ sp' ──> │ (new stack top) │ ◄── SP after save Lower addresses │ ... (growth space) │ └─────────────────────────┘ Total saved: 12 callee-saved registers + 1 mepc = ? bytes ``` How many bytes does the stack pointer decrease when saving context? __ D02 __ (Hint: Count: 12 callee-saved registers (s0-s11) + 1 mepc = ? values × 4 bytes each) a. 48 bytes (12 registers × 4) b. 52 bytes (13 values × 4) c. 56 bytes (14 values × 4) d. 64 bytes (16 values × 4) - [ ] Part 3: Context Switch Implementation The context switch is triggered when the scheduler determines a different task should run. This implementation assumes the switch occurs within an interrupt handler (timer or software interrupt), but the core register save/restore logic is independent of the interrupt source. ``` Context Switch Flow (within interrupt handler): ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Interrupt │ │ Scheduler │ │ Restore │ │ Entry │────>│ Decision │────>│ & Return │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ ▼ ▼ ▼ Save caller-saved Find highest Restore callee-saved regs to sys stack priority ready regs from new stack Switch to sys stack task (next_task) Load new mepc If same, skip Switch stacks Else: full switch Execute mret ``` Complete the assembly implementation: ```assembly # Scheduler: Find next task to run # Returns: a0 = pointer to highest priority ready task .global next_sched_task next_sched_task: la t0, tasks_ready lw t1, 0(t0) # t1 = tasks_ready bitmap la t0, tasks_enabled lw t2, 0(t0) # t2 = tasks_enabled bitmap and t1, t1, t2 # t1 = ready AND enabled # Find lowest set bit (lower bit = higher priority) # __ffs returns bit index of lowest set bit mv a0, t1 call __ffs # a0 = highest priority task ID # Convert task ID to task_t pointer # tasks array base + (id * sizeof(task_t)) la t0, tasks li t1, 24 # sizeof(task_t) = 24 bytes mul t1, a0, t1 add a0, t0, t1 # a0 = &tasks[id] ret #-------------------------------------------------------------------- # Context Switch: Save current task, restore next task # Called from interrupt handler context # Precondition: mscratch contains current task's user stack pointer #-------------------------------------------------------------------- .global __switch_task __switch_task: # Get next task to run jal next_sched_task # a0 = next task pointer # Get current task pointer la t1, current_task lw t0, 0(t1) # t0 = current_task # Check if switch is needed beq a0, t0, __no_switch # Same task, skip context switch # Update current_task pointer sw a0, 0(t1) # current_task = next task # Save system stack pointer (we're on interrupt stack) mv t3, sp # Switch to current task's user stack (saved in mscratch) csrr sp, mscratch # Read program counter to resume (saved by interrupt entry) csrr t5, mepc #--- Save callee-saved registers to current task's stack --- sw s0, -1*4(sp) sw s1, -2*4(sp) sw s2, -3*4(sp) sw s3, -4*4(sp) sw s4, -5*4(sp) sw s5, -6*4(sp) sw s6, -7*4(sp) sw s7, -8*4(sp) sw s8, -9*4(sp) sw s9, -10*4(sp) sw s10, -11*4(sp) sw s11, -12*4(sp) # Save program counter (mepc) to stack __D03__ # Store t5 (mepc) at offset -13*4 # Adjust stack pointer to account for saved context addi sp, sp, __D04__ # Decrease SP by 13*4 = 52 bytes # Save adjusted stack pointer to current task's TCB sw sp, 0(t0) # t0->sp = sp (offset 0, first field) #--- Load next task's context --- # Load next task's saved stack pointer lw sp, 0(a0) # sp = next_task->sp # Adjust stack pointer back (point to saved registers) addi sp, sp, __D05__ # Increase SP by 13*4 = 52 bytes # Restore program counter from next task's stack lw t0, __D06__ # Load saved mepc csrw mepc, t0 # Set return address #--- Restore callee-saved registers from next task's stack --- lw s0, -1*4(sp) lw s1, -2*4(sp) lw s2, -3*4(sp) lw s3, -4*4(sp) lw s4, -5*4(sp) lw s5, -6*4(sp) lw s6, -7*4(sp) lw s7, -8*4(sp) lw s8, -9*4(sp) lw s9, -10*4(sp) lw s10, -11*4(sp) lw s11, -12*4(sp) # Save new task's user SP to mscratch, restore system stack csrw __D07__, __D08__ # mscratch = sp (user stack) mv sp, t3 # Restore system/interrupt stack __no_switch: # Return to interrupt exit handler j __irq_exit ``` Fill in * D03 = ? (Hint: Which instruction saves a register to memory? Refer to the stack frame diagram for the correct offset) * D04 = ? (Hint: Calculate the total bytes needed for the saved context. Stack grows downward, so use a negative value) * D05 = ? (Hint: This reverses D04 to restore the original stack pointer position) * D06 = ? (Hint: Load mepc from the same relative position where it was stored in the stack frame) * D07 = ? (Hint: Which CSR stores the user stack pointer for swap on interrupt?) * D08 = ? (Hint: What register currently holds the new task's user stack pointer?) - [ ] Part 4: Task Initialization Before a task can be scheduled, its stack must be initialized with a valid context frame that the context switch code can restore. ```c typedef void (*task_entry_t)(void); /* Task entry point function type */ /* Initialize a task's stack for first-time scheduling */ void task_init(int task_id, task_entry_t entry, uint32_t *stack_top) { task_t *task = &tasks[task_id]; /* Stack grows downward; start from top */ uint32_t *sp = stack_top; /* * Pre-populate the stack frame as if the task had been * context-switched out. When restored, it will "return" * to the entry point. * * Stack layout (same as context switch save order): * sp[-1] = s0 * sp[-2] = s1 * ... * sp[-12] = s11 * sp[-13] = mepc (entry point address) */ /* Initialize callee-saved registers to zero */ for (int i = 1; i <= 12; i++) { sp[-i] = 0; /* s0-s11 = 0 */ } /* Set initial program counter to task entry point */ sp[__D09__] = (uint32_t)entry; /* Adjust stack pointer to point below saved context */ sp -= 13; /* Save the prepared stack pointer in TCB */ task->sp = (uint32_t)sp; task->stack_base = stack_top - TASK_STACK_SIZE; task->runtime = 0; atomic_store(&task->events, 0); } ``` Fill in * D09 = ? (Hint: The entry point must be placed where mepc would be in the stack frame. Based on the loop above, what index corresponds to the last saved item?) - [ ] Part 5: Understanding the mscratch CSR The `mscratch` CSR plays a critical role in interrupt handling and context switching. It provides a scratch register for M-mode that persists across privilege level changes. ``` mscratch Usage Pattern: User/Supervisor Mode Machine Mode (Running task code) (Interrupt handler) ┌─────────────────┐ ┌─────────────────┐ │ sp = user stack │ │ sp = sys stack │ │ │ Interrupt │ │ │ mscratch = │ ──────────────> │ mscratch = │ │ sys stack │ │ user stack │ └─────────────────┘ └─────────────────┘ │ ┌───────────────────────────────────────┘ │ csrrw sp, mscratch, sp │ (atomically swap sp and mscratch) ▼ Before: sp=user_stack, mscratch=sys_stack After: sp=sys_stack, mscratch=user_stack ``` What would happen if the kernel mistakenly enabled nested interrupts without modifying the handler to save `mscratch`? __ D10 __ (Hint: The second interrupt handler would also use `csrrw sp, mscratch, sp`. What happens to the original user stack pointer?) a. The second interrupt is safely queued and handled after mret b. The second interrupt is lost c. mscratch is corrupted, causing stack corruption when the first handler returns d. The processor resets - [ ] Part 6: Task Memory Isolation with PMP In a secure embedded system, tasks should be isolated from each other. The RISC-V Physical Memory Protection (PMP) mechanism can enforce this at the hardware level. When switching tasks, the kernel updates PMP configuration to grant the new task access only to its own memory regions. ``` PMP-Based Task Isolation: Physical Memory Layout: ┌──────────────────────────────────────────────────────────────┐ │ 0x8000_0000 ┌────────────────┐ │ │ │ Kernel Code/ │ ◄── PMP0: RX (all tasks) │ │ │ Data (64 KiB) │ Locked, shared │ │ 0x8001_0000 ├────────────────┤ │ │ │ Task 0 Stack │ ◄── PMP1: RW (task 0 only) │ │ │ (4 KiB) │ Updated on context switch │ │ 0x8002_0000 ├────────────────┤ │ │ │ Task 1 Stack │ ◄── PMP1: RW (task 1 only) │ │ │ (4 KiB) │ Updated on context switch │ │ 0x8003_0000 ├────────────────┤ │ │ │ Task 2 Stack │ ◄── (additional tasks...) │ │ │ ... │ │ │ └────────────────┘ │ └──────────────────────────────────────────────────────────────┘ PMP Configuration Strategy: ┌──────────────────────────────────────────────────────────────┐ │ Entry │ Region │ Config │ Purpose │ ├───────┼─────────────────────┼───────────────┼────────────────┤ │ pmp0 │ Kernel (0x8000_0000 │ L=1, A=NAPOT │ Protect kernel │ │ │ to 0x8000_FFFF) │ R=1,W=0,X=1 │ from tasks │ ├───────┼─────────────────────┼───────────────┼────────────────┤ │ pmp1 │ Current task stack │ L=0, A=NAPOT │ Task's private │ │ │ (4 KiB aligned) │ R=1,W=1,X=0 │ stack region │ ├───────┼─────────────────────┼───────────────┼────────────────┤ │ pmp2 │ Shared data region │ L=0, A=TOR │ IPC buffers │ │ │ │ R=1,W=1,X=0 │ │ └───────┴─────────────────────┴───────────────┴────────────────┘ On Context Switch: 1. Save current task's context (registers, PC) 2. Update pmp1 address to point to new task's stack region 3. Restore new task's context 4. Execute mret (returns to U/S-mode with new PMP config active) ``` Complete the PMP update code for context switch: ```c #include <stdint.h> /* PMP address register encodes top bits of region address * For NAPOT (Naturally Aligned Power-Of-Two): * pmpaddr = (base + (size/2 - 1)) >> 2 * For 4 KiB region: pmpaddr = (base + 0x7FF) >> 2 */ #define PMP_NAPOT_4K(base) (((base) + 0x7FF) >> 2) /* pmpcfg bit fields */ #define PMP_R (1 << 0) /* Read permission */ #define PMP_W (1 << 1) /* Write permission */ #define PMP_X (1 << 2) /* Execute permission */ #define PMP_A_NAPOT (3 << 3) /* NAPOT addressing mode */ #define PMP_L (1 << 7) /* Lock bit */ /* CSR access macros */ #define write_csr(csr, val) \ __asm__ volatile ("csrw " #csr ", %0" :: "r"(val) : "memory") #define read_csr(csr) ({ \ uint32_t __val; \ __asm__ volatile ("csrr %0, " #csr : "=r"(__val)); \ __val; }) /* Update PMP entry 1 to protect new task's stack region */ void pmp_update_task_region(uint32_t stack_base) { /* * Configure pmp1 for the task's 4 KiB stack region: * - NAPOT mode (naturally aligned power-of-two) * - Read and Write permission (no Execute) * - Not locked (so M-mode can update it) */ /* Calculate NAPOT-encoded address for 4 KiB region */ uint32_t pmpaddr = PMP_NAPOT_4K(stack_base); /* Write to pmpaddr1 CSR */ write_csr(pmpaddr1, pmpaddr); /* * pmpcfg0 contains config for pmp0-pmp3 (8 bits each): * Bits [7:0] = pmp0cfg (kernel, locked) * Bits [15:8] = pmp1cfg (task stack, update this) * Bits [23:16] = pmp2cfg (shared region) * Bits [31:24] = pmp3cfg (unused) */ /* Read current pmpcfg0 */ uint32_t cfg = read_csr(pmpcfg0); /* Clear pmp1cfg bits [15:8] and set new config */ cfg &= ~(0xFF << 8); /* Clear bits [15:8] */ cfg |= (__D11__) << 8; /* Set RW + NAPOT mode */ /* Write updated configuration */ write_csr(pmpcfg0, cfg); /* Fence to ensure PMP changes take effect before mret */ __asm__ volatile ("fence rw, rw" ::: "memory"); } ``` What value should be written to pmp1cfg (D11) to grant read and write access using NAPOT addressing? __ D11 __ (Hint: We need R, W, and NAPOT mode. Should the lock bit be set if M-mode needs to update this entry?) a. `PMP_R | PMP_W` (0x03) b. `PMP_R | PMP_W | PMP_A_NAPOT` (0x1B) c. `PMP_R | PMP_W | PMP_X | PMP_A_NAPOT` (0x1F) d. `PMP_R | PMP_W | PMP_A_NAPOT | PMP_L` (0x9B) - [ ] Part 7: Complete Context Switch with PMP Integrate PMP updates into the context switch routine: ```c /* Extended context switch with PMP update */ void switch_task_with_isolation(task_t *next_task) { task_t *current = current_task; if (current == next_task) return; /* No switch needed */ /* Update PMP to protect next task's stack region */ pmp_update_task_region((uint32_t)next_task->__D12__); /* Perform register context switch */ /* (This is done in assembly, shown in Part 3) */ current_task = next_task; } ``` Fill in * D12 = ? (Hint: The PMP should protect the task's stack region. Which field stores the stack's base address?) - [ ] Part 8: Interrupt Return and Privilege Transition After the context switch completes, the `mret` instruction returns to the new task. This instruction atomically: 1. Sets PC to the value in `mepc` 2. Sets privilege mode to the value in `mstatus.MPP` 3. Restores interrupt enable state from `mstatus.MPIE` ``` mret Execution Flow: Machine Mode (Interrupt Handler) User/Supervisor Mode ┌─────────────────────────────┐ ┌─────────────────────┐ │ mepc = 0x8002_1000 │ │ │ │ mstatus.MPP = 01 (S-mode) │ mret │ PC = 0x8002_1000 │ │ mstatus.MPIE = 1 │ ──────> │ Mode = Supervisor │ │ │ │ Interrupts enabled │ └─────────────────────────────┘ └─────────────────────┘ mstatus bit fields for privilege: ┌────────────────────────────────────────────────────────────┐ │ Bit [12:11] │ MPP │ Previous privilege mode │ │ Bit [7] │ MPIE │ Previous interrupt enable │ │ Bit [3] │ MIE │ Current M-mode interrupt enable │ └────────────────────────────────────────────────────────────┘ MPP encoding: 00=User, 01=Supervisor, 11=Machine ``` If a task was interrupted while running in Supervisor mode, what should `mstatus.MPP` contain when `mret` is executed to resume that task? __ D13 __ (Hint: MPP stores the Previous Privilege mode. What mode was the task running in before the interrupt?) a. 0b00 (User mode) b. 0b01 (Supervisor mode) c. 0b10 (Reserved) d. 0b11 (Machine mode) - [ ] Part 9: Voluntary Context Switch (Yield) Tasks can voluntarily give up the CPU by calling a yield function, which triggers a context switch without waiting for a timer interrupt. ```c /* System call numbers */ #define SYSCALL_YIELD 0 /* Trigger ecall to request kernel service */ static inline void syscall(int num) { register int a7 __asm__("a7") = num; __asm__ volatile ("ecall" :: "r"(a7) : "memory"); } /* Voluntary yield - give up CPU to other tasks */ void task_yield(void) { syscall(SYSCALL_YIELD); } ``` The corresponding M-mode trap handler: ```assembly # Trap handler for ecall from S/U-mode .global trap_handler trap_handler: # Save minimal context csrrw sp, mscratch, sp # Swap to system stack # Check trap cause csrr t0, mcause li t1, 9 # Environment call from S-mode beq t0, t1, handle_syscall li t1, 8 # Environment call from U-mode beq t0, t1, handle_syscall j handle_other_trap handle_syscall: # Check syscall number (in a7) li t0, SYSCALL_YIELD bne a7, t0, handle_other_syscall # Yield: Advance mepc past ecall instruction, then switch csrr t0, mepc addi t0, t0, __D14__ # Skip past ecall instruction csrw mepc, t0 j __switch_task # Perform context switch handle_other_syscall: # Handle other system calls... j trap_return handle_other_trap: # Handle exceptions, interrupts... j trap_return trap_return: csrrw sp, mscratch, sp # Restore user stack mret ``` Fill in * D14 = ? (Hint: The `ecall` instruction is a standard 32-bit RISC-V instruction. How many bytes?) - [ ] Part 10: Analysis Questions Consider a system with 4 tasks where Task 0 has highest priority and Task 3 has lowest. The scheduler always runs the highest-priority ready task. ``` Task State Diagram: ┌─────────────────────────────────────────┐ │ tasks_ready bitmap │ │ Bit 3 Bit 2 Bit 1 Bit 0 │ │ │ │ │ │ │ │ Task3 Task2 Task1 Task0 (priority) │ │ (low) (high) │ └─────────────────────────────────────────┘ Priority Convention: Lower bit position = Higher priority Task 0 (bit 0) = highest priority Task 3 (bit 3) = lowest priority Example: tasks_ready = 0b1010 Task 3 (bit 3) and Task 1 (bit 1) are ready Scheduler picks Task 1 (lower bit = higher priority) The scheduler uses __ffs (find first set) to get the lowest set bit, which corresponds to the highest priority ready task. ``` Given: `tasks_ready = 0b1101` and `tasks_enabled = 0b1111` Which task will the scheduler select? __ D15 __ (Hint: `__ffs()` finds the lowest set bit. In 0b1101, which bit position is the lowest that's set?) a. Task 0 (lowest bit set, highest priority) b. Task 2 (middle priority ready) c. Task 3 (highest bit set in tasks_ready) d. No task (none ready) If Task 3 is currently running and Task 0 becomes ready (event arrives), what happens at the next scheduling point? __ D16 __ (Hint: This is a priority-based scheduler. When a higher-priority task becomes ready...) a. Task 3 continues (no preemption) b. Task 0 preempts Task 3 immediately c. Tasks 0 and 3 run in round-robin d. The system deadlocks ### ✅ Solution to Problem D: D01: ==b== Explanation: The assembly code accesses the stack pointer using a fixed offset of 0 from the task pointer: `lw sp, 0(a0)` and `sw sp, 0(t0)`. This only works if `sp` is the first field in the structure (offset 0). D02: ==b== Explanation: The context save stores 12 callee-saved registers (`s0`-`s11`) plus the program counter (`mepc`), totaling 13 values × 4 bytes = 52 bytes. D03: ==sw t5, -13*4(sp)== Explanation: The program counter (in `t5` from `csrr t5, mepc`) must be saved at offset -13×4 = -52 from the original stack pointer, following the 12 callee-saved registers. D04: ==-13*4== Explanation: The stack pointer decreases by 52 bytes (13 words × 4 bytes) to account for the saved context. Alternative acceptable answers: `-52`. D05: ==13*4== Explanation: When loading the next task's context, the stack pointer increases by 52 bytes to point back to the saved registers. Alternative acceptable answers: `52`. D06: ==-13*4(sp)== Explanation: The saved `mepc` is at offset -52 from the adjusted stack pointer (after adding 13*4), which is the same relative position as where it was stored. D07: ==mscratch== Explanation: The user/task stack pointer must be saved to `mscratch` so it can be retrieved on the next interrupt. D08: ==sp== Explanation: The current stack pointer (pointing to the new task's user stack) is written to `mscratch`. D09: ==-13== Explanation: The entry point (initial `mepc`) is stored at `sp[-13]`, which corresponds to offset -13×4 = -52 bytes, matching the context switch save location. D10: ==c== Explanation: If nested interrupts were enabled without saving `mscratch`, the second interrupt handler would execute `csrrw sp, mscratch, sp`, swapping the system stack pointer with whatever is in `mscratch` (which contains the first handler's user stack). When the first handler returns with `mret`, it would use the corrupted stack, causing undefined behavior. Proper nested interrupt handling requires saving and restoring `mscratch`. D11: ==b== Explanation: `PMP_R | PMP_W | PMP_A_NAPOT` = 0x01 | 0x02 | 0x18 = 0x1B. This enables read and write permissions with NAPOT addressing mode. The lock bit (PMP_L) should NOT be set, as M-mode needs to update this entry on each context switch. D12: ==stack_base== Explanation: The PMP region should protect the task's stack, which starts at `stack_base`. The `stack_base` field in `task_t` stores the bottom address of the allocated stack region. D13: ==b== Explanation: If the task was running in Supervisor mode when interrupted, `mstatus.MPP` should contain 0b01 so that `mret` returns to Supervisor mode. The trap entry hardware automatically saves the previous privilege mode to MPP. D14: ==4== Explanation: The `ecall` instruction is 4 bytes (standard RISC-V instruction width). To prevent the `ecall` from re-executing after returning, `mepc` must be advanced by 4 to point to the next instruction. D15: ==a== Explanation: With `tasks_ready = 0b1101`, bits 0, 2, and 3 are set. The scheduler uses `__ffs()` (find first set) which returns the lowest bit position. In this kernel design, lower bit = higher priority, so `__ffs(0b1101) = 0` selects Task 0 (highest priority). D16: ==b== Explanation: In a priority-based preemptive scheduler, when a higher-priority task (Task 0) becomes ready, it will preempt the lower-priority task (Task 3) at the next scheduling point (typically the next timer interrupt or when Task 3 yields). The scheduler always runs the highest-priority ready task. Context Switch Timing Analysis: ``` Cycle-level breakdown of context switch: ┌─────────────────────────────────────────────────────────────────┐ │ Operation │ Approximate Cycles │ ├────────────────────────────────────────┼────────────────────────┤ │ Scheduler decision (next_sched_task) │ 10-20 cycles │ │ Save 13 registers to stack │ 13-26 cycles │ │ Save SP to TCB │ 1-2 cycles │ │ Load SP from new TCB │ 1-2 cycles │ │ Restore 13 registers from stack │ 13-26 cycles │ │ PMP update (if used) │ 5-10 cycles │ │ CSR operations (mepc, mscratch) │ 4-8 cycles │ ├────────────────────────────────────────┼────────────────────────┤ │ Total context switch overhead │ ~50-100 cycles │ └─────────────────────────────────────────────────────────────────┘ For a 100 MHz processor: ~0.5-1.0 μs per context switch ``` Key Design Decisions: 1. Callee-saved registers only: Caller-saved registers (`t0`-`t6`, `a0`-`a7`, `ra`) are already saved by the interrupt entry code or are not preserved across function calls, reducing context switch overhead. 2. mscratch for stack swap: Using `mscratch` enables atomic stack switching without corrupting either stack, critical for safe interrupt handling. 3. TCB sp at offset 0: Simplifies assembly code by allowing direct access without offset calculation, reducing instruction count and potential bugs. 4. PMP for isolation: Hardware-enforced memory protection prevents tasks from corrupting each other's stacks or accessing kernel memory, improving system reliability without MMU overhead. ---