# Implement A (atomic) extension for MyCPU > 江冠緯 > [GitHub](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/tree/A-Extension/4-soc) > [video](https://youtu.be/QZtxhdd_oi0) ## Goals * Extend the provided pipelined RISC-V design (`4-soc`) to support the **A** (Atomic) extension. The A extension defines atomic read-modify-write operations for synchronization among multiple harts sharing a memory space. It includes both Load-Reserved/Store-Conditional (LR/SC) instructions and atomic fetch-and-op instructions, each supporting multiple memory-ordering semantics (unordered, acquire, release, and sequentially consistent). * Use RISCOF to validate the implementation and ensure full compatibility with the official [riscv-arch-test](https://github.com/riscv-non-isa/riscv-arch-test) suite for the A extension. * Confirm that all added atomic operations function correctly within the existing pipelined microarchitecture. You must prepare multi-threaded or concurrent test programs that exercise LR/SC behavior, contention, and memory-ordering constraints to validate correctness. See also: * [Implement A (atomic) extension for srv32](https://hackmd.io/@P31nISeeTvau1gKumIykRA/SJ3LH14aF) ## Background Study :::danger Write complete sentences, along with your thoughts. ::: ### Watch class video of Synchornization/Multi-Harts/Cache Coherence #### **MIT 6.004 - Synchornization** * **Source** * [MIT 6.004 L21: Synchronization](https://www.youtube.com/watch?v=r8H_af_dCF0) * **Thread-level Parallelism** * **Definition**: * **Thread-level Parallelism (TLP)** refers to improving performance by dividing a computation into multiple threads of execution. * These threads may be **independent**, in which case they compete for shared resources such as memory and I/O devices * Or **cooperating**, where threads must communicate and synchronize with each other to complete a task correctly. - **Communication Model** There are two main communication models for thread-level parallelism: - **Shared Memory** ![Screenshot 2025-12-22 165647](https://hackmd.io/_uploads/HyaQD6LmZg.png) * In the **shared memory model**, all threads operate within a single address space and communicate implicitly through load and store operations to shared variables. This model is commonly used in multicore systems.through load/store - **Message Passing** ![Screenshot 2025-12-22 165635](https://hackmd.io/_uploads/HkOrPTL7Wg.png) * The **message passing model** uses separate addess spaces for each thread or process. Communication is explicit and achieved by sending and receiving messages, typically over a network, which is common in distributed systems. - **Scenarios** - **Fork and Joins** ![Screenshot 2025-12-22 171106](https://hackmd.io/_uploads/S1BPDaLQZx.png) * A parallel process creates multiple threads and must wait until all of them have completed before joining and continuing execution. - **Producer and Consumer** ![Screenshot 2025-12-22 171143](https://hackmd.io/_uploads/ry1FvpU7Zg.png) * Producers generate data while consumers process it, and consumers must wait until data becomes available. - **Mutual Exclusion** - OS needs to make sure only one process or thread accesses a critical hardware or software resource at any given time. - **Multithreaded-safe programming** - **Timesharing** * On a uniprocessor system, **timesharing** is used, where the operating system switches between threads, giving each a slice of execution time. - **Thread-safe** * A program is considered thread-safe if it behaves correctly whether it runs on a single processor or on multiple processors simultaneously. - **Precedence Constraints** ![Screenshot 2025-12-22 173856](https://hackmd.io/_uploads/Bkp0DaLQZg.png) - Describe ordering relationships between operations in parallel programs. A constraint such as “**a precedes b**” means that operation `a` must complete before operation `b` is allowed to execute. These constraints are essential for ensuring correctness when multiple threads execute concurrently. * **FIFO Buffer** ![Screenshot 2025-12-22 174113](https://hackmd.io/_uploads/Syzeu68Xbl.png) - A **FIFO buffer** is used to relax the strict precedence constraints between a producer and a consumer. Instead of forcing the producer and consumer to alternate execution, the buffer allows the producer to run up to **N values ahead** of the consumer. This increases concurrency and improves overall throughput. ![Screenshot 2025-12-22 174040](https://hackmd.io/_uploads/Bk5WOp8XZe.png) - When the FIFO buffer is **empty**, only the producer is allowed to send data, since there is nothing available for the consumer to receive. - Conversely, when the buffer is **full**, only the consumer can receive data, preventing the producer from overwriting existing entries. These conditions enforce correct synchronization while still allowing maximum parallelism. - In shared-memory systems, a FIFO buffer is commonly implemented as a **ring buffer** (circular buffer). The ring buffer uses fixed-size storage with head and tail pointers that wrap around, enabling efficient enqueue and dequeue operations without moving data in memory. * **Semaphore** - **Initialize shared data** ```c semasphore s = k; // Init s to k ``` - A semaphore is initialized as shared data with a non-negative integer value, for example initializing `s = k ` indicates that k identical resources are available. - **Operations** - `wait(semaphore s)` - The `wait(s)` operation blocks execution until the semaphore value is greater than zero, then atomically decrements it by one. - `signal(semaphore s)` - The `signal(s)` operation atomically increments the semaphore value, potentially waking a waiting thread. These operations allow semaphores to enforce precedence constraints, ensuring that certain actions occur before others. - **Precedence Constraint** ![Screenshot 2025-12-22 195706](https://hackmd.io/_uploads/Sky9dp8mWe.png) * This means one resource must be released resources if all resources are oppcupied by users. - **Use Scenario** - Resource allocation in shared memory ```c In shared memory: semaphore s = K; // K resources Using resources: wait(s); // Allocate a resource ... signal(s); // return it to pool ``` * A semaphore initialized to the number of available resources allows threads to safely allocate and release resources by surrounding the usage with `wait` and `signal` operations. This guarantees that no more than the available number of resources are used concurrently. - Implement Ring Buffer and record Ring Buffer’s space in Producer and Consumer ```c /* SHARED MEMORY: */ char buf[N]; /* The buffer */ int in = 0, out = 0; semaphore chars = 0, spaces = N; /* PRODUCER: */ void send(char c) { wait(spaces); buf[in] = c; in = (in + 1) % N; signal(chars); } /* CONSUMER: */ char rcv() { char c; wait(chars); c = buf[out]; out = (out + 1) % N; signal(spaces); return c; } ``` * In this design, one semaphore tracks the number of filled slots (`chars`), while another tracks the number of available slots (`spaces`). * The producer waits for available space before inserting data and signals when new data is produced. * The consumer waits for available data before removing it and signals when space is freed. * This approach ensures correct synchronization without busy waiting. - Mutual exclusion in **Critical Section** ```c semaphore lock = 1; void debit(int account, int amount) { wait(lock); // Wait for exclusive access t = balance[account]; balance[account] = t - amount; signal(lock); // Finished with lock } ``` - Another important application of semaphores is **mutual exclusion** in a critical section. By initializing a semaphore to 1, it can act as a lock that ensures only one thread enters the critical section at a time. - The code between `wait(lock)` and `signal(lock)` defines the critical section, preventing overlapping execution across threads. - When designing such locks, **lock granularity** must be carefully considered, especially in systems with many concurrent users, to balance correctness and performance. - **Implementation for `wait` and `signal`** - Use **Atomic read-modify-write** to guarantee correctness under concurrency. - Alternatively, implement using **system calls**, which is **only** effective **on uniprocessor systems** and therefore not considered suitable for multiprocessor designs such as those in the term project. * **Dark Side** - **Dead Lock** * A situation in which a set of threads are permanently blocked because each is waiting for resources held by others. Deadlock can only occur when fllowing all four conditions are simultaneously satisfied: - **Mutual Exclusion** - Only one thread can hold a resource at a time - **Hold-and-wait** - A thread holding at least one resource waits to acquire additional resources. - **No preemption** - Resources cannot be forcibly taken away from a thread once they have been allocated. - **Circular wait** - A cycle of threads exists such that each thread waits for a resource held by the next thread in the cycle. - **Dealing with Dead lock** -- Add shared priority to lock - Force thread to pick lower priority’s lock first. This ensure higher priority lock will only be taken by user with lower priority lock and keep other user waiting, #### **[CS61C SU20] Lecture 23 - Multithreading Issues, Cache Coherency** * **Source** * [CS 61C Departmental Archive - Lecture 23](https://www.youtube.com/playlist?list=PLDoI-XvXO0aqgj_8Og51XoE7iyAu5yEWZ) * **Recap** * Sequential software is slow. To speed up, SIMD, MIMD is path to better performance * Atomic Swap in RISC-V ```risc-v= amoswap.w.aq rd, rs2, (rs1) amoswap.w.rl rd, rs2, (rs1) ``` - Swaps the memory value at M[R[rs1]] with the register value in R[rs2] - Atomic because this is done in one instruction * Parallel Section 1. Shared Iteration of a loop 2. Sections are executede by separate thread 3. Serializes execution of **Un-parallelizable** section * Parallel program exanple in C using **OpenMP** (`omp`) ```C= #include <stdio.h> #include <omp.h> int main () { int nthreads, tid; /* Fork team of threads with private var tid */ #pragma omp parallel private(tid) { tid = omp_get_thread_num(); /* get thread id */ printf("Hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } } /* All threads join master and terminate */ } ``` * **Multithreading Example - Matrix Multiplication** * **Naive Implementation** ```c= for (i = 0; i < N; i++) for (j = 0; j < N; j++) for (k = 0; k < N; k++) C[i][j] += A[i][k] * B[k][j]; ``` * **Drawback**: Blindly marches through memory which causes matrix B not utilizing cache's space locality * **Using OpenMP** ```c= #pragma omp parallel for private(tmp, i, j, k) for (i = 0; i < Mdim; i++) { for (j = 0; j < Ndim; j++) { tmp = 0.0; for (k = 0; k < Pdim; k++) { /* C(i,j) = sum(over k) A(i,k) * B(k,j) */ tmp += *(A + (i * Pdim + k)) * *(B + (k * Ndim + j)); } *(C + (i * Ndim + j)) = tmp; } } ``` * Outer loop `i = 0` spread accross `Mdim` threads and each thread computae one row of C matrix. This reduces the time complexity of Matrix Multiply from $O(N^3)$ to $O(N^2)$ * Race condition is avoided since every threads works on different row. * **Cache Blocking** * Cache blocking solves the problem of inequal cahce block size and matrix size by reshaping the computation so that the algorithm’s reuse distance matches the cache block size, eliminating over-fetch and turning wasted cache blocks into fully utilized ones. <a id="synchornization-problem"></a> * **Synchornization Problem** * **Data Race between threads** ```c= double compute_sum(double *a, int a_len) { double sum = 0.0; #pragma omp parallel for for (int i = 0; i < a_len; i++) { sum += a[i]; } return sum; } ``` * When the loop is annotated with `#pragma omp parallel for`, OpenMP executes different iterations of the loop concurrently using multiple threads. The loop index `i` does not cause a race because OpenMP treats the iteration variable as private to each thread by default, and each thread updates its own `i`. The array access `a[i]` also does not cause a race because all threads only read from `a`, and concurrent reads of the same memory location are safe. * The variable `sum` causes a data race because it is shared among all threads by default, and every thread executes `sum += a[i]` at the same time. The expression `sum += a[i]` is not a single atomic operation; it is a read–modify–write sequence that first reads the current value of `sum`, then adds `a[i]`, and then writes the result back to `sum`. If two threads perform this sequence concurrently, they can both read the same old value of `sum` and then each write back an updated value, so one update overwrites the other and the final result becomes nondeterministic and often incorrect. * **Solution -- OpenMP Reduction** ```c= double compute_sum(double *a, int a_len) { double sum = 0.0; #pragma omp parallel for reduction(+:sum) for (int i = 0; i < a_len; i++) { sum += a[i]; } return sum; } ``` * In OpenMP, a `reduction` clause tells the compiler that multiple threads will update the same logical variable using a specific operator, and OpenMP should combine those updates safely. When you write something like `reduction(+: nSum)`, OpenMP creates a **private copy of `nSum` for each thread** and initializes that private copy to the **identity value** of the operator (for `+`, the identity is `0`). During the parallel loop, each thread updates only its own private copy, so no other thread can interfere and no data race occurs. After all threads finish their assigned iterations, OpenMP performs a **merge step at the end of the parallel region** where it combines all private copies into the original shared variable using the same operator, so the final `nSum` equals the sum of every thread’s partial result. * **OpenMP Pitfall** * **Data Dependencies** ```c= a[0] = 1; #pragma omp parallel for for (i = 1; i < 5000; i++) a[i] = i + a[i-1]; ``` * Each thread has dependecy to each other so the derministic order of thread need to be considered and parallization might not be utilized * **Sharing Issues** ```c= #pragma omp parallel for for (i = 0; i < n; i++) { temp = 2.0 * a[i]; a[i] = temp; b[i] = c[i] / temp; } ``` * Global variables shared accross thread will cause inconsistent result in for loop. * Solution for this pitfalls is to privatize the shared variable for each thread. ```c= #pragma omp parallel for private(temp) ``` * **Update Shared Variable Simulraneously** * As previous [sychornization problem](). * **Parallel Overhead** * If we **spawning** and **releasing** threads too often, there will be significant overhead that reduces the speedup benifit of multi-thread or even cost more time than sequential execution. * **Multiprocessor Caches** ![image](https://hackmd.io/_uploads/SySPTZTQZx.png) * In shared memory multiprocessor architecture, we still need **private cache** for each processor to prevent access to memory. * Caches communicates through interconnection Network. For **store instruction**, besides from write data to itself and memory, the processor's cache would also check whether other caches have same cache block data and invalidate them so they would't hold out-of-date data. * For **load instruction**, if there is a local private cache miss, we can check whether other caches have the data through interconnection network and read it, reducing the access to memory. * **MSI Protocol** * **Three states** * **Modified (M)** * Cache have sole up-to-date data and set the `dirty bit` to 1 (inconsistent with memory). * When transforming to **Modified** state (write to cache block), the cache will need to invalidate other cache having the same cache block and convert their state from **Shared** to **Invalid**. * When other cache request to read data, the cache with modified data **must update the data to memory** and transform the state to **Shared** * **Shared (S)** * Cache shared up-to-date data with other cache or memory. * **Invalidate (I)** * Data in the cache block is invalid. If requesting read, cache will have to access the memory and signal to the cache with **Modified** data to update the memory * **MSI Problem** * Writing to **Shared** cache is expensive since we have to if check other cache also have the data and need to invalidate them. * **Solution** -- **MESI** protocol * New **Exclusive State** which separate the case that only one cache hold the data from **Shared** state. In this state, when write occur, the cache wouldn't have to check whether other caches have the data and invalidate them, instead, it can simply just write to its own cache block. * In this protocol, **shared** state cache would have at least 1 other cache contain the same data block. * **MESI Problem** * Sharing block in **Modified** is expensive since cache in **Modified** state need to write back to memory everytime when other cache request to read the modified data block. * Solution -- **MOESI** protocol * New **Owner** state which combines **Modified** and **Shared** state. If a cache request to read a data block, it will check if other cache have modified version of it, if there is, it would fetch the data from the cache containing modified data through interconnected network, which reducing the chance to access the memory. * The state transition to **Owner** will happen if a **Modified** cache receive read request from other cache. * In this protocol, a **Shared** state cache still have other cache that deinitely have same copy of data, but the copy in memory **may not be up-to-date** * **MOESI Cache Coherence – Compatibility Matrix** | Cache A \ Cache B | M | O | E | S | I | |-------------------|---|---|---|---|---| | **M** | Not allowed | Not allowed | Not allowed | Not allowed | Allowed | | **O** | Not allowed | Not allowed | Not allowed | Allowed | Allowed | | **E** | Not allowed | Not allowed | Not allowed | Not allowed | Allowed | | **S** | Not allowed | Allowed | Not allowed | Allowed | Allowed | | **I** | Allowed | Allowed | Allowed | Allowed | Allowed | * **Truth Table of MOESI:** | State | Valid Bit | Dirty Bit | Shared Bit | |------------|-----------|-----------|------------| | Modified | 1 | 1 | 0 | | **Owned** | **1** | **1** | **1** | | Exclusive | 1 | 0 | 0 | | Shared | 1 | 0 | 1 | | Invalid | 0 | X | X | * **Coherence Miss (The fourth C)** * Cache miss happens because other cache in multi-processor system invalidete the cache block. When this happens between two or more caches, it is called **Ping-Ponging**. Mostly caused by **False Sharing** which two different data in different processors coincidently loacate in same cache block. * False Sharing can be prevent by reducing cache block's size or putting frequently used variable/data in different cache blocks. #### **NIU Multithreaded Application Synchronization** * **Source** * [Multithreaded Application Synchronization pt. I ](https://www.youtube.com/watch?v=kk8q5rwm1ZI) * [Multithreaded Application Synchronization pt. II](https://www.youtube.com/watch?v=dAVZTHxC16w) * **What is HART?** * A Hart (HARdware Thread) is the fundamental execution unit that maintains its own independent state—specifically a Program Counter (PC) and registers—allowing it to fetch and run a stream of instructions. * **Multi-Hart in Multi-Core** * In a Multi-Core architecture, harts execute in **true parallel**, as each hart is assigned its own private physical core and execution pipeline to run instructions simultaneously. * **Multi-Hart in Single-Core** * In a Single-Core architecture, harts execute via **hardware interleaving (like SMT)**, where they share the single physical pipeline by taking turns issuing instructions **cycle-by-cycle or during memory delays** to keep the processor busy. * **HART's unique ID and stack** * For Multi-HART application, assigning unique ID to each HART is important for allocating unique stack position in the shared memory. Trivially, Multi-HART is made of multiple single HART objects which execute same instruction with different data. * **Initialize `.bss` or global data by one HART** * To start execute a program on CPU, `.bss`section need to be cleared to 0 before executing. In Mulit-HART program, we would like to clear the BSS section only by one HART to reduce unnecessary re-entrance and repeating initialization. * For implementation, `done_flag` is used to check whether the HART with ID 0 initialized the `.bss` section. If the flag is set to 0, other HART will wait for HART 0 to finish `.bss` initialization. * **Solution for multi-thread divide and conquer application** * **Bank Server Application** * In a multithreaded bank server, the following code updates a shared account balance: ```c void process_xact(int act, float amt) { float bal = get_balance(act); bal += amt; set_balance(act, bal); } ```` * This function performs a **read–modify–write** sequence on shared data. If multiple threads execute it concurrently for the same account, their operations may interleave, causing lost updates and incorrect balances. * Therefore, the update must be **protected as a critical section using a lock**. The RISC-V `amoswap` instruction is used to implement this lock safely because it provides an atomic test-and-set operation. * **amoswap** ```c int amoswap(volatile int &m, int i) { rmw = 1; // enter atomic RMW region (conceptual) int s = m; // read old value m = i; // write new value rmw = 0; // exit atomic RMW region return s; // return old value } ``` * It returns the previous contents of `m` while simultaneously updating `m` to `i`, so no other thread can observe an intermediate state. This lets `lock()` acquire the lock by swapping `m` with `1` and checking whether the old value was `0`. It also supports safe synchronization by ensuring only one thread can change `m` from unlocked to locked at a time. * **Atomic Acquire lock** ```c void lock(volatile int &m) { while (amoswap(m, 1) != 0) ; } ``` * Here, `amoswap(m, 1)` atomically swaps the memory location `m` with the value `1` and returns the old value of `m`. * If the old value is 0, the lock was free and the thread successfully acquires it (because `m` becomes `1`). * If the old value is nonzero, the lock is already held by another thread, so the thread keeps spinning and retrying until it observes and atomically changes `m` from `0` to `1`. * The atomicity requirement is that no two threads can simultaneously see `m == 0` and both set it to `1`. * **Atomic Release lock** ```c= void unlock(volatile int &m) { m = 0; } ``` * Releasing the lock means writing `0` to `m`, marking it as free. For correctness, the release must ensure that all writes inside the critical section become visible before `m` is set back to `0`. * On real RISC-V hardware, this is typically enforced by using a release-ordered atomic (e.g., `amoswap.w.rl x0, x0, (a0)` or `sw` preceded by a suitable fence). Conceptually, the code expresses the release by clearing the lock variable, allowing another thread’s `amoswap(m, 1)` to succeed. * **After Implemnting atomic lock** ```c= // Global lock for protecting the critical section volatile int lock_m = 0 void process_xact(int act, float amt) { lock(&lock_m); float bal = get_balance(act); bal += amt; set_balance(act, bal); unlock(&lock_m); } ``` * Atomicity ensures that only one thread can acquire the lock at a time, preserving correctness in a multithreaded environment. ### Study the specification of A-Extension in RISC-V Instruction Set Manual * **Source** * Page 72 ~ 77 in [RISC-V Instruction Set Manual Vol.1](https://docs.riscv.org/reference/isa/_attachments/riscv-unprivileged.pdf) * **What “RV32 A-extension” contains** * In the current spec text, Extension A is composed of two sub-extensions: * **Zalrsc**: LR/SC instructions (on RV32: `LR.W`, `SC.W`) * **Zaamo**: AMO instructions (on RV32: `.W` forms such as `AMOSWAP.W`, `AMOADD.W`, etc.) * All atomic instructions include two ordering bits: * **aq** (acquire): prevents later memory ops from being observed before the atomic op * **rl** (release): prevents earlier memory ops from being observed after the atomic op * **aq+rl**: sequentially consistent for that **address domain** (might be **memory** or **I/O**,depending on which address domain the atomic instruction is accessing) * **LR/SC semantics (Zalrsc)** * **Instruction Format:** | Bits | 31–27 | 26 | 25 | 24–20 | 19–15 | 14–12 | 11–7 | 6–0 | |-------------|-------|----|----|-------|-------|-------|------|-----| | Field name | funct5 | aq | rl | rs2 | rs1 | funct3| rd | opcode | | Width (bits)| 5 | 1 | 1 | 5 | 5 | 3 | 5 | 7 | | Description | Atomic op selector (LR/SC/AMO type) | Acquire ordering | Release ordering | Source register (value) | Source register (address) | Data width / type | Destination register | AMO opcode | | LR.W / SC.D | 00010 | ordering | ordering | 0 | addr | width | dest | 0101111 | | SC.W / SC.D | 00011 | ordering | ordering | src | addr | width | dest | 0101111 | * `LR.W` loads a 32-bit word from address in rs1 into rd and creates a **reservation set** covering at least **the addres of accessed bytes** (Size of reservation set is determined by the platform itself) * If there already exists a reservation set, the most recent `LR` instruction will overwrites the previous reservation (like invalidate) * To make LR/SC sequentially consistent, set the `aq` bit on `LR` to give acquire semantics. * `SC.W` attempts to store rs2 to address in rs1 **only if the reservation is still valid**; it writes 0 to rd on success, nonzero on failure. Either way, executing SC.W clears the reservation. * `SC.W` must fail if another hart’s store/AMO overlaps the reservation set between LR and SC * Or if another `SC` occurred in between (program order) * Or if the `SC` address isn’t in the reservation set of the most recest `LR`. * Or if external device writes to invalidate the reservation when they overlap the exact bytes loaded by `LR`. * To make LR/SC sequentially consistent, set the `rl` bit on `SC` to give release semantics. * **Alignment Requirement for LR/SC (Zalrsc)** * For `LR` and `SC` instructions, the Zalrsc extension **requires natural alignment** of the address in `rs1`: - **Word (`.W`)** operations must be **4-byte aligned** * If the address is **not naturally aligned**, the processor will raise an exception: - Either an **address-misaligned exception** - Or an **access-fault exception** (used when the system chooses **not** to emulate misaligned accesses) - This means misaligned LR/SC accesses are **not guaranteed to work** and generally **cannot be emulated**, so software must always ensure proper alignment. - **Constrained LR/SC loops** - Constrained LR/SC loops are short, tightly restricted LR→SC retry sequences for which the RISC-V execution environment guarantees that some observable progress will eventually occur, whereas unconstrained LR/SC sequences may fail indefinitely. A constrained LR/SC loop must satisfy **all** of the following: - The loop contains **only**: - One LR instruction, - One matching SC instruction - And minimal retry code. - The loop is **at most 16 instructions** and is placed **contiguously in memory**. - Instructions between `LR` and `SC`: - Must come from the base **I** instruction set, - **Must not** include loads, stores, backward jumps, taken backward branches, `JALR`, `FENCE`, or `SYSTEM`, - Compressed (`Zca`, `Zcb`) forms are allowed. - Retry code may use backward branches to repeat the LR/SC sequence but follows the same instruction restrictions. - `LR` and `SC` must access: - The **same effective address**, - The **same data size**, - And a memory region that supports the **LR/SC eventuality property** (defined by the execution environment). - **Example code of constrained LR/SC loop** ```asm= x# a0 holds address of memory location # a1 holds expected value # a2 holds desired value # a0 holds return value, 0 if successful, !0 otherwise cas: lr.w t0, (a0) # Load original value. bne t0, a1, fail # Doesn't match, so fail. sc.w t0, a2, (a0) # Try to update. bnez t0, cas # Retry if store-conditional failed. li a0, 0 # Set return to success. jr ra # Return. fail: li a0, 1 # Set return to failure. jr ra # Return. ``` * **AMO semantics (Zaamo)** * **Instruction Format** | Bits | 31–27 | 26 | 25 | 24–20 | 19–15 | 14–12 | 11–7 | 6–0 | |-------------|-------|----|----|-------|-------|-------|------|-----| | Field name | funct5 | aq | rl | rs2 | rs1 | funct3 | rd | opcode | | Width (bits)| 5 | 1 | 1 | 5 | 5 | 3 | 5 | 7 | | Description | AMO function selector | Acquire ordering | Release ordering | Source register (value) | Source register (address) | Data width | Destination register | AMO opcode | | AMOSWAP.W / D | 00001 | 0/1| 0/1| src | addr | width | dest | 0101111 | | AMOADD.W / D | 00000 | 0/1| 0/1| src | addr | width | dest | 0101111 | | AMOAND.W / D | 01100 | 0/1| 0/1| src | addr | width | dest | 0101111 | | AMOOR.W / D | 01000 | 0/1| 0/1| src | addr | width | dest | 0101111 | | AMOXOR.W / D | 00100 | 0/1| 0/1| src | addr | width | dest | 0101111 | | AMOMAX.W / D | 10100 | 0/1| 0/1| src | addr | width | dest | 0101111 | | AMOMAXU.W / D | 11100 | 0/1| 0/1| src | addr | width | dest | 0101111 | | AMOMIN.W / D | 10000 | 0/1| 0/1| src | addr | width | dest | AMO | | AMOMINU.W / D | 11000 | 0/1| 0/1| src | addr | width | dest | 0101111 | * AMOs do an atomic **read–modify–write** on a word address in rs1: * Atomically load old value from addres in `rs1` into `rd` * Compute op(old, rs2) * Store result back to memory address in `rs1` * **Address Alignment** * By default, AMOs require natural alignment (4 bytes for `.W`); otherwise an address-misaligned or access-fault exception is raised. * If the platform supports the **misaligned atomicity granule (MAG)** PMA, the requirement is relaxed: * A misaligned AMO is allowed and still atomic as long as all accessed bytes fall within a single MAG (a power-of-two-sized region) * If the access crosses a granule boundary, the exception behavior applies. * **Code Example - Critical Section** ```asm= li t0, 1 # Initialize swap value. again: amoswap.w.aq t1, t0, (a0) # Attempt to acquire lock. bnez t1, again # Retry if held. # ... # Critical section. # ... amoswap.w.rl x0, x0, (a0) # Release lock by storing 0. ``` * The `.aq` bit gives acquire ordering, so operations in the critical section cannot be observed before the lock is acquired. * The `.rl` gives release ordering, so all critical-section memory updates become visible before other harts can observe the unlock. * **Advantage** * Scale well on highly parallel systems (often better than LR/SC or CAS) * Can be implemented efficiently (even at a memory controller) * Practical for microcontroller-class designs without caches that may not support LR/SC * Mapping well to C11/C++11 atomics, parallel reductions, and atomic updates to MMIO device registers. ### Connecting Theories from Lecture Videos with MyCPU A-Extension Nowadays, to increase the performance of Chips, Multi-Harts or Multi-processors are very common architectures to speedup execution of program. In MyCPU's `4-soc` project, with AXI4-Lite bus system being implemented, there are a lot of potentials to add multi-processors with shared memory feature in MyCPU. While to migrate MyCPU to multi-processors, we need to ensure all CPUs can access the shared memory correclty, but correct execution requires strict handling of concurrent memory accesses. The RISC-V A-extension provides the necessary hardware primitives, such as LR/SC and AMO instructions with acquire and release semantics, to implement mutual exclusion, producer–consumer synchronization, and ordered critical sections as described in the lecture videos. By integrating the A-extension, MyCPU can safely support multithreaded software on multiple harts while avoiding data races and ensuring correct memory visibility across processors. ### Research existing Multi-threaded or Concurrent programs #### **Refference Concurrent Program : [ring-buffer.c](https://github.com/sysprog21/concurrent-programs/blob/master/ringbuffer/ringbuffer.c)** * **Adapting to bare-metal program** * Elimnating C runtime/ libc * `memset` ```c= static void *ringbuf_memset(void *dst, int value, size_t len) { unsigned char *p = (unsigned char *) dst; while (len--) { *p++ = (unsigned char) value; } return dst; } ``` * This function can replace libc `memset` by iterating over the destination buffer byte-by-byte and writing the given `value` into each of the `len` bytes, then returning the original destination pointer. * `printf` * In a bare-metal program, the `return -1;` is already sufficient to indicate that ring buffer creation failed, so you can omit `printf` entirely and instead rely on debugging techniques—such as stepping execution and inspecting registers (e.g., `a0`/return value) and relevant memory locations—to confirm the failure state. * `assert` * In a bare-metal program, `assert` can be eliminated because there is no standard error-reporting or process-termination mechanism; instead, you can explicitly check the condition and return an error code—such as ```c if (i != (int)(intptr_t)obj) { return -1; } ``` —which is sufficient to indicate failure and can be verified by inspecting registers and memory while debugging the bare-metal execution. * `malloc` and `free` ```c= #define RING_COUNT (1u << 6) // 64 entries (power of 2) #define RINGBUF_BYTES(count) \ (((sizeof(ringbuf_t) + (count) * sizeof(void*) + (CACHE_LINE_SIZE - 1)) / CACHE_LINE_SIZE) * CACHE_LINE_SIZE) static unsigned char ring_storage[RINGBUF_BYTES(RING_COUNT)] __attribute__((aligned(CACHE_LINE_SIZE))); ``` * `RING_COUNT` fixes the ring capacity at compile time, and `RINGBUF_BYTES(RING_COUNT)` computes the exact byte size (rounded up to a cache-line multiple) needed for `ringbuf_t` plus its `ring[]` pointer table, so you can declare a global buffer like `static unsigned char ring_storage[RINGBUF_BYTES(RING_COUNT)] __attribute__((aligned(CACHE_LINE_SIZE)));` and replace `ringbuf_create()`’s `malloc(ring_size)` with `ringbuf_t *r = (ringbuf_t *)ring_storage; ringbuf_init(r, RING_COUNT);`, eliminating the libc `malloc/free` dependency. * `errno.h` * On bare metal, skip <errno.h> and just define the few error codes uses in the program. For example, near the top of the file add: * `#define EINVAL 22` * `#define ENOBUFS 105` * `#define EDQUOT 122` * `#define ENOENT 2` This avoids any libc dependency while keeping the same error codes. #### **Fitting with required validations** * **LR/SC behavior** ```c= /* CAS for head reservation (no aq/rl needed for head itself) */ static inline int cas_w(volatile uint32_t *p, uint32_t expect, uint32_t desired) { uint32_t old, sc; asm volatile( "0:\n" " lr.w %0, (%2)\n" " bne %0, %3, 1f\n" " sc.w %1, %4, (%2)\n" " bnez %1, 0b\n" " li %1, 1\n" " j 2f\n" "1:\n" " li %1, 0\n" "2:\n" : "=&r"(old), "=&r"(sc) : "r"(p), "r"(expect), "r"(desired) : "memory" ); return (int)sc; } ``` * This `cas_w()` adds LR/SC behavior by using `lr.w` to take a reservation on `*p`, conditionally attempting `sc.w` to commit `desired` only if the reservation is still valid and `old==expect`, and it satisfies the RISC-V loop constraint by enclosing `lr.w/sc.w` in a tight retry loop (`bnez %1, 0b`) that repeats the LR/SC pair until `sc.w` succeeds or the expected value check fails. * **Contention** ```c= int main(void) { ringbuf_t *r = (ringbuf_t *)ring_storage; uint32_t id = read_mhartid(); if (id == 0) ringbuf_init(r, RING_COUNT); barrier_4(); if (id < 2) { /* 2 producers */ for (uint32_t i = 0; i < ITERS; i++) { void *obj = (void *)(uintptr_t)((id << 28) ^ i); while (ringbuf_mp_enqueue_1(r, obj) != 0) { } } } else { /* 2 consumers */ for (uint32_t i = 0; i < ITERS; i++) { void *obj; while (ringbuf_mc_dequeue_1(r, &obj) != 0) { } } } return 0; } ``` * By adding `ringbuf_mp_enqueue_1` and `ringbuf_mc_dequeue_1` and mapping two producers and two consumers by `mhartid`, multiple harts repeatedly race to update the same shared head/tail indices via LR/SC-based CAS (causing `sc` failures and spin/retry loops), thereby creating contention unlike the original SPSC `ringbuffer.c`. * **Memory-Ordering Constraints** ```c= static volatile uint32_t start_count = 0; static volatile uint32_t start_go = 0; static inline void barrier_4(void) { /* atomic increment start_count using CAS */ while (1) { uint32_t v; v = start_count; if (cas_w(&start_count, v, v + 1)) break; } if (read_mhartid() == 0) { while (start_count < 4) { } store_w_rl(&start_go, 1); // release: start signal } else { while (lr_w_aq(&start_go) == 0) { } // acquire: wait start signal } } ``` * `barrier_4()` enforces a **global happens-before start point** by having hart 0 publish the “start” flag with a **release store** (`store_w_rl(&start_go, 1)`), while other harts spin until they observe it with an **acquire load** (`lr_w_aq(&start_go)`), so all memory operations performed before the barrier are ordered before any producer/consumer actions after the barrier—making memory-ordering constraints explicit in the MP/MC test with `aq/lr` semantic of A-extension compared to the SP/SC baseline. You can see the full implementation of test program in [e5e9968](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/A-Extension/4-soc/csrc/ring-buffer.c) #### **Validate correctness of test program with Spike** * **Setup Spike environment for multi-hart and execution** * You can follow the instruction in `README.md` of [riscv-isa-sim](https://github.com/riscv-software-src/riscv-isa-sim) to setup spike in local environment. After build up spike, we have to modify the source code to simulate multi hart behavior: * Change multi-hart execution from the default “run ~5000 instructions per hart before switching” to instruction-by-instruction interleaving that better matches a single-core time-sliced multi-hart baseline. * Set `static const size_t INTERLEAVE = 1`; in `riscv/sim.h` (from 5000) and rebuild/reinstall. * Run `spike --isa=rv32imac -p4 ring-buffer.elf` to start Spike with the RV32IMAC ISA and execute `ring-buffer.elf` on **4 harts** (multi-hart) as specified by `-p4`. * **Encountering Problems and Bug Discovering** * **Un-safe initialization of `.bss` and stack allocation in multi-hart** * With the original `init.S`, **all 4 harts start at `_start` and set the same `sp = 0x00400000` and concurrently clear `.sbss/.bss`**, so they **corrupt each other’s call frames/return addresses via a shared stack** and **randomly re-zero shared globals (ring indices, barrier counters/flags, etc.) while other harts are already using them**, which typically makes `ring-buffer` on Spike become nondeterministic (hang in spin-waits/barriers, trap, or behave incorrectly). * To fix this, the new `init-multi-hart.S` solves this by **checking `mhartid` and letting only hart 0 clear `.bss` (and then publishing a “done” flag that other harts wait for before entering `main`)**, preventing other harts from wiping shared state mid-execution (and, if it also gives each hart a distinct stack region, it eliminates stack corruption as well). ```asm= ... csrrs x4, 0xf14, x0 # load the hart ID into a temp reg # Allocate individual stacks for each hart slli x4, x4, 12 # temp = hartid * 0x1000; sub x2, x2, x4 # sp = sp - temp bne x4, x0, wait_hart0 # wait for hart 0 to initialize everything # hart 0: clear the .sbss and .bss segments la t0, __sbss_start la t1, __sbss_end sbss_clear_loop: bgeu t0, t1, sbss_clear_done sw zero, 0(t0) addi t0, t0, 4 j sbss_clear_loop sbss_clear_done: # Clear .bss section (large uninitialized data) la t0, __bss_start la t1, __bss_end bss_clear_loop: bgeu t0, t1, bss_clear_done sw zero, 0(t0) addi t0, t0, 4 j bss_clear_loop bss_clear_done: # hart 0 will set the init_done flag to 1 la x5, init_done li x6, 1 sb x6, 0(x5) # set the init_done flag call main j spike_terminate wait_hart0: la x5, init_done wait_hart0_loop: lb x6, 0(x5) # x6 = init_done beq x6, x0, wait_hart0_loop # if init_done is zero, keep waiting # Call main function call main ... ``` * **Imcompatible linker script setup** * With the original `link.ld` placing `.text` at `0x00001000` and not defining `.tohost/.fromhost`, Spike’s expected RAM/reset mapping (DRAM base `0x80000000`) and HTIF communication symbols don’t line up, so `ring-buffer.elf` may fetch/access invalid addresses or appear to hang because Spike can’t observe the program’s exit/host-communication; * The new `link-spike-mh.ld` fixes this by locating `.text` at `0x80000000` and creating a 4KB-aligned `.tohost` section that defines `tohost/fromhost` at the required memory-mapped locations, enabling correct execution and proper termination signaling. * **Live-lock causing by harts (forever retrying CAS)** * The original CAS livelocks because multiple harts repeatedly execute identical tight LR/SC retry loops that keep breaking each other’s reservations without dela. To solve this, the modified version adds a hart-dependent backoff after a failed `sc.w`, desynchronizing retries so one hart can complete its LR/SC sequence and make progress. * **False Sharing** * The original layout causes false sharing because producer and consumer `head`/`tail` fields share cache lines so unrelated writes break LR/SC reservations, and although the padded layout should isolate `head` to avoid this. You can see the whole modified code after trying to fix encountering issues when executing the MCMP test program in spike's Multi-core environment in the commit [b39197e](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/commit/b39197e49e5db7a2a2b804a7bf3bdcc42e605974). * **Why `sc.w` still failling** * Despite the modification, spike log still shows `sc.w` fails when running the `cas_w` function. After inspecting the resource online, I find out that Spike has a known behavior where it calls `yield_load_reservation()` when its instruction-interleaving quantum is reached, which can produce **false SC failures** if LR and SC get separated across that boundary. This is discussed directly in Spike’s issue tracker [#278](https://github.com/riscv-software-src/riscv-isa-sim/issues/278), including the concern that it can cause “false failure on SC”. * If running with **very small interleave (e.g., 1)**, LR and SC are almost guaranteed to be separated by scheduling, making `sc.w` fail frequently even without a conflicting store to the same address. * **Solution** * By increasing Spike’s interleave quantum in `sim.h` from 1 to 16, `spike.log` shows `lr.w` and the subsequent `sc.w` are more often executed within the same scheduling slice (avoiding `yield_load_reservation()` between them), which eliminates the spurious reservation loss and stops the repeated `sc.w` failures. ## Inspect `4-soc` project ### Additions in `4-soc` compared to `3-pipeline` #### **System/SoC integration (new in 4-soc)** - **AXI4-Lite bus infrastructure** replaces the direct memory/peripheral hookup used in the simpler pipeline lab. This includes the AXI4-Lite protocol implementation and bundles for master/slave transactions in **`4-soc/src/main/scala/bus/AXI4Lite.scala`**, plus multi-master arbitration infrastructure in **`BusArbiter.scala`**. - **AXI4Lite.scala** - **Role:** Implements the AXI4-Lite bus protocol, defining the interface between the CPU (Master) and peripherals/memory (Slaves). - **Protocol Definition** - The file defines the standard 5 independent channels of AXI4-Lite: 1. **Write Address ( AW ):** Address and protection info. 2. **Write Data ( W ):** Data and byte strobes (`WSTRB`). 3. **Write Response ( B ):** Status response (`BRESP`) from slave to master. 4. **Read Address ( AR ):** Address for reads. 5. **Read Data ( R ):** Data and read response (`RRESP`). - **Master Module Logic** - The `AXI4LiteMaster` manages the handshake signals (`VALID`/`READY`). - **State Machine:** It cycles through `Idle`, `ReadData`, `WriteData`, and `WriteResp`. - **Performance Optimization (Pre-assertion):** To save a clock cycle, the master asserts `ARVALID` (read) or `AWVALID`/`WVALID` (write) while still in the `Idle` state if a request is detected. This allows the address phase to complete immediately on the next clock edge. - **Posted-Write Optimization:** The logic includes a `write_data_accepted` signal. This fires when data is accepted (`WREADY && WVALID`) *before* the final `BRESP` arrives. While intended to let the pipeline proceed early, the `MemoryAccess` module currently waits for the full `BRESP` conservatively. - **Slave Module Logic** - The `AXI4LiteSlave` is a generic wrapper that converts AXI signals into a simple "Bundle" (Address, Read, Write, Data) for easy attachment of peripherals. - **It enforces strict ordering**: Address Phase $\to$ Data Phase $\to$ Response Phase. It acknowledges requests by asserting `READY` signals only when in the correct state. - **SoC-style peripherals** are present in 4-soc (VGA, UART, timer, RAM, ROM loader) and wired through a bus, whereas 3-pipeline only carries simple peripherals without a bus fabric. #### CPU microarchitecture enhancements in 4-soc * **Branch prediction hardware** (**BTB** + **RAS** + **IndirectBTB**) - **BranchTargetBuffer.scala (BTB)** - **Role:** Provides dynamic branch prediction in the Instruction Fetch (IF) stage to reduce control hazards. - **Architecture** - The BTB is implemented as a direct-mapped cache with a configurable number of entries (default 16). Each entry contains a **Valid bit**, a **Tag**, a **Target Address**, and a **2-bit Saturating Counter**. - **Prediction Logic (IF Stage)** - The lookup happens combinatorially: 1. **Index & Tag:** The PC is split into an index (to select the entry) and a tag (to verify identity). 2. **Hit Condition:** A hit occurs if `valid(index)` is true and `tags(index)` matches the current PC tag. 3. **Direction Decision:** The prediction is `taken` only if there is a Hit **AND** the 2-bit counter is $\ge 2$ (Weakly Taken or Strongly Taken). 4. **Target Selection:** If taken, `predicted_pc` becomes the stored target; otherwise, it is $PC + 4$. - **Update Logic (ID Stage)** - Updates occur sequentially when the actual branch outcome is resolved in the ID stage: - **New Entry:** If a branch is taken but misses the BTB, a new entry is allocated with the counter initialized to **Weakly Taken (2)**. - **Existing Entry Update (Hysteresis):** - **Taken:** The counter increments, saturating at 3 (Strongly Taken). - **Not Taken:** The counter decrements. If the counter is at **Weakly Not Taken (1)** and decrements, the entry is invalidated (`valid := false`) rather than storing a 0. This proactively clears "dead" branches from the buffer. * **Full 5‑stage pipeline integration for the SoC** is expanded in 4-soc with dedicated stage modules (IF/ID/EX/MEM/WB, forwarding, memory access FSM). - **MemoryAccess.scala** - **Role:** The Memory (MEM) stage controller. It acts as a bridge between the CPU pipeline and the AXI4-Lite bus, handling stall generation, data formatting, and hazard forwarding. - **Program Logic & State Machine** - The core addition of this module is a Finite State Machine (FSM) that pauses the pipeline while waiting for asynchronous bus transactions. - **FSM States (`mem_access_state`):** - **Idle:** Default state. Monitors `memory_read_enable` and `memory_write_enable`. If an operation is requested, it asserts the bus request, raises `ctrl_stall_flag` to freeze the pipeline, and transitions to `Read` or `Write`. - **Read:** Keeps `bus.request` and `ctrl_stall_flag` high. It waits for `io.bus.read_valid`. Upon receipt, it processes the raw data (sign-extending bytes/halfwords based on `funct3`) and transitions back to `Idle`. - **Write:** Keeps `bus.request` and `ctrl_stall_flag` high until `io.bus.write_valid` (which corresponds to the AXI Write Response B-channel) is received. This ensures the store is fully acknowledged before the pipeline proceeds. - **Critical Timing & Latching Logic** - A significant portion of the logic addresses the timing mismatch between the asynchronous bus completion and the synchronous pipeline registers. - **The "One Cycle Gap" Problem:** When `read_valid` arrives, the stall is released immediately. However, the Writeback (WB) stage captures data at the *next* clock edge. If the control signals (`regs_write_source`, etc.) are not preserved, the WB stage might capture the signals of the *next* instruction entering the MEM stage, leading to data corruption. - **The Solution (`read_just_completed`):** 1. **Latching:** When entering the `Read` state, control signals are saved into registers (e.g., `latched_regs_write_source`). 2. **Extension:** A flag `read_just_completed` is set for exactly one cycle after the bus transaction finishes. 3. **Selection:** The output MUX `wb_effective_regs_write_source` selects the **latched** values if the module is reading or if `read_just_completed` is true. This guarantees the WB stage sees the correct control signals for the load instruction, even as the MEM stage transitions to the next instruction. - **Data Forwarding** - **Immediate Availability:** Unlike the WB path, the `forward_to_ex` output (used to solve hazards in the EX stage) does *not* wait for latching. It routes `io.wb_memory_read_data` directly. Inside the `Read` state, this value is updated combinatorially as soon as `read_valid` hits, allowing the dependent instruction in the EX stage to proceed in the same cycle the load finishes. ### Concepction of Implementation after Inspection #### **Adapt to Multi-Processor** * Duplicate multiple CPU in both `Top.scala` and `TestTopModule.scala`, and wire CPUs with `BusAbriter.scala` and `BusSwitch.scala` for determining which CPU can use AXI4-lite bus to access Memory. * Need to increase`MasterDeviceCount` to `4` in `Parameter.scala` #### Integrate LR/SC & AMO semantic with pipelineCPU * Adding LR/SC and AMO operation's RISC-V **encoding information** in `Instruction.scala` * Need to add **extra stall signal** and **forwarding** signal for AMO operation. * Add **reservation set table** in Memory to record all harts reservation sets for LR/SC. ### Apply `DebugWriteTest` Scala Test logic `DebugWriteTest` works by loading `uart.asmbin` into the DUT’s memory via `TestTopModule`, stepping the clock for sufficient cycles, and then using the debug memory read interface to assert that the program base address contains non-zero instructions The same logic can be mimicked for `ring-buffer.asmbin` in the for A-Extension characteristic testing in 4-soc by loading the binary into the 4-core top module, running long enough for multi-hart execution, and asserting via debug reads that expected ring-buffer state (e.g., program area, head/tail, or a done flag) has been written to memory. You can see the whole program of `DebugWriteTest` in [here](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/dev/4-soc/src/test/scala/riscv/DebugWriteTest.scala) ## Implemetnation ### High-level flow (pipeline + bus) 1. **Decode (ID)** recognizes AMO/LR/SC encoding and emits control signals. 2. **ID2EX/EX2MEM** pipeline registers carry `is_lr`, `is_sc`, `is_amo`, and AMO fields to MEM. 3. **MemoryAccess (MEM)** performs: - **LR**: load + reservation set. - **SC**: conditional store with success/fail code. - **AMO**: read-modify-write using the loaded value and `rs2` data. 4. **AXI4-Lite**: all memory transactions ride on the AXI4-Lite master interface. 5. **Multi-core snooping**: Top/TestTopModule detect write completion and broadcast reservation invalidation to all CPUs. ### LR/SC Implementation (A-extension) - **Instruction encoding & decode** - The implementation uses the AMO opcode `0101111`, where `funct5` distinguishes `LR` and `SC`, `funct3=010` selects word-sized operations, and the decoder generates `is_lr` and `is_sc` signals with `rs1` as the address source, `rs2` used only by `SC`, and an immediate fixed to zero. - **Pipeline propagation** - The `is_lr`, `is_sc`, and ordering-related bits are propagated from ID through `ID2EX` and `EX2MEM` to the MEM stage, while the normal ALU result is reused as the effective memory address and writeback correctness is preserved across stalls by latched WB signals. - **MemoryAccess semantics** - The MEM stage maintains a per-hart reservation consisting of a **valid bit** and **address**, sets the reservation on successful LR completion, evaluates SC success by matching the reservation address, writes `rs2` to memory only on success, returns `0` on success or `1` on failure to `rd`, and clears the reservation in all cases. - **Why do I choose adress-only as reservation granularity** - Since `4-soc` project doesn't implement cache, adopting a **block-based reservation granularity** would introduce unnecessary hardware complexity and overhead without providing practical benefits. - The key difference between address-only and block-based reservation granularity is that address-only tracks reservations **at a single memory address**, while block-based tracks an **entire cache block** and may cause false reservation invalidations due to unrelated stores. - Therefore, address-only reservation granularity better matches the current MyCPU microarchitecture and still correctly supports **LR/SC** semantics required for synchronization in a shared-memory multi-hart system. - **Multi-core integration** - In the multi-core system, all harts share an AXI4-Lite bus where write-address snoops generated on write completion are broadcast to invalidate reservations in other harts. --- ### AMO Implementation (A-extension) - **Instruction encoding & decode** - AMO instructions use the same AMO opcode `0101111` with `funct5` selecting the specific AMO operation, `rs1` providing the target address, `rs2` providing the operand, and the immediate forced to zero. - **Pipeline propagation** - AMO control signals, including the selected `funct5` and ordering bits, are carried unchanged through `ID2EX` and `EX2MEM` into the MEM stage, while ALU AMO functions exist only for completeness and are not used for final atomic semantics. - **MemoryAccess semantics** - Each AMO is executed in MEM as a serialized read–modify–write sequence that first reads memory, computes the AMO result from the loaded value and `rs2`, writes the result back to the same address, and returns the original loaded value to `rd` as required by the RISC-V specification. - **Multi-core behavior** - AMO atomicity across multiple harts is achieved by funneling all read–modify–write transactions **through the shared bus arbiter rather than relying on AXI-level atomic support**, and the latched writeback logic ensures correctness under bus stalls. - **Why bus-arbiter-level of serialization is enough for atomicity** - The arbiter/mux ensures **only one master’s AXI channel** is active at a time, so the AMO read and the following AMO write are both guaranteed to be globally serialized on the shared bus. With serialization at the bus, no other hart can slip another read/write between the AMO’s read and write, which is exactly the atomicity requirement for RMW in this architecture. - **Timing Diagram of AMO Instruction spanning multiple cycles** In commit [36e887a](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/commit/36e887ac08e8bfa9cb06e33d18fb2041c3f0fbb7), I add VCD waveform dumping for `make sim` via the Verilator simulator flow and introduces an `amoadd-test` program to observe AMO memory-stage stalls during simulation. ![Screenshot 2026-01-26 120710](https://hackmd.io/_uploads/SybqFDNIbx.png) ![Screenshot 2026-01-26 120747](https://hackmd.io/_uploads/rkZ9tD4LZl.png) - The waveform shows an `amoadd.w` instruction lingering in the MEM stage while the bus performs a read followed by a write, so the AMO spans multiple cycles. The `io_bus_request`, `io_bus_read_valid`, and later `io_bus_write_valid` signals indicate the two-phase read‑modify‑write sequence. The pipeline remains stalled during these phases, which is why the same instruction stays visible across several clock edges. You can see the whole implementation of LR/SC and AMO instruction in commit: [dd71c84](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/commit/dd71c845da591b7f3c8dc8b9661500e4f12212bf) ### `aq` / `rl` Bits Memory Ordering Guarantee In the current implementation, `.rl` (release) is enforced by stalling the pipeline as soon as the bit is visible in ID and again just before issuing LR/AMO/SC in MEM, ensuring no younger operations pass the release point before the atomic is issued. Conversely, `.aq` (acquire) is applied after the atomic completes by asserting a post-completion acquire fence on LR/AMO/SC completion paths, which holds the pipeline so later operations cannot be observed before the atomic finishes. This separation is made explicit by tracking release and acquire fences independently (`release_fence_hold` vs `acquire_fence_hold`) and combining them only for a unified stall signal, so the directionality of ordering is preserved rather than treating `.aq` and `.rl` as identical flags. This design explicitly guarantees only atomicity with fence-based ordering around AMO/LR/SC completions; however, this is enough because **all harts/cores share a single AXI4-Lite bus to main memory**, which enforces a single global serialization order for memory transactions, thereby guaranteeing memory ordering across harts. A broader memory-model reordering guarantee in multi-core architecture beyond these localized intra-core ordering requires more complex coherence protocol and relization of cache. You can see the whole implementation of `aq`/`rl` semantic in commit: [cb00cbc](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/commit/cb00cbcaec0fdee5f2d360089d7329f2a5986b84) ## Test with RISCOF ### Append RISCOF test setup for 4-soc #### Environment Setup The RISCOF setup for the 4-soc design is configured by using `config-4-soc.ini` to select `mycpu` as the DUT plugin, specify the reference model, and point the DUT execution path to the `4-soc` project so that compliance tests run on the SoC-level implementation. The ISA specification file `mycpu_isa_rv32i_zicsr_a.yaml` declares an RV32I core with `Zicsr` and `A` extensions enabled, which instructs RISCOF to include CSR-related and atomic instruction compliance tests in the test plan. The DUT plugin implementation in `riscof_mycpu.py` defines how RISCOF compiles each compliance test using a RISC-V GCC toolchain, links it with a custom linker script, converts the resulting ELF into a `.asmbin` format, and prepares the test for execution on the 4-soc platform. In instructor’s version, there is no 4-soc in `project_map` so I add it in the commit : [f5e131c](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/commit/f5e131c503e30f5371fe1b3b8b268bd9a6979942) The plugin further integrates with the existing SBT-based verification flow by auto-generating a Scala compliance test file and invoking the `soc` subproject so that all tests are executed against the 4-core SoC in simulation. #### Adding A Extension Test Suite The A-extension test suite is enabled automatically because the ISA YAML ([`tests/mycpu_plugin/mycpu_isa_rv32i_zicsr_a.yaml`](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/dev/tests/mycpu_plugin/mycpu_isa_rv32i_zicsr_a.yaml)) explicitly includes the `A` extension, causing RISCOF to select LR, SC, and AMO tests from the RISC-V architectural test repository. Each A-extension test is compiled with an ISA string that includes `rv32ia_zicsr`, ensuring that atomic and CSR instructions used by the test harness are accepted by the compiler and supported by the DUT. ### Encountering Problem #### NullPointerException during running Compilance test * **Root Cause** * In `ComplianceTestBase.scala`, it reads the absolute path of testing `.asmbin` file when trying to load to `InstructionRom`, which will fail because RISCOF read the path from `4-soc/src/main/resources/` and cause the `InstructionRom` to read Null from absolute path. * **Solution** * To address the above issue, we need to resolves the asmbin relative to the ELF location and falls back to the resource name if the file is not found: ```c= // FIX START: Handle both local files (RISCOF default) and Resources val elfPath = java.nio.file.Paths.get(elfFile) val testDir = elfPath.getParent val resolvedPath = testDir.resolve(asmbinFile) // Check if the binary exists where RISCOF generated it (next to the ELF). // If it exists there, use the absolute path. // If NOT, assume it is a Resource file in 'src/main/resources/' and pass just the filename. val finalLoadPath = if (java.nio.file.Files.exists(resolvedPath)) { resolvedPath.toAbsolutePath.toString } else { asmbinFile // e.g., "test_000.asmbin" - The ClassLoader will find this at the resource root } // FIX END // Instantiate 4-soc CPU with the corrected path test(new TestTopModule(finalLoadPath)).withAnnotations(annos) { c => ``` ### RISCOF Result #### Successfully passing on Single core ```c= info] Run completed in 21 minutes, 59 seconds. [info] Total number of tests run: 129 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 129, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 1328 s (22:08), completed Jan 20, 2026, 11:54:29 AM ``` * **Correctness Valiadation** * Single-core RISCOF compliance runs complete successfully (129/129 tests) and demonstrate that the RV32I ISA behavior with Zicsr and A Extensions matches the architectural specification for the 4-soc CPU pipeline when executed on a single core. This validates functional correctness of the core and bus integration at the ISA level under the tested configuration. #### Fail when trying to run on 4 cores ```c= [info] MyCPU Compliance <command-line>: fatal error: VTestTopModule__pch.h.slow: No such file or directory <command-line>: fatal error: VTestTopModule__pch.h.slow: No such file or directory compilation terminated. ``` * **Race Condition Issue** * The above log is from `tests/riscof_work_4soc/batch_test.log`. When I increased `MasterDeviceCount` to 4 in order to run 4 cores at same time, it effectively quadrupled the size of the generated C++ code. This increased complexity caused a **race condition** in the C++ build process where the compiler tried to use a **"Precompiled Header" (PCH)** file before it was fully generated. ## Validate with Test Program ### Expected Behavior of Final ring-buffer Invariants Test Program `ring-buffer.c` implements a fixed-size ring buffer (`RING_COUNT = 64`) and uses the producer and consumer **head** pointers as the sole LR/SC contention points; cache-line padding isolates `head` to reduce spurious reservation loss from unrelated stores. The program runs on **four harts**: hart 0 performs initialization, then all harts synchronize via a start barrier built from AMOs (`amoadd.w.aq` to count arrivals and `amoswap.w.rl` to release `start_go`), creating a single global happens-before point before the test begins. **MPMC contention behavior** is exercised by assigning harts 0–1 as producers and harts 2–3 as consumers; each producer enqueues `ITERS` elements and each consumer dequeues `ITERS` elements, with enqueue/dequeue using `cas_w()` (`lr.w` + `sc.w`) to advance `prod.head` and `cons.head`, forcing heavy contention on shared heads and validating forward progress under retries. **Memory-ordering constraints** are tested using Acquire/Release AMOs: producers acquire on reading `cons.tail`, publish data before releasing `prod.tail`, and consumers acquire on reading `prod.tail`, consume data, then release `cons.tail`, ensuring correct visibility and ordering between data writes and tail updates across harts. A deterministic backoff based on `mhartid` is used in LR/SC retry loops to avoid livelock under high contention. After all operations complete, a final barrier synchronizes all harts, and hart 0 checks that the total number of dequeued elements equals the total number enqueued, confirming correct LR/SC atomicity, contention handling, and memory ordering in an MPMC setting. You can see the final invariant of ring-buffer in [ring-buffer.c](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/blob/A-Extension/4-soc/csrc/ring-buffer.c) ### Encountering Problem Since I had trouble to run **multi-cores** on `4-soc` project, I didn't run the test successfully with test program's expected result. But you can see my Chisel Scala's test logic for `ring-buffer.asmbin` in [RingBufferProgramTest](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu/commit/e7dab8dd4c11c9b450428fb9fec8f4185e5b7f80), which it contains the required test characteristic (**LR/SC behavior** ,**Contention**, **Memory-Ordering Constraints**). ### `RingBufferProgramTest.scala` Logic RingBufferProgramTest runs `ring-buffer.asmbin` long enough for all four harts to synchronize and execute the enqueue/dequeue loops. It then reads the shared memory block at 0x00100000 (specified in `link.ld`) to assert `start_count/start_go`, `head/tail` counters, and the program-computed result flag that `ring-buffer.c` writes after a completion barrier confirms the LR/SC and AMO operations behaved correctly. ## Lesson Learned/ Future Work ### What I Learn from Debug Real Multi-Hart Execution Issues This project exposed several multi-hart and multi-core pitfalls that are easy to miss when moving from single-core correctness to concurrent execution. 1. First, bare-metal multi-hart bring-up must ensure per-hart stack separation and single-hart .bss initialization; otherwise, shared stack frames and repeated zeroing of globals can corrupt control flow and shared state non-deterministically. 2. Second, LR/SC stress tests can fail for reasons unrelated to architectural bugs, such as simulator scheduling artifacts (e.g., interleaving-induced reservation loss on Spike), so concurrency failures must be triaged carefully to distinguish tool behavior from ISA-level semantics. 3. Third, progress under contention requires practical measures such as backoff and false-sharing avoidance, since tight symmetric retry loops can livelock even when the atomic primitives are implemented correctly. True multi-core execution on 4-core MyCPU has not yet been demonstrated because Verilator encounters a precompiled-header (PCH) build race when the design is scaled to four cores, which prevents completing the full RingBufferProgramTest in 4-core simulation. ### Find out how to run Multi-Core on MyCPU in Future In the future, I will first modify the Verilator/C++ build flow to enforce correct dependency ordering for the precompiled header (PCH), e.g., by generating the PCH in a dedicated single-thread step and ensuring all compilation units depend on it before parallel compilation starts. I will also investigate build-system mitigations such as disabling PCH for the `VTestTopModule` build, reducing parallelism (e.g., `-j1` for the PCH stage only), or splitting the generated C++ into smaller translation units to avoid the race condition triggered by the larger multi-core design. After stabilizing the multi-core build, I will re-run `RingBufferTestProgram` on 4 cores and compare the observed LR/SC success/failure patterns, contention behavior across producers/consumers, and memory-ordering outcomes against the expected characteristics to validate correctness under true multi-core execution. ### Adapt `aq`/`rl` reordering semantic to multi-core shared memory system I plan to scale the current intra‑core AQ/RL ordering to a shared‑memory multicore system is to propagate AQ/RL metadata into the interconnect, enforce global release‑before‑issue and acquire‑after‑completion ordering at the shared memory/arbiter, and broadcast reservation invalidations so other cores observe the same ordering and LR/SC correctness. ## Conclusion To conclude, I completed the background study on synchronization, cache coherence, and the RISC-V A-extension, and I adapted a multi-hart ring-buffer test on Spike that exercises LR/SC behavior, contention, and acquire/release ordering constraints. After inspecting the 4-soc microarchitecture, I implemented LR/SC and AMO semantics by executing atomic operations in the MEM stage and enforcing atomic serialization via the AXI4-Lite stall mechanism, with reservation tracking and multi-core invalidation integrated into the SoC. Architectural correctness is supported by the fact that RISCOF compliance passes completely in the single-core configuration (129/129), indicating that the pipeline decode, MEM-stage atomic semantics, and writeback behavior match the RV32I + Zicsr + A specification under that setup. The remaining gap of multi-core validation is not an architectural failure but a toolchain limitation: enabling stable 4-core compliance runs is currently blocked by a Verilator C++ build race involving precompiled headers when the generated code size increases for multi-core. Once this Verilator PCH race condition is resolved, I will rerun RISCOF in true 4-core mode and execute the ring-buffer test on the 4-core MyCPU simulation to validate multi-hart behavior end-to-end under the actual SoC configuration. ## AI usage declaration > Course policy: follow [Computer Architecture (2025 Fall) – Guidelines for Student Use of AI Tools](https://hackmd.io/@sysprog/arch2025-ai-guidelines). I document how references and AI tools were used, what I contributed, and how reproducibility is ensured. This section accompanies all labs and HackMD notes in the repository. ### Purpose and Scope of AI Usage AI tools were used **as an auxiliary assistant** throughout this project, primarily for comprehension, debugging, and validation. All design decisions, implementations, experiments, and written results were ultimately produced, verified, and integrated by me. Specifically, AI assistance was used in the following aspects: 1. **Conceptual Understanding** - To help understand and clarify content from the **course lecture videos**. - To assist in reading and interpreting the **RISC-V ISA Manual, Volume I**, with a focus on the **A-Extension (Atomic Instructions)**, including LR/SC semantics, memory ordering, and architectural constraints. 2. **Software Debugging and Validation** - To help identify and reason about bugs encountered when running a **modified MPMC version of `ring-buffer.c` on Spike**, including issues related to LR/SC failures, contention, and scheduling behavior. - To validate my understanding of how the **A-Extension should be implemented**, by inspecting and reasoning about the `4-soc` project in forked repo: [Computer-Architecture_HW3-ca2025-mycpu](https://github.com/jgw0915/Computer-Architecture_HW3-ca2025-mycpu). - To assist in debugging issues related to **wire-level signal connections** in the **Chisel/Scala implementation** of the A-Extension. 3. **Verification and Toolchain Issues** - To help diagnose problems encountered when running **multi-core RISCOF tests** for the A-Extension, including mismatches between expected atomic behavior and observed test failures. 4. **Program Refactoring** - To assist in refactoring `ring-buffer.c` so that its behavior aligns correctly with the test logic defined in `RingBufferProgramTest.scala`, particularly regarding concurrency assumptions and atomic operation usage. ### Human Contribution All of the following were done independently by me: - Designing and implementing the A-Extension in the hardware project. - Writing, modifying, and testing the `ring-buffer.c` program. - Running experiments on Spike and RISCOF. - Interpreting logs, test results, and failures. - Writing structure for HackMD notes and documentation. - Making final judgments on correctness, design trade-offs, and conclusions. AI tools were **not used to generate final code or results without manual review**. Any AI-suggested explanations or fixes were cross-checked against the RISC-V specification, course materials, and experimental outcomes. ### Reproducibility Reproducibility is ensured by: - Providing all source code, configuration files, and scripts in the repository. - Documenting tool versions, command-line invocations, and test environments. - Ensuring that all conclusions are derived from observable behavior (Spike logs, RISCOF results, and simulation outputs) rather than AI-generated claims. In summary, AI tools were used responsibly as a learning and debugging aid, while all substantive technical work and conclusions remain my own.