CA2025: X-Problem at Quiz6

# CA2025: X-Problem at Quiz6 > 徐曉汎 ## 11. What pipeline hazard is hardest to eliminate in hardware? ### Ans : The pipeline hazard that is hardest to eliminate in hardware is the **Control Hazard**. #### 1. The Core Reason: Unpredictability Control Hazards occur when the processor encounters a `Branch` or `Jump` instruction. * **The Problem** : Until the instruction is fully executed calculating the target address and deciding whether to take the branch, the CPU has no idea which instruction to fetch next. * **Difficulty**: To completely eliminate this, we would need 100% accuracy in branch prediction. However, even with modern, complex AI-assisted Branch Predictors, mispredictions still occur. When they do, the pipeline must be flushed, causing a significant performance penalty. #### 2. Comparison with Other Hazards * **Structural Hazards**: Two or more instructions in the pipeline both need to access a single physical resource. In other words, a physical resource is needed in multiple stages. * **Solution to eliminate in hardware**: * Instructions take turns to use the resource. * Add more hardware to machine. * Design the instruction set architecture to avoid structural hazards. * **Data Hazards**: An instruction waits for operands from a previous instruction. * **Solution to eliminate in hardware**: * Rearrange instructions to avoid data hazards. * Forwarding + Stall * This is specifically required to solve Load-Use Hazards. Since the data from a Load instruction is not available until the memory stage, forwarding alone is insufficient, and a 1-cycle stall is mandatory. * **Control Hazards**: The decision of which instruction to fetch next depends on the result of the current instruction. * **Solution to try to mitigate them**: Since we cannot strictly "eliminate" Control Hazards, we try to mitigate them. * Branch prediction. * Try and guess whether the branch will be taken. * Only flush the pipeline if we guess wrong. * Branch prediction improves performance on average, but there's still a penalty for wrong guesses. ### How to validate your answer with experiments? (Use MyCPU for further discussions) To compare the pipeline behavior and performance impact of Data Hazards versus Control Hazards using the pipeline RISC-V processor. #### Prerequisites * Target: `pipeline` version of MyCPU. * Tools: Verilator, Surfer. #### Experiment 1: Eliminating Data Hazards (Forwarding) Hypothesis: Data hazards (RAW) can be completely resolved by hardware forwarding logic without stalling the pipeline. * Test Code: Create an assembly program with immediate data dependencies: ``` li x1, 10 # x1 = 10 add x2, x1, x1 # x2 = x1 + x1 (depends on x1 from EX stage) add x3, x2, x2 # x3 = x2 + x2 (depends on x2 from EX stage) ``` * Simulation & Observation: Run `make sim` in the directory and open `trace.vcd` in Surfer. * Observe: Check the instruction flow in the `EX` stage. * Verification: * Should see instructions executing in consecutive clock cycles without any bubbles inserted. * The ForwardingUnit successfully bypasses the data from the `EX/MEM` stages back to the `ID` stage. #### Experiment 2: Verifying Elimination of Structural Hazards (Resource Conflict) Structural hazards, specifically the conflict between fetching instructions (`IF` stage) and accessing data (`MEM` stage), are eliminated in the MyCPU design by using separate instruction and data memory interfaces. * Test Code: Create a program that forces the CPU to perform a Load/Store operation (`MEM` stage) while simultaneously fetching a new instruction (`IF` stage). ``` .global _start _start: li x10, 0x1000 # Base address for data lw x1, 0(x10) # Cycle N: MEM stage (Data Access) add x2, x1, x1 # Cycle N: EX stage sub x3, x2, x2 # Cycle N: ID stage or x4, x3, x3 # Cycle N: IF stage (Instruction Fetch) sw x4, 4(x10) # Store to verify write port ``` * Simulation & Observation: Run `make sim` in the directory and open `trace.vcd` in Surfer. * Observe: * Locate the `Instruction Bus signals`. * Locate the `Data Memory Bus signals`. * Verification: * Find the clock cycle where lw is active (`Data Memory Read Enable` is HIGH). * In the same clock cycle, check if the Instruction Fetch is also active (`Instruction Valid` is HIGH, and `PC` is updating). * If both interfaces are active simultaneously and the pipeline does not stall (`PC` keeps incrementing), the structural hazard is successfully eliminated. #### Experiment 3: The Persistence of Control Hazards (Branch Penalty) Control hazards inherently cause pipeline flushes and wasted cycles, which cannot be eliminated by simple forwarding logic. * Test Code: Create a loop that forces the CPU to branch repeatedly: ``` li x1, 10 loop: addi x1, x1, -1 bne x1, x0, loop # Branch Taken: jumps back to loop nop # Branch Not Taken path nop ``` * Simulation & Observation: Run `make sim` in the directory and open `trace.vcd` in Surfer. * Observe: * Find the `io_jump_flag` signal. * Look at the instructions in the `IF` and `ID` stages immediately following the `bne`. * Verification: * When `bne` is in the `EX` stage and decides to jump, we will see the instructions currently in `IF` and `ID` being flushed. * This results in a visible penalty of 2 clock cycles where no useful work is completed. ##### Reference 1. [Lectures 21-23: RISC-V 5-Stage Pipeline](https://docs.google.com/presentation/d/1v-Squx8lK-oOrflFOwBZh-ue94seVHZudqDOgJmFf5Q/edit?slide=id.g30cd39f6c3e_86_146#slide=id.g30cd39f6c3e_86_146): p40, p43 - p54, p61 - p65, p71 - p78 --- ## 16. Why are L1 caches virtually indexed but physically tagged? ### Ans : L1 caches are designed as Virtually Indexed, Physically Tagged (VIPT) to strike the optimal balance between access latency and correctness. #### 1. Parallel Access The primary goal of the L1 cache is low latency. The CPU pipeline operates using Virtual Addresses (VA), but memory is accessed using Physical Addresses (PA). * **The Problem with PIPT (Physically Indexed, Physically Tagged)**: In a PIPT design, the CPU must first wait for the MMU/TLB to translate the Virtual Address into a Physical Address. Only after this translation is complete can the cache lookup begin. This serialization (`TLB` -> `Cache`) adds significant latency to the critical path. * **The VIPT Solution**: By using the Virtual Address for the Index, the CPU can perform the TLB Lookup and the Cache Set Selection in parallel. * The cache hardware uses the Index bits from the VA to fetch the data array and tag array. At the same time, the TLB translates the VA to get the Physical Page Number. * By the time the data is fetched from the cache RAM, the TLB has the physical tag ready for comparison. This parallelism removes the TLB translation latency from the cache access time, making VIPT as fast as a virtually addressed cache. #### 2. The "Physical Tag" Reason If speed is the only goal, one might suggest VIVT (Virtually Indexed, Virtually Tagged). However, VIVT introduces complex software problems that Physically Tagged caches avoid. * **Solves the Homonym Problem**: Different processes often use the same virtual addresses (e.g., `0x400000`) to point to different physical memory. If tags were virtual, the CPU might return data belonging to Process A while executing Process B. * VIPT Solution: Since tags are Physical, the cache can distinguish that Process A's `0x400000` maps to Physical Address `0x1000` and Process B's maps to `0x2000`. This avoids the need to flush the L1 cache on every context switch. * **Solves the Synonym Problem**: Multiple virtual addresses can map to the same physical address. In a VIVT cache, this data would exist in multiple lines with different virtual tags, making it nearly impossible to keep them consistent. * VIPT Solution: Because the tag is physical, the hardware has a unique identifier for the underlying memory location. Combined with a proper index constraint (discussed in section 3), this ensures that a physical address always maps to a predictable location, preventing or allowing the hardware to detect inconsistent copies. * **Supports Hardware Cache Coherence**: In multi-core systems, coherence traffic occurs on the system bus using Physical Addresses. * VIPT Solution: Since the cache stores Physical Tags, the hardware can easily check if it holds a specific physical address requested by another core. If the tags were virtual, every snoop request would require a slow reverse-translation (PA to VA). #### 3. Translation Constraint VIPT works perfectly because of how paging works. * In a Virtual Address, the lowest bits are the Page Offset. These bits are identical in both the Virtual and Physical address. * If the L1 Cache structure is sized such that the Index bits fall entirely within the Page Offset, then:$$\text{Virtual Index} == \text{Physical Index}$$This implies that the hardware gets the speed of Virtual Indexing, but logically behaves exactly like a Physically Indexed cache (PIPT). This effectively avoids Aliasing issues entirely without requiring extra hardware logic. ### How to validate your answer with experiments? To validate why L1 caches are Virtually Indexed but Physically Tagged, I focus on two metrics: Latency and Correctness #### Experiment 1: Validating Latency Benefits (Parallel vs. Serial Access) Prove that VIPT reduces the memory access critical path compared to PIPT. * Hardware Setup: In the `pipeline` on `MyCPU`, the `EXE` stage generates the address, and the `MEM` stage handles the Cache/TLB lookup. * Case A (Simulated PIPT): Modify the logic so the Cache Set Selection waits for the TLB output signal. This introduces a structural stall or a longer combinatorial path. * Case B (VIPT Implementation): Ensure the `Index` bits are sent to the Cache SRAM immediately, while the `Tag` comparison logic waits for the TLB’s Physical Page Number. * Experiment Steps: * Run a benchmark with back-to-back Load instructions: ``` lw x1, 0(x10) add x2, x1, x1 ``` * Use Surfer to monitor the timing between the `EXE` stage address generation and the `WB` stage data availability. * Verification: Observe that in Case B (VIPT), the Cache RAM access begins at the very start of the `MEM` cycle. In Case A (PIPT), the RAM access is delayed by the TLB latency. VIPT allows the Load instruction to complete in a single `MEM` cycle without stalling the pipeline, whereas PIPT would likely require a pipeline bubble or a slower clock frequency. #### Experiment 2: Validating Correctness Prove that Physical Tags prevent data leakage between different processes using the same Virtual Addresses. * Software Setup: * Create two small programs (Task A and Task B). Both tasks use the same Virtual Address: `0x4000`. * Using the `PageTable` logic in MyCPU, map Task A's `0x4000` to Physical Address `0x1000` and Task B's `0x4000` to `0x2000`. * Experiment Steps: * Execute Task A: Write `0xDEADBEEF` to `0x4000`. * Perform a context switch but do not flush the L1 Cache. * Execute Task B: Read from `0x4000`. * Verification: Check the `io.regs_write_data` in the WB stage. Even though the Index from `0x4000` hits the same cache set, the Physical Tag comparison will fail because Task B’s PPN (`0x2000`) does not match the stored Tag (`0x1000`). This triggers a Cache Miss, forcing Task B to fetch its own data correctly. If the tags were virtual, Task B would have incorrectly read Task A's data. #### Experiment 3: Validating the Aliasing Constraint Demonstrate that VIPT behaves like PIPT only if the Index bits are within the Page Offset. * Configuration Change: * Assume PageSize = `4KB` (Offset = `bits 11:0`). * Scenario 1 (Safe): Set L1 Cache Way size to 4KB. The Index bits are `[11:offset_bits]`. * Scenario 2 (Aliasing Risk): Increase L1 Cache Way size to 8KB. The Index now requires bit 12 which is part of the VPN, not the Offset. * Experiment Steps: * In Scenario 2, map two different Virtual Addresses `0x0000` and `0x1000` to the same Physical Address 0x5000. * Write to `0x0000`, then read from `0x1000`. * Verification: Check if the cache reports a Miss for the second access even though the data is physically present in another line. This validates the Translation Constraint mentioned in the report. It proves that for VIPT to be effectively aliasing-free, the hardware must ensure the Virtual Index is identical to the Physical Index by keeping it within the Page Offset. ##### Reference 1. [L16. Virtual Memory](https://computationstructures.org/lectures/vm/vm.html): p1 - p4, p6 - p8, p10 2. [Lecture 34 - Virtual Memory II](https://docs.google.com/presentation/d/1KB2HKnvqsKq65udERzMw9imTMtV8KREMWmKc7w8XOfk/edit?slide=id.g313922e28e2_0_286#slide=id.g313922e28e2_0_286): p21 3. [Virtually Indexed Physically Tagged (VIPT) Cache](https://www.geeksforgeeks.org/computer-organization-architecture/virtually-indexed-physically-tagged-vipt-cache/) (Non-curricular teaching materials) --- ## 17. Why is L1 usually write-back but write-allocate? ### Ans: L1 caches are designed with the Write-back and Write-allocate policies to maximize performance by minimizing the penalty of slow memory writes and leveraging spatial locality. #### 1. Why Write-back ? In a Write-back policy, data is written only to the L1 cache. The corresponding cache line is marked as "Dirty." The data is written to the next level (L2 or Main Memory) only when the line is evicted. * L1 is the closest level to the CPU and handles a massive volume of store instructions. A Write-through policy would saturate the system bus, causing the pipeline to stall frequently due to bus congestion. * Store instructions are considered complete as soon as the L1 cache is updated. The CPU does not have to wait for the slower L2 or RAM to acknowledge the write. * Multiple Writes, One Update: If a loop repeatedly updates the same variable, Write-back performs multiple updates in the fast L1 and only performs a single write-back to L2 when the loop finishes or the line is replaced. #### 2. Why Write-allocate ? In a Write-allocate policy, if a write miss occurs, the cache line is first fetched from the lower memory level into the cache, and then the write is * Spatial Locality: Programs often write to contiguous memory locations. With Write-allocate, the first write fetches the whole line; subsequent writes to the same line result in Cache Hits, significantly speeding up the process. * Write-back is most effective when the latest version of data stays in the cache. If we used Write-no-allocate, every write miss would bypass the L1 and go directly to L2. This contradicts the goal of Write-back, which is to keep frequently modified data in the fastest tier ### How to validate your answer with experiments? #### Experiment 1: Validating Write-back * Logic Check: In`MemoryAccess.scala`, look at the `memory_bundle`. * In a Write-back cache, a store instruction only asserts `write_enable` to the L1 Cache controller. * Check if the L1 Cache controller has a `dirty` array. * Experiment Steps: * Execute a store instruction: `sw x5, 0(x10)`. * Use Surfer to observe the `MEM` and `WB` stages. * Verification: The `WB` stage should finish the instruction in the very next cycle after `MEM`. Observe the `L1_Cache` internal state; the "Dirty bit" for that set should flip to `1`, but no traffic should appear on the `L2_Bus` or `DRAM` interface yet. #### Experiment 2: Validating Write-allocate Observe the Fetch-on-Miss behavior during a store operation. * Software Setup: * Access a memory address that is currently not in the L1 Cache. * Experiment Steps: * Execute `sw x5, 0(x20)` where `0(x20)` is not cached. * Observation: The L1 Cache controller will detect a Miss. * Instead of writing `x5` directly to RAM, observe a Read Request issued by the L1 Cache to the L2/Memory. * Once the data returns from memory, the L1 Cache updates the line with `x5` and marks it `Dirty`. * Verification: The `MEM` stage should stay stalled while the cache line is being fetched. Once the line is allocated, the instruction completes. #### Experiment 3: Performance Comparison (Write-back vs. Write-through Compare the total cycle count for high-frequency writes. ``` li t0, 100 loop: sw x1, 0(x10) addi t0, t0, -1 bnez t0, loop ``` * Observation: * In Write-back: Only the first write might be a miss. All subsequent 99 writes will be L1 hits. Total cycles will be low. * In Write-through: Every single `sw` would require a bus transaction. We would see the pipeline stalling for multiple cycles on every single iteration of the loop The total execution time in MyCPU will be significantly lower with Write-back for this pattern ##### Reference 1. [L14. The Memory Hierarchy](https://computationstructures.org/lectures/caches/caches.html): p37 - p40 2. [Lecture 26: Caches III](https://docs.google.com/presentation/d/1dN876ETDdJATOV0JYQJzvHadBcXAgA03NqG311es7QU/edit?slide=id.p1#slide=id.p1): p28 - p29 3. [Lecture 34 - Virtual Memory II](https://docs.google.com/presentation/d/1KB2HKnvqsKq65udERzMw9imTMtV8KREMWmKc7w8XOfk/edit?slide=id.g313922e28e2_0_286#slide=id.g313922e28e2_0_286): p19