Quiz 6 problem x

# CA2025 Quiz6 > 陳昶安 ## Why does false sharing hurt multicore performance? ### Reference - Lecture: In Week 15 (Multithreading Issues + Cache Coherency) Video : Lecture 23.7 - Multithreading Issues, Cache Coherency: Coherency Misses https://www.youtube.com/watch?v=xOZH48uGYuc&list=PLDoI-XvXO0aqgj_8Og51XoE7iyAu5yEWZ&index=8 - Video: ![image](https://hackmd.io/_uploads/SJSJ8xWE-x.png) ### Answer 1. **Why false sharing hurts multicore performance** According to the slide “False Sharing” from CS61C Lecture 23.7, false sharing occurs when a cache block ping-pongs between two caches even though processors are accessing disjoint variables. Because cache coherence is maintained at the cache block level, a write by one processor invalidates the entire block in another processor’s cache. As a result, the block continuously moves back and forth between caches, causing excessive coherence traffic and stalls, which significantly degrades multicore performance. The performance penalty comes from unnecessary cache coherence actions, not from true data sharing. 2. **How can it be prevented?** False sharing can be prevented by placing data on different cache blocks or by reducing block size, so independent variables do not reside in the same cache block. ### Experimental Idea: Demonstrating Slowdown Caused by False Sharing AI Tools:Using Perplexity to search for existing implementation examples related to false sharing on the Internet. Two threads are executed on different CPU cores. Each thread repeatedly increments its own counter for a fixed number of iterations. The experiment compares two memory layouts: one where the counters share the same cache line, and another where they are separated into different cache lines using padding. Since the workload and instruction sequence are identical, any performance difference can be attributed solely to false sharing. ### Experimental Development #### Data layout causing false sharing ``` c typedef struct { volatile uint64_t x; volatile uint64_t y; } shared_bad_t; ``` In this structure, `x` and `y` are likely to reside in the same cache line. When two threads on different cores update these variables concurrently, cache coherence protocols cause frequent cache line invalidations, even though the variables are logically independent. #### Eliminating false sharing using padding ``` c typedef struct { volatile uint64_t x; char pad[64 - sizeof(uint64_t)]; volatile uint64_t y; } shared_good_t; ``` Padding is inserted to force the two counters into different cache lines. This change does not modify the algorithm or workload, but only the memory layout, allowing us to isolate the impact of false sharing. #### CPU affinity and synchronization design Threads are pinned to specific CPU cores to ensure true parallel execution and to avoid thread migration by the OS scheduler. No locks or synchronization primitives are used, ensuring that performance differences are caused by cache coherence effects rather than synchronization overhead. ### Experimental Procedure ``` bash gcc -O2 -pthread false_sharing_bench.c -o fsbench ./fsbench 200000000 0 1 ./fsbench 200000000 2 3 ``` After compiling the program using GCC, the experiment was run under different CPU core combinations, with 200,000,000 incremental operations per thread. ### Code Diff. **Difference 1: Data layout** ``` c typedef struct { volatile uint64_t x; volatile uint64_t y; } shared_bad_t; // x and y are likely to land on the same cache line typedef struct { volatile uint64_t x; char pad[64 - sizeof(uint64_t)]; volatile uint64_t y; } shared_good_t; // Use padding to force y to the next cache line ``` The `GOOD` version only adds cache-line padding to separate counters into different cache lines; the workload and instruction sequence remain unchanged. **Difference 2: Choose BAD / GOOD when running** ``` c if (!use_good) { for (uint64_t i = 0; i < iters; i++) bad.x++; } else { for (uint64_t i = 0; i < iters; i++) good.x++; } // Same loop, same iteration count. // Only the target variable differs (bad vs good layout). ``` The loop itself remains unchanged, but the memory location written to it is different (same cache line vs different cache line) ### Experimental Results ``` bash //gcc -O2 -pthread false_sharing_bench.c -o ./fsbench 200000000 0 1 ./fsbench 200000000 2 3 iters=200000000 cpu0=0 cpu1=1 BAD (same cacheline) : 401320700 ns GOOD (separate lines) : 290422700 ns speedup = 1.38x iters=200000000 cpu0=2 cpu1=3 BAD (same cacheline) : 408789900 ns GOOD (separate lines) : 289770700 ns speedup = 1.41x ``` Experiments have shown that when two counters are on the same cache line, execution time increases by approximately 38%–41%. By simply changing the memory allocation and eliminating false sharing, performance can be significantly improved. ### C Code :::spoiler C Code ```c= #define _GNU_SOURCE #include <pthread.h> #include <sched.h> #include <stdint.h> #include <stdio.h> #include <time.h> #include <stdlib.h> #ifndef CACHELINE #define CACHELINE 64 #endif static inline uint64_t nsec_now(void) { struct timespec ts; clock_gettime(CLOCK_MONOTONIC_RAW, &ts); return (uint64_t)ts.tv_sec * 1000000000ull + (uint64_t)ts.tv_nsec; } typedef struct { volatile uint64_t x; volatile uint64_t y; } shared_bad_t; typedef struct { volatile uint64_t x; char pad[CACHELINE - sizeof(uint64_t)]; volatile uint64_t y; } shared_good_t; static shared_bad_t bad; static shared_good_t good; typedef struct { int cpu; uint64_t iters; int use_good; // 0: bad, 1: good } arg_t; static void pin_to_cpu(int cpu) { cpu_set_t set; CPU_ZERO(&set); CPU_SET(cpu, &set); pthread_setaffinity_np(pthread_self(), sizeof(set), &set); } static void *worker0(void *p) { arg_t *a = (arg_t*)p; pin_to_cpu(a->cpu); if (!a->use_good) { for (uint64_t i = 0; i < a->iters; i++) bad.x++; } else { for (uint64_t i = 0; i < a->iters; i++) good.x++; } return NULL; } static void *worker1(void *p) { arg_t *a = (arg_t*)p; pin_to_cpu(a->cpu); if (!a->use_good) { for (uint64_t i = 0; i < a->iters; i++) bad.y++; } else { for (uint64_t i = 0; i < a->iters; i++) good.y++; } return NULL; } static uint64_t run_case(int use_good, uint64_t iters, int cpu0, int cpu1) { pthread_t t0, t1; arg_t a0 = {.cpu = cpu0, .iters = iters, .use_good = use_good}; arg_t a1 = {.cpu = cpu1, .iters = iters, .use_good = use_good}; bad.x = bad.y = 0; good.x = good.y = 0; uint64_t t_start = nsec_now(); pthread_create(&t0, NULL, worker0, &a0); pthread_create(&t1, NULL, worker1, &a1); pthread_join(t0, NULL); pthread_join(t1, NULL); uint64_t t_end = nsec_now(); return t_end - t_start; } int main(int argc, char **argv) { uint64_t iters = (argc >= 2) ? (uint64_t)atoll(argv[1]) : 200000000ull; int cpu0 = (argc >= 3) ? atoi(argv[2]) : 0; int cpu1 = (argc >= 4) ? atoi(argv[3]) : 1; run_case(0, iters/10, cpu0, cpu1); run_case(1, iters/10, cpu0, cpu1); uint64_t t_bad = run_case(0, iters, cpu0, cpu1); uint64_t t_good = run_case(1, iters, cpu0, cpu1); printf("iters=%llu cpu0=%d cpu1=%d\n", (unsigned long long)iters, cpu0, cpu1); printf("BAD (same cacheline) : %llu ns\n", (unsigned long long)t_bad); printf("GOOD (separate lines) : %llu ns\n", (unsigned long long)t_good); printf("speedup = %.2fx\n", (double)t_bad / (double)t_good); return 0; } ``` ::: ### Reference https://vocus.cc/article/690472e1fd89780001c68510 ## Why is cache line size a trade-off? ### Reference - Lecture: In Week 12 (Caches III) https://docs.google.com/presentation/d/1dN876ETDdJATOV0JYQJzvHadBcXAgA03NqG311es7QU/edit?slide=id.p1#slide=id.p1 - Slides: * Benefits of Larger Block Size — p.29 ![image](https://hackmd.io/_uploads/BJEhY0xV-e.png) * Drawbacks of Larger Block Size — p.29 ![image](https://hackmd.io/_uploads/H16hYAeNZx.png) * Block Size Tradeoff Conclusions — p.31 ![image](https://hackmd.io/_uploads/Sywbq0gE-e.png) ### Answer 1. **Benefit of larger cache line size** According to Caches III, p.29, a larger cache block size can better exploit spatial locality. If a program accesses one word, it is likely to access nearby words soon. By fetching a larger contiguous block from memory, the cache increases the chance that future accesses will hit in the cache. 2. **Drawbacks of larger cache line size** also states that a larger block size leads to a larger miss penalty. On a cache miss, more data must be fetched from the next memory level, increasing the time required to service the miss. 3. **Effect on miss rate when block size is too large** As explained in Caches III, p.29, if the block size is too large relative to the cache size, the cache will contain fewer blocks. With fewer blocks available, useful data may be evicted before being reused, which can increase the overall miss rate. ### Experimental idea From Caches III (p.29), increasing block (cache line) size has two opposite effects: it can exploit spatial locality (benefit), but it also increases miss penalty and may increase miss rate if the block is too large relative to cache size. Caches III (p.31) further summarizes that larger block size “exploits spatial locality” while “fewer blocks compromises temporal locality” and can lead to “increased miss penalty & miss rate.” ## Why is page fault handling slow? ### Reference - Lecture: In Week 13 (Virtual Memory II) https://docs.google.com/presentation/d/1KB2HKnvqsKq65udERzMw9imTMtV8KREMWmKc7w8XOfk/edit?slide=id.g313922e28e2_0_286#slide=id.g313922e28e2_0_286 - Slides: * Best case: TLB hit — p.10 ![image](https://hackmd.io/_uploads/rybkqkZE-l.png) * TLB miss, page table walk — p.11–12 ![image](https://hackmd.io/_uploads/H1ugqk-NWx.png) ![image](https://hackmd.io/_uploads/HJsZcy-V-e.png) * Worst case: TLB miss, page fault, go to disk — p.13 ![image](https://hackmd.io/_uploads/r1wQ9yW4Zg.png) * Handling a page fault — p.14 ![image](https://hackmd.io/_uploads/HytHc1WEZx.png) ### Answer 1. **Page fault is the worst-case address translation path** From Worst case: TLB miss, page fault, go to disk — p.13, a page fault occurs when both the TLB and the page table fail to provide a valid translation, making it the slowest possible case. 2. **Page fault requires disk access** According to TLB miss, page fault, go to disk — p.13, handling a page fault requires accessing disk to load the missing page, which is orders of magnitude slower than memory access. 3. **Page fault involves multiple costly operations** According to Handling a page fault — p.14, the system must load the page from disk, update the page table, and update the TLB, which further increases handling latency. ### Experimental idea Design a program that accesses memory larger than physical memory to intentionally trigger page faults, and compare its execution time with a version that fits in physical memory, as motivated by the contrast between Best case: TLB hit — p.10 and Worst case: page fault — p.13. ## Why are interrupts considered control hazards? ### Reference - Lecture: In Week 16 (I/O: Devices, Polling, Interrupts) https://drive.google.com/drive/folders/1NRoyYSbAyarMhmPD5LiY6YEqYpLoZx_s - Slides: * Traps/Interrupts/Exceptions: altering the normal flow of control — p.20 ![image](https://hackmd.io/_uploads/SyFti1WE-e.png) * Trap Handling in 5-Stage Pipeline — p.23 ![image](https://hackmd.io/_uploads/SJ1ookWNWg.png) * Trap Pipeline Diagram — p.24–25 ![image](https://hackmd.io/_uploads/BJLhi1-4-x.png) ![image](https://hackmd.io/_uploads/BktTjJWVWx.png) ### Answer 1. **Interrupts alter the normal flow of control** According to the “Traps/Interrupts/Exceptions: altering the normal flow of control” (p.20), interrupts alter the normal flow of control by transferring execution to an interrupt handler. 2. **Interrupts cause pipeline flush and redirection** According to “Trap Handling in 5-Stage Pipeline” — p.23, when an interrupt occurs, the processor must complete earlier instructions, flush the instructions currently in the pipeline, and transfer control to the trap handler. 3. **Interrupts behave like control hazards in pipeline diagrams** The Trap Pipeline Diagram — p.24–25 visually shows that instructions following the interrupted instruction are discarded, and execution resumes at a different PC, which matches the behavior of control hazards. ### Experimental Idea Introduce frequent interrupts (e.g., timer interrupts) while running a pipelined program and measure the increase in pipeline stalls or flushed instructions, comparing it with execution without interrupts. ## Describe the purpose of the compressed C extension. ### Reference Lecture: In Week 13 (RISC-V Vector extension (RVV)) The RISC-V Instruction Set Manual Volume I https://drive.google.com/file/d/1uviu1nH-tScFfgrovvFCrj7Omv8tFtkp/view ![image](https://hackmd.io/_uploads/H1qptx-4Zx.png) ### Answer According to the RISC-V Instruction Set Manual, Volume I: Unprivileged Architecture, Section 27 “C Extension for Compressed Instructions”, page 160, the purpose of the compressed C extension is to reduce code size by providing 16-bit instruction encodings for commonly used operations, while remaining fully compatible with the base 32-bit RISC-V ISA. Specifically, the C extension allows frequently used instructions (such as loads, stores, arithmetic operations, and control transfers) to be encoded in 16-bit compressed formats, which significantly improves code density, reduces instruction memory footprint, and can lead to better instruction cache utilization and lower memory bandwidth usage. ### Experimental Idea Compile the same program with and without the compressed C extension enabled, and compare the resulting binary sizes and instruction fetch counts. A smaller binary size and reduced instruction fetch activity when using the C extension would validate that it achieves its intended purpose of improving code density. ## What is the significance of the funct3 and funct7 fields in R-type instructions? ### Reference - Lecture: In Week 4 (RISC-V Instruction Formats (Part I)) https://docs.google.com/presentation/d/1pXcXcBjmUCXFJNFi4DpFLtpyDKsol5SaKfVce1hKdXE/edit?slide=id.p13#slide=id.p13 - Slides: * R-Format Instruction Layout — p.10 ![image](https://hackmd.io/_uploads/ry6OqgbV-x.png) ### Answer 1. **funct3 and funct7 refine the opcode to identify the exact operation** According to the slide “R-Format Instruction Layout” (p.10), the opcode in an R-type instruction only partially specifies which instruction it is. The funct3 and funct7 fields, combined with the opcode, describe what operation to perform, allowing the processor to distinguish between different arithmetic and logical operations. 2. **Different funct3/funct7 combinations select different ALU operations** According to slide (p.10), different combinations of funct3 and funct7 select different operations under the same opcode. For example, instructions such as `add` and `sub` share the same opcode and funct3, but are distinguished by different funct7 values. ### Experimental Idea Design a simple instruction decoder that takes an R-type instruction as input and varies only the funct3 and funct7 fields while keeping the opcode constant. Observe that the decoded ALU operation changes according to the funct3 and funct7 values, validating their role in determining the exact operation. ## Explain how branch immediates are encoded in B-type instructions. ### Reference - Lecture: In Week 4 (RISC-V Instruction Formats (Part II)) https://docs.google.com/presentation/d/1rYJeuDUPezDdk01M8ib3XTo15IGLdsMJclzD4Vv9XGQ/edit?slide=id.p15#slide=id.p15 - Slides: * RISC-V Instruction Formats, Part II, slides 14–17 ![image](https://hackmd.io/_uploads/H1Tj6lZN-x.png) ![image](https://hackmd.io/_uploads/H1j36xWV-e.png) ![image](https://hackmd.io/_uploads/By6paxZE-l.png) ![image](https://hackmd.io/_uploads/H1TR6x-4Wl.png) ### Answer Branch immediates in B-type instructions are encoded as a **PC-relative signed offset** that is **split across non-contiguous bit fields** in the instruction format. Specifically: - The B-type immediate is **12 bits wide**, but it represents a **13-bit signed offset** because the least significant bit is implicitly zero due to instruction alignment. - The immediate bits are divided and stored in different positions of the instruction: - `imm[12]` is stored in **instruction bit [31]** - `imm[10:5]` are stored in **instruction bits [30:25]** - `imm[4:1]` are stored in **instruction bits [11:8]** - `imm[11]` is stored in **instruction bit [7]** - During execution, the processor: 1. Reconstructs the immediate from these scattered fields 2. Sign-extends the immediate 3. Shifts it left by 1 (×2 bytes) 4. Adds it to the current PC to compute the branch target address This encoding enables **PC-relative branching** while keeping register field positions consistent across instruction formats and simplifying hardware decoding. ### Experimental Idea To validate how branch immediates are encoded in B-type instructions, we can use a simple assembly experiment. 1. Write a RISC-V assembly program containing several branch instructions (e.g., `beq`, `bne`) with known forward and backward branch offsets. 2. Assemble the program and use an object dump tool (e.g., `objdump -d`) to inspect the binary encoding of each branch instruction. 3. Manually extract the immediate-related bits from the instruction encoding and reconstruct the branch offset according to the B-type format. 4. Verify that the reconstructed offset matches the expected PC-relative target address. This experiment confirms that branch immediates are split across non-contiguous bit fields and reconstructed at runtime as described in the B-type instruction format. :::danger Show the details! :::