CA2025 Quiz6 - HackMD

# CA2025 Quiz6 > 黃國豪 ## 19. Why does false sharing hurt multicore performance? ### Lecture Materials From [CSCI 463: Computer Architecture and Systems Organization](https://faculty.cs.niu.edu/~winans/CS463/2022-fa/) - Memory [Performance Analysis and Tuning on Modern CPUs](https://faculty.cs.niu.edu/~winans/notes/patmc.pdf) Page 158 [Multithreading Issues + Cache Coherency](https://www.youtube.com/watch?v=xOZH48uGYuc&list=PLDoI-XvXO0aqgj_8Og51XoE7iyAu5yEWZ&index=7) 1:06 ~ 2:40 ![image](https://hackmd.io/_uploads/BktErLCmbe.png) According to the lecture materials, false sharing is explained as a cache block repeatedly ping-ponging between caches. Even though different processors update different variables, the entire cache block is invalidated each time. This happens because cache coherence works at the cache block/line level rather than the variable level, which causes unnecessary coherence traffic and performance slowdown. ``` c= // False Sharing Example // Captured from 'Performance Analysis and Tuning on Modern CPUs'. struct S { int sumA; // sumA and sumB are likely to int sumB; // reside in the same cache line }; S s; { // section executed by thread A for (int i = 0; i < N; i++) s.sumA += a[i]; } { // section executed by thread B for (int i = 0; i < N; i++) s.sumB += b[i]; } ``` - False Sharing is a performance degradation that occurs in multithreaded systems when threads on different cores modify independent variables that reside on the same Cache Line. Due to Cache Coherency protocols, modifying one variable forces other cores to refresh their entire cache line, leading to unnecessary memory traffic and massive pipeline stalls. It is a critical topic in Performance Analysis and Tuning on Modern CPUs - False sharing hurts multicore performance because processors cache data in fixed-size cache lines rather than individual variables. When different cores update separate variables that happen to reside in the same cache line, the cache coherence protocol treats this as shared data. As a result, each write operation invalidates the cache line in other cores, forcing them to reload it repeatedly. This causes excessive coherence traffic, pipeline stalls, and wasted memory bandwidth. Even though the program has no logical data sharing, the hardware behaves as if there is, significantly reducing parallel efficiency ### Example #### Experimental Setup * This experiment was originally performed on an Apple Silicon machine running macOS. However, macOS does not support Linux CPU affinity APIs such as pthread_setaffinity_np and taskset. As a result, it is difficult to guarantee that multiple threads are actually running on different CPU cores. In addition, performance analysis tools like perf are not fully supported on macOS. Because false sharing is closely related to cache coherence behavior between CPU cores, these limitations make macOS unsuitable for accurate false sharing experiments. * Method __Using Linux Virtual Machine with UTM__ , to overcome these limitations, a Linux ARM64 virtual machine was created using UTM with Apple Virtualization on Apple Silicon, this environment provides better support for * CPU affinity control * Multi-thread scheduling behavior * Linux performance tools such as perf stat * Test ``` shell= $ wget https://raw.githubusercontent.com/Zeosleus/False_Sharing/main/false_sharing.c $ gcc -o false_sharing false_sharing.c -lpthread $ ./false_sharing $ ./false_sharing --nofalse-sharing ``` #### Result Performance measurements were collected using `perf stat -d <filename & parameter>` __False Sharing Example__ ``` txt= Performance counter stats for './false_sharing': 4,306.59 msec task-clock # 1.000 CPUs utilized 1,449 context-switches # 336.461 /sec 2 cpu-migrations # 0.464 /sec 548 page-faults # 127.247 /sec <not supported> cycles <not supported> instructions <not supported> branches <not supported> branch-misses <not supported> L1-dcache-loads <not supported> L1-dcache-load-misses <not supported> LLC-loads <not supported> LLC-load-misses ``` __No False Sharing Result__ ``` txt= Performance counter stats for './false_sharing --nofalse-sharing': 4,413.45 msec task-clock # 1.995 CPUs utilized 95 context-switches # 21.525 /sec 2 cpu-migrations # 0.453 /sec 550 page-faults # 124.619 /sec <not supported> cycles <not supported> instructions <not supported> branches <not supported> branch-misses <not supported> L1-dcache-loads <not supported> L1-dcache-load-misses <not supported> LLC-loads <not supported> LLC-load-misses 2.211789192 seconds time elapsed ``` #### Conclusion In the false sharing configuration, the program utilized approximately one CPU core and required about 4.3 seconds to complete, while in the no false sharing configuration, the program utilized nearly two CPU cores and completed in about 2.2 seconds. This result suggests that false sharing significantly reduces parallel efficiency. Although both threads are runnable, frequent cache coherence traffic causes them to stall more often, effectively limiting parallel execution. In addition, false sharing may indirectly increase the number of context switches. From the experimental results, the false sharing case experienced 1,449 context switches, while the no false sharing case experienced only 95 context switches. This large difference indicates that false sharing introduces additional scheduling overhead, which clearly demonstrates that false sharing hurts multicore performance. ## 20. Why is cache line size a trade-off? [Lecture 26: Caches III ](https://docs.google.com/presentation/d/1dN876ETDdJATOV0JYQJzvHadBcXAgA03NqG311es7QU/edit?slide=id.p29#slide=id.p29)Page 29 ### Modern Architectural Standards #### x86-64 Considerations * __Standard Block Size :__ Most x86-64 cores use 64 byte cache lines at all levels (L1/L2/L3). * __Legacy Compatibility:__ Most software data structures (structs, arrays) are padded or aligned to 64 bytes. Changing this would break performance for millions of existing applications. - *[4] Section 3.6.4 Data Alignment & [5] & [9]* * __Block Transfer Efficiency:__ DRAM burst transfers and internal cache buses are typically optimized around 64-byte chunks, making this size efficient from a hardware pipeline and bus perspective. - *[11]* * __Multi-Core Coherence Effects:__ Multiple cores share parts of the cache hierarchy, having a block size that’s too large can cause false sharing — two threads accessing independent data in the same line can unnecessarily trigger coherence traffic, slowing performance. - *[10]* #### Arm64 Considerations * __Standard Block Size:__ L1 64 Bytes and L2 128 Bytes. * __Tag Overhead:__ Apple chips have massive L2 caches. By using 128-byte blocks instead of 64-byte ones, they halve the amount of "Tag" memory needed to track the data. This saves significant physical space on the silicon. - *[7]* * __Throughput over Latency:__ Apple's Unified Memory Architecture (UMA) provides extremely wide pipes to RAM (up to 400GB/s+). This high bandwidth makes the "Miss Penalty" of a 128-byte block negligible compared to the benefit of having more data ready for the powerful P-cores. - *[8]* * __Spatial Locality Optimization:__ ARM processors heavily optimize for sequential loads/stores and vectorized data — a 64 B block matches common vector lengths. - *[12]* * __Prefetching and Memory Ordering:__ ARM implements aggressive prefetching policies that assume regular strides; cache line size interacts with these prefetchers. Too large a block can cause unnecessary memory traffic if prefetching doesn’t match access patterns. * __Miss Penalty vs. Locality:__ Since DRAM latency is high relative to core frequency, fetching a whole 64 B block can be 10–100× slower than a cache hit. Therefore, while larger blocks reduce compulsory misses, they deepen miss penalty significantly. - *[13]* #### Benefits - __Spatial Locality__: If a specific word is accessed, it is highly likely that nearby words will be accessed soon. A larger block captures these neighbors in a single transfer - __Stored-Program Concept__: This makes larger blocks very applicable for modern computing environments where instructions and data are stored together - __Sequential Efficiency__: Larger blocks work exceptionally well for tasks like sequential array accesses, where the CPU reads through memory in a predictable, linear fashion #### Drawbacks - __Larger Miss Penalty__: On a cache miss, the CPU must wait longer because it takes more time to load a larger block from the next level of memory - __Reduced Block Count__: If the block size is too large relative to the total cache size, the cache can only hold too few blocks - __Increased Miss Rate__: When there are fewer blocks available, the cache cannot store a diverse enough set of data, causing the miss rate to go up The trade-off exists because a designer must find the __balance__ : a block large enough to exploit spatial locality but small enough to keep the miss penalty low and the number of blocks high ## 21. Why does TLB miss hurt more than cache miss? [Lecture 34 - Virtual Memory II](https://docs.google.com/presentation/d/1KB2HKnvqsKq65udERzMw9imTMtV8KREMWmKc7w8XOfk/edit?slide=id.g313922e28e2_0_1026#slide=id.g313922e28e2_0_1026) Page 9 - Page 15 #### Cache Miss - When there's a cache miss despite having the physical address, the CPU retrieves the data from the next level of the hierarchy or RAM. It’s essentially a one-to-one data movement #### TLB Miss - If the address translation isn't cached, the CPU has to search a page table. On 64-bit systems with 4 or 5-level page tables, this requires several sequential memory lookups just to find the final physical address #### Conclusion - A TLB Miss is inherently more expensive than a Cache Miss because it converts a single data request into a serialized chain of multiple memory accesses. Effectively, a TLB miss multiplies the memory latency by the number of page table levels. Furthermore, if the page table entries themselves are not in the cache, each step of the "search" incurs its own latency, leading to a massive performance penalty. ## Reference * [1] https://github.com/torvalds/linux/blob/master/Documentation/kernel-hacking/false-sharing.rst * [2] https://hackmd.io/@sysprog/concurrency-atomics * [3] https://lemire.me/blog/2023/12/12/measuring-the-size-of-the-cache-line-empirically/ * [4] https://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf * [5]https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size.html * [7] https://acs.pub.ro/~cpop/SMPA/Computer%20Architecture%20A%20Quantitative%20Approach%20(5th%20edition).pdf * [8] https://www.tomshardware.com/news/apple-m1-pro-max-everything-we-know * [9] https://stackoverflow.com/questions/68320687/why-are-most-cache-line-sizes-designed-to-be-64-byte-instead-of-32-128byte-now * [10] https://en.wikipedia.org/wiki/False_sharing * [11] https://db.in.tum.de/teaching/ss20/dataprocessingonmodernhardware/MH_2.pdf?lang=de * [12] https://zhuanlan.zhihu.com/p/545918309 * [13] https://people.cs.nycu.edu.tw/~ttyeh/course/2024_Spring/CS10014/slide/lecture-11.pdf