CA2025 Quiz6 - HackMD

# CA2025 Quiz6 > 陳識博 ## Questions 19 Why does false sharing hurt multicore performance? ### Definition of False sharing False sharing occurs when multiple processors access different variables that reside in the same cache block, causing unnecessary cache coherence traffic and performance degradation. --- ### Example in Lecture 23.7 ![image](https://hackmd.io/_uploads/BJBpP6sXbg.png) Suppose : - Block size : 32 bytes - P0 reading and writing variable `X` - P1 reading and writing variable `Y` - `X` in location 4000, `Y` in 4012 --- At the program (logical) level: - Variables X and Y are disjoint variables (they are unrelated). At the hardware level: - Variables X and Y reside in the same cache block because that all addresses in the range `4000–4031` map to one 32-byte cache block. - As a result, they are treated as being shared by the cache coherence mechanism. --- What Happens : 1. `P0` writes `X` (address 4000) - Cache coherence protocol (e.g., `MESI`): - The cache block in `P0`’s cache transitions to the `Modified state`. - The same cache block in `P1`’s cache is `invalidated`. 2. Then `P1` writes `Y` (address 4012) - P1 needs to obtain ownership of the cache block. - The cache block is transferred from P0 to P1. - The cache block in P0’s cache is invalidated. 3. This process repeats - The cache block `ping-pongs` between Cache 0 and Cache 1. - The lecture refers to this phenomenon as: - “Block ping-pongs between two caches.” Even though `P0` and `P1` are accessing different `variables`, cache coherence operates at the cache-block granularity, causing the block to repeatedly transfer between caches, which results in `false sharing`. `False sharing` is not caused by true data sharing, but by cache coherence being maintained at cache-block granularity rather than variable granularity. --- How to prevent `False sharing` : - Place data on different blocks - Reduce block size --- ### Experiments In this experiments,I try to create a false-sharing example program and compare with the no false sharing version, then compare their performance. #### 1. Frist, using `ChatGPT` to help me generate a false-sharing program. ```c #include <pthread.h> #include <stdint.h> static volatile uint64_t shared[2] = {0, 0}; void* t0(void* arg){ for (uint64_t i=0; i<200000000ULL; i++) shared[0]++; return 0; } void* t1(void* arg){ for (uint64_t i=0; i<200000000ULL; i++) shared[1]++; return 0; } int main(){ pthread_t a, b; pthread_create(&a, 0, t0, 0); pthread_create(&b, 0, t1, 0); pthread_join(a, 0); pthread_join(b, 0); return 0; } ``` - An array shared[2] contains tow `uint64_t` elements and the entire array is 16 bytes. - `shared[0]` and `shared[1]` are in the same cache line. - Thread 0 constantly updates `shared[0]` ; Thread 1 constantly updates `shared[1]`. - Check that both `shared[0]` and `shared[1]` are in the same cache line. ```bash getconf LEVEL1_DCACHE_LINESIZE 64 ``` We can observe that the entire array has a size of only 16 bytes, which is much smaller than the 64-byte cache line. This confirms that `shared[0]` and `shared[1]` indeed reside in the same cache line. --- #### 2. Using `ChatGPT` to help me generate a program without false-sharing. ```c #include <pthread.h> #include <stdint.h> typedef struct { volatile uint64_t value; char padding[64]; // put values on different cache lines (roughly) } padded_u64; static padded_u64 shared[2]; void* t0(void* arg){ for (uint64_t i=0; i<200000000ULL; i++) shared[0].value++; return 0; } void* t1(void* arg){ for (uint64_t i=0; i<200000000ULL; i++) shared[1].value++; return 0; } int main(){ pthread_t a, b; pthread_create(&a, 0, t0, 0); pthread_create(&b, 0, t1, 0); pthread_join(a, 0); pthread_join(b, 0); return 0; } ``` This version updated the code uses the technique `Manual Padding`. The technique adding a `char padding[64]` buffer between the `uint64_t` value so that two values far enough apart to physically occupy the different 64-byte cache line. ```ASCII Original Code (False Sharing) : Memory Address | Data Content | Cache Line Boundary ----------------------------------------------------------------------- 0x...00 | [ shared[0] ] (8 bytes) | <--- Start Line 1 0x...08 | [ shared[1] ] (8 bytes) | (Both in Line 1) 0x...10 | [ Unused/Other Data ] | ... | ... | 0x...3F | [ Unused/Other Data ] | <--- End Line 1 ----------------------------------------------------------------------- ``` ```ASCII Padded Code (False Sharing Prevented): Memory Address | Data Content | Cache Line Boundary ----------------------------------------------------------------------- 0x...00 | [ shared[0].value ] (8 bytes) | <--- Start Line 1 0x...08 | [ shared[0].padding ] (64 bytes) | ... | [ ... ] | 0x...3F | [ ... ] | <--- End Line 1 ----------------------------------------------------------------------- 0x...40 | [ shared[0].padding ] (continued)| <--- Start Line 2 0x...48 | [ shared[1].value ] (8 bytes) | 0x...50 | [ shared[1].padding ] (64 bytes) | ... | [ ... ] | 0x...7F | [ ... ] | <--- End Line 2 ----------------------------------------------------------------------- ``` --- #### 3. Compile ```bash gcc -O0 -g -pthread false_sharing.c -o false_sharing gcc -O0 -g -pthread no_false_sharing.c -o no_false_sharing ``` --- #### 4. Validation With false shring : ```bash time ./false_sharing real 0m0.348s user 0m0.645s sys 0m0.000s ``` Without false shring : ```bash time ./no_false_sharing real 0m0.069s user 0m0.117s sys 0m0.001s ``` - The false sharing version take `0.348s` - The no false shaaring version take `0.69s` This clearly demonstrate that false sharing siginificantly degrades performance. --- Analysis performance counter (per stat). ```bash perf stat -e cycles,instructions,cache-references,cache-misses ./false_sharing Performance counter stats for './false_sharing': 3533544728 cycles:u 2400132539 instructions:u # 0.68 insn per cycle 20502575 cache-references:u 16897355 cache-misses:u # 82.42% of all cache refs 0.357091997 seconds time elapsed 0.667517000 seconds user 0.000000000 seconds sys ``` ```bash perf stat -e cycles,instructions,cache-references,cache-misses ./no_false_sharing Performance counter stats for './no_false_sharing': 546250003 cycles:u 2400131856 instructions:u # 4.39 insn per cycle 21094 cache-references:u 6262 cache-misses:u # 29.69% of all cache refs 0.056999830 seconds time elapsed 0.104226000 seconds user 0.000000000 seconds sys ``` Observation: - Instruction count neraly identical -> confirm that both program perform the same work. - By eliminating `false sharing`, the cycle count dropped from `353M` to `54.6M`. - False sharing increases the cache `miss rate` because cache coherence operates at cache-line granularity. When one core writes to a variable, the entire cache line is invalidated in other cores. --- #### 5. Conclusion This experiment demonstrates that false sharing can cause performances slowdown. --- ### Lecture Material Reference - Week 15 [Multithreading Issues + Cache Coherency](https://www.youtube.com/watch?v=xOZH48uGYuc&list=PLDoI-XvXO0aqgj_8Og51XoE7iyAu5yEWZ&index=7) - Section : Lecture 23.7 - Multithreading Issues, Cache Coherency: Coherency Misses - [False Sharing](https://github.com/torvalds/linux/blob/master/Documentation/kernel-hacking/false-sharing.rst) - [並行程式設計: Atomics 操作](https://hackmd.io/@sysprog/concurrency-atomics) ## Questions 24 Why are interrupts considered control hazards? ### Definition of Interrupt (In CS61C) Caused by an event external to current running program. - E.g., key press, disk I/O - Asynchronous to current program - Can handle interrupt on any convenient instruction - “Whenever it’s convenient, just don’t wait too long” --- ### Answer Interrupts are considered control hazzards because they alter the normal sequential flow of control by redirecting the PC to an interrupt handler. As shown in the lecture `slide 20`, an interrupt causes the processor to save the current PC in `MEPC` and transfer control to separate `interrupt handler`.This redirection of control flow is not part of the original instruction sequence. ![image](https://hackmd.io/_uploads/BJEgfij7bx.png) Furthermore, interrupts are asynchronous to the current program, meaning they can occur at unpredictable times. As a result, the processor pipeline cannot anticipate the control flow change in advance. Therefore, because interrupts causes an unexpected and asynchronous change in the program counter, they are considered control hazards. --- ### Lecture Material Reference - Week 16 [I/O: Devices, Polling, Interrupts](https://drive.google.com/drive/folders/1NRoyYSbAyarMhmPD5LiY6YEqYpLoZx_s) - Slides : 19-22 ## Questions 28 ### Question Description In RISC-V, why is there no SUBI instruction? --- ### Answer RISC-V provides `add` and `sub` instructions, as well as the immediate instruction `addi`, but doesn't include a corresponding `subi` instruction. RISC-V doesn't include a `subi` instruction because its insruction set is intentionally designed to be minimal.The goal is to limit the number of instruction types to absolute minimum while still being functionally complete. Subtraction with an immediate value can be expressed using the existing `addi` instruction by providing a `two's complment`( negative number ) immediate. >[!Note] Example >```assembly > subi x3, x4, 10 > || > || > \/ > addi x3, x4, -10 >``` Since subtraction by an immediate value can be decomposed into an addition with a negative immediate, a seperate `subi` instruction is unnecessary. --- ### Lecture Material Reference - Week 2 [RISC-V Instructions](https://docs.google.com/presentation/d/1qNVK8ULddo6luq0Rrjj_bmOShhcx9hlESN36AK91n6I/edit?slide=id.g1202e32ef3e_0_392#slide=id.g1202e32ef3e_0_392) - Section : Immediate - Slides : 21 --- ### Further Discussion C code : ```C int subi(int a, int b) { a = b - 10; return a; } ``` Assembly : ```assembly .file "SUBI.c" .option nopic .attribute arch, "rv32i2p1_zicsr2p0" .attribute unaligned_access, 0 .attribute stack_align, 16 .text .align 2 .globl subi .type subi, @function subi: addi sp,sp,-32 sw s0,28(sp) addi s0,sp,32 sw a0,-20(s0) sw a1,-24(s0) lw a5,-24(s0) addi a5,a5,-10 sw a5,-20(s0) lw a5,-20(s0) mv a0,a5 lw s0,28(sp) addi sp,sp,32 jr ra .size subi, .-subi .ident "GCC: (13.2.0-11ubuntu1+12) 13.2.0" ``` No `subi` instruction in Generated RISC-V Assembly. Instead, the compiler emits the key instruction: ```assembly addi a5,a5,-10 ``` This instruction performs subtraction by using the `addi` instruction with a negative immediate value. The result prove that there is no SUBI instruction in RISC-V. ## Questions 29 How does RISC-V support position-independent code? ### Answer RISC-V supports `positions-independent code`( PIC ) by using PC-relative branches and jumps (e.g., `beq`, `bne`, `jal`), allowing all control-flow targets to be relative computed relative to the current PC rather than fixed absolute addresses. >[!Note] Example >```assembly > j Label > || > || > \/ > jal x0, Label >``` >As shown in the lecture slides, once pseudoinstructions are replaced with real instructions (e.g., `j` replaced by `jal x0`). Determine the offset to encode by counting the number of half-word instructions between current instruction and target instruction. --- ### Lecture Material Reference - Week 6 [Lecture 13: Compiling, Assembling, Linking, and Loading](https://docs.google.com/presentation/d/1uAURy-tL-K9oMtUP1gY9034gsOrxbhhDL9MnoTm1NKM/edit?slide=id.p14#slide=id.p14) - Video : [[CS61C] Lecture 13: Compiling-Assembling-Linking-Loading (Part I)](https://www.youtube.com/watch?v=oeRTvGGLRZI) start at 17:50 - Section : Producing Machine Code - Slides : 12