Chun Min Chang
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    ## 目標 1. 量化 Mutex (Futex)、Spinlock (Busy-wait) 與 Atomic (CAS) 的效能差異。 2. 分析高競爭 (High Contention) 下的排程器行為與 Context Switch 成本。 3. Padding, Data Layout Strategy ## 平行的代價 在CA等課程中,大多會在pipline單元接觸到 **Harzard** 的概念,而這裡的平行化其實也有著類似的問題。 除去完全無依賴性的情況,**data harzard**的發生會是顯而易見的,但在平行化中更多的應該會是硬體爭奪,這是源自於計算資源有限,這從小範圍的memory、regitster、ALU,到更大的單位如Threads、L1/2/3 Cache、Main Memory、SM、Core等。 ### 執行緒的資源清單 | 資源類型 | 狀態 | | :--- | :--- | | **Stack (堆疊)** | **獨佔** | | **Registers (暫存器)** | **獨佔** | | **Heap / Global (堆積/全域)** | **共享** | | **Files / Sockets** | **共享**| 而在OS或CA Intro中,有提到了許多困境的形式,最常見的應該就屬**Dead Lock**,故為防止污染與競爭侵占,Lock的機制產生,但這也對平行化產生了負面影響 > 在平行計算中,我們稱這種爭搶為 **Contention(競爭)**。 > 如果你的程式 90% 的時間都在處理自己私有的 Stack 數據,你的加速比會很完美;但如果你 90% 的時間都在搶那個全域的 `counter`,你的電腦核心再多也沒用,因為大家都卡在排隊。 ### 競爭情況(簡述) 當 Thread A 搶到鎖,而 Thread B 也想要時: 1. **User Space (Spin)**:Thread B 會先在門外等著(Spinlock),等A出來。 2. **Kernel Space (Context Switch)**:如果 A 遲遲不出來,OS 就會介入,把 B 移出 CPU,使B Threads sleep。 3. **喚醒**:當 A 釋放鎖,OS wake up B threads。這個wake up 即recreate,也就是之前所測量的overhead。 ### OS 機制 * **序列化效應(Serialization)** 即使有 4 個執行緒,但在任何時間點,LOCKED的狀態永遠只會出現在一個執行緒的賽道上。這就是互斥鎖的本質:將並行變回串列。 * **Queue** * **Unfairness** 除非有piority,不然fmutex是不公平的 ### 視覺化的code #### visualizer_thread_contention ```c // ~/pthreads/visualizer/pthread/visualizer_thread_contention.c #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #define NUM_THREADS 5 #define ITERATIONS 10 // --- 共享資源 --- int shared_pot = -1; // 所有人都要搶著修改 int use_lock = 0; // 安全開關:0 = 沒鎖 (亂搶), 1 = 有鎖 (排隊) pthread_mutex_t lock; void print_ui(int id, const char* action, const char* color, int val) { // 每個執行緒有自己的行,顯示目前的動作和讀取到的值 printf("\033[%d;1HThread %d: [%s%-12s\033[0m] | 讀取到的值: %2d", id + 4, id, color, action, val); // 在上方顯示目前的值真相 printf("\033[2;1H================================"); printf("\033[3;1H [ 共享變數的現況 ] -> 【 %2d 】", shared_pot); printf("\033[2;35H模式: %s", use_lock ? "Mutex ON" : "No Lock"); fflush(stdout); } void* thread_worker(void* arg) { int id = *(int*)arg; for (int i = 0; i < ITERATIONS; i++) { // 1. 準備階段 print_ui(id, "準備中", "\033[32m", shared_pot); usleep(rand() % 800000 + 200000); // 2. 進入爭奪 if (use_lock) pthread_mutex_lock(&lock); print_ui(id, "正在修改!", "\033[31m", shared_pot); // 模擬「讀取 -> 修改 -> 寫回」的延遲 int temp = shared_pot; usleep(100000); // 故意製造空隙讓別人有機會插隊 shared_pot = id; if (use_lock) pthread_mutex_unlock(&lock); // 3. 修改完成 usleep(500000); } return NULL; } int main(int argc, char** argv) { if (argc > 1) use_lock = atoi(argv[1]); // 可以透過指令參數控制 pthread_t threads[NUM_THREADS]; int ids[NUM_THREADS]; pthread_mutex_init(&lock, NULL); printf("\033[2J\033[?25l"); // 清屏、隱藏游標 printf("\033[1;1H=== 共享變數強奪實驗 ==="); for (int i = 0; i < NUM_THREADS; i++) { ids[i] = i; pthread_create(&threads[i], NULL, thread_worker, &ids[i]); } for (int i = 0; i < NUM_THREADS; i++) pthread_join(threads[i], NULL); printf("\033[%d;1H\n實驗結束。請觀察落差。\033[?25h\n", NUM_THREADS + 5); return 0; } ``` #### visualizer_thread_static ```c // ~/pthread/visualizer_thread_static.c #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <time.h> #define NUM_THREADS 4 #define TOTAL_STEPS 10 pthread_mutex_t lock; int lock_counts[NUM_THREADS] = {0}; // 紀錄每個 Thread 搶到鎖的次數 void print_status(int id, const char* status, const char* color) { // 顯示 ID, 狀態, 以及搶到鎖的累積次數 printf("\033[%d;1HThread %d: [%s%-12s\033[0m] | Success: %d 次", id + 3, id, color, status, lock_counts[id]); fflush(stdout); } void* thread_worker(void* arg) { int id = *(int*)arg; for (int i = 0; i < TOTAL_STEPS; i++) { print_status(id, "THINKING", "\033[32m"); // 綠色 usleep(rand() % 500000); print_status(id, "WAITING...", "\033[33m"); // 黃色 pthread_mutex_lock(&lock); // --- 進入臨界區 --- lock_counts[id]++; print_status(id, "LOCKED!!", "\033[31m"); // 紅色 usleep(300000); // ---------------- pthread_mutex_unlock(&lock); } print_status(id, "FINISHED", "\033[34m"); return NULL; } int main() { pthread_t threads[NUM_THREADS]; int ids[NUM_THREADS]; pthread_mutex_init(&lock, NULL); srand(time(NULL)); printf("\033[2J\033[?25l\033[1;1H=== 資源爭奪計數器 ===\n"); for (int i = 0; i < NUM_THREADS; i++) { ids[i] = i; pthread_create(&threads[i], NULL, thread_worker, &ids[i]); } for (int i = 0; i < NUM_THREADS; i++) pthread_join(threads[i], NULL); printf("\033[%d;1H\n總結:即便核心再多,鎖(Lock)依然讓執行變回了排隊。\n\033[?25h", NUM_THREADS + 4); pthread_mutex_destroy(&lock); return 0; } ``` ## Synchronization Primitives 在 Linux 中,同步原語的實作涉及硬體指令與作業系統調度器的協作。我們將比較三種核心技術: * **Mutex (POSIX Mutex)** 基於 futex (Fast Userspace Mutex)。在無競爭時待在使用者態,有競爭時透過系統呼叫進入核心態掛起執行緒。 * **Spinlock** 執行緒在使用者態不斷輪詢(Busy-waiting)。不觸發 Context Switch,但極度消耗 CPU 週期。 > Context Switch : CPU在不同的排程中,切換不同register、PC、Stack Pointer的狀況,也就是切換threads * **Atomic (C11/C++11 Memory Model)** 利用硬體層級的 CAS (Compare-and-Swap) 指令,直接在 CPU 快取一致性協定(MESI)層級完成同步。 ### Threads Contention ```c //sync_compare.c #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <stdatomic.h> #define NUM_ITERATIONS 10000000 #define NUM_THREADS 16 typedef struct { int counter; // 給 Mutex 與 Spinlock 使用的普通整數 pthread_mutex_t mutex; // OS 級別的互斥鎖 pthread_spinlock_t spinlock; // 使用者態的自旋鎖 _Atomic int atomic_counter; // 硬體級別的原子變數 } shared_data_t; // 測試函式指標類型 typedef void* (*test_func_t)(void*); // --- 不同的同步實作 --- void* test_mutex(void* arg) { shared_data_t* data = (shared_data_t*)arg; for (int i = 0; i < NUM_ITERATIONS / NUM_THREADS; i++) { pthread_mutex_lock(&data->mutex); // 確定鎖沒鎖 data->counter++; // 臨界區 (Critical Section) pthread_mutex_unlock(&data->mutex); // 釋放鎖 } return NULL; } // pthread_mutex_lock 在 Linux 下 // 通常實作為 Futex (Fast Userspace Mutex)。 // 它會先嘗試在 User-space 獲取鎖, // 失敗後才會透過系統呼叫進入核心態掛起執行緒, // 這是為了減少不必要的 Context Switch。 void* test_spinlock(void* arg) { shared_data_t* data = (shared_data_t*)arg; for (int i = 0; i < NUM_ITERATIONS / NUM_THREADS; i++) { pthread_spin_lock(&data->spinlock); // 不斷循環檢查鎖是否釋放,CPU 不休息 data->counter++; pthread_spin_unlock(&data->spinlock); } return NULL; } // Spinlock 適合鎖定時間極短的情境。 // 但在環境中,若執行緒發生 Preemption(被 OS 強行奪走 CPU), // 其它在 Spin 的執行緒會白白浪費整個時間片(Time Slice)。 void* test_atomic(void* arg) { shared_data_t* data = (shared_data_t*)arg; for (int i = 0; i < NUM_ITERATIONS / NUM_THREADS; i++) { atomic_fetch_add(&data->atomic_counter, 1); } return NULL; } // 對應到 x86 架構的 LOCK ADD 指令。 // 它不需要經過 OS 核心, // 而是透過 CPU 的 Cache Coherency Protocol (如 MESI) // 在快取層級鎖定該記憶體行(Cache Line), // 避免其它核心同時修改。 double run_experiment(test_func_t func, shared_data_t* data) { pthread_t threads[NUM_THREADS]; struct timespec start, end; // Default 初始化計數器 data->counter = 0; atomic_store(&data->atomic_counter, 0); // fork clock_gettime(CLOCK_MONOTONIC, &start); for (int i = 0; i < NUM_THREADS; i++) pthread_create(&threads[i], NULL, func, data); for (int i = 0; i < NUM_THREADS; i++) pthread_join(threads[i], NULL); clock_gettime(CLOCK_MONOTONIC, &end); return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9; } int main() { shared_data_t data; pthread_mutex_init(&data.mutex, NULL); pthread_spin_init(&data.spinlock, PTHREAD_PROCESS_PRIVATE); printf("執行緒數量: %d, 總迭代次數: %d\n", NUM_THREADS, NUM_ITERATIONS); printf("Mutex 耗時: %f s\n", run_experiment(test_mutex, &data)); printf("Spinlock 耗時: %f s\n", run_experiment(test_spinlock, &data)); printf("Atomic 耗時: %f s\n", run_experiment(test_atomic, &data)); pthread_mutex_destroy(&data.mutex); pthread_spin_destroy(&data.spinlock); return 0; } ``` #### Compile & Eval ``` # 1. 編譯 gcc -O3 -pthread src/pthreads/sync_compare.c -o bin/sync_compare # 2. 測量 Mutex (觀察 Context Switches) perf stat -e cpu-clock,faults,context-switches,cpu-migrations ./sync # WSL的話要先改,因為在 WSL2 中,系統核心(Kernel)是由微軟提供的特製版本,且 sysctl 的路徑處理有時與原生 Ubuntu 不同 sudo sh -c 'echo -1 > /proc/sys/kernel/perf_event_paranoid' ``` #### Return Mutex ``` 執行緒數量: 16, 總迭代次數: 10000000 Mutex 耗時: 0.540014 s Performance counter stats for './sync': 89271 context-switches 586 cpu-migrations <not supported> cycles 0.541140951 seconds time elapsed 0.706919000 seconds user 6.872599000 seconds sys ``` #### Return Spin ``` 執行緒數量: 16, 總迭代次數: 10000000 Spinlock 耗時: 1.273953 s Performance counter stats for './sync': 456 context-switches 3 cpu-migrations <not supported> cycles 1.275071647 seconds time elapsed 17.841535000 seconds user 0.007791000 seconds sys ``` #### Return Atomic ``` 執行緒數量: 16, 總迭代次數: 10000000 Atomic 耗時: 0.183654 s Performance counter stats for './sync': 32 context-switches 1 cpu-migrations <not supported> cycles 0.184785650 seconds time elapsed 2.445988000 seconds user 0.003851000 seconds sys ``` ## No Lock? 如果沒有鎖,怎麼辦? - Cache Coherence & False Sharing > 記憶體分區 ```c // nolock.c #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #define ITERATIONS 100000000 // 一億次運算,放大效能差距 #define NUM_THREADS 4 // 建議設定為你的實體核心數 // 情況 A:緊鄰存放 (觸發 False Sharing) struct bad_struct { long volatile count[NUM_THREADS]; }; // 情況 B:對齊存放 (規避 False Sharing) // 使用 64 位元組的 Padding 確保每個 count 獨立佔用一個 Cache Line struct good_struct { struct { long volatile count; long padding[7]; // 1個long(8b) + 7個long(56b) = 64 bytes } data[NUM_THREADS]; }; struct bad_struct s1; struct good_struct s2; void* bench_bad(void* arg) { int id = *(int*)arg; for (long i = 0; i < ITERATIONS; i++) { s1.count[id]++; } return NULL; } void* bench_good(void* arg) { int id = *(int*)arg; for (long i = 0; i < ITERATIONS; i++) { s2.data[id].count++; } return NULL; } double get_time(void* (*func)(void*)) { pthread_t threads[NUM_THREADS]; int ids[NUM_THREADS]; struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); for (int i = 0; i < NUM_THREADS; i++) { ids[i] = i; pthread_create(&threads[i], NULL, func, &ids[i]); } for (int i = 0; i < NUM_THREADS; i++) pthread_join(threads[i], NULL); clock_gettime(CLOCK_MONOTONIC, &end); return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9; } int main() { printf("--- False Sharing 效能研究 (Threads: %d) ---\n", NUM_THREADS); double time_bad = get_time(bench_bad); printf("緊鄰存放 (Bad) 耗時: %.4f 秒\n", time_bad); double time_good = get_time(bench_good); printf("對齊存放 (Good) 耗時: %.4f 秒\n", time_good); printf("加速比 (Speedup): %.2fx\n", time_bad / time_good); return 0; } ``` ``` # perf 在wsl下沒辦法看cache ,所以最多看時間 --- False Sharing 效能研究 (Threads: 4) --- 緊鄰存放 (Bad) 耗時: 0.2050 秒 對齊存放 (Good) 耗時: 0.0257 秒 加速比 (Speedup): 7.98x ``` 有趣的是,如果更改padding的間隔 ``` # 改為3 a@owner:~/sp/parallel-programming-lab/src/pthreads$ ./nolock --- False Sharing 效能研究 (Threads: 4) --- 緊鄰存放 (Bad) 耗時: 0.3086 秒 對齊存放 (Good) 耗時: 0.0343 秒 加速比 (Speedup): 9.01x a@owner:~/sp/parallel-programming-lab/src/pthreads$ ./nolock --- False Sharing 效能研究 (Threads: 4) --- 緊鄰存放 (Bad) 耗時: 0.4284 秒 對齊存放 (Good) 耗時: 0.3608 秒 加速比 (Speedup): 1.19x a@owner:~/sp/parallel-programming-lab/src/pthreads$ ./nolock --- False Sharing 效能研究 (Threads: 4) --- 緊鄰存放 (Bad) 耗時: 0.3226 秒 對齊存放 (Good) 耗時: 0.0303 秒 加速比 (Speedup): 10.65x ``` 速度明顯開始抖動,雖然有環境影響(電腦內其餘工作的因素),但padding為3顯然不能隔開,這時候就看運氣了 ### alignas * alignas(64) 的作用: 確保這個變數的起始地址,一定是 64 的倍數。當你把它放在 struct aligned_element 裡並宣告成陣列時,編譯器會自動在每個 long 之間插入剛好足夠的空間(Padding),確保每個 data[i] 都佔據一個獨立的 Cache Line。 * 可移植性 (Portability): 如果你手動寫 long padding[6],在 long 為 4 bytes 的系統上,總長度會縮水(4+24=28),導致對齊失效。但 alignas(64) 永遠會保證 64 bytes 的間距。 ```c // nolock_opt.c #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <stdalign.h> // 必須包含此標頭檔來使用 alignas #define ITERATIONS 100000000 #define NUM_THREADS 4 // 情況 A:緊鄰存放 (Bad) // 整個陣列連續存放,4 個 long 共 32 bytes,全部擠在同一個 64-byte Cache Line struct bad_struct { volatile long count[NUM_THREADS]; }; // 情況 B:對齊存放 (Good) // 我們定義一個結構體,強制它在記憶體中以 64 bytes 為單位對齊 struct aligned_element { alignas(64) volatile long count; }; struct good_struct { struct aligned_element data[NUM_THREADS]; }; struct bad_struct s1; struct good_struct s2; // --- 執行緒函式與計時邏輯 (保持不變) --- void* bench_bad(void* arg) { int id = *(int*)arg; for (long i = 0; i < ITERATIONS; i++) { s1.count[id]++; } return NULL; } void* bench_good(void* arg) { int id = *(int*)arg; for (long i = 0; i < ITERATIONS; i++) { s2.data[id].count++; } return NULL; } double get_time(void* (*func)(void*)) { pthread_t threads[NUM_THREADS]; int ids[NUM_THREADS]; struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); for (int i = 0; i < NUM_THREADS; i++) { ids[i] = i; pthread_create(&threads[i], NULL, func, &ids[i]); } for (int i = 0; i < NUM_THREADS; i++) pthread_join(threads[i], NULL); clock_gettime(CLOCK_MONOTONIC, &end); return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9; } int main() { printf("--- False Sharing 效能研究 (Threads: %d) ---\n", NUM_THREADS); printf("Cache Line 對齊大小設定為: 64 bytes\n\n"); double time_bad = get_time(bench_bad); printf("緊鄰存放 (Bad) 耗時: %.4f 秒\n", time_bad); double time_good = get_time(bench_good); printf("對齊存放 (Good) 耗時: %.4f 秒\n", time_good); printf("加速比 (Speedup): %.2fx\n", time_bad / time_good); return 0; } ``` ``` 緊鄰存放 (Bad) 耗時: 0.4143 秒 對齊存放 (Good) 耗時: 0.0264 秒 加速比 (Speedup): 15.69x a@owner:~/sp/parallel-programming-lab/src/pthreads$ ./nolock --- False Sharing 效能研究 (Threads: 4) --- Cache Line 對齊大小設定為: 64 bytes 緊鄰存放 (Bad) 耗時: 0.2063 秒 對齊存放 (Good) 耗時: 0.0275 秒 加速比 (Speedup): 7.49x a@owner:~/sp/parallel-programming-lab/src/pthreads$ ./nolock --- False Sharing 效能研究 (Threads: 4) --- Cache Line 對齊大小設定為: 64 bytes 緊鄰存放 (Bad) 耗時: 0.1933 秒 對齊存放 (Good) 耗時: 0.0260 秒 加速比 (Speedup): 7.44x a@owner:~/sp/parallel-programming-lab/src/pthreads$ ./nolock --- False Sharing 效能研究 (Threads: 4) --- Cache Line 對齊大小設定為: 64 bytes 緊鄰存放 (Bad) 耗時: 0.2006 秒 對齊存放 (Good) 耗時: 0.0261 秒 加速比 (Speedup): 7.67x a@owner:~/sp/parallel-programming-lab/src/pthreads$ ./nolock --- False Sharing 效能研究 (Threads: 4) --- Cache Line 對齊大小設定為: 64 bytes 緊鄰存放 (Bad) 耗時: 0.2008 秒 對齊存放 (Good) 耗時: 0.0259 秒 加速比 (Speedup): 7.76x a@owner:~/sp/parallel-programming-lab/src/pthreads$ ./nolock --- False Sharing 效能研究 (Threads: 4) --- Cache Line 對齊大小設定為: 64 bytes 緊鄰存放 (Bad) 耗時: 0.3147 秒 對齊存放 (Good) 耗時: 0.0261 秒 加速比 (Speedup): 12.05x a@owner:~/sp/parallel-programming-lab/src/pthreads$ ./nolock --- False Sharing 效能研究 (Threads: 4) --- Cache Line 對齊大小設定為: 64 bytes 緊鄰存放 (Bad) 耗時: 0.4128 秒 對齊存放 (Good) 耗時: 0.0276 秒 加速比 (Speedup): 14.93x a@owner:~/sp/parallel-programming-lab/src/pthreads$ ./nolock --- False Sharing 效能研究 (Threads: 4) --- Cache Line 對齊大小設定為: 64 bytes 緊鄰存放 (Bad) 耗時: 0.2148 秒 對齊存放 (Good) 耗時: 0.0251 秒 加速比 (Speedup): 8.55x a@owner:~/sp/parallel-programming-lab/src/pthreads$ ./nolock --- False Sharing 效能研究 (Threads: 4) --- Cache Line 對齊大小設定為: 64 bytes 緊鄰存放 (Bad) 耗時: 0.1953 秒 對齊存放 (Good) 耗時: 0.0291 秒 加速比 (Speedup): 6.70x ``` ### Cache Line 與 MESI 協定 現代電腦中的記憶體並不是以單一字節為單位存取的,而是以 64 Bytes 為一組的 Cache Line (快取行) 進行搬運,所以當 Thread A 修改了位址 0x100 的數據,而 Thread B 修改位址 0x104 的數據時: 1. 如果這兩個位址落在同一個 64-byte 的 Cache Line 裡。 2. 硬體的 MESI (Modified, Exclusive, Shared, Invalid) 協定 會認為整個 Cache Line 都被修改了。 3. Thread B 所在的核心必須強制將其快取設為「失效」,並重新從記憶體或 Thread A 的快取中抓取最新資料。 > 這種現象被稱為 False Sharing (偽共享):邏輯上沒有共享,硬體上卻被迫同步 ### Question 在Cache Line 與 MESI 協定下,false sharing狀況很常發生,也就是當thread A去cache內修改threads B共用的64byte的其中一個,即便沒有改到B的資料,但還是會被當作汙染重新更新或抓取。 如果是在重抓的過程中Threads A又修改,那這樣Threads B不會進入無限抓取嗎? > 這被稱作快取乒乓 (Cache Ping-Pong) ,如果執行緒 A 瘋狂寫入,執行緒 B 確實會陷入一種「不斷被宣告無效、不斷重抓、抓到又立刻失效」的循環。這雖然不會導致程式當掉,但會讓效能掉進黑洞。 不過,就算看起來會無限循環,CPU 硬體的匯流排仲裁 (Bus Arbitration)會介入 >* 在 MESI 協定背後,所有的核心都連在一條「訊息通道(Bus)」上。當核心 B 想要重抓資料時: > 1. 發出請求: 核心 B 會發出一個 Read Invalid (RFO) 訊號。 > 2. 仲裁機制: 匯流排不會永遠偏袒核心 A。硬體會確保每個核心都有機會獲得匯流排的控制權。 > 3. 強制暫停: 當核心 B 獲得控制權準備讀取時,核心 A 的寫入請求會被暫時擋住(Stall),直到 B 完成這一次的抓取。 #### 數據佈局 (Data Layout) 既然硬體是 64-byte 一組在管,我們寫程式就要有「邊界感」。 | 策略 | 做法 | 效果 | | :--- | :--- | :--- | | **避讓** | 使用 `padding` 或 `alignas(64)` | 讓不同執行緒的資料住在不同「房間」。 | | **私有化** | 使用 `Thread Local Storage` | 每個執行緒先算自己的,算完才回報,完全不吵架。 | | **唯讀化** | 盡量讓共享資料變為 `const` | 如果狀態是 **S (Shared)** 且沒人修改,大家都能飛速讀取。 | 簡單來說,在無Layout與lock策略下,讓平行運作的**核心 A 改一個位元組,核心 B 附近的 63 個位元組都會跟著廢掉。** LAB3 https://hackmd.io/@1p0_VOUHTXGcHlBU9A0OEA/SywA3pXiWe

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully