Chun Min Chang
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    ## Final Project (Phase 1) Include : **Thread Lifecycle**, **Synchronization**, **False Sharing**, **Affinity** ### 題目 多核心異構環境下的矩陣特徵提取器 **(Parallel Matrix Feature Extractor)** ### Spec 給定一個巨大的二維矩陣 $M$ ($16384 \times 16384$,單精度浮點數),你需要撰寫一個平行程式來計算: 1. **全局總和 (Global Sum)**:矩陣所有元素的累加。 2. **行極值 (Row-wise Maxima)**:每一行最大的數值,存入一個陣列。 3. **效能報告**:量化在不同核心綁定策略下的執行效率。 #### Requirement 1 : Domain Decomposition * **任務**:將矩陣按「行 (Rows)」平均分配給 $N$ 個執行緒。 * **OS 連結**:利用 **Thread Affinity** 將每個執行緒固定在不同的 CPU 核心上,觀察 P-core 與 E-core 的計算速度落差。 #### Requirement 2 : Reduction Optimization * **禁忌**:嚴禁多個執行緒直接對一個全域變數加鎖(會產生嚴重的 Contention)。 * **要求**:每個執行緒必須先計算其分配區域的「局部總和 (Local Sum)」,最後再彙整。 #### Requirement 3 : False Sharing Preventing * **挑戰**:如果你建立一個儲存每個執行緒局部結果的陣列 `float results[NUM_THREADS]`,會觸發 False Sharing。 * **要求**:使用 **Padding** 或 **Thread-local Storage** 確保各核心的 L1 Cache 不會互相打架。 --- ### File Structure ```text ~project/ ├──Project1/ │ ├── src/ │ │ ├── main.c # 負責初始化矩陣、建立執行緒、匯總結果 │ │ ├── worker.c │ │ └── utils.h # 定義對齊用的結構體 │ ├── scripts/ │ │ └── run_bench.sh # 自動測試 1, 2, 4, 8, 16 執行緒的腳本 │ ├── Makefile # 需包含 -O3 與 -pthread │ └── report.md │ ├──Project1_affinity/ ├── src/ │ ├── main.c │ ├── worker.c # 核心計算邏輯 (含 Affinity 設定) │ └── utils.h ├── scripts/ │ └── run_bench.sh ├── Makefile └── report.md ``` --- ### report `report.md` 必須包含以下量化分析: 1. **加速比曲線 (Speedup Curve)**: * 計算 $S = T_1 / T_n$。 * **分析**:當執行緒數量超過實體核心(Physical Cores)進入超執行緒(Hyper-threading)時,加速比是否趨於平緩?為什麼? 2. **硬體異構性分析 (Heterogeneity)**: * 分別測試「只綁定 P-cores」與「只綁定 E-cores」的執行時間。 * **量化**:計算兩者的 **效能功耗比估算**(若有工具可測)或單純的運算速度倍率。 3. **Cache Miss 驗證**: * 使用 `perf stat -e L1-dcache-load-misses` 比較「有 Padding」與「無 Padding」版本的數據差異。 ### 啟動代碼片段 (Data Structure) ```c #define CACHELINE_SIZE 64 typedef struct { double local_sum; // Padding 確保每個 struct 佔用整倍數的 Cache Line char padding[CACHELINE_SIZE - sizeof(double)]; } __attribute__((aligned(CACHELINE_SIZE))) thread_data_t; ``` --- ### Step 1 : Structure 建立一個 $16384 \times 16384$ 的單精度浮點數 (float) 矩陣 $M$。 單精度浮點數佔用 4 Bytes(32 bits)。 * 總大小: $$16384 * 16384 * 4 Bytes = 1,073,741,824 Bytes = 1 GB$$ 1. 絕對不能宣告在 Stack 上 如果你寫 `float matrix[16384][16384]`,程式一執行就會立刻發生 Stack Overflow (堆疊溢位),因為作業系統預設給每個執行緒的 Stack 大小通常只有 8 MB > use `ulimit -s` to check default stack 2. 必須使用 Heap (堆積) 我們必須使用 malloc 或 calloc 來動態配置這 1 GB 的記憶體。 3. 使用一維陣列模擬二維陣列 為了極致的效能,我們不使用 `(float **)`來建立二維陣列,因為那會讓記憶體變得破碎。我們應該宣告一個巨大的一維陣列,這樣記憶體才是連續的 (Contiguous)。連續的記憶體對 CPU Cache 絕對友善,能大幅降低 Cache Miss。 > 這裡請複習Stack / Heap 的差異 #### `(float **)` ##### 1. 指標陣列層次(第一層) 當你宣告一個 `float **ptr` 並為其分配空間時(例如模擬二維陣列),你通常會先分配一個存放「指標」的空間: ```c float **ptr = malloc(rows * sizeof(float *)); ``` 在這個階段,`ptr` 指向的是一個連續的記憶體區塊,裡面存放著多個 `float *`(指標)。**在這個區塊內,指標與指標之間是連續的。** ##### 2. 實際資料層次(第二層) 問題通常出在這裡。當你為每一列分配空間時: ```c for (int i = 0; i < rows; i++) { ptr[i] = malloc(cols * sizeof(float)); } ``` * **指標之間不連續:** 每次呼叫 `malloc` 分配的 `ptr[i]` 區塊,在記憶體中的位置是由作業系統動態決定的。`ptr[0]` 的結尾與 `ptr[1]` 的開頭通常**不相鄰**。 * **區塊內部連續:** 在同一個 `ptr[i]` 所指向的 `cols` 個 `float` 之間,記憶體是連續的。 ##### 比較表:`float **` vs. 二維陣列 `float[R][C]` | 特性 | `float **ptr` (動態分配) | `float arr[R][C]` (靜態陣列) | | :--- | :--- | :--- | | **記憶體佈局** | **非線性**。每一列可能散落在記憶體各處。 | **線性**。所有資料都在一個完整的連續區塊。 | | **存取效率** | 較慢。需要兩次記憶體尋址 (Dereference)。 | 較快。編譯器可直接計算偏移量。 | | **彈性** | 高。每一列的長度可以不同(鋸齒陣列)。 | 低。每一列長度必須相同。 | --- ##### 如何讓 `float **` 的記憶體完全連續? 如果你需要使用 `ptr[i][j]` 的語法,又希望記憶體完全連續(為了效能或傳輸需求),你可以使用「一次性分配」技巧: ```c float **ptr = malloc(rows * sizeof(float *)); float *data = malloc(rows * cols * sizeof(float)); // 分配一整塊連續空間 for (int i = 0; i < rows; i++) { ptr[i] = &data[i * cols]; // 將指標指向連續空間的對應位置 } ``` #### Header : utils.h ```c /* * utils.h * version 1.0 * Created on: 2025/3/27 * Author: Jimmy */ #ifndef UTILS_H #define UTILS_H #define _GNU_SOURCE #include <stdalign.h> // 用於 alignas # define MATRIX_SIZE 16384 # define rows 16384 # define cols 16384 # define CACHE_LINE_SIZE 64 # define EXIT_FAILURE 1 typedef struct { alignas(CACHE_LINE_SIZE) long long final_count; } thread_data_t; // 等價於 // typedef struct { // double local_sum; // char padding[CACHELINE_SIZE - sizeof(double)]; // } __attribute__((aligned(CACHELINE_SIZE))) thread_data_t; // --------------------------------------------------------- // typedef struct { // double local_sum; // char padding[CACHELINE_SIZE - sizeof(double)]; // } alignas(CACHELINE_SIZE) thread_data_t; #endif // UTILS_H ``` #### main.c (ver. matrix default) ```c /* * main.c * version 1.0 * Created on: 2025/3/27 * Author: Jimmy */ #include <stdio.h> #include <stdlib.h> #include <time.h> #include "utils.h" int main () { printf("Default\n"); // size_t : unsigned integer type, auto-adjusted to the size of the platform (32-bit or 64-bit) size_t total_element = (size_t)rows * cols; // number of elements in the matrix = rows * cols printf("Total elements: %zu\n", total_element); // single precision float: 4 bytes // malloc return void* , float* can trasform void* to float* float *matrix = (float *)malloc(total_element * sizeof(float)); // float **ptr = malloc(rows * sizeof(float *)); // float *data = malloc(rows * cols * sizeof(float)); // 分配一整塊連續空間 // for (int i = 0; i < rows; i++) { // ptr[i] = &data[i * cols]; // 將指標指向連續空間的對應位置 // } if (matrix == NULL) { // log error message to a file name "stderr" fprintf(stderr, "Memory allocation failed\n"); return EXIT_FAILURE; } // time_t time(time_t *tloc); // time return the current calendar time as // a value of type time_t. // If tloc is non-NULL, // the return value is also stored // in the memory pointed to by tloc. srand (time(NULL)); // seed the random number generator with current time for (size_t i = 0; i < total_element; i++) { // RAND_MAX is a constant defined in stdlib.h, // which represents the maximum value returned by the rand function matrix[i] = (float)rand() / RAND_MAX; // Initialize with some values, e.g., 0.0, 1.0, ..., 99.0, then repeat } printf ("Matrix initialization completed.\n"); free (matrix); return 0; } ``` ##### Take out * **size_t** : auto-adjusted to the size of the platform (32-bit or 64-bit) * **time(NULL)**: 代表不使用指標接收時間,只取回傳值。 * **RAND_MAX**: 是標準庫定義的固定數字,不是函式。 * **(float)**: 強制轉型很重要,如果沒加,整數除法(例如 10 / 32767)會直接變成 0 * **($float^*$)**: 轉malloc (default return void*) 成float * **fprint**(log's name, message) #### Makefile ``` # Makefile for Project 1: Parallel Matrix Extraction # 1. 定義變數 (Variables) CC = gcc # CFLAGS 是一連串的編譯選項: # -Wall -Wextra: 開啟所有常見的警告。工程師最喜歡警告了,它能幫我們抓出潛在的 Bug。 # -O3: 最高等級的編譯器最佳化。因為我們要測效能,這個參數絕對不能少。 # -pthread: 告訴編譯器我們要連結 POSIX Thread 函式庫,這是多執行緒的靈魂。 # -g: 加入除錯資訊,萬一程式當掉 (Segmentation fault),我們可以用 gdb 追蹤。 CFLAGS = -Wall -Wextra -O3 -pthread -g # exe file name TARGET = extractor # 所有的 .c 原始碼檔案 SRCS = src/main.c src/worker.c # 預設目標 # make = 執行第一個 Target(通常就是 all)。 # make all = 明確指定執行名為 all 的 Target。 all: $(TARGET) # 編譯規則:告訴 Make 如何把 SRCS 編譯成 TARGET $(TARGET): $(SRCS) $(CC) $(CFLAGS) -o $(TARGET) $(SRCS) # 清理編譯出來的檔案 (方便重新編譯) clean: rm -f $(TARGET) ``` ### Step 2 Workload & Threads Lifecycle 依照threads數量分配,如 * Thread 0 負責 Row 0 ~ 4095 * Thread 1 負責 Row 4096 ~ 8191 以達成記憶體連續存取對齊 #### utils.h (modify ver 2) ```c // utils.h // below the thread_data_t typedef struct { int thread_id; // 執行緒的編號 (0, 1, 2...) size_t start_row; // 負責計算的起始行 size_t end_row; // 負責計算的結束行 (不包含此行) float *matrix; // 整個矩陣的指標 thread_data_t *thread_results; // 指向我們防 False Sharing 的結果陣列 } thread_args_t; void* worker_task(void* arg); ``` #### worker.c ```c #include "utils.h" void* worker_task(void* arg) { // 1. transform void* arg to thread struct, pointer to the struct we defined in utils.h thread_args_t *args = (thread_args_t*)arg; // here we can access the fields of the struct, e.g., args->thread_id, args->start_row, etc. int tid = args->thread_id; double sum = 0.0; // 暫存用的局部變數,放在 CPU Register 裡最快! // 2. Start work:只跑自己的 Rows for (size_t i = args->start_row; i < args->end_row; i++) { // Columns for (size_t j = 0; j < MATRIX_SIZE; j++) { // 一維陣列模擬二維陣列的公式: 索引 = (行號 * 總列數) + 列號 sum += args->matrix[i * MATRIX_SIZE + j]; } } // 3. 計算完畢,把結果寫入專屬的記憶體位置 (已 Padding 過,不會有 False Sharing) args->thread_results[tid].local_sum = sum; // print log to console printf("[Thread %d] 完成計算 Rows %zu ~ %zu,局部總和: %f\n", tid, args->start_row, args->end_row - 1, sum); return NULL; // 執行緒結束 } ``` #### main.c version 2 這裡就該宣告的宣告,該設置的設置 > 這裡的NUM_THREADS是先用#define for 在這裡其實可以手動展開如join,但僅限於define下,理想還是直接用openmp自動展開 ```c // ... after random init matrix // ========================================== // Benchmark 開始:記錄起始時間 // ========================================== struct timespec start_time, end_time; clock_gettime(CLOCK_MONOTONIC, &start_time); // ========================================== pthread_t threads[num_threads]; thread_args_t thread_args[num_threads]; thread_data_t thread_results[num_threads]; // check if thread_results is correctly allocated float *row_max_array = malloc(MATRIX_SIZE * sizeof(float)); size_t rows_per_thread = MATRIX_SIZE / num_threads; printf("\n %d 個thread開始作業...\n", num_threads); for (int i = 0; i < num_threads; i++) { thread_args[i].thread_id = i; thread_args[i].start_row = i * rows_per_thread; // 如果是最後一個執行緒,就包辦剩下的所有行數 (避免無法整除的問題) thread_args[i].end_row = (i == num_threads - 1) ? MATRIX_SIZE : (i + 1) * rows_per_thread; thread_args[i].matrix = matrix; thread_args[i].thread_results = thread_results; thread_args[i].row_max_array = row_max_array; // 建立執行緒並執行 worker_task pthread_create(&threads[i], NULL, worker_task, &thread_args[i]); } for (int i = 0; i < num_threads; i++) { pthread_join(threads[i], NULL); } double global_sum = 0.0; for (int i = 0; i < num_threads; i++) { global_sum += thread_results[i].local_sum; } // ========================================== // Benchmark 結束:記錄結束時間並計算差值 // ========================================== clock_gettime(CLOCK_MONOTONIC, &end_time); double elapsed_time = (end_time.tv_sec - start_time.tv_sec) + (end_time.tv_nsec - start_time.tv_nsec) / 1e9; printf("\n>>> 最終全局總和 (Global Sum): %f <<<\n", global_sum); printf(">>> 核心計算耗時: %f 秒 <<<\n", elapsed_time); printf("\n所有執行緒完成計算,矩陣總和: %f\n", global_sum); // ========================================== // 結果驗證階段 (Sequential Verification) // ========================================== printf("\n>>> 啟動結果驗證 (計算單執行緒基準值以供核對)... <<<\n"); double expected_global_sum = 0.0; int validation_passed = 1; for (size_t i = 0; i < MATRIX_SIZE; i++) { // 先預設該行的第一個元素是最大值 float expected_row_max = matrix[i * MATRIX_SIZE]; for (size_t j = 0; j < MATRIX_SIZE; j++) { float val = matrix[i * MATRIX_SIZE + j]; expected_global_sum += val; // 計算單執行緒總和 // 計算單執行緒極值 if (val > expected_row_max) { expected_row_max = val; } } // 1. 驗證行極值:將單執行緒算出的極值,與多執行緒陣列中的結果比對 if (row_max_array[i] != expected_row_max) { printf(" error:Row %zu 的極值不吻合!預期: %f, 實際: %f\n", i, expected_row_max, row_max_array[i]); validation_passed = 0; break; // 錯一個就不用比了,直接中斷 } } // 2. 驗證全局總和 // 浮點數運算會有微小的精度誤差 (Epsilon),尤其是加法順序不同的時候。 // 所以我們不能用 == 來比對浮點數,必須取相減的絕對值,確認誤差小於一個極小值。 double diff = expected_global_sum - global_sum; if (diff < 0) diff = -diff; // 簡單的絕對值處理 if (diff > 1e-2) { // 容許 0.01 的浮點數累加誤差 printf("error:全局總和不吻合!預期: %f, 實際: %f\n", expected_global_sum, global_sum); validation_passed = 0; } // 最終宣判 if (validation_passed) { printf("Pass\n"); } else { printf("Fail\n"); } // ========================================== // release memory free(threads); free(thread_args); free(thread_results); free(row_max_array); free(matrix); printf("Matrix memory freed.\n"); return 0; } ``` ##### return ```bash 所有執行緒完成計算,矩陣總和: 134211673.527968 >>> 啟動結果驗證 (計算單執行緒基準值以供核對)... <<< Pass double free or corruption (out) Aborted (core dumped) ``` 這裡是源自於記憶體對齊 (Memory Alignment)引發了堆積 (Heap) 結構破壞 呼叫` malloc() `或 `calloc() `時,作業系統不僅僅會給你一塊記憶體,它還會在這塊記憶體前塞入一些管理資料 **(Heap Metadata)**,用來記錄這塊記憶體有多大。當你呼叫 free() 時,系統會去讀取這些隱藏的資料來決定要回收多少空間。 ```c // utils.h內,這裡要求了需要對齊64 typedef struct { double local_sum; alignas(CACHE_LINE_SIZE) long long final_count; } thread_data_t; ``` 但` malloc() calloc()`卻不會指定起始位置,只會給你正確的大小,這就導致了汙染,雖然不影響計算數值,但最終的free讀到汙染後會讓OS給出 corruption 並強制 Abortion 這裡可以改用`posix_memalign`取代calloc ```c thread_data_t *results = NULL; if (posix_memalign((void **)&results, CACHELINE_SIZE, num_threads * sizeof(thread_data_t)) != 0) { fprintf(stderr, "錯誤:對齊記憶體配置失敗!\n"); return 1; } ``` 或直接把下方的`free`註解掉,不過這是有風險的,stack空間小,如果今天照原樣,而宣告過多,則有塞滿的風險,故可以考慮改放heap ```c // ========== stack version ========== pthread_t threads[num_threads]; thread_args_t thread_args[num_threads]; thread_data_t thread_results[num_threads]; // check if thread_results is correctly allocated float *row_max_array = malloc(MATRIX_SIZE * sizeof(float)); // ========== heap version ========== // pthread_t *threads = malloc(num_threads * sizeof(pthread_t)); // thread_args_t *thread_args = malloc(num_threads * sizeof(thread_args_t)); // thread_data_t *thread_results = NULL; // posix_memalign((void **)&thread_results, 64, num_threads * sizeof(thread_data_t)); ``` ### Step 3 bash and plot 建立自動測試腳本與繪圖 ```bash # ~/scripts/run_bench.sh #!/bin/bash cd "$(dirname "$0")/.." || exit 1 echo "===============================================" echo " Parallel Matrix Extractor Benchmark Script " echo "===============================================" # 1. 自動重新編譯 echo "Recompiling..." make clean > /dev/null 2>&1 make > /dev/null 2>&1 if [ $? -ne 0 ]; then echo "Compile Error!plz check code。" exit 1 fi OUT_DIR="results" mkdir -p "$OUT_DIR" # Make sure the plots directory exists too mkdir -p "$OUT_DIR/plots" CSV_FILE="$OUT_DIR/benchmark_results.csv" PNG_FILE="$OUT_DIR/plots/speedup_curve.png" # 定義要測試的執行緒數與採樣次數 THREADS=(1 2 4 8 16) RUNS=1 CSV_FILE="$OUT_DIR/benchmark_results.csv" PNG_FILE="$OUT_DIR/plots/speedup_curve.png" # 初始化 CSV 檔案並寫入標題 echo "Threads,AverageTime" > "$CSV_FILE" echo "Compile Success!Start Benchmarking(Each number of threads will run $RUNS times)..." echo "------------------------------------------" # 2. 雙重迴圈:測試不同執行緒數,每次跑 5 遍 for t in "${THREADS[@]}" do echo -n "$t threads... " TOTAL_TIME=0 for ((i=1; i<=RUNS; i++)) do # 執行程式並將結果存入變數 OUTPUT OUTPUT=$(./extractor $t) # 使用 grep 找出含有「耗時」的那一行,並用正則表達式提取出浮點數數字 TIME=$(echo "$OUTPUT" | grep "核心計算耗時" | grep -oE '[0-9]+\.[0-9]+') # 萬一抓不到數字的防呆機制 if [ -z "$TIME" ]; then echo "Error:Fetching time error, plz check main.c format" exit 1 fi # 使用 awk 進行浮點數累加 TOTAL_TIME=$(awk "BEGIN {print $TOTAL_TIME + $TIME}") done # 計算 5 次的平均值 AVG_TIME=$(awk "BEGIN {print $TOTAL_TIME / $RUNS}") echo "Avg Time: $AVG_TIME 秒" # 將結果寫入 CSV 檔案 echo "$t,$AVG_TIME" >> $CSV_FILE done echo "------------------------------------------" echo "Data collected!Save in $CSV_FILE" echo "Ready to plot (Speedup Curve)..." # 3. 自動生成 Python 繪圖腳本 cat << 'EOF' > plot_speedup.py import csv import sys import matplotlib.pyplot as plt input_csv = sys.argv[1] output_png = sys.argv[2] try: threads = [] times = [] # 讀取剛剛 Bash 產生的 CSV with open(input_csv, 'r') as f: reader = csv.DictReader(f) for row in reader: threads.append(int(row['Threads'])) times.append(float(row['AverageTime'])) # 計算加速比 Speedup = T1 / Tn t1_time = times[0] speedup = [t1_time / t for t in times] # 設定圖表樣式 plt.figure(figsize=(8, 5)) # 畫出實際加速比 (藍線) plt.plot(threads, speedup, marker='o', linestyle='-', color='b', label='Actual Speedup') # 畫出理想加速比 (紅虛線:如果 2 個 Thread 快 2 倍,4 個快 4 倍的理論值) plt.plot(threads, threads, linestyle='--', color='r', alpha=0.6, label='Ideal Speedup') plt.title('Parallel Matrix Extractor - Speedup Curve') plt.xlabel('Number of Threads') plt.ylabel('Speedup (T1 / Tn)') plt.xticks(threads) plt.grid(True, linestyle='--', alpha=0.7) plt.legend() # 儲存圖片 plt.savefig(output_png, dpi=300) print(f"Success!Save as {output_png}") except ImportError: print("Error, lack of python module: matplotlib。") print("Solution:Checked if'pip install matplotlib' ,execute 'python3 plot_speedup.py'。") except Exception as e: print(f"plot error: {e}") EOF # 4. 呼叫 Python 執行繪圖 python3 plot_speedup.py "$CSV_FILE" "$PNG_FILE" rm plot_speedup.py ``` #### Result ``` Threads,AverageTime 1,0.294057 2,0.165808 4,0.084245 8,0.037628 16,0.044957 ``` ![speedup_curve](https://hackmd.io/_uploads/SJ_W6RDsbl.png) ### Step 4 Thread Affinity 作業系統 (OS) 的排程器容易將task依據core的忙碌狀態做負載平衡,而task遷移時TLB、L1、L2等資料的報廢與更新將造成巨大的開銷 同時,在Intel的處理器中,有不少型號都是有P/E核的配置,這使得核心間的效能有所差距 #### utils.h ```c typedef struct { int thread_id; // 執行緒的編號 (0, 1, 2...) int core_id; // 執行緒綁定的核心編號 size_t start_row; // 負責計算的起始行 size_t end_row; // 負責計算的結束行 (不包含此行) float *matrix; // 整個矩陣的指標 thread_data_t *thread_results; // 指向結果陣列 float *row_max_array; // 指向每行最大值的陣列 } thread_args_t; ``` #### worker.c ```c /* * worker.c * version 2.0 * Modified ver 2.0 on: 2025/4/1 * Modified Content : affinity * Created on: 2025/3/27 * Author: Jimmy */ // #define _GNU_SOURCE here or in utils.h (the head of the file) #include "utils.h" void* worker_task(void* arg) { // 1. transform void* arg to thread struct, pointer to the struct we defined in utils.h thread_args_t *args = (thread_args_t*)arg; // here we can access the fields of the struct, e.g., args->thread_id, args->start_row, etc. int tid = args->thread_id; int core_id = args->core_id; // 從參數結構中獲取核心 ID // ========================================== // 執行緒綁核 (Thread Affinity) 邏輯 // ========================================== cpu_set_t cpuset; CPU_ZERO(&cpuset); // 1. 清空 CPU 集合 CPU_SET(core_id, &cpuset); // 2. 將我們指定的 core_id 加入集合 // 取得當前執行緒並強制綁定 pthread_t current_thread = pthread_self(); int affinity_status = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset); if (affinity_status != 0) { printf("[Thread %d] 警告:無法綁定到 Core %d\n", tid, core_id); } else { // 印出來確認有綁定成功 printf("[Thread %d] 成功綁在 CPU Core %d\n", tid, core_id); } // ========================================== double sum = 0.0; // 暫存用的局部變數,放在 CPU Register 裡最快! // 2. Start work:只跑自己的 Rows for (size_t i = args->start_row; i < args->end_row; i++) { // Columns float current_row_max = args->matrix[i * MATRIX_SIZE]; for (size_t j = 0; j < MATRIX_SIZE; j++) { // 一維陣列模擬二維陣列的公式: 索引 = (行號 * 總列數) + 列號 sum += args->matrix[i * MATRIX_SIZE + j]; if (args->matrix[i * MATRIX_SIZE + j] > current_row_max) { current_row_max = args->matrix[i * MATRIX_SIZE + j]; } } args->row_max_array[i] = current_row_max; } // 3. 計算完畢,把結果寫入專屬的記憶體位置 (已 Padding 過,不會有 False Sharing) args->thread_results[tid].local_sum = sum; // print log to console printf("[Thread %d] 完成計算 Rows %zu ~ %zu,局部總和: %f\n", tid, args->start_row, args->end_row - 1, sum); return NULL; // 執行緒結束 } ``` #### main.c ```c size_t rows_per_thread = MATRIX_SIZE / num_threads; printf("\n %d 個thread開始作業...\n", num_threads); for (int i = 0; i < num_threads; i++) { thread_args[i].thread_id = i; thread_args[i].core_id = i; // <- 先按順序綁 thread_args[i].start_row = i * rows_per_thread; // 如果是最後一個執行緒,就包辦剩下的所有行數 (避免無法整除的問題) thread_args[i].end_row = (i == num_threads - 1) ? MATRIX_SIZE : (i + 1) * rows_per_thread; thread_args[i].matrix = matrix; thread_args[i].thread_results = thread_results; thread_args[i].row_max_array = row_max_array; // 建立執行緒並執行 worker_task pthread_create(&threads[i], NULL, worker_task, &thread_args[i]); } ``` ``` # 查看CPU狀況 lscpu -e # return CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE 0 0 0 0 0:0:0:0 yes 1 0 0 0 0:0:0:0 yes 2 0 0 1 1:1:1:0 yes 3 0 0 1 1:1:1:0 yes 4 0 0 2 2:2:2:0 yes 5 0 0 2 2:2:2:0 yes 6 0 0 3 3:3:3:0 yes 7 0 0 3 3:3:3:0 yes ``` 依照cache,core0 由CPU0和1組成,而我並沒有P/E核的問題 這個core的切割,依賴於**Hyper-Threading** ### Step 5 exetreme condition * 策略 A:資源獨享 (最佳效能) 把 4 個執行緒分配到完全不同的實體核心上。 * 綁核目標:CPU 0, 2, 4, 6。 * 優勢:每個人都有自己獨立的 L1/L2 Cache,計算單元完全不用跟別人擠。 * 策略 B:資源互卡 (最差效能) * 把threads塞進同一個實體核心裡。 綁核目標:CPU 0, 1, 2, 3。 * 劣勢:Thread 0 和 Thread 1 被迫擠在 CORE 0 裡面,它們必須搶同一個 L1 Cache 和 ALU(算術邏輯單元),效能絕對會互相拖累 ```c size_t rows_per_thread = MATRIX_SIZE / num_threads; printf("\n %d 個thread開始作業...\n", num_threads); // ================= Affinity ========================= // section 2 // int core_map[] = {0, 2, 4, 6, 8, 10, 12, 14}; // section 3 int core_map[] = {0, 1}; int map_size = sizeof(core_map) / sizeof(core_map[0]); // ==================================================== for (int i = 0; i < num_threads; i++) { thread_args[i].thread_id = i; // section 1 // thread_args[i].core_id = i; // section 2 & 3 thread_args[i].core_id = core_map[i % map_size]; // 下方照舊 ``` #### return (2 core, 8 threads) ``` 8 個thread開始作業... [Thread 0] 成功綁在 CPU Core 0 [Thread 1] 成功綁在 CPU Core 1 [Thread 3] 成功綁在 CPU Core 1 [Thread 2] 成功綁在 CPU Core 0 [Thread 5] 成功綁在 CPU Core 1 [Thread 6] 成功綁在 CPU Core 0 [Thread 4] 成功綁在 CPU Core 0 [Thread 7] 成功綁在 CPU Core 1 >>> 核心計算耗時: 0.140819 秒 <<< ``` #### return (8 core, 8 threads) ``` 8 個thread開始作業... [Thread 0] 成功綁在 CPU Core 0 [Thread 2] 成功綁在 CPU Core 4 [Thread 3] 成功綁在 CPU Core 6 [Thread 7] 成功綁在 CPU Core 14 [Thread 6] 成功綁在 CPU Core 12 [Thread 4] 成功綁在 CPU Core 8 [Thread 5] 成功綁在 CPU Core 10 [Thread 1] 成功綁在 CPU Core 2 >>> 核心計算耗時: 0.071867 秒 <<< ``` #### return (16 core, 8 threads) ``` [Thread 1] 成功綁在 CPU Core 1 [Thread 5] 成功綁在 CPU Core 5 [Thread 6] 成功綁在 CPU Core 6 [Thread 3] 成功綁在 CPU Core 3 [Thread 4] 成功綁在 CPU Core 4 [Thread 2] 成功綁在 CPU Core 2 [Thread 0] 成功綁在 CPU Core 0 [Thread 7] 成功綁在 CPU Core 7 >>> 核心計算耗時: 0.059652 秒 <<< ``` ### Step 6 without padding 最後要做 False Sharing Defense 與 Cache Miss 驗證,若無padding * Thread 0 的 local_sum 和 Thread 1 的 local_sum 會被塞進同一個 64 Bytes 的 Cache Line 裡。 * 當 Thread 0 更新自己的總和時,CPU 會為了保證資料一致性,強制把 Thread 1 核心裡的同一個 Cache Line 標記為「失效 (Invalid)」。 * Thread 1 下一次要寫入時,就發生了 Cache Miss,必須重新去主記憶體抓資料。 最後最慘就變成pingpong,OS介入 #### utils.h 增加 ``` # define USE_PADDING 1 // 1: 有防禦,0: 無防禦 # if USE_PADDING // [有防禦] 在 local_sum 後面加上足夠的 padding,確保每個 thread_data_t 都獨占一個 Cache Line,避免 False Sharing typedef struct { double local_sum; alignas(CACHE_LINE_SIZE) long long final_count; } thread_data_t; #else // [無防禦] 乾淨溜溜的結構體,多個 local_sum 會擠在同一個 Cache Line 裡互相傷害 typedef struct { double local_sum; } thread_data_t; #endif ``` #### worker.c ```c /* * worker.c * version 3.0 * Modified ver 3.0 on: 2025/4/1 * Modified Content : affinity, padding * Created on: 2025/3/27 * Author: Jimmy */ // #define _GNU_SOURCE here or in utils.h (the head of the file) #include "utils.h" void* worker_task(void* arg) { // 1. transform void* arg to thread struct, pointer to the struct we defined in utils.h thread_args_t *args = (thread_args_t*)arg; // here we can access the fields of the struct, e.g., args->thread_id, args->start_row, etc. int tid = args->thread_id; int core_id = args->core_id; // 從參數結構中獲取核心 ID // ========================================== // 執行緒綁核 (Thread Affinity) 邏輯 // ========================================== cpu_set_t cpuset; CPU_ZERO(&cpuset); // 1. 清空 CPU 集合 CPU_SET(core_id, &cpuset); // 2. 將我們指定的 core_id 加入集合 // 取得當前執行緒並強制綁定 pthread_t current_thread = pthread_self(); int affinity_status = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset); if (affinity_status != 0) { printf("[Thread %d] 警告:無法綁定到 Core %d\n", tid, core_id); } else { // 印出來確認有綁定成功 printf("[Thread %d] 成功綁在 CPU Core %d\n", tid, core_id); } // ========================================== #if USE_PADDING_WORKER double sum = 0.0; // 暫存用的局部變數,放在 CPU Register 裡最快! #else args->thread_results[tid].local_sum = 0.0; #endif // 2. Start work:只跑自己的 Rows for (size_t i = args->start_row; i < args->end_row; i++) { // Columns float current_row_max = args->matrix[i * MATRIX_SIZE]; for (size_t j = 0; j < MATRIX_SIZE; j++) { // 一維陣列模擬二維陣列的公式: 索引 = (行號 * 總列數) + 列號 #if USE_PADDING_WORKER sum += args->matrix[i * MATRIX_SIZE + j]; # else args->thread_results[tid].local_sum += args->matrix[i * MATRIX_SIZE + j]; #endif if (args->matrix[i * MATRIX_SIZE + j] > current_row_max) { current_row_max = args->matrix[i * MATRIX_SIZE + j]; } } args->row_max_array[i] = current_row_max; } // 3. 計算完畢,把結果寫入專屬的記憶體位置 (已 Padding 過,不會有 False Sharing) #if USE_PADDING_WORKER args->thread_results[tid].local_sum = sum; printf("[Thread %d] 完成計算 Rows %zu ~ %zu,局部總和: %f\n", tid, args->start_row, args->end_row - 1, sum); #else printf("[Thread %d] 完成計算 Rows %zu ~ %zu,局部總和: %f\n", tid, args->start_row, args->end_row - 1, args->thread_results[tid].local_sum); #endif // print log to console return NULL; // 執行緒結束 } ``` 結果其實不會差到太多,Compiler的優化有一定的影響,加上無法看到L1 miss 所以在utils.h內加上`volatile` ```c # if USE_PADDING typedef struct { double local_sum; alignas(CACHE_LINE_SIZE) long long final_count; } thread_data_t; #else typedef struct { volatile double local_sum; } thread_data_t; #endif ``` ``` # padding >>> 核心計算耗時: 0.037044 秒 <<< # without padding >>> 核心計算耗時: 0.193459 秒 <<< ``` END

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully