Phase 1 Final Project

## Final Project (Phase 1) Include : **Thread Lifecycle**, **Synchronization**, **False Sharing**, **Affinity** ### 題目多核心異構環境下的矩陣特徵提取器 **(Parallel Matrix Feature Extractor)** ### Spec 給定一個巨大的二維矩陣 $M$ ($16384 \times 16384$，單精度浮點數)，你需要撰寫一個平行程式來計算： 1. **全局總和 (Global Sum)**：矩陣所有元素的累加。 2. **行極值 (Row-wise Maxima)**：每一行最大的數值，存入一個陣列。 3. **效能報告**：量化在不同核心綁定策略下的執行效率。 #### Requirement 1 : Domain Decomposition * **任務**：將矩陣按「行 (Rows)」平均分配給 $N$ 個執行緒。 * **OS 連結**：利用 **Thread Affinity** 將每個執行緒固定在不同的 CPU 核心上，觀察 P-core 與 E-core 的計算速度落差。 #### Requirement 2 : Reduction Optimization * **禁忌**：嚴禁多個執行緒直接對一個全域變數加鎖（會產生嚴重的 Contention）。 * **要求**：每個執行緒必須先計算其分配區域的「局部總和 (Local Sum)」，最後再彙整。 #### Requirement 3 : False Sharing Preventing * **挑戰**：如果你建立一個儲存每個執行緒局部結果的陣列 `float results[NUM_THREADS]`，會觸發 False Sharing。 * **要求**：使用 **Padding** 或 **Thread-local Storage** 確保各核心的 L1 Cache 不會互相打架。 --- ### File Structure ```text ~project/ ├──Project1/ │ ├── src/ │ │ ├── main.c # 負責初始化矩陣、建立執行緒、匯總結果 │ │ ├── worker.c │ │ └── utils.h # 定義對齊用的結構體 │ ├── scripts/ │ │ └── run_bench.sh # 自動測試 1, 2, 4, 8, 16 執行緒的腳本 │ ├── Makefile # 需包含 -O3 與 -pthread │ └── report.md │ ├──Project1_affinity/ ├── src/ │ ├── main.c │ ├── worker.c # 核心計算邏輯 (含 Affinity 設定) │ └── utils.h ├── scripts/ │ └── run_bench.sh ├── Makefile └── report.md ``` --- ### report `report.md` 必須包含以下量化分析： 1. **加速比曲線 (Speedup Curve)**： * 計算 $S = T_1 / T_n$。 * **分析**：當執行緒數量超過實體核心（Physical Cores）進入超執行緒（Hyper-threading）時，加速比是否趨於平緩？為什麼？ 2. **硬體異構性分析 (Heterogeneity)**： * 分別測試「只綁定 P-cores」與「只綁定 E-cores」的執行時間。 * **量化**：計算兩者的 **效能功耗比估算**（若有工具可測）或單純的運算速度倍率。 3. **Cache Miss 驗證**： * 使用 `perf stat -e L1-dcache-load-misses` 比較「有 Padding」與「無 Padding」版本的數據差異。 ### 啟動代碼片段 (Data Structure) ```c #define CACHELINE_SIZE 64 typedef struct { double local_sum; // Padding 確保每個 struct 佔用整倍數的 Cache Line char padding[CACHELINE_SIZE - sizeof(double)]; } __attribute__((aligned(CACHELINE_SIZE))) thread_data_t; ``` --- ### Step 1 : Structure 建立一個 $16384 \times 16384$ 的單精度浮點數 (float) 矩陣 $M$。單精度浮點數佔用 4 Bytes(32 bits)。 * 總大小： $$16384 * 16384 * 4 Bytes = 1,073,741,824 Bytes = 1 GB$$ 1. 絕對不能宣告在 Stack 上如果你寫 `float matrix[16384][16384]`，程式一執行就會立刻發生 Stack Overflow (堆疊溢位)，因為作業系統預設給每個執行緒的 Stack 大小通常只有 8 MB > use `ulimit -s` to check default stack 2. 必須使用 Heap (堆積) 我們必須使用 malloc 或 calloc 來動態配置這 1 GB 的記憶體。 3. 使用一維陣列模擬二維陣列為了極致的效能，我們不使用 `(float **)`來建立二維陣列，因為那會讓記憶體變得破碎。我們應該宣告一個巨大的一維陣列，這樣記憶體才是連續的 (Contiguous)。連續的記憶體對 CPU Cache 絕對友善，能大幅降低 Cache Miss。 > 這裡請複習Stack / Heap 的差異 #### `(float **)` ##### 1. 指標陣列層次（第一層）當你宣告一個 `float **ptr` 並為其分配空間時（例如模擬二維陣列），你通常會先分配一個存放「指標」的空間： ```c float **ptr = malloc(rows * sizeof(float *)); ``` 在這個階段，`ptr` 指向的是一個連續的記憶體區塊，裡面存放著多個 `float *`（指標）。**在這個區塊內，指標與指標之間是連續的。** ##### 2. 實際資料層次（第二層）問題通常出在這裡。當你為每一列分配空間時： ```c for (int i = 0; i < rows; i++) { ptr[i] = malloc(cols * sizeof(float)); } ``` * **指標之間不連續：** 每次呼叫 `malloc` 分配的 `ptr[i]` 區塊，在記憶體中的位置是由作業系統動態決定的。`ptr[0]` 的結尾與 `ptr[1]` 的開頭通常**不相鄰**。 * **區塊內部連續：** 在同一個 `ptr[i]` 所指向的 `cols` 個 `float` 之間，記憶體是連續的。 ##### 比較表：`float **` vs. 二維陣列 `float[R][C]` | 特性 | `float **ptr` (動態分配) | `float arr[R][C]` (靜態陣列) | | :--- | :--- | :--- | | **記憶體佈局** | **非線性**。每一列可能散落在記憶體各處。 | **線性**。所有資料都在一個完整的連續區塊。 | | **存取效率** | 較慢。需要兩次記憶體尋址 (Dereference)。 | 較快。編譯器可直接計算偏移量。 | | **彈性** | 高。每一列的長度可以不同（鋸齒陣列）。 | 低。每一列長度必須相同。 | --- ##### 如何讓 `float **` 的記憶體完全連續？如果你需要使用 `ptr[i][j]` 的語法，又希望記憶體完全連續（為了效能或傳輸需求），你可以使用「一次性分配」技巧： ```c float **ptr = malloc(rows * sizeof(float *)); float *data = malloc(rows * cols * sizeof(float)); // 分配一整塊連續空間 for (int i = 0; i < rows; i++) { ptr[i] = &data[i * cols]; // 將指標指向連續空間的對應位置 } ``` #### Header : utils.h ```c /* * utils.h * version 1.0 * Created on: 2025/3/27 * Author: Jimmy */ #ifndef UTILS_H #define UTILS_H #define _GNU_SOURCE #include <stdalign.h> // 用於 alignas # define MATRIX_SIZE 16384 # define rows 16384 # define cols 16384 # define CACHE_LINE_SIZE 64 # define EXIT_FAILURE 1 typedef struct { alignas(CACHE_LINE_SIZE) long long final_count; } thread_data_t; // 等價於 // typedef struct { // double local_sum; // char padding[CACHELINE_SIZE - sizeof(double)]; // } __attribute__((aligned(CACHELINE_SIZE))) thread_data_t; // --------------------------------------------------------- // typedef struct { // double local_sum; // char padding[CACHELINE_SIZE - sizeof(double)]; // } alignas(CACHELINE_SIZE) thread_data_t; #endif // UTILS_H ``` #### main.c (ver. matrix default) ```c /* * main.c * version 1.0 * Created on: 2025/3/27 * Author: Jimmy */ #include <stdio.h> #include <stdlib.h> #include <time.h> #include "utils.h" int main () { printf("Default\n"); // size_t : unsigned integer type, auto-adjusted to the size of the platform (32-bit or 64-bit) size_t total_element = (size_t)rows * cols; // number of elements in the matrix = rows * cols printf("Total elements: %zu\n", total_element); // single precision float: 4 bytes // malloc return void* , float* can trasform void* to float* float *matrix = (float *)malloc(total_element * sizeof(float)); // float **ptr = malloc(rows * sizeof(float *)); // float *data = malloc(rows * cols * sizeof(float)); // 分配一整塊連續空間 // for (int i = 0; i < rows; i++) { // ptr[i] = &data[i * cols]; // 將指標指向連續空間的對應位置 // } if (matrix == NULL) { // log error message to a file name "stderr" fprintf(stderr, "Memory allocation failed\n"); return EXIT_FAILURE; } // time_t time(time_t *tloc); // time return the current calendar time as // a value of type time_t. // If tloc is non-NULL, // the return value is also stored // in the memory pointed to by tloc. srand (time(NULL)); // seed the random number generator with current time for (size_t i = 0; i < total_element; i++) { // RAND_MAX is a constant defined in stdlib.h, // which represents the maximum value returned by the rand function matrix[i] = (float)rand() / RAND_MAX; // Initialize with some values, e.g., 0.0, 1.0, ..., 99.0, then repeat } printf ("Matrix initialization completed.\n"); free (matrix); return 0; } ``` ##### Take out * **size_t** : auto-adjusted to the size of the platform (32-bit or 64-bit) * **time(NULL)**: 代表不使用指標接收時間，只取回傳值。 * **RAND_MAX**: 是標準庫定義的固定數字，不是函式。 * **(float)**: 強制轉型很重要，如果沒加，整數除法（例如 10 / 32767）會直接變成 0 * **($float^*$)**: 轉malloc (default return void*) 成float * **fprint**(log's name, message) #### Makefile ``` # Makefile for Project 1: Parallel Matrix Extraction # 1. 定義變數 (Variables) CC = gcc # CFLAGS 是一連串的編譯選項： # -Wall -Wextra: 開啟所有常見的警告。工程師最喜歡警告了，它能幫我們抓出潛在的 Bug。 # -O3: 最高等級的編譯器最佳化。因為我們要測效能，這個參數絕對不能少。 # -pthread: 告訴編譯器我們要連結 POSIX Thread 函式庫，這是多執行緒的靈魂。 # -g: 加入除錯資訊，萬一程式當掉 (Segmentation fault)，我們可以用 gdb 追蹤。 CFLAGS = -Wall -Wextra -O3 -pthread -g # exe file name TARGET = extractor # 所有的 .c 原始碼檔案 SRCS = src/main.c src/worker.c # 預設目標 # make = 執行第一個 Target（通常就是 all）。 # make all = 明確指定執行名為 all 的 Target。 all: $(TARGET) # 編譯規則：告訴 Make 如何把 SRCS 編譯成 TARGET $(TARGET): $(SRCS) $(CC) $(CFLAGS) -o $(TARGET) $(SRCS) # 清理編譯出來的檔案 (方便重新編譯) clean: rm -f $(TARGET) ``` ### Step 2 Workload & Threads Lifecycle 依照threads數量分配，如 * Thread 0 負責 Row 0 ~ 4095 * Thread 1 負責 Row 4096 ~ 8191 以達成記憶體連續存取對齊 #### utils.h (modify ver 2) ```c // utils.h // below the thread_data_t typedef struct { int thread_id; // 執行緒的編號 (0, 1, 2...) size_t start_row; // 負責計算的起始行 size_t end_row; // 負責計算的結束行 (不包含此行) float *matrix; // 整個矩陣的指標 thread_data_t *thread_results; // 指向我們防 False Sharing 的結果陣列 } thread_args_t; void* worker_task(void* arg); ``` #### worker.c ```c #include "utils.h" void* worker_task(void* arg) { // 1. transform void* arg to thread struct, pointer to the struct we defined in utils.h thread_args_t *args = (thread_args_t*)arg; // here we can access the fields of the struct, e.g., args->thread_id, args->start_row, etc. int tid = args->thread_id; double sum = 0.0; // 暫存用的局部變數，放在 CPU Register 裡最快！ // 2. Start work：只跑自己的 Rows for (size_t i = args->start_row; i < args->end_row; i++) { // Columns for (size_t j = 0; j < MATRIX_SIZE; j++) { // 一維陣列模擬二維陣列的公式：索引 = (行號 * 總列數) + 列號 sum += args->matrix[i * MATRIX_SIZE + j]; } } // 3. 計算完畢，把結果寫入專屬的記憶體位置 (已 Padding 過，不會有 False Sharing) args->thread_results[tid].local_sum = sum; // print log to console printf("[Thread %d] 完成計算 Rows %zu ~ %zu，局部總和: %f\n", tid, args->start_row, args->end_row - 1, sum); return NULL; // 執行緒結束 } ``` #### main.c version 2 這裡就該宣告的宣告，該設置的設置 > 這裡的NUM_THREADS是先用#define for 在這裡其實可以手動展開如join，但僅限於define下，理想還是直接用openmp自動展開 ```c // ... after random init matrix // ========================================== // Benchmark 開始：記錄起始時間 // ========================================== struct timespec start_time, end_time; clock_gettime(CLOCK_MONOTONIC, &start_time); // ========================================== pthread_t threads[num_threads]; thread_args_t thread_args[num_threads]; thread_data_t thread_results[num_threads]; // check if thread_results is correctly allocated float *row_max_array = malloc(MATRIX_SIZE * sizeof(float)); size_t rows_per_thread = MATRIX_SIZE / num_threads; printf("\n %d 個thread開始作業...\n", num_threads); for (int i = 0; i < num_threads; i++) { thread_args[i].thread_id = i; thread_args[i].start_row = i * rows_per_thread; // 如果是最後一個執行緒，就包辦剩下的所有行數 (避免無法整除的問題) thread_args[i].end_row = (i == num_threads - 1) ? MATRIX_SIZE : (i + 1) * rows_per_thread; thread_args[i].matrix = matrix; thread_args[i].thread_results = thread_results; thread_args[i].row_max_array = row_max_array; // 建立執行緒並執行 worker_task pthread_create(&threads[i], NULL, worker_task, &thread_args[i]); } for (int i = 0; i < num_threads; i++) { pthread_join(threads[i], NULL); } double global_sum = 0.0; for (int i = 0; i < num_threads; i++) { global_sum += thread_results[i].local_sum; } // ========================================== // Benchmark 結束：記錄結束時間並計算差值 // ========================================== clock_gettime(CLOCK_MONOTONIC, &end_time); double elapsed_time = (end_time.tv_sec - start_time.tv_sec) + (end_time.tv_nsec - start_time.tv_nsec) / 1e9; printf("\n>>> 最終全局總和 (Global Sum): %f <<<\n", global_sum); printf(">>> 核心計算耗時: %f 秒 <<<\n", elapsed_time); printf("\n所有執行緒完成計算，矩陣總和: %f\n", global_sum); // ========================================== // 結果驗證階段 (Sequential Verification) // ========================================== printf("\n>>> 啟動結果驗證 (計算單執行緒基準值以供核對)... <<<\n"); double expected_global_sum = 0.0; int validation_passed = 1; for (size_t i = 0; i < MATRIX_SIZE; i++) { // 先預設該行的第一個元素是最大值 float expected_row_max = matrix[i * MATRIX_SIZE]; for (size_t j = 0; j < MATRIX_SIZE; j++) { float val = matrix[i * MATRIX_SIZE + j]; expected_global_sum += val; // 計算單執行緒總和 // 計算單執行緒極值 if (val > expected_row_max) { expected_row_max = val; } } // 1. 驗證行極值：將單執行緒算出的極值，與多執行緒陣列中的結果比對 if (row_max_array[i] != expected_row_max) { printf(" error：Row %zu 的極值不吻合！預期: %f, 實際: %f\n", i, expected_row_max, row_max_array[i]); validation_passed = 0; break; // 錯一個就不用比了，直接中斷 } } // 2. 驗證全局總和 // 浮點數運算會有微小的精度誤差 (Epsilon)，尤其是加法順序不同的時候。 // 所以我們不能用 == 來比對浮點數，必須取相減的絕對值，確認誤差小於一個極小值。 double diff = expected_global_sum - global_sum; if (diff < 0) diff = -diff; // 簡單的絕對值處理 if (diff > 1e-2) { // 容許 0.01 的浮點數累加誤差 printf("error：全局總和不吻合！預期: %f, 實際: %f\n", expected_global_sum, global_sum); validation_passed = 0; } // 最終宣判 if (validation_passed) { printf("Pass\n"); } else { printf("Fail\n"); } // ========================================== // release memory free(threads); free(thread_args); free(thread_results); free(row_max_array); free(matrix); printf("Matrix memory freed.\n"); return 0; } ``` ##### return ```bash 所有執行緒完成計算，矩陣總和: 134211673.527968 >>> 啟動結果驗證 (計算單執行緒基準值以供核對)... <<< Pass double free or corruption (out) Aborted (core dumped) ``` 這裡是源自於記憶體對齊 (Memory Alignment)引發了堆積 (Heap) 結構破壞呼叫` malloc() `或 `calloc() `時，作業系統不僅僅會給你一塊記憶體，它還會在這塊記憶體前塞入一些管理資料 **(Heap Metadata)**，用來記錄這塊記憶體有多大。當你呼叫 free() 時，系統會去讀取這些隱藏的資料來決定要回收多少空間。 ```c // utils.h內，這裡要求了需要對齊64 typedef struct { double local_sum; alignas(CACHE_LINE_SIZE) long long final_count; } thread_data_t; ``` 但` malloc() calloc()`卻不會指定起始位置，只會給你正確的大小，這就導致了汙染，雖然不影響計算數值，但最終的free讀到汙染後會讓OS給出 corruption 並強制 Abortion 這裡可以改用`posix_memalign`取代calloc ```c thread_data_t *results = NULL; if (posix_memalign((void **)&results, CACHELINE_SIZE, num_threads * sizeof(thread_data_t)) != 0) { fprintf(stderr, "錯誤：對齊記憶體配置失敗！\n"); return 1; } ``` 或直接把下方的`free`註解掉，不過這是有風險的，stack空間小，如果今天照原樣，而宣告過多，則有塞滿的風險，故可以考慮改放heap ```c // ========== stack version ========== pthread_t threads[num_threads]; thread_args_t thread_args[num_threads]; thread_data_t thread_results[num_threads]; // check if thread_results is correctly allocated float *row_max_array = malloc(MATRIX_SIZE * sizeof(float)); // ========== heap version ========== // pthread_t *threads = malloc(num_threads * sizeof(pthread_t)); // thread_args_t *thread_args = malloc(num_threads * sizeof(thread_args_t)); // thread_data_t *thread_results = NULL; // posix_memalign((void **)&thread_results, 64, num_threads * sizeof(thread_data_t)); ``` ### Step 3 bash and plot 建立自動測試腳本與繪圖 ```bash # ~/scripts/run_bench.sh #!/bin/bash cd "$(dirname "$0")/.." || exit 1 echo "===============================================" echo " Parallel Matrix Extractor Benchmark Script " echo "===============================================" # 1. 自動重新編譯 echo "Recompiling..." make clean > /dev/null 2>&1 make > /dev/null 2>&1 if [ $? -ne 0 ]; then echo "Compile Error！plz check code。" exit 1 fi OUT_DIR="results" mkdir -p "$OUT_DIR" # Make sure the plots directory exists too mkdir -p "$OUT_DIR/plots" CSV_FILE="$OUT_DIR/benchmark_results.csv" PNG_FILE="$OUT_DIR/plots/speedup_curve.png" # 定義要測試的執行緒數與採樣次數 THREADS=(1 2 4 8 16) RUNS=1 CSV_FILE="$OUT_DIR/benchmark_results.csv" PNG_FILE="$OUT_DIR/plots/speedup_curve.png" # 初始化 CSV 檔案並寫入標題 echo "Threads,AverageTime" > "$CSV_FILE" echo "Compile Success！Start Benchmarking(Each number of threads will run $RUNS times)..." echo "------------------------------------------" # 2. 雙重迴圈：測試不同執行緒數，每次跑 5 遍 for t in "${THREADS[@]}" do echo -n "$t threads... " TOTAL_TIME=0 for ((i=1; i<=RUNS; i++)) do # 執行程式並將結果存入變數 OUTPUT OUTPUT=$(./extractor $t) # 使用 grep 找出含有「耗時」的那一行，並用正則表達式提取出浮點數數字 TIME=$(echo "$OUTPUT" | grep "核心計算耗時" | grep -oE '[0-9]+\.[0-9]+') # 萬一抓不到數字的防呆機制 if [ -z "$TIME" ]; then echo "Error：Fetching time error, plz check main.c format" exit 1 fi # 使用 awk 進行浮點數累加 TOTAL_TIME=$(awk "BEGIN {print $TOTAL_TIME + $TIME}") done # 計算 5 次的平均值 AVG_TIME=$(awk "BEGIN {print $TOTAL_TIME / $RUNS}") echo "Avg Time: $AVG_TIME 秒" # 將結果寫入 CSV 檔案 echo "$t,$AVG_TIME" >> $CSV_FILE done echo "------------------------------------------" echo "Data collected！Save in $CSV_FILE" echo "Ready to plot (Speedup Curve)..." # 3. 自動生成 Python 繪圖腳本 cat << 'EOF' > plot_speedup.py import csv import sys import matplotlib.pyplot as plt input_csv = sys.argv[1] output_png = sys.argv[2] try: threads = [] times = [] # 讀取剛剛 Bash 產生的 CSV with open(input_csv, 'r') as f: reader = csv.DictReader(f) for row in reader: threads.append(int(row['Threads'])) times.append(float(row['AverageTime'])) # 計算加速比 Speedup = T1 / Tn t1_time = times[0] speedup = [t1_time / t for t in times] # 設定圖表樣式 plt.figure(figsize=(8, 5)) # 畫出實際加速比 (藍線) plt.plot(threads, speedup, marker='o', linestyle='-', color='b', label='Actual Speedup') # 畫出理想加速比 (紅虛線：如果 2 個 Thread 快 2 倍，4 個快 4 倍的理論值) plt.plot(threads, threads, linestyle='--', color='r', alpha=0.6, label='Ideal Speedup') plt.title('Parallel Matrix Extractor - Speedup Curve') plt.xlabel('Number of Threads') plt.ylabel('Speedup (T1 / Tn)') plt.xticks(threads) plt.grid(True, linestyle='--', alpha=0.7) plt.legend() # 儲存圖片 plt.savefig(output_png, dpi=300) print(f"Success！Save as {output_png}") except ImportError: print("Error, lack of python module: matplotlib。") print("Solution：Checked if'pip install matplotlib' ,execute 'python3 plot_speedup.py'。") except Exception as e: print(f"plot error: {e}") EOF # 4. 呼叫 Python 執行繪圖 python3 plot_speedup.py "$CSV_FILE" "$PNG_FILE" rm plot_speedup.py ``` #### Result ``` Threads,AverageTime 1,0.294057 2,0.165808 4,0.084245 8,0.037628 16,0.044957 ``` ![speedup_curve](https://hackmd.io/_uploads/SJ_W6RDsbl.png) ### Step 4 Thread Affinity 作業系統 (OS) 的排程器容易將task依據core的忙碌狀態做負載平衡，而task遷移時TLB、L1、L2等資料的報廢與更新將造成巨大的開銷同時，在Intel的處理器中，有不少型號都是有P/E核的配置，這使得核心間的效能有所差距 #### utils.h ```c typedef struct { int thread_id; // 執行緒的編號 (0, 1, 2...) int core_id; // 執行緒綁定的核心編號 size_t start_row; // 負責計算的起始行 size_t end_row; // 負責計算的結束行 (不包含此行) float *matrix; // 整個矩陣的指標 thread_data_t *thread_results; // 指向結果陣列 float *row_max_array; // 指向每行最大值的陣列 } thread_args_t; ``` #### worker.c ```c /* * worker.c * version 2.0 * Modified ver 2.0 on: 2025/4/1 * Modified Content : affinity * Created on: 2025/3/27 * Author: Jimmy */ // #define _GNU_SOURCE here or in utils.h (the head of the file) #include "utils.h" void* worker_task(void* arg) { // 1. transform void* arg to thread struct, pointer to the struct we defined in utils.h thread_args_t *args = (thread_args_t*)arg; // here we can access the fields of the struct, e.g., args->thread_id, args->start_row, etc. int tid = args->thread_id; int core_id = args->core_id; // 從參數結構中獲取核心 ID // ========================================== // 執行緒綁核 (Thread Affinity) 邏輯 // ========================================== cpu_set_t cpuset; CPU_ZERO(&cpuset); // 1. 清空 CPU 集合 CPU_SET(core_id, &cpuset); // 2. 將我們指定的 core_id 加入集合 // 取得當前執行緒並強制綁定 pthread_t current_thread = pthread_self(); int affinity_status = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset); if (affinity_status != 0) { printf("[Thread %d] 警告：無法綁定到 Core %d\n", tid, core_id); } else { // 印出來確認有綁定成功 printf("[Thread %d] 成功綁在 CPU Core %d\n", tid, core_id); } // ========================================== double sum = 0.0; // 暫存用的局部變數，放在 CPU Register 裡最快！ // 2. Start work：只跑自己的 Rows for (size_t i = args->start_row; i < args->end_row; i++) { // Columns float current_row_max = args->matrix[i * MATRIX_SIZE]; for (size_t j = 0; j < MATRIX_SIZE; j++) { // 一維陣列模擬二維陣列的公式：索引 = (行號 * 總列數) + 列號 sum += args->matrix[i * MATRIX_SIZE + j]; if (args->matrix[i * MATRIX_SIZE + j] > current_row_max) { current_row_max = args->matrix[i * MATRIX_SIZE + j]; } } args->row_max_array[i] = current_row_max; } // 3. 計算完畢，把結果寫入專屬的記憶體位置 (已 Padding 過，不會有 False Sharing) args->thread_results[tid].local_sum = sum; // print log to console printf("[Thread %d] 完成計算 Rows %zu ~ %zu，局部總和: %f\n", tid, args->start_row, args->end_row - 1, sum); return NULL; // 執行緒結束 } ``` #### main.c ```c size_t rows_per_thread = MATRIX_SIZE / num_threads; printf("\n %d 個thread開始作業...\n", num_threads); for (int i = 0; i < num_threads; i++) { thread_args[i].thread_id = i; thread_args[i].core_id = i; // <- 先按順序綁 thread_args[i].start_row = i * rows_per_thread; // 如果是最後一個執行緒，就包辦剩下的所有行數 (避免無法整除的問題) thread_args[i].end_row = (i == num_threads - 1) ? MATRIX_SIZE : (i + 1) * rows_per_thread; thread_args[i].matrix = matrix; thread_args[i].thread_results = thread_results; thread_args[i].row_max_array = row_max_array; // 建立執行緒並執行 worker_task pthread_create(&threads[i], NULL, worker_task, &thread_args[i]); } ``` ``` # 查看CPU狀況 lscpu -e # return CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE 0 0 0 0 0:0:0:0 yes 1 0 0 0 0:0:0:0 yes 2 0 0 1 1:1:1:0 yes 3 0 0 1 1:1:1:0 yes 4 0 0 2 2:2:2:0 yes 5 0 0 2 2:2:2:0 yes 6 0 0 3 3:3:3:0 yes 7 0 0 3 3:3:3:0 yes ``` 依照cache，core0 由CPU0和1組成，而我並沒有P/E核的問題這個core的切割，依賴於**Hyper-Threading** ### Step 5 exetreme condition * 策略 A：資源獨享 (最佳效能) 把 4 個執行緒分配到完全不同的實體核心上。 * 綁核目標：CPU 0, 2, 4, 6。 * 優勢：每個人都有自己獨立的 L1/L2 Cache，計算單元完全不用跟別人擠。 * 策略 B：資源互卡 (最差效能) * 把threads塞進同一個實體核心裡。綁核目標：CPU 0, 1, 2, 3。 * 劣勢：Thread 0 和 Thread 1 被迫擠在 CORE 0 裡面，它們必須搶同一個 L1 Cache 和 ALU（算術邏輯單元），效能絕對會互相拖累 ```c size_t rows_per_thread = MATRIX_SIZE / num_threads; printf("\n %d 個thread開始作業...\n", num_threads); // ================= Affinity ========================= // section 2 // int core_map[] = {0, 2, 4, 6, 8, 10, 12, 14}; // section 3 int core_map[] = {0, 1}; int map_size = sizeof(core_map) / sizeof(core_map[0]); // ==================================================== for (int i = 0; i < num_threads; i++) { thread_args[i].thread_id = i; // section 1 // thread_args[i].core_id = i; // section 2 & 3 thread_args[i].core_id = core_map[i % map_size]; // 下方照舊 ``` #### return (2 core, 8 threads) ``` 8 個thread開始作業... [Thread 0] 成功綁在 CPU Core 0 [Thread 1] 成功綁在 CPU Core 1 [Thread 3] 成功綁在 CPU Core 1 [Thread 2] 成功綁在 CPU Core 0 [Thread 5] 成功綁在 CPU Core 1 [Thread 6] 成功綁在 CPU Core 0 [Thread 4] 成功綁在 CPU Core 0 [Thread 7] 成功綁在 CPU Core 1 >>> 核心計算耗時: 0.140819 秒 <<< ``` #### return (8 core, 8 threads) ``` 8 個thread開始作業... [Thread 0] 成功綁在 CPU Core 0 [Thread 2] 成功綁在 CPU Core 4 [Thread 3] 成功綁在 CPU Core 6 [Thread 7] 成功綁在 CPU Core 14 [Thread 6] 成功綁在 CPU Core 12 [Thread 4] 成功綁在 CPU Core 8 [Thread 5] 成功綁在 CPU Core 10 [Thread 1] 成功綁在 CPU Core 2 >>> 核心計算耗時: 0.071867 秒 <<< ``` #### return (16 core, 8 threads) ``` [Thread 1] 成功綁在 CPU Core 1 [Thread 5] 成功綁在 CPU Core 5 [Thread 6] 成功綁在 CPU Core 6 [Thread 3] 成功綁在 CPU Core 3 [Thread 4] 成功綁在 CPU Core 4 [Thread 2] 成功綁在 CPU Core 2 [Thread 0] 成功綁在 CPU Core 0 [Thread 7] 成功綁在 CPU Core 7 >>> 核心計算耗時: 0.059652 秒 <<< ``` ### Step 6 without padding 最後要做 False Sharing Defense 與 Cache Miss 驗證，若無padding * Thread 0 的 local_sum 和 Thread 1 的 local_sum 會被塞進同一個 64 Bytes 的 Cache Line 裡。 * 當 Thread 0 更新自己的總和時，CPU 會為了保證資料一致性，強制把 Thread 1 核心裡的同一個 Cache Line 標記為「失效 (Invalid)」。 * Thread 1 下一次要寫入時，就發生了 Cache Miss，必須重新去主記憶體抓資料。最後最慘就變成pingpong，OS介入 #### utils.h 增加 ``` # define USE_PADDING 1 // 1: 有防禦，0: 無防禦 # if USE_PADDING // [有防禦] 在 local_sum 後面加上足夠的 padding，確保每個 thread_data_t 都獨占一個 Cache Line，避免 False Sharing typedef struct { double local_sum; alignas(CACHE_LINE_SIZE) long long final_count; } thread_data_t; #else // [無防禦] 乾淨溜溜的結構體，多個 local_sum 會擠在同一個 Cache Line 裡互相傷害 typedef struct { double local_sum; } thread_data_t; #endif ``` #### worker.c ```c /* * worker.c * version 3.0 * Modified ver 3.0 on: 2025/4/1 * Modified Content : affinity, padding * Created on: 2025/3/27 * Author: Jimmy */ // #define _GNU_SOURCE here or in utils.h (the head of the file) #include "utils.h" void* worker_task(void* arg) { // 1. transform void* arg to thread struct, pointer to the struct we defined in utils.h thread_args_t *args = (thread_args_t*)arg; // here we can access the fields of the struct, e.g., args->thread_id, args->start_row, etc. int tid = args->thread_id; int core_id = args->core_id; // 從參數結構中獲取核心 ID // ========================================== // 執行緒綁核 (Thread Affinity) 邏輯 // ========================================== cpu_set_t cpuset; CPU_ZERO(&cpuset); // 1. 清空 CPU 集合 CPU_SET(core_id, &cpuset); // 2. 將我們指定的 core_id 加入集合 // 取得當前執行緒並強制綁定 pthread_t current_thread = pthread_self(); int affinity_status = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset); if (affinity_status != 0) { printf("[Thread %d] 警告：無法綁定到 Core %d\n", tid, core_id); } else { // 印出來確認有綁定成功 printf("[Thread %d] 成功綁在 CPU Core %d\n", tid, core_id); } // ========================================== #if USE_PADDING_WORKER double sum = 0.0; // 暫存用的局部變數，放在 CPU Register 裡最快！ #else args->thread_results[tid].local_sum = 0.0; #endif // 2. Start work：只跑自己的 Rows for (size_t i = args->start_row; i < args->end_row; i++) { // Columns float current_row_max = args->matrix[i * MATRIX_SIZE]; for (size_t j = 0; j < MATRIX_SIZE; j++) { // 一維陣列模擬二維陣列的公式：索引 = (行號 * 總列數) + 列號 #if USE_PADDING_WORKER sum += args->matrix[i * MATRIX_SIZE + j]; # else args->thread_results[tid].local_sum += args->matrix[i * MATRIX_SIZE + j]; #endif if (args->matrix[i * MATRIX_SIZE + j] > current_row_max) { current_row_max = args->matrix[i * MATRIX_SIZE + j]; } } args->row_max_array[i] = current_row_max; } // 3. 計算完畢，把結果寫入專屬的記憶體位置 (已 Padding 過，不會有 False Sharing) #if USE_PADDING_WORKER args->thread_results[tid].local_sum = sum; printf("[Thread %d] 完成計算 Rows %zu ~ %zu，局部總和: %f\n", tid, args->start_row, args->end_row - 1, sum); #else printf("[Thread %d] 完成計算 Rows %zu ~ %zu，局部總和: %f\n", tid, args->start_row, args->end_row - 1, args->thread_results[tid].local_sum); #endif // print log to console return NULL; // 執行緒結束 } ``` 結果其實不會差到太多，Compiler的優化有一定的影響，加上無法看到L1 miss 所以在utils.h內加上`volatile` ```c # if USE_PADDING typedef struct { double local_sum; alignas(CACHE_LINE_SIZE) long long final_count; } thread_data_t; #else typedef struct { volatile double local_sum; } thread_data_t; #endif ``` ``` # padding >>> 核心計算耗時: 0.037044 秒 <<< # without padding >>> 核心計算耗時: 0.193459 秒 <<< ``` END

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.