---
# System prepended metadata

title: Phase 1 Final Project
tags: [SP]

---

## Final Project (Phase 1)
Include : **Thread Lifecycle**, **Synchronization**, **False Sharing**, **Affinity**


### 題目
多核心異構環境下的矩陣特徵提取器 **(Parallel Matrix Feature Extractor)**

### Spec
給定一個巨大的二維矩陣 $M$ ($16384 \times 16384$，單精度浮點數)，你需要撰寫一個平行程式來計算：
1.  **全局總和 (Global Sum)**：矩陣所有元素的累加。
2.  **行極值 (Row-wise Maxima)**：每一行最大的數值，存入一個陣列。
3.  **效能報告**：量化在不同核心綁定策略下的執行效率。

#### Requirement 1 : Domain Decomposition
* **任務**：將矩陣按「行 (Rows)」平均分配給 $N$ 個執行緒。
* **OS 連結**：利用 **Thread Affinity** 將每個執行緒固定在不同的 CPU 核心上，觀察 P-core 與 E-core 的計算速度落差。

#### Requirement 2 : Reduction Optimization
* **禁忌**：嚴禁多個執行緒直接對一個全域變數加鎖（會產生嚴重的 Contention）。
* **要求**：每個執行緒必須先計算其分配區域的「局部總和 (Local Sum)」，最後再彙整。

#### Requirement 3 : False Sharing Preventing
* **挑戰**：如果你建立一個儲存每個執行緒局部結果的陣列 `float results[NUM_THREADS]`，會觸發 False Sharing。
* **要求**：使用 **Padding** 或 **Thread-local Storage** 確保各核心的 L1 Cache 不會互相打架。

---

### File Structure

```text
~project/
 ├──Project1/
 │  ├── src/
 │  │   ├── main.c           # 負責初始化矩陣、建立執行緒、匯總結果
 │  │   ├── worker.c         
 │  │   └── utils.h          # 定義對齊用的結構體
 │  ├── scripts/
 │  │   └── run_bench.sh     # 自動測試 1, 2, 4, 8, 16 執行緒的腳本
 │  ├── Makefile             # 需包含 -O3 與 -pthread
 │  └── report.md   
 │  
 ├──Project1_affinity/
    ├── src/
    │   ├── main.c           
    │   ├── worker.c         # 核心計算邏輯 (含 Affinity 設定)
    │   └── utils.h          
    ├── scripts/
    │   └── run_bench.sh     
    ├── Makefile             
    └── report.md
```

---

### report

 `report.md` 必須包含以下量化分析：

1.  **加速比曲線 (Speedup Curve)**：
    * 計算 $S = T_1 / T_n$。
    * **分析**：當執行緒數量超過實體核心（Physical Cores）進入超執行緒（Hyper-threading）時，加速比是否趨於平緩？為什麼？

2.  **硬體異構性分析 (Heterogeneity)**：
    * 分別測試「只綁定 P-cores」與「只綁定 E-cores」的執行時間。
    * **量化**：計算兩者的 **效能功耗比估算**（若有工具可測）或單純的運算速度倍率。

3.  **Cache Miss 驗證**：
    * 使用 `perf stat -e L1-dcache-load-misses` 比較「有 Padding」與「無 Padding」版本的數據差異。


### 啟動代碼片段 (Data Structure)

```c
#define CACHELINE_SIZE 64

typedef struct {
    double local_sum;
    // Padding 確保每個 struct 佔用整倍數的 Cache Line
    char padding[CACHELINE_SIZE - sizeof(double)]; 
} __attribute__((aligned(CACHELINE_SIZE))) thread_data_t;
```

---
### Step 1 : Structure
建立一個 $16384 \times 16384$ 的單精度浮點數 (float) 矩陣 $M$。
單精度浮點數佔用 4 Bytes(32 bits)。
* 總大小：
$$16384 * 16384 * 4 Bytes = 1,073,741,824 Bytes = 1 GB$$

1. 絕對不能宣告在 Stack 上
如果你寫 `float matrix[16384][16384]`，程式一執行就會立刻發生 Stack Overflow (堆疊溢位)，因為作業系統預設給每個執行緒的 Stack 大小通常只有 8 MB 
    > use `ulimit -s` to check default stack

2. 必須使用 Heap (堆積)
我們必須使用 malloc 或 calloc 來動態配置這 1 GB 的記憶體。

3. 使用一維陣列模擬二維陣列
為了極致的效能，我們不使用 `(float **)`來建立二維陣列，因為那會讓記憶體變得破碎。我們應該宣告一個巨大的一維陣列，這樣記憶體才是連續的 (Contiguous)。連續的記憶體對 CPU Cache 絕對友善，能大幅降低 Cache Miss。

> 這裡請複習Stack / Heap 的差異

#### `(float **)`


##### 1. 指標陣列層次（第一層）
當你宣告一個 `float **ptr` 並為其分配空間時（例如模擬二維陣列），你通常會先分配一個存放「指標」的空間：

```c
float **ptr = malloc(rows * sizeof(float *));
```

在這個階段，`ptr` 指向的是一個連續的記憶體區塊，裡面存放著多個 `float *`（指標）。**在這個區塊內，指標與指標之間是連續的。**

##### 2. 實際資料層次（第二層）
問題通常出在這裡。當你為每一列分配空間時：

```c
for (int i = 0; i < rows; i++) {
    ptr[i] = malloc(cols * sizeof(float));
}
```

* **指標之間不連續：** 每次呼叫 `malloc` 分配的 `ptr[i]` 區塊，在記憶體中的位置是由作業系統動態決定的。`ptr[0]` 的結尾與 `ptr[1]` 的開頭通常**不相鄰**。
* **區塊內部連續：** 在同一個 `ptr[i]` 所指向的 `cols` 個 `float` 之間，記憶體是連續的。
##### 比較表：`float **` vs. 二維陣列 `float[R][C]`

| 特性 | `float **ptr` (動態分配) | `float arr[R][C]` (靜態陣列) |
| :--- | :--- | :--- |
| **記憶體佈局** | **非線性**。每一列可能散落在記憶體各處。 | **線性**。所有資料都在一個完整的連續區塊。 |
| **存取效率** | 較慢。需要兩次記憶體尋址 (Dereference)。 | 較快。編譯器可直接計算偏移量。 |
| **彈性** | 高。每一列的長度可以不同（鋸齒陣列）。 | 低。每一列長度必須相同。 |


---

##### 如何讓 `float **` 的記憶體完全連續？
如果你需要使用 `ptr[i][j]` 的語法，又希望記憶體完全連續（為了效能或傳輸需求），你可以使用「一次性分配」技巧：

```c
float **ptr = malloc(rows * sizeof(float *));
float *data = malloc(rows * cols * sizeof(float)); // 分配一整塊連續空間

for (int i = 0; i < rows; i++) {
    ptr[i] = &data[i * cols]; // 將指標指向連續空間的對應位置
}
```


#### Header : utils.h
```c
/*
 *  utils.h
 *  version 1.0
 *  Created on: 2025/3/27
 *  Author: Jimmy
 */
#ifndef UTILS_H
#define UTILS_H
#define _GNU_SOURCE
#include <stdalign.h> // 用於 alignas
# define MATRIX_SIZE 16384
# define rows 16384
# define cols 16384
# define CACHE_LINE_SIZE 64
# define EXIT_FAILURE 1
typedef struct  {
    alignas(CACHE_LINE_SIZE) long long final_count;
} thread_data_t;
// 等價於
// typedef struct {
//     double local_sum;
//     char padding[CACHELINE_SIZE - sizeof(double)]; 
// } __attribute__((aligned(CACHELINE_SIZE))) thread_data_t;
// ---------------------------------------------------------
// typedef struct {
//     double local_sum;
//     char padding[CACHELINE_SIZE - sizeof(double)];
// } alignas(CACHELINE_SIZE) thread_data_t;


#endif // UTILS_H
```
#### main.c (ver. matrix default)
```c
/*
 *  main.c
 *  version 1.0
 *  Created on: 2025/3/27
 *  Author: Jimmy 
*/
#include <stdio.h>
#include <stdlib.h> 
#include <time.h>
#include "utils.h"


int main () {
    printf("Default\n");
    // size_t : unsigned integer type, auto-adjusted to the size of the platform (32-bit or 64-bit)
    size_t total_element = (size_t)rows * cols;
    // number of elements in the matrix = rows * cols
    printf("Total elements: %zu\n", total_element);

    // single precision float: 4 bytes
    // malloc return void* , float* can trasform void* to float*
    float *matrix = (float *)malloc(total_element * sizeof(float));
    // float **ptr = malloc(rows * sizeof(float *));
    // float *data = malloc(rows * cols * sizeof(float)); // 分配一整塊連續空間

    // for (int i = 0; i < rows; i++) {
    //     ptr[i] = &data[i * cols]; // 將指標指向連續空間的對應位置
    // }
    if (matrix == NULL) {
        // log error message to a file name "stderr"
        fprintf(stderr, "Memory allocation failed\n");
        return EXIT_FAILURE;
    }
    // time_t time(time_t *tloc);
    // time return the current calendar time as 
    // a value of type time_t. 
    // If tloc is non-NULL, 
    // the return value is also stored 
    // in the memory pointed to by tloc.
    srand (time(NULL)); // seed the random number generator with current time
    for (size_t i = 0; i < total_element; i++) {
        // RAND_MAX is a constant defined in stdlib.h,
        // which represents the maximum value returned by the rand function
        matrix[i] = (float)rand() / RAND_MAX; // Initialize with some values, e.g., 0.0, 1.0, ..., 99.0, then repeat
    }

    printf ("Matrix initialization completed.\n");
    free (matrix);
 
    return 0;
}
```
##### Take out
* **size_t** : auto-adjusted to the size of the platform (32-bit or 64-bit)
* **time(NULL)**: 代表不使用指標接收時間，只取回傳值。
* **RAND_MAX**: 是標準庫定義的固定數字，不是函式。
* **(float)**: 強制轉型很重要，如果沒加，整數除法（例如 10 / 32767）會直接變成 0
* **($float^*$)**: 轉malloc (default return void*) 成float
* **fprint**(log's name, message)

#### Makefile
```
# Makefile for Project 1: Parallel Matrix Extraction

# 1. 定義變數 (Variables)
CC = gcc
# CFLAGS 是一連串的編譯選項：
#   -Wall -Wextra: 開啟所有常見的警告。工程師最喜歡警告了，它能幫我們抓出潛在的 Bug。
#   -O3: 最高等級的編譯器最佳化。因為我們要測效能，這個參數絕對不能少。
#   -pthread: 告訴編譯器我們要連結 POSIX Thread 函式庫，這是多執行緒的靈魂。
#   -g: 加入除錯資訊，萬一程式當掉 (Segmentation fault)，我們可以用 gdb 追蹤。
CFLAGS = -Wall -Wextra -O3 -pthread -g

# exe file name
TARGET = extractor

# 所有的 .c 原始碼檔案
SRCS = src/main.c src/worker.c

# 預設目標
# make = 執行第一個 Target（通常就是 all）。
# make all = 明確指定執行名為 all 的 Target。
all: $(TARGET)

# 編譯規則：告訴 Make 如何把 SRCS 編譯成 TARGET
$(TARGET): $(SRCS)
	$(CC) $(CFLAGS) -o $(TARGET) $(SRCS)

# 清理編譯出來的檔案 (方便重新編譯)
clean:
	rm -f $(TARGET)
```

### Step 2 Workload & Threads Lifecycle
依照threads數量分配，如
* Thread 0 負責 Row 0 ~ 4095
* Thread 1 負責 Row 4096 ~ 8191

以達成記憶體連續存取對齊
#### utils.h (modify ver 2)
```c
// utils.h
// below the thread_data_t
typedef struct {
    int thread_id;                 // 執行緒的編號 (0, 1, 2...)
    size_t start_row;              // 負責計算的起始行
    size_t end_row;                // 負責計算的結束行 (不包含此行)
    float *matrix;                 // 整個矩陣的指標
    thread_data_t *thread_results; // 指向我們防 False Sharing 的結果陣列
} thread_args_t;

void* worker_task(void* arg);
```

#### worker.c
```c

#include "utils.h"

void* worker_task(void* arg) {
    // 1. transform void* arg to thread struct, pointer to the struct we defined in utils.h 
    thread_args_t *args = (thread_args_t*)arg;
    // here we can access the fields of the struct, e.g., args->thread_id, args->start_row, etc.
    int tid = args->thread_id;
    
    double sum = 0.0; // 暫存用的局部變數，放在 CPU Register 裡最快！

    // 2. Start work：只跑自己的 Rows
    for (size_t i = args->start_row; i < args->end_row; i++) {
        // Columns
        for (size_t j = 0; j < MATRIX_SIZE; j++) {
            // 一維陣列模擬二維陣列的公式： 索引 = (行號 * 總列數) + 列號
            sum += args->matrix[i * MATRIX_SIZE + j];
        }
    }

    // 3. 計算完畢，把結果寫入專屬的記憶體位置 (已 Padding 過，不會有 False Sharing)
    args->thread_results[tid].local_sum = sum;

    // print log to console
    printf("[Thread %d] 完成計算 Rows %zu ~ %zu，局部總和: %f\n", 
           tid, args->start_row, args->end_row - 1, sum);

    return NULL; // 執行緒結束
}
```
#### main.c version 2
這裡就該宣告的宣告，該設置的設置
> 這裡的NUM_THREADS是先用#define

for 在這裡其實可以手動展開如join，但僅限於define下，理想還是直接用openmp自動展開
```c
// ... after random init matrix
    // ==========================================
    // Benchmark 開始：記錄起始時間
    // ==========================================
    struct timespec start_time, end_time;
    clock_gettime(CLOCK_MONOTONIC, &start_time);
    // ==========================================
    pthread_t threads[num_threads];
    thread_args_t thread_args[num_threads];
    thread_data_t thread_results[num_threads]; 
   
    // check if thread_results is correctly allocated
    float *row_max_array = malloc(MATRIX_SIZE * sizeof(float));

    size_t rows_per_thread = MATRIX_SIZE / num_threads;
    printf("\n %d 個thread開始作業...\n", num_threads);
    for (int i = 0; i < num_threads; i++) {
        thread_args[i].thread_id = i;
        thread_args[i].start_row = i * rows_per_thread;
        
        // 如果是最後一個執行緒，就包辦剩下的所有行數 (避免無法整除的問題)
        thread_args[i].end_row = (i == num_threads - 1) ? MATRIX_SIZE : (i + 1) * rows_per_thread;
        thread_args[i].matrix = matrix;
        thread_args[i].thread_results = thread_results;
        thread_args[i].row_max_array = row_max_array;

        // 建立執行緒並執行 worker_task
        pthread_create(&threads[i], NULL, worker_task, &thread_args[i]);
    }

    
    for (int i = 0; i < num_threads; i++) {
        pthread_join(threads[i], NULL); 
    }
    double global_sum = 0.0;
    for (int i = 0; i < num_threads; i++) {
        global_sum += thread_results[i].local_sum;
    }
    // ==========================================
    // Benchmark 結束：記錄結束時間並計算差值
    // ==========================================
    clock_gettime(CLOCK_MONOTONIC, &end_time);
    double elapsed_time = (end_time.tv_sec - start_time.tv_sec) + 
                          (end_time.tv_nsec - start_time.tv_nsec) / 1e9;
    printf("\n>>> 最終全局總和 (Global Sum): %f <<<\n", global_sum);
    printf(">>> 核心計算耗時: %f 秒 <<<\n", elapsed_time);

    printf("\n所有執行緒完成計算，矩陣總和: %f\n", global_sum);
    
    
    // ==========================================
    // 結果驗證階段 (Sequential Verification)
    // ==========================================
    printf("\n>>> 啟動結果驗證 (計算單執行緒基準值以供核對)... <<<\n");
    double expected_global_sum = 0.0;
    int validation_passed = 1;

    for (size_t i = 0; i < MATRIX_SIZE; i++) {
        // 先預設該行的第一個元素是最大值
        float expected_row_max = matrix[i * MATRIX_SIZE]; 
        
        for (size_t j = 0; j < MATRIX_SIZE; j++) {
            float val = matrix[i * MATRIX_SIZE + j];
            expected_global_sum += val; // 計算單執行緒總和
            
            // 計算單執行緒極值
            if (val > expected_row_max) {
                expected_row_max = val;
            }
        }

        // 1. 驗證行極值：將單執行緒算出的極值，與多執行緒陣列中的結果比對
        if (row_max_array[i] != expected_row_max) {
            printf(" error：Row %zu 的極值不吻合！預期: %f, 實際: %f\n", 
                   i, expected_row_max, row_max_array[i]);
            validation_passed = 0;
            break; // 錯一個就不用比了，直接中斷
        }
    }

    // 2. 驗證全局總和
    // 浮點數運算會有微小的精度誤差 (Epsilon)，尤其是加法順序不同的時候。
    // 所以我們不能用 == 來比對浮點數，必須取相減的絕對值，確認誤差小於一個極小值。
    double diff = expected_global_sum - global_sum;
    if (diff < 0) diff = -diff; // 簡單的絕對值處理

    if (diff > 1e-2) { // 容許 0.01 的浮點數累加誤差
        printf("error：全局總和不吻合！預期: %f, 實際: %f\n", 
               expected_global_sum, global_sum);
        validation_passed = 0;
    }

    // 最終宣判
    if (validation_passed) {
        printf("Pass\n");
    } else {
        printf("Fail\n");
    }
    // ==========================================
    // release memory
  

    free(threads);
    free(thread_args);
    free(thread_results);
    free(row_max_array); 
    free(matrix);
    printf("Matrix memory freed.\n");
    return 0;
}


```
##### return
```bash
所有執行緒完成計算，矩陣總和: 134211673.527968

>>> 啟動結果驗證 (計算單執行緒基準值以供核對)... <<<
Pass
double free or corruption (out)
Aborted (core dumped)
```
這裡是源自於記憶體對齊 (Memory Alignment)引發了堆積 (Heap) 結構破壞

呼叫` malloc() `或 `calloc() `時，作業系統不僅僅會給你一塊記憶體，它還會在這塊記憶體前塞入一些管理資料 **(Heap Metadata)**，用來記錄這塊記憶體有多大。當你呼叫 free() 時，系統會去讀取這些隱藏的資料來決定要回收多少空間。

```c
// utils.h內，這裡要求了需要對齊64
typedef struct  {
    double local_sum;
    alignas(CACHE_LINE_SIZE) long long final_count;
} thread_data_t;
```
但` malloc() calloc()`卻不會指定起始位置，只會給你正確的大小，這就導致了汙染，雖然不影響計算數值，但最終的free讀到汙染後會讓OS給出 corruption 並強制 Abortion 

這裡可以改用`posix_memalign`取代calloc
```c
thread_data_t *results = NULL;
    if (posix_memalign((void **)&results, CACHELINE_SIZE, num_threads * sizeof(thread_data_t)) != 0) {
        fprintf(stderr, "錯誤：對齊記憶體配置失敗！\n");
        return 1;
    }
```
或直接把下方的`free`註解掉，不過這是有風險的，stack空間小，如果今天照原樣，而宣告過多，則有塞滿的風險，故可以考慮改放heap
```c
    // ========== stack version ==========
    pthread_t threads[num_threads];
    thread_args_t thread_args[num_threads];
    thread_data_t thread_results[num_threads]; 
    // check if thread_results is correctly allocated
    float *row_max_array = malloc(MATRIX_SIZE * sizeof(float)); 
    
    // ========== heap version ==========
    // pthread_t *threads = malloc(num_threads * sizeof(pthread_t));
    // thread_args_t *thread_args = malloc(num_threads * sizeof(thread_args_t));
    // thread_data_t *thread_results = NULL;
    // posix_memalign((void **)&thread_results, 64, num_threads * sizeof(thread_data_t));


```


### Step 3 bash and plot
建立自動測試腳本與繪圖
```bash
# ~/scripts/run_bench.sh
#!/bin/bash
cd "$(dirname "$0")/.." || exit 1
echo "==============================================="
echo "  Parallel Matrix Extractor Benchmark Script  "
echo "==============================================="

# 1. 自動重新編譯
echo "Recompiling..."
make clean > /dev/null 2>&1
make > /dev/null 2>&1

if [ $? -ne 0 ]; then
    echo "Compile Error！plz check code。"
    exit 1
fi
OUT_DIR="results"
mkdir -p "$OUT_DIR"
# Make sure the plots directory exists too
mkdir -p "$OUT_DIR/plots"

CSV_FILE="$OUT_DIR/benchmark_results.csv"
PNG_FILE="$OUT_DIR/plots/speedup_curve.png"
# 定義要測試的執行緒數與採樣次數
THREADS=(1 2 4 8 16)
RUNS=1
CSV_FILE="$OUT_DIR/benchmark_results.csv"
PNG_FILE="$OUT_DIR/plots/speedup_curve.png"
# 初始化 CSV 檔案並寫入標題
echo "Threads,AverageTime" > "$CSV_FILE"

echo "Compile Success！Start Benchmarking(Each number of threads will run $RUNS times)..."
echo "------------------------------------------"

# 2. 雙重迴圈：測試不同執行緒數，每次跑 5 遍
for t in "${THREADS[@]}"
do
    echo -n "$t threads... "
    TOTAL_TIME=0

    for ((i=1; i<=RUNS; i++))
    do
        # 執行程式並將結果存入變數 OUTPUT
        OUTPUT=$(./extractor $t)
        
        # 使用 grep 找出含有「耗時」的那一行，並用正則表達式提取出浮點數數字
        TIME=$(echo "$OUTPUT" | grep "核心計算耗時" | grep -oE '[0-9]+\.[0-9]+')
        
        # 萬一抓不到數字的防呆機制
        if [ -z "$TIME" ]; then
            echo "Error：Fetching time error, plz check main.c format"
            exit 1
        fi

        # 使用 awk 進行浮點數累加
        TOTAL_TIME=$(awk "BEGIN {print $TOTAL_TIME + $TIME}")
    done

    # 計算 5 次的平均值
    AVG_TIME=$(awk "BEGIN {print $TOTAL_TIME / $RUNS}")
    echo "Avg Time: $AVG_TIME 秒"

    # 將結果寫入 CSV 檔案
    echo "$t,$AVG_TIME" >> $CSV_FILE
done

echo "------------------------------------------"
echo "Data collected！Save in $CSV_FILE"
echo "Ready to plot (Speedup Curve)..."

# 3. 自動生成 Python 繪圖腳本
cat << 'EOF' > plot_speedup.py
import csv
import sys
import matplotlib.pyplot as plt
input_csv = sys.argv[1]
output_png = sys.argv[2]
try:
    threads = []
    times = []
    # 讀取剛剛 Bash 產生的 CSV
    with open(input_csv, 'r') as f:
        reader = csv.DictReader(f)
        for row in reader:
            threads.append(int(row['Threads']))
            times.append(float(row['AverageTime']))

    # 計算加速比 Speedup = T1 / Tn
    t1_time = times[0]
    speedup = [t1_time / t for t in times]

    # 設定圖表樣式
    plt.figure(figsize=(8, 5))
    
    # 畫出實際加速比 (藍線)
    plt.plot(threads, speedup, marker='o', linestyle='-', color='b', label='Actual Speedup')
    # 畫出理想加速比 (紅虛線：如果 2 個 Thread 快 2 倍，4 個快 4 倍的理論值)
    plt.plot(threads, threads, linestyle='--', color='r', alpha=0.6, label='Ideal Speedup')

    plt.title('Parallel Matrix Extractor - Speedup Curve')
    plt.xlabel('Number of Threads')
    plt.ylabel('Speedup (T1 / Tn)')
    plt.xticks(threads)
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.legend()

    # 儲存圖片
    plt.savefig(output_png, dpi=300)
    print(f"Success！Save as {output_png}")

except ImportError:
    print("Error, lack of python module: matplotlib。")
    print("Solution：Checked if'pip install matplotlib' ,execute 'python3 plot_speedup.py'。")
except Exception as e:
    print(f"plot error: {e}")
EOF

# 4. 呼叫 Python 執行繪圖
python3 plot_speedup.py "$CSV_FILE" "$PNG_FILE"

rm plot_speedup.py
```
#### Result
```
Threads,AverageTime
1,0.294057
2,0.165808
4,0.084245
8,0.037628
16,0.044957
```
![speedup_curve](https://hackmd.io/_uploads/SJ_W6RDsbl.png)

### Step 4 Thread Affinity
作業系統 (OS) 的排程器容易將task依據core的忙碌狀態做負載平衡，而task遷移時TLB、L1、L2等資料的報廢與更新將造成巨大的開銷

同時，在Intel的處理器中，有不少型號都是有P/E核的配置，這使得核心間的效能有所差距
#### utils.h
```c
typedef struct {
    int thread_id;                 // 執行緒的編號 (0, 1, 2...)
    int core_id;                   // 執行緒綁定的核心編號
    size_t start_row;              // 負責計算的起始行
    size_t end_row;                // 負責計算的結束行 (不包含此行)
    float *matrix;                 // 整個矩陣的指標
    thread_data_t *thread_results; // 指向結果陣列
    float *row_max_array;          // 指向每行最大值的陣列
} thread_args_t;
```

#### worker.c
```c
/*
 *  worker.c
 *  version 2.0
 *  Modified ver 2.0 on: 2025/4/1 
 *  Modified Content : affinity
 *  Created on: 2025/3/27
 *  Author: Jimmy 
*/
// #define _GNU_SOURCE here or in utils.h (the head of the file)
#include "utils.h"


void* worker_task(void* arg) {
    // 1. transform void* arg to thread struct, pointer to the struct we defined in utils.h 
    thread_args_t *args = (thread_args_t*)arg;
    // here we can access the fields of the struct, e.g., args->thread_id, args->start_row, etc.
    int tid = args->thread_id;
    
    int core_id = args->core_id; // 從參數結構中獲取核心 ID
    
    // ==========================================
    // 執行緒綁核 (Thread Affinity) 邏輯
    // ==========================================
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);          // 1. 清空 CPU 集合
    CPU_SET(core_id, &cpuset);  // 2. 將我們指定的 core_id 加入集合
    // 取得當前執行緒並強制綁定
    pthread_t current_thread = pthread_self();
    int affinity_status = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
    if (affinity_status != 0) {
        printf("[Thread %d] 警告：無法綁定到 Core %d\n", tid, core_id);
    } else {
        // 印出來確認有綁定成功
        printf("[Thread %d] 成功綁在 CPU Core %d\n", tid, core_id);
    }
    // ==========================================

    double sum = 0.0; // 暫存用的局部變數，放在 CPU Register 裡最快！

    // 2. Start work：只跑自己的 Rows
    for (size_t i = args->start_row; i < args->end_row; i++) {
        // Columns
        float current_row_max = args->matrix[i * MATRIX_SIZE];
        for (size_t j = 0; j < MATRIX_SIZE; j++) {
            // 一維陣列模擬二維陣列的公式： 索引 = (行號 * 總列數) + 列號
            sum += args->matrix[i * MATRIX_SIZE + j];
            if (args->matrix[i * MATRIX_SIZE + j] > current_row_max) {
                current_row_max = args->matrix[i * MATRIX_SIZE + j];
            }
        }
        args->row_max_array[i] = current_row_max;
    }

    // 3. 計算完畢，把結果寫入專屬的記憶體位置 (已 Padding 過，不會有 False Sharing)
    args->thread_results[tid].local_sum = sum;

    // print log to console
    printf("[Thread %d] 完成計算 Rows %zu ~ %zu，局部總和: %f\n", 
           tid, args->start_row, args->end_row - 1, sum);

    return NULL; // 執行緒結束
}
```
#### main.c
```c
 size_t rows_per_thread = MATRIX_SIZE / num_threads;
    printf("\n %d 個thread開始作業...\n", num_threads);
    
    for (int i = 0; i < num_threads; i++) {
        thread_args[i].thread_id = i;
        thread_args[i].core_id = i; // <- 先按順序綁
        
        thread_args[i].start_row = i * rows_per_thread;
        // 如果是最後一個執行緒，就包辦剩下的所有行數 (避免無法整除的問題)
        thread_args[i].end_row = (i == num_threads - 1) ? MATRIX_SIZE : (i + 1) * rows_per_thread;
        thread_args[i].matrix = matrix;
        thread_args[i].thread_results = thread_results;
        thread_args[i].row_max_array = row_max_array;

        // 建立執行緒並執行 worker_task
        pthread_create(&threads[i], NULL, worker_task, &thread_args[i]);
    }

```
```
# 查看CPU狀況
lscpu -e 

# return
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
  0    0      0    0 0:0:0:0          yes
  1    0      0    0 0:0:0:0          yes
  2    0      0    1 1:1:1:0          yes
  3    0      0    1 1:1:1:0          yes
  4    0      0    2 2:2:2:0          yes
  5    0      0    2 2:2:2:0          yes
  6    0      0    3 3:3:3:0          yes
  7    0      0    3 3:3:3:0          yes
```
依照cache，core0 由CPU0和1組成，而我並沒有P/E核的問題

這個core的切割，依賴於**Hyper-Threading**

### Step 5 exetreme condition
* 策略 A：資源獨享 (最佳效能)
把 4 個執行緒分配到完全不同的實體核心上。
    * 綁核目標：CPU 0, 2, 4, 6。
    * 優勢：每個人都有自己獨立的 L1/L2 Cache，計算單元完全不用跟別人擠。

* 策略 B：資源互卡 (最差效能)

* 把threads塞進同一個實體核心裡。
綁核目標：CPU 0, 1, 2, 3。
    * 劣勢：Thread 0 和 Thread 1 被迫擠在 CORE 0 裡面，它們必須搶同一個 L1 Cache 和 ALU（算術邏輯單元），效能絕對會互相拖累
 
 
```c
size_t rows_per_thread = MATRIX_SIZE / num_threads;
    printf("\n %d 個thread開始作業...\n", num_threads);
    // ================= Affinity =========================


    // section 2
    // int core_map[] = {0, 2, 4, 6, 8, 10, 12, 14};
    
    // section 3 
    int core_map[] = {0, 1};
    
    
    int map_size = sizeof(core_map) / sizeof(core_map[0]);
    // ====================================================
        
    for (int i = 0; i < num_threads; i++) {
        thread_args[i].thread_id = i;
        // section 1
        // thread_args[i].core_id = i;
        // section 2 & 3
        thread_args[i].core_id = core_map[i % map_size];
        
// 下方照舊
```
#### return (2 core, 8 threads)
```
 8 個thread開始作業...
[Thread 0] 成功綁在 CPU Core 0
[Thread 1] 成功綁在 CPU Core 1
[Thread 3] 成功綁在 CPU Core 1
[Thread 2] 成功綁在 CPU Core 0
[Thread 5] 成功綁在 CPU Core 1
[Thread 6] 成功綁在 CPU Core 0
[Thread 4] 成功綁在 CPU Core 0
[Thread 7] 成功綁在 CPU Core 1

>>> 核心計算耗時: 0.140819 秒 <<<
```
#### return (8 core, 8 threads)
```
 8 個thread開始作業...
[Thread 0] 成功綁在 CPU Core 0
[Thread 2] 成功綁在 CPU Core 4
[Thread 3] 成功綁在 CPU Core 6
[Thread 7] 成功綁在 CPU Core 14
[Thread 6] 成功綁在 CPU Core 12
[Thread 4] 成功綁在 CPU Core 8
[Thread 5] 成功綁在 CPU Core 10
[Thread 1] 成功綁在 CPU Core 2

>>> 核心計算耗時: 0.071867 秒 <<<
```
#### return (16 core, 8 threads)
```
[Thread 1] 成功綁在 CPU Core 1
[Thread 5] 成功綁在 CPU Core 5
[Thread 6] 成功綁在 CPU Core 6
[Thread 3] 成功綁在 CPU Core 3
[Thread 4] 成功綁在 CPU Core 4
[Thread 2] 成功綁在 CPU Core 2
[Thread 0] 成功綁在 CPU Core 0
[Thread 7] 成功綁在 CPU Core 7

>>> 核心計算耗時: 0.059652 秒 <<<
```

### Step 6 without padding
最後要做 False Sharing Defense 與 Cache Miss 驗證，若無padding
* Thread 0 的 local_sum 和 Thread 1 的 local_sum 會被塞進同一個 64 Bytes 的 Cache Line 裡。
* 當 Thread 0 更新自己的總和時，CPU 會為了保證資料一致性，強制把 Thread 1 核心裡的同一個 Cache Line 標記為「失效 (Invalid)」。
* Thread 1 下一次要寫入時，就發生了 Cache Miss，必須重新去主記憶體抓資料。

最後最慘就變成pingpong，OS介入
#### utils.h 增加
```
# define USE_PADDING 1 // 1: 有防禦，0: 無防禦

# if USE_PADDING
    // [有防禦] 在 local_sum 後面加上足夠的 padding，確保每個 thread_data_t 都獨占一個 Cache Line，避免 False Sharing
typedef struct  {
    double local_sum;
    alignas(CACHE_LINE_SIZE) long long final_count;
} thread_data_t;
#else
    // [無防禦] 乾淨溜溜的結構體，多個 local_sum 會擠在同一個 Cache Line 裡互相傷害
    typedef struct {
        double local_sum;
    } thread_data_t;
#endif
```
#### worker.c
```c
/*
 *  worker.c
 *  version 3.0
 *  Modified ver 3.0 on: 2025/4/1 
 *  Modified Content : affinity, padding
 *  Created on: 2025/3/27
 *  Author: Jimmy 
*/
// #define _GNU_SOURCE here or in utils.h (the head of the file)
#include "utils.h"


void* worker_task(void* arg) {
    // 1. transform void* arg to thread struct, pointer to the struct we defined in utils.h 
    thread_args_t *args = (thread_args_t*)arg;
    // here we can access the fields of the struct, e.g., args->thread_id, args->start_row, etc.
    int tid = args->thread_id;
    
    int core_id = args->core_id; // 從參數結構中獲取核心 ID
    
    // ==========================================
    // 執行緒綁核 (Thread Affinity) 邏輯
    // ==========================================
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);          // 1. 清空 CPU 集合
    CPU_SET(core_id, &cpuset);  // 2. 將我們指定的 core_id 加入集合
    // 取得當前執行緒並強制綁定
    pthread_t current_thread = pthread_self();
    int affinity_status = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
    if (affinity_status != 0) {
        printf("[Thread %d] 警告：無法綁定到 Core %d\n", tid, core_id);
    } else {
        // 印出來確認有綁定成功
        printf("[Thread %d] 成功綁在 CPU Core %d\n", tid, core_id);
    }
    // ==========================================

    #if USE_PADDING_WORKER
    double sum = 0.0; // 暫存用的局部變數，放在 CPU Register 裡最快！
    #else 
    args->thread_results[tid].local_sum = 0.0;
    #endif
    // 2. Start work：只跑自己的 Rows
    for (size_t i = args->start_row; i < args->end_row; i++) {
        // Columns
        float current_row_max = args->matrix[i * MATRIX_SIZE];
        for (size_t j = 0; j < MATRIX_SIZE; j++) {
            // 一維陣列模擬二維陣列的公式： 索引 = (行號 * 總列數) + 列號
            #if USE_PADDING_WORKER
            sum += args->matrix[i * MATRIX_SIZE + j];
            # else 
            args->thread_results[tid].local_sum += args->matrix[i * MATRIX_SIZE + j];
            #endif
            if (args->matrix[i * MATRIX_SIZE + j] > current_row_max) {
                current_row_max = args->matrix[i * MATRIX_SIZE + j];
            }
        }
        args->row_max_array[i] = current_row_max;
    }

    // 3. 計算完畢，把結果寫入專屬的記憶體位置 (已 Padding 過，不會有 False Sharing)
    #if USE_PADDING_WORKER
    args->thread_results[tid].local_sum = sum;
    printf("[Thread %d] 完成計算 Rows %zu ~ %zu，局部總和: %f\n", 
           tid, args->start_row, args->end_row - 1, sum);
    #else
    printf("[Thread %d] 完成計算 Rows %zu ~ %zu，局部總和: %f\n", 
           tid, args->start_row, args->end_row - 1, args->thread_results[tid].local_sum);
    #endif

    // print log to console

    
    return NULL; // 執行緒結束
}
```
結果其實不會差到太多，Compiler的優化有一定的影響，加上無法看到L1 miss
所以在utils.h內加上`volatile`
```c
# if USE_PADDING
typedef struct  {
    double local_sum;
    alignas(CACHE_LINE_SIZE) long long final_count;
} thread_data_t;
#else
    typedef struct {
        volatile  double local_sum;
    } thread_data_t;
#endif
```

```
# padding
>>> 核心計算耗時: 0.037044 秒 <<<

# without padding
>>> 核心計算耗時: 0.193459 秒 <<<
```

END