Parallel Programming and System LAB 1

## Env Setting 1. 安裝 Docker Desktop 2. 進入 Resources > WSL Integration。 3. 確認 "Enable integration with my default WSL distro" 已開啟。 4. reboot 5. vscode(IDE) install Dev Containers 6. 在 VS Code 中 F1 (或 Ctrl+Shift+P)。 7. 輸入並選擇：Dev Containers: Add Dev Container Configuration Files... 8. 右下角 Reopen in Container 9. 選擇環境 "Ubuntu" / "Debian" / "Linux Image (Full System)" 10. 選擇 Features (功能元件) 也能直接修改自動產生的 `.devcontainer/devcontainer.json` ## Check Docker ``` #正在運行 docker ps # 已停止的 docker ps -a # log docker logs <container_name 或 ID> docker logs -f <container> ``` ## Eval ``` usr/bin/time -v # 獲取詳細的 CPU 利用率與記憶體峰值。 perf stat # 獲取 CPU 週期、指令數、分支預測、快取命中等硬體層級數據。 # sudo apt install linux-tools-virtual linux-cloud-tools-virtual # 在wsl先安裝 # 還是不行就暴力破解 find /usr/lib/linux-tools -name perf # 假設你找到的是 /usr/lib/linux-tools/6.5.0-27-generic/perf sudo ln -sf /usr/lib/linux-tools/ <你找到的版本資料夾> /perf /usr/bin/perf numactl # 控制執行緒綁定到特定的 CPU 核心（用於測試 Thread Affinity）。 ``` ## Default Setting Check * Env : WSL2/ RTX 3050 / Core : 16 ```ter # Check how many core we have lscpu | grep "^CPU(s):" lscpu -e htop ``` * ENV : MACOS ```ter # pthread 內建 # openmp 無內建，原gcc 為clang 所以要 brew install gcc # check gcc not clang name ls /opt/homebrew/bin/gcc* # Would be like follow, then you can type gcc-(number) -foopenmp /opt/homebrew/bin/gcc-15 /opt/homebrew/bin/gcc-nm-15 /opt/homebrew/bin/gcc-ar-15 /opt/homebrew/bin/gcc-ranlib-15 # check CPU sysctl -n hw.ncpu # With logic sysctl hw.physicalcpu hw.logicalcpu # instructions, cache sysctl -a | grep machdep.cpu # full system system_profiler SPHardwareDataType | grep "Processors\|Cores\|Threads" ``` * Hello World (openmp) ```ter #include <omp.h> #include <stdio.h> int main() { #pragma omp parallel { printf("Hello from thread %d\n", omp_get_thread_num()); } return 0; } ``` ```ter # Execute gcc-15 -fopenmp parallel_lab.c -o parallel_lab ./parallel_lab ``` * pthread testing ```ㄎ #include <pthread.h> #include <stdio.h> #include <stdlib.h> #define NUM_THREADS 5 // 執行緒要執行的函數 void* print_hello(void* thread_id) { long tid = (long)thread_id; printf("Hello World! 我是執行緒 #%ld\n", tid); pthread_exit(NULL); } int main() { pthread_t threads[NUM_THREADS]; int rc; long t; for (t = 0; t < NUM_THREADS; t++) { printf("在 main 中：正在建立執行緒 %ld\n", t); // 建立執行緒 rc = pthread_create(&threads[t], NULL, print_hello, (void*)t); if (rc) { printf("錯誤：無法建立執行緒，回傳碼：%d\n", rc); exit(-1); } } // 等待所有執行緒執行完畢 (Join) for (t = 0; t < NUM_THREADS; t++) { pthread_join(threads[t], NULL); } printf("所有執行緒已完成，主程式結束。\n"); return 0; } ``` ``` # 編譯指令 (-lpthread 是連結執行緒函式庫) gcc test_pthread.c -o test_pthread -lpthread # 執行 ./test_pthread ``` ## Phase 1 > 引言 : 你在程式碼裡每創一個 pthread，Linux 核心就會幫你創一個對應的「排程單位」。 > > * 平行程式設計視角： > 你手上有 4 個任務，你開 4 個執行緒。 > * 作業系統視角： > 核心看到 4 個可以分配到不同 CPU 核心去跑的「小單位」 > > 所以結合 OS 和 Hardware 的角度來看，這裡有兩種cost > 1. System Cost > OS在建立與銷毀Threads，排程等所花費的時間，這包含了Memory Management(Stack、VirtualMem...)、Threads Management > 2. Executing Cost > 實際運算所花費的時間 > >這使得問題變得複雜，工作量與是否需要那麼多人手平行執行之間的權衡變得重要 ### User vs. Kernel 在現代 Linux (WSL/Ubuntu) 中，每一個 pthread 實際上對應到核心中的一個 task_struct * **User-level (pthread 函式庫)** pthread_t 標識符、執行緒區域儲存 (TLS)。 * **Kernel(system)-level(Linux 核心管的)** 核心堆疊 (Kernel Stack)、排程實體 (Scheduling Entity)、PID/TGID 管理。 #### Linux Kernel * **task_struct** 這是 Linux 核心中最核心的結構。在 Linux 眼中，其實沒有區分「進程」和「執行緒」，它們都叫 task。 * **Kernel Stack** 每個執行緒在執行系統呼叫（System Call）時，需要一個獨立的核心堆疊空間來存放暫存資料。 * **PID/TGID** 在核心裡，每個執行緒都有自己唯一的 PID。但它們共享同一個 TGID (Thread Group ID)，這就是你在 User 空間看到的「Process ID」。 #### TLS (Thread Local Storage) 這是每個執行緒私有的儲存。雖然大家共享大部分記憶體，但 TLS 讓每個執行緒可以擁有自己的全域變數副本（例如 errno） ### **pthread_create 與 clone()** > aka 執行緒的誕生當你呼叫 pthread_create 時，底層發生了什麼事？ 1. **clone() 系統呼叫** 這是 Linux 的特點。它不像 Unix 嚴格區分 fork()（生孩子）和 thread_create。 2. **資源共享（CoW）** clone() 可以透過參數決定要共享什麼。執行緒會共享： * 分頁表 (Page Table) 共享記憶體位址空間。 * 檔案描述符 (File Descriptors) 一個執行緒打開檔案，另一個也能讀。 * 初始化排程器核心把這個新的 task_struct 放進排程佇列，等待 CPU 執行。 ### Parallelism Trade off 如果你開了 8 個執行緒，在 8 核電腦上，OS能使它們真的在「同時」運作。但這是有trade off的 1. **threads's overhead** 雖然比進程（Process）輕量，但因為要進核心（Kernel Space）跑clone()，頻繁地建立與銷毀執行緒會造成效能瓶頸 2. **Kernel Constraint** 因為 1:1，你的執行緒總數受限於作業系統能開多少 task_struct #### Process 一個執行中的程式實例，擁有獨立的虛擬記憶體空間（包含程式碼、全域變數、Heap 堆積）、打開的文件描述符（File Descriptors）、環境變數，像是獨立的工作室。 ### Thread id Code ```c /* 為什麼要用 void*？平行程式設計中， // pthread_create 規定任務函式必須長這樣。 // 你可以傳入任何資料的位址。 // PID vs TID： // getpid() 一樣 # task_struct.tgid (thread group ID) // gettid() 不相同 #True process ID // 它們共用同一個 Process 空間 // 但各自擁有獨立的核心調度實體。 */ #include <pthread.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> // 為了呼叫 SYS_gettid #include <sys/types.h> void* thread_function(void* arg) { // 將傳進來的通用指標(void*) 轉回整數指標並取值 int id = *(int*)arg; // getpid(): 獲取進程 ID // syscall(SYS_gettid): 獲取 Linux 核心分配給這個執行緒的真正 ID printf("執行緒 %d: Process ID (PID) = %d, Thread ID (TID) = %ld\n", id, getpid(), syscall(SYS_gettid)); sleep(10); // 暫停一下，方便我們在終端機觀察 return NULL; } int main() { pthread_t t1, t2; int id1 = 1, id2 = 2; printf("主執行緒: Process ID (PID) = %d, Thread ID (TID) = %ld\n", getpid(), syscall(SYS_gettid)); pthread_create(&t1, NULL, thread_function, &id1); pthread_create(&t2, NULL, thread_function, &id2); pthread_join(t1, NULL); pthread_join(t2, NULL); return 0; } ``` ### Thread overhead code ```c #include <pthread.h> #include <stdio.h> #include <time.h> #define ITERATIONS 50000 void* dummy_func(void* arg) { return NULL; } int main() { struct timespec m_start, m_end; // 給 Monotonic 使用 time_t t_start, t_end; // 給 time() 使用 pthread_t thread; // 1. 記錄開始 t_start = time(NULL); clock_gettime(CLOCK_MONOTONIC, &m_start); for (int i = 0; i < ITERATIONS; i++) { pthread_create(&thread, NULL, dummy_func, NULL); pthread_join(thread, NULL); } // 2. 記錄結束 clock_gettime(CLOCK_MONOTONIC, &m_end); t_end = time(NULL); // 3. 計算並顯示 // .tv_sec 為libc定義的結構成員，表示秒數 // .tv_nsec 為libc定義的結構成員，表示ns double m_diff = (m_end.tv_sec - m_start.tv_sec) + (m_end.tv_nsec - m_start.tv_nsec) / 1e9; double t_diff =t_end - t_start; printf("精確測量 (clock_gettime): %f 秒\n", m_diff); printf("粗略測量 (time): %f 秒\n", t_diff); printf("平均每個 thread 建立耗時: %f us\n", (m_diff / ITERATIONS) * 1e6); return 0; } ``` ### lifecycle Compare Fuc vs thread ```c #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <unistd.h> #define ITERATIONS 10000 // 最小任務函式 void* minimal_task(void* arg) { return NULL; } void pure_function_call() { minimal_task(NULL); } double get_time_ns() { struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); return ts.tv_sec * 1e9 + ts.tv_nsec; } int main() { double start, end; pthread_t thread; // 1. 基準測試：普通函式呼叫 (User-space only) start = get_time_ns(); for (int i = 0; i < ITERATIONS; i++) { pure_function_call(); } end = get_time_ns(); printf("平均純函式呼叫耗時: %.2f ns\n", (end - start) / ITERATIONS); // 2. 執行緒建立與銷毀 (User <-> Kernel switch) start = get_time_ns(); for (int i = 0; i < ITERATIONS; i++) { pthread_create(&thread, NULL, minimal_task, NULL); pthread_join(thread, NULL); } end = get_time_ns(); printf("平均 pthread_create + join 耗時: %.2f ns\n", (end - start) / ITERATIONS); return 0; } ``` #### CMD Command : System Calls Eval ``` strace -c ./lifecycle ``` > 在這裡，clone 與 mmap (分配執行緒堆疊)理論上來說最多，但目前是empty task，可以改用下面用來測/bin/time 的 code ，會比較明顯 ##### return ```ter 平均純函式呼叫耗時: 1.07 ns 平均 pthread_create + join 耗時: 323344.39 ns % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 60.96 0.521357 52 10000 clone 25.32 0.216528 10 20001 rt_sigprocmask 13.65 0.116749 11 10000 10000 clone3 0.06 0.000547 11 47 27 futex 0.00 0.000031 15 2 write 0.00 0.000000 0 1 read 0.00 0.000000 0 2 close 0.00 0.000000 0 9 mmap 0.00 0.000000 0 5 mprotect 0.00 0.000000 0 1 munmap 0.00 0.000000 0 3 brk 0.00 0.000000 0 1 rt_sigaction 0.00 0.000000 0 4 pread64 0.00 0.000000 0 1 1 access 0.00 0.000000 0 1 execve 0.00 0.000000 0 2 1 arch_prctl 0.00 0.000000 0 1 set_tid_address 0.00 0.000000 0 2 openat 0.00 0.000000 0 3 newfstatat 0.00 0.000000 0 1 set_robust_list 0.00 0.000000 0 1 prlimit64 0.00 0.000000 0 1 getrandom 0.00 0.000000 0 1 rseq ------ ----------- ----------- --------- --------- ---------------- 100.00 0.855212 21 40090 10029 total ``` #### CMD Command : System/CPU ``` # Should be run in WSL or other, but no container /usr/bin/time -v ./lifecycle ``` * Voluntary Context Switches Benchmark of **Synchronization** (Waiting)，類似於工作做完，釋放算力，再向OS要task來跑 ##### return ``` 平均純函式呼叫耗時: 1.03 ns 平均 pthread_create + join 耗時: 66025.08 ns Command being timed: "./1" User time (seconds): 0.01 System time (seconds): 0.32 --> Percent of CPU this job got: 50% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.66 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 1664 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 78 Voluntary context switches: 9980 Involuntary context switches: 0 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` ##### Sample Testing Code ```c #include <pthread.h> #include <stdio.h> #include <stdlib.h> #define LIMIT 1000000000LL // 十億 #define THREADS 4 // 使用 4 個執行緒 // 傳給執行緒的參數包 typedef struct { long long start; long long end; long long partial_sum; } ThreadData; // 每個員工要做的工作：把分配到的範圍加起來 void* sum_range(void* arg) { ThreadData* data = (ThreadData*)arg; data->partial_sum = 0; for (long long i = data->start; i <= data->end; i++) { data->partial_sum += i; } return NULL; } int main() { pthread_t threads[THREADS]; ThreadData data[THREADS]; long long chunk_size = LIMIT / THREADS; // 1. 分派任務 for (int i = 0; i < THREADS; i++) { data[i].start = i * chunk_size + 1; data[i].end = (i == THREADS - 1) ? LIMIT : (i + 1) * chunk_size; pthread_create(&threads[i], NULL, sum_range, &data[i]); } // 2. 等待所有人收工，並把各組結果加起來 long long total_sum = 0; for (int i = 0; i < THREADS; i++) { pthread_join(threads[i], NULL); total_sum += data[i].partial_sum; } printf("計算完成！1 到 %lld 的總和是: %lld\n", LIMIT, total_sum); return 0; } ``` ### 更多的 Threads 會更好!? 從最開始的測量Threads overhead到最後的System/CPU Eval ，看得出純僱傭一個人手的開銷與中間的溝通成本，在最後的System/CPU中，可以看到core的利用率和user/system time，即便利用率高，但若工作量本身低，則多工的溝通與僱傭成本將拖垮效能。而記憶體層面則根據`ulimit -s`確認stack的預設值，進而簡單推導所被預留的空間佔用。 --- `ulimit -s ` 是 User Limit 的縮寫，而 -s 代表 Stack（堆疊）。 * 定義它規定了作業系統分配給每一個執行緒（或進程）的最大堆疊空間。 * 預設值在大多數 Linux 系統中，這個值通常是 8192 KB (8MB)。 * 平行程式在 1:1 下，每創一個pthread，核心就會在虛擬記憶體中「預留」8MB 的空間給該執行緒存放區域變數與函數呼叫記錄。 > 這很可怕，假設創了1000個threads，他就預留了8MB x 1000 = 8GB，的VM 可以透過CMD指令來知道thread的預設stack 單位空間 ``` ulimit -s ``` ## Question #### 但為什麼分配的是「虛擬記憶體」而不是「實體記憶體」? * 虛擬記憶體（Virtual Memory）當你建立一個執行緒，OS 會給它一個 8MB 的 Stack 空間。這是紙面上預留的，並非真實占用，所以當下Memory是空的，只是被劃下來，直到thread 執行，才分配實體stack > 但如果被劃分到滿了呢? 即便你只有 16GB 的實體 RAM，你依然可以讓程式預留遠超過16GB的VM，如果真的溢滿，則啟動這三種方式 1. **Denied**直接拒絕 2. **Swap**根據記憶體管理策略進行DRAM與RAM間的data swap 通常是LRU 3. **OOM Killer** 直接砍掉占比最大或消耗資源大的程式，解放空間 ## Analysis Code ### Time ```c #include <time.h> struct timespec m_start, m_end; // Begin Eval clock_gettime(CLOCK_MONOTONIC, &start) // your code clock_gettime(CLOCK_MONOTONIC, &start) // end double thread_time = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9; // Current time_t start = time() // code here time_t end = time() int cost = end - start; ``` LAB 2 https://hackmd.io/@1p0_VOUHTXGcHlBU9A0OEA/HkyDmOzsWx