# Week <12> — Team <27>
**Members:** @112062124, @111000116, @111006239 (student IDs)
:::info
You can write in Chinese.
:::
## Lecture & Reading Notes
### Limitations of Processes
- **Isolation:**
OS 透過虛擬位址空間(Virtual Address Space)避免不同 Process 互相干擾。
- **High Communication Cost:**
Process 間溝通(IPC)需要 system call、trap into kernel → 成本高。
- **Concurrency vs Parallelism**
- **Concurrency**:單核心透過 context switch 偽造「同時執行」的效果。
- **Parallelism**:多核心真正同時執行,無 context switching overhead。
### Introducing Threads
- 若不需要強隔離,可使用 **Threads**。
- **定義**:一個 process 中可包含多個 threads。
- **優點:**
- **Shared Memory**:共享 heap、code → 不需進入 kernel → 共享資料更便宜。
- **Low Overhead**:`pthread_create` 遠比 `fork()` 輕量。
---
### Structure of a Thread
Threads 是在同一個虛擬位址空間內的輕量執行單位。
### Shared (Process-wise)
- Virtual Address Space
- Global Variables / Heap / Code
- File Descriptors
### Private (Thread-wise)
- Thread ID
- Program Counter (PC)
- Registers
- **Stack**(最關鍵的差異:每個 thread 都有自己的 stack)
---
### Process vs Thread
| Feature | Process | Thread |
|--------|---------|--------|
| Address Space | Isolated | Shared |
| Creation Cost | High (`fork`) | Low (`pthread_create`) |
| Communication | IPC + kernel trap | Shared memory, no trap |
| Crash Scope | 互不影響 | 任一 thread crash → 整個 process crash |
| Security | Strong isolation | Weaker isolation |
| Summary | **Security** | **Performance** |
---
### Motivations for Multi-threading
### 1. Latency-Oriented Concurrency(例如 Web Server)
- 目標:快速切換大量blocking I/O請求
- 通常一個 thread 對應一個 user request
### 2. Throughput-Oriented Parallelism(例如 HPC)
- 目標:把 CPU-bound 工作拆成小任務加速
- **Fork-Join Pattern**:主 thread 分派工作 → 等待所有 worker 完成
---
### Amdahl’s Law — 極限加速比
增加核心 ≠ 線性加速。
### 公式
- P:可平行化部分
- 1-P:不可平行化部分
- N:workers / cores

Even if $N \to \infty$, the speedup is capped at $\frac{1}{1-p}$.
---
### Why Real Speedup Is Worse Than Amdahl’s Bound
### A. Hardware Limits & Oversubscription
- 多核心負載高 → 溫度上升 → **頻率下降(thermal throttling)**
- Thread 數 > CPU 核心 → **context switching 開銷**
### B. Turbo Boost & Load Imbalance
- Turbo Boost 讓少數核心跑更快 → thread 速度不一致
- Fork-Join 會等待最慢的 thread → 整體速度受拖累
- 有時關掉 turbo boost(全核心固定、平均變快)反而更快
### C. Thread Creation Overhead
- Thread 建立成本對短任務來說太高
- **解法:Thread Pool**
- 啟動時預先建立固定 thread
- 任務直接丟給現有 workers
- amortize creation cost
---
### Universal Scalability Law(USL)
當 workers 過多時,效能會下降(曲線往下)。
### 原因
- **Coherence / Synchronization Cost**
每個 worker 都需要更多同步 → 反而變慢。
### Real Examples
1. **ZSTD(壓縮)**
- 壓縮可平行
- 但 **output file 寫入** 必須 sequential → bottleneck
2. **Cloverleaf(HPC)**
- 4 threads 之後效能不再上升
- 原因:**記憶體頻寬(memory bandwidth)不足**
---
## Shio: Threads and “Private Stack” Experiment
**The lecture states that threads have "their own stack." Does this mean the stack memory of Thread A is strictly protected from Thread B (like process isolation)? Or is it just a logical separation?**
It is only a logical separation, not a physical hardware protection.
In Operating System design, "private stack" means that the Kernel allocates a specific memory region for each thread to manage its own function calls and local variables, and assigns a unique Stack Pointer (%rsp or %sp) to it.
However, because all threads within the same process share the same Virtual Address Space (VAS) and the same Page Table, there is no hardware mechanism (like the MMU) preventing one thread from accessing another thread's stack memory.
---
### Experiment Verification:
To verify this, I wrote a C program (stack_hack.c) where the Main Thread passes the memory address of a local variable (located on the Main Stack) to a Worker Thread. The Worker Thread then attempts to dereference this pointer and modify the value.
``` cpp
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
typedef struct {
int id;
int *target_addr;
} thread_arg_t;
void* worker(void* arg) {
thread_arg_t *args = (thread_arg_t*)arg;
// Log: Thread view before modification
printf("[TID:%d] READ @ %p -> VAL: %d\n", args->id, args->target_addr, *args->target_addr);
// Modification
*args->target_addr = 0xDEADBEEF; // Magic number modification
// Log: Thread view after modification
printf("[TID:%d] WRITE @ %p -> VAL: %d\n", args->id, args->target_addr, *args->target_addr);
return NULL;
}
int main() {
int stack_var = 0xCAFEBABE;
pthread_t t;
thread_arg_t args;
printf("[MAIN] INIT @ %p -> VAL: %d (0x%X)\n", &stack_var, stack_var, stack_var);
args.id = 1;
args.target_addr = &stack_var;
if (pthread_create(&t, NULL, worker, &args) != 0) {
perror("pthread_create failed");
return 1;
}
pthread_join(t, NULL);
printf("[MAIN] CHECK @ %p -> VAL: %d (0x%X)\n", &stack_var, stack_var, stack_var);
return 0;
}
```
---
### Experimental Data (Log Output):

---
### Deep Analysis & Findings:
**1. Proof of Shared Virtual Address Space**
The most critical observation is the address 0x16d8c2cd8.
The Main Thread allocated a variable at this address.
The Worker Thread (TID:1) was able to READ from and WRITE to this exact same address.
In a multi-process environment (like using fork()), the child process would have a copy of the address space (Copy-on-Write). Modifying a variable in the child would not change the value in the parent.
Here, the change was instantaneous and visible to the Main Thread (CHECK step showed 0xDEADBEEF), proving they are operating on the same physical memory pages mapped to the same virtual addresses.
**2. The "Private Stack" Illusion**
Why do we say threads have private stacks if they aren't protected?
Logical Layout: The OS allocates stack space for Thread 1 at Address range $A$, and Thread 2 at Address range $B$. As long as Thread 1 only uses its stack pointer (%rsp), it will naturally stay within range $A$.
No Firewall: However, my experiment shows that if I explicitly pass a pointer (address from range $A$) to Thread 2, Thread 2 can reach across and modify it. The CPU's Memory Management Unit (MMU) does not block this because both ranges $A$ and $B$ belong to the same Process ID.
**3. Implication: Performance vs. Safety**
This design choice is the fundamental reason why Threads are faster than Processes but also more dangerous.
Performance: Because the Page Table is shared, switching between threads (Context Switch) does not require flushing the TLB (Translation Lookaside Buffer). This makes thread switching extremely fast.
Safety Risk: The lack of isolation means a bug in one thread (e.g., a buffer overflow or a wild pointer) can corrupt the stack of another thread, causing the entire process to crash or behave unpredictably. This is known as Memory Corruption and is a common source of difficult-to-debug errors in parallel programming.
---
## Shio: Process vs. Thread Creation Overhead Experiment
**We learned that creating a thread is faster than forking a process. But *how much* faster? Is it 2x? 100x? Let's quantify the "overhead" mentioned in the lecture.**
### Threads are approximately 15x to 20x faster to create than processes.
:::info
**Core Concept**
The term "Lightweight" isn't just a buzzword. It refers to the specific kernel operations required to spawn an execution unit. `fork()` requires duplicating the entire memory map (Page Tables), whereas `pthread_create()` allows the new unit to share the existing memory map.
:::
#### Experiment Verification
I wrote a benchmark program `perf_benchmark.c` to perform a stress test:
``` cpp
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
#include <pthread.h>
#include <time.h>
#define ITERATIONS 10000
void* noop_thread(void* arg) {
return NULL;
}
double get_time_diff(struct timespec start, struct timespec end) {
return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) * 1e-9;
}
int main() {
struct timespec start, end;
double proc_time, thread_time;
pthread_t t;
printf("BENCHMARK: Creation & Teardown (N=%d)\n", ITERATIONS);
printf("------------------------------------------------\n");
printf("%-10s | %-12s | %-15s\n", "Type", "Total(s)", "Avg/Op(us)");
printf("------------------------------------------------\n");
// 1. Process Benchmark
clock_gettime(CLOCK_MONOTONIC, &start);
for (int i = 0; i < ITERATIONS; i++) {
pid_t pid = fork();
if (pid == 0) exit(0);
else wait(NULL);
}
clock_gettime(CLOCK_MONOTONIC, &end);
proc_time = get_time_diff(start, end);
printf("%-10s | %-12.4f | %-15.2f\n", "Process", proc_time, (proc_time/ITERATIONS)*1e6);
// 2. Thread Benchmark
clock_gettime(CLOCK_MONOTONIC, &start);
for (int i = 0; i < ITERATIONS; i++) {
pthread_create(&t, NULL, noop_thread, NULL);
pthread_join(t, NULL);
}
clock_gettime(CLOCK_MONOTONIC, &end);
thread_time = get_time_diff(start, end);
printf("%-10s | %-12.4f | %-15.2f\n", "Thread", thread_time, (thread_time/ITERATIONS)*1e6);
printf("------------------------------------------------\n");
printf("Ratio (Process/Thread): %.2fx\n", proc_time / thread_time);
return 0;
}
```
1. **Process Mode:** Sequentially `fork()` and `wait()` for 10,000 child processes.
2. **Thread Mode:** Sequentially `pthread_create()` and `pthread_join()` for 10,000 threads.
3. **Measurement:** Calculated the total wall-clock time and average time per operation.
**Log Output:**

### Deep Analysis & Findings
#### I. The Quantifiable Cost of Isolation
The experiment data shows a massive performance gap:
- **Process Creation:** ~145 microseconds per op.
- **Thread Creation:** ~9 microseconds per op.
- **Speedup:** Threads are **16.31x faster** in this specific environment.
---
#### II. Why is fork() so expensive?
Even with optimizations like Copy-On-Write (COW), fork() is heavy because the kernel must:
- **Duplicate the Page Table:**
The OS must iterate through the parent's page table entries and create a copy for the child. Even if the data isn't copied, the mapping structure must be.
- **Clone Kernel Structures:**
Duplicating file descriptors (struct files_struct), signal handlers, and virtual memory areas (vm_area_struct).
- **TLB Flush:**
Context switching between different processes usually requires flushing the Translation Lookaside Buffer (TLB) (unless PCID is used), adding hidden costs.
---
#### III. Why is pthread_create so cheap?
In Linux, threads are essentially processes that share resources (created via clone() system call with `CLONE_VM` flag).
- **Shared Page Table:**
The new thread points to the same Page Directory base address. No page table copying is needed.
- **Shared File Descriptors:**
No need to duplicate the file table.
- **Overhead:**
The main cost is merely allocating a Stack (usually via `mmap`) and a small `task_struct`.
---
#### IV. Architectural Implications
:::success
**Design Takeaway**
This result validates why modern high-concurrency servers (like Nginx or Go routines) avoid the "Process-per-Request" model (like old Apache MPM Prefork).
:::
- **Scalability:**
If a web server receives 10,000 requests per second, using fork() would consume significantly more CPU cycles just for setup/teardown, limiting throughput.
- **Responsiveness:**
For latency-sensitive applications (e.g., UI or real-time systems), the ~130us difference per task accumulation is unacceptable.
- **The Limit:**
However, even 9us is not free —
This is why **Thread Pools** are used: reuse threads to eliminate creation cost entirely.
## Hand-on Lab
### Answers to Reflection Questions
## 1. Server A — Single-Threaded Blocking
### Q: Why did the latency for **c = 4** make sense (≈ 4 × 0.02s)?
**Reason: Head-of-Line Blocking**
Server A is **single-threaded** and **blocking**.
Requests are processed *strictly one-by-one*.
For c=4:
- Req 1 → 0.02s
- Req 2 → waits 0.02s + 0.02s = 0.04s
- Req 3 → waits 0.04s + 0.02s = 0.06s
- Req 4 → waits 0.06s + 0.02s = 0.08s
This is exactly linear accumulation.
---
### Q: Why did **c = 16** produce far worse latency (≫ 16 × 0.02s)?
**Reason: Queue Saturation**
At high concurrency:
- Arrival rate > Service rate
- Server utilization = 100%
- The client (“hey”) immediately sends new requests whenever a response arrives
- The server cannot drain the queue fast enough
This produces **unbounded queue growth**, leading to an **exponential explosion in latency**.
---
## 2. Server B — Thread Pool (8 Workers)
### Q: Why was latency for **c = 4** so good (~0.02s)?
**Reason: Resource Oversupply**
- We have **8 worker threads**
- Only **4 incoming requests**
Since **Threads > Requests**, no queue is formed.
Each request is immediately assigned to an idle worker:
→ Pure processing time ≈ 0.02s.
---
### Q: Why did the median latency double to ~0.04s when **c = 16**?
**Reason: Saturation → Two Batches**
We have 16 requests but only 8 workers.
**Batch 1 (Req 1–8):**
- Runs immediately
- Finishes in ~0.02s
**Batch 2 (Req 9–16):**
- Must wait for Batch 1
- Wait 0.02s + run 0.02s
→ **Total = 0.04s**
Half of the requests take 0.02s, the other half 0.04s →
Median shifts to **≈ 0.04s**.
This shows **queue depth** and **limited worker bandwidth**.
---
## 3. Server C — AsyncIO (Single Thread, Non-Blocking)
### Q: Why did median latency stay ~0.02s even at **c = 16**?
**Reason: Non-blocking Overlap (No Blocking Sleep)**
AsyncIO does:
1. Accept Req 1
2. Schedule timer (0.02s)
3. Yield immediately (microseconds)
4. Accept Req 2
5. Schedule timer
6. Yield
7. … repeats up to Req 16
Because **sleep(0.02)** does *not* block the thread:
- All requests *start their timers at almost the same moment*
- All wake up ~0.02s later
Thus latency remains **≈ the I/O wait time itself**.
Non-blocking I/O allows perfect overlap.
---
## 4. Scenario Application
### Q: You want to run a Line Bot server on a dorm PC.
It may handle up to **1000 concurrent chat connections**.
Which model should you choose: **Thread Pool** or **AsyncIO**?
## **Choice: AsyncIO (Server C)**
### Why?
#### 1. Memory Efficiency
- **Thread Pool:**
Every thread requires its own stack (MB-level).
1000 threads → **GBs of RAM**, risk of OOM or swapping.
- **AsyncIO:**
1000 connections = 1000 small Python objects
Memory usage is tiny.
#### 2. Minimal Context Switching
- **Thread Pool:**
OS must context-switch between 1000 threads → expensive.
- **AsyncIO:**
Switching happens in user space (event loop) → cheap.
#### 3. Perfect for I/O-bound Workloads
A Line Bot mainly waits for:
- incoming messages
- LINE API responses
This is **I/O-bound**, exactly what AsyncIO excels at.
...
### Shio (Explorations & Discoveries)
- Extra question(s) we asked:
- Q: <...> A: <what we found / how we tested>
- Our Discovery(ies):
### What I learned
(each member writes 2–4 sentences, in their own words)
### @112XXXXXX
<your reflection here – ok to write Chinese, **no LLMs**>
### @111000116
What really surprised me, though, was the practical reality beyond the theory—specifically the "Red Curve" in the Universal Scalability Law. It’s crazy to think that after a certain point, adding more threads actually hurts performance because the cost of coordination and synchronization outweighs the benefit of extra computing power. Also, the bit about Turbo Boost was counter-intuitive; I never thought that a feature designed to make CPUs faster could actually cause load imbalance and thermal throttling in parallel workloads. It made me realize that in OS design, "coordination" is often much more expensive than the actual "computation." We trade the safety of Processes for the speed of Threads, but we pay for it with complexity.
### @112062124
在同一個 process 底下的所有 threads,就像同一個部門裡的不同員工:大家共享相同的資源(stack、程式碼、資料區),所以能夠一起處理同一份任務。但是當一個部門的人數開始攀升時,問題就不再只是「有多少人可以平行工作」,反而會額外增加工作如何分配(arrangement)、不同任務是否有先後關係(coherency)等協調成本。這些額外的同步、排程、資源爭用,都會使整體效率下降;更不用說在實際硬體上還受到溫度、功耗限制
我覺得這部分和我在「網路概論」學到的概念很像:
當要傳輸一個大檔案時,可以把資料拆成許多小區塊來傳送,這樣能提升排隊公平性,而且因為丟包影響的只是某幾個區塊,也能增加可靠性。但拆得越多、封包數越高,header(地址與控制資訊)也會佔掉越多頻寬,反而造成效率降低。
。
## Content Sharing Agreement
**Please uncheck if you disagree!**
- [X] We agree that excerpts of this note may be shared by TA to Discord if our team's note is selected as *shio* highlight.
- [X] We agree that our team ID may be shown on Discord if our team's note is selected as *shio* highlight.
- [X] We agree that the instructore can show excerpts of this note if our team's note is selected as *shio* highlight.
:::warning
* Please don't share this HackMD note directly with non-teammate. But you are encouraged to help others (including non-teammate) on Discord. [Follow our collaboration policy](https://sys-nthu.github.io/os25-fall/admin/policies.html#collaboration-citation)
* Please don't peek at or copy another team's HackMD, or using LLM to generate your reflection.
:::