# Multiprocessor Scheduling
## Multiprocessor Architecture
現代處理器為了讓處理器核可以不用一直和記憶體互動而需要快取記憶體 (Cache) 紀錄常用的資料,快取記憶體就在處理器核旁邊,比記憶體的容量小很多且貴很多,而主記憶體因為離處理器核太遠,延遲太長,因此需要快取記憶體
快取記憶體的兩個優勢
- Temporal Locality:常用的資料存在快取,就不用去主記憶體找
- Spatial Locality:存取陣列中的某個資料,陣列中的其他資料可能也會被存取,因此把周遭的資料都放進來

## Cache Coherence
行程修改過的資料都是先放在快取記憶體,但是快取記憶體為了效率不會常常將更新好的資料放回主記憶體,此時如果其他處理器核的行程需要同樣記憶體位址的資料,從主記憶體拿來的資料就會是錯的,因為剛更新好的還沒有存回去,因此需要有機制去標注主記憶體中哪些資料已經過時了,行程可能需要從另一個處理器核的快取記憶體拿資料或是強制其更新資料到主記憶體
## Cache Affinity
行程在處理器執行一段時間後快取記憶體和 TLB 都會是很滿的狀況,因此行程可以執行得很快,因為不用一直和高延遲的主記憶體互動,因此排程方式應該盡可能的讓行程保持在同一個處理器核
### SQMS
SQMS (Single Queue Multiprocessor Scheduling) 讓多個處理器核共用一個任務佇列,好處是只有一個佇列要處理,壞處是可擴展性 (Scalability) 不高,當處理器核變多會很麻煩,而且幾乎沒有 Cache Affinity,行程可能被排到任意處理器核,如果不是上次的處理器核則快取記憶體和 TLB 都要重新刷新


### MQMS
MQMS (Multiple Queue Multiprocessor Scheduling) 讓每一個處理器核都具備一個任務佇列,此策略具備可擴展性,當有新的處理器核加入時新增佇列,問題是需要處理負載平衡 (Load Balance),每個處理器核的任務數量會隨時間不同,如何讓不忙的處理器核幫忙很忙的處理器核


透過任務搬移 (Migration),分攤很忙的處理器核的任務到不忙的處理器核,不忙的處理器核可以固定時間去詢問忙的處理器核須不需要幫忙,注意詢問的頻率也不能太高,因為不同處理器核之間溝通的成本也很高

## 課後問題
In this homework, we’ll use `multi.py` to simulate a multi-processor CPU scheduler, and learn about some of its details. Read the related `README` for more information about the simulator and its options.
1. To start things off, let’s learn how to use the simulator to study how to build an effective multi-processor scheduler. The first simulation will run just one job, which has a run-time of `30`, and a working-set size of `200`. Run this job (called job ’a’ here) on one simulated CPU as follows: `./multi.py -n 1 -L a:30:200`. How long will it take to complete? Turn on the `-c` flag to see a final answer, and the `-t` flag to see a tick-by-tick trace of the job and how it is scheduled.
> 需要 `30` 個時間單位,因為預設 Cache Size 是 `100`,這個任務的 Working Set 為 `200`,所以沒辦法進入 Warm Cache 的狀態,所以以 Cool Cache 的狀態執行完
2. Now increase the cache size so as to make the job’s working set (size=`200`) fit into the cache (which, by default, is size=`100`); for example, run `./multi.py -n 1 -L a:30:200 -M 300`. Can you predict how fast the job will run once it fits in cache? (hint: remember the key parameter of the warm rate, which is set by the `-r` flag) Check your answer by running with the solve flag (`-c`) enabled.
> 把第一題的指令改成 `./multi.py -n 1 -L 1:30:200 -M 200`,任務完成時間為 `20` 個時間單位,在前 `10` 個時間單位結束後進入 Warm Cache,因此速度變 2 倍
3. One cool thing about `multi.py` is that you can see more detail about what is going on with different tracing flags. Run the same simulation as above, but this time with time left tracing enabled (`-T`). This flag shows both the job that was scheduled on a CPU at each time step, as well as how much run-time that job has left after each tick has run. What do you notice about how that second column decreases?
> 第 10 秒有一個分界線,代表進入 Warm Cache,任務需要的時間開始以 2 遞減
```shell
./multi.py -n 1 -L 1:30:200 -M 200 -T -c -t
Job name:1 run_time:30 working_set_size:200
Scheduler central queue: ['1']
0 1 [ 29]
1 1 [ 28]
2 1 [ 27]
3 1 [ 26]
4 1 [ 25]
5 1 [ 24]
6 1 [ 23]
7 1 [ 22]
8 1 [ 21]
9 1 [ 20]
----------------
10 1 [ 18]
11 1 [ 16]
12 1 [ 14]
13 1 [ 12]
14 1 [ 10]
15 1 [ 8]
16 1 [ 6]
17 1 [ 4]
18 1 [ 2]
19 1 [ 0]
Finished time 20
Per-CPU stats
CPU 0 utilization 100.00 [ warm 50.00 ]
```
5. Now add one more bit of tracing, to show the status of each CPU cache for each job, with the `-C` flag. For each job, each cache will either show a blank space (if the cache is cold for that job) or a ’w’ (if the cache is warm for that job). At what point does the cache become warm for job ’a’ in this simple example? What happens as you change the warmup time parameter (`-w`) to lower or higher values than the default?
> 當 Warmup Time 越大,行程要在處理器待越久才會進入 Warm Cache
7. At this point, you should have a good idea of how the simulator works for a single job running on a single CPU. But hey, isn’t this a multi-processor CPU scheduling chapter? Oh yeah! So let’s start working with multiple jobs. Specifically, let’s run the following three jobs on a two-CPU system (i.e., type `./multi.py -n 2 -L a:100:100,b:100:50,c:100:50`) Can you predict how long this will take, given a round-robin centralized scheduler? Use `-c` to see if you were right, and then dive down into details with `-t` to see a step-by-step and then `-C` to see whether caches got warmed effectively for these jobs. What do you notice?
> 三個任務都沒有辦法在 Warm Cache 的狀況下執行,因為 Time Quantum 預設為 `10`,然後預設為 SQMS 因此在 0 到 9 時間段的 CPU 0 和 1 分別為任務 a 和 b,第 10 到 19 時間段變成任務 c 和 a,在 CPU 剛變成 Warm Cache,任務就搬移走了,因此總時間長為 `150` 個時間單位,每個任務在 `30` 個時間單位內總共佔用 CPU 時間為 `20` 個時間單位且都是在 Cool Cache 的狀況
9. Now we’ll apply some explicit controls to study cache affinity, as described in the chapter. To do this, you’ll need the `-A` flag. This flag can be used to limit which CPUs the scheduler can place a particular job upon. In this case, let’s use it to place jobs ’b’ and ’c’ on CPU 1, while restricting ’a’ to CPU 0. This magic is accomplished by typing this `./multi.py -n 2 -L a:100:100,b:100:50,c:100:50 -A a:0,b:1,c:1` ; don’t forget to turn on various tracing options to see what is really happening! Can you predict how fast this version will run? Why does it do better? Will other combinations of ’a’, ’b’, and ’c’ onto the two processors run faster or slower?
> 總共需要 `110` 個時間單位,CPU 0 只有任務 a,且任務 a 的 Working Set 剛好可以被 Cache 容納,因此可以進入 Warm Cache 狀態,而 CPU 1 在任務 b 和 c 之間 RR,且任務 b 和 c 的 Working Set 總和剛好可以塞進 Cache,所以任務 b 和 c 都進入 Warm Cache 之後都可以變 `2` 倍速度執行
11. One interesting aspect of caching multiprocessors is the opportunity for better-than-expected speed up of jobs when using multiple CPUs (and their caches) as compared to running jobs on a single processor. Specifically, when you run on $N$ CPUs, sometimes you can speed up by more than a factor of $N$, a situation entitled super-linear speedup. To experiment with this, use the job description here (`-L a:100:100,b:100:100,c:100:100`) with a small cache (`-M 50`) to create three jobs. Run this on systems with 1, 2, and 3 CPUs (`-n 1`, `-n 2`, `-n 3`). Now, do the same, but with a larger per-CPU cache of size `100`. What do you notice about performance as the number of CPUs scales? Use `-c` to confirm your guesses, and other tracing flags to dive even deeper.
> 在 Cache Size 為 `50` 時沒有任務的 Working Set 可以塞進去,因此總執行時間 = 全部任務總時間 / CPU 數量,當 Cache Size 變為 `100` 時若 CPU 數量為 `1`,則需要 `300` 個時間單位,因為 RR 策略一直破壞 Cache Affinity,當 CPU 數量變為 `2` 時,為了公平性會把目前沒被排程的任務排進來,因此也會破壞 Cache Affinity,總時間為 `150` 個時間單位,但是當 CPU 數量變為 `3` 時,任務佇列為空因為任務數量跟 CPU 數量相同可以把任務全部排上去,每個任務在 Warm Cache 之後都可以變 `2` 倍速執行,總共只需要 `55` 個時間單位
13. One other aspect of the simulator worth studying is the per-CPU scheduling option, the `-p` flag. Run with two CPUs again, and this three job configuration (`-L a:100:100,b:100:50,c:100:50`). How does this option do, as opposed to the hand-controlled affinity limits you put in place above? How does performance change as you alter the ’peek interval’ (`-P`) to lower or higher values? How does this per-CPU approach work as the number of CPUs scales?
> 在只有 `2` 個 CPU 時,執行時間會受 Peek Interval 影響,因為一定會有一邊先執行完,如果可以在某一個處理器進入 IDLE 時馬上把另一個 CPU 的任務排過來會是最有效率的,以下顯示三種 Peek Interval 對總執行時間的影響,如果 Peek Interval 很短,因為 CPU IDLE 時可以馬上搬移任務,模擬的總執行時間最短,但此時還沒有考慮 Peek 的成本
```shell
./multi.py -p -n 2 -P 1 -L a:100:100,b:100:50,c:100:50 -c
// Finished time 90
./multi.py -p -n 2 -P 10 -L a:100:100,b:100:50,c:100:50 -c
// Finished time 100
./multi.py -p -n 2 -P 40 -L a:100:100,b:100:50,c:100:50 -c
// Finished time 115
```
> 當 CPU 數量剛好是任務數量時效率最好,因為任務不會被搬來搬去,Warm Cache 的時間最長,如果 CPU 數量超過任務數量,就會發現任務被偷到 IDLE 的 CPU 執行,因為變成 Cool Cache 反而變慢
15. Finally, feel free to just generate random workloads and see if you can predict their performance on different numbers of processors, cache sizes, and scheduling options. If you do this, you’ll soon be a multi-processor scheduling master, which is a pretty awesome thing to be. Good luck!