# OS 作業系統
## Overview
### Computation Modes
- **Batch** Mode: 獨立於外部世界不受干擾的模式,過程中不會和使用者及外界互動
- **Effiviency**
- Performance Metric: completion time for one process
- Resource allocation policy: 需要在process開始執行前就分配好足夠的resources
- **Online/interative** Mode: 可以和使用者互動的模式,當系統接收到inputs通常要產生outputs給使用者,需要處理大量inputs和outputs
- **Responsiveness**
- Performance Metric: throughput
- Resource allocation policy: fair share and processes should not occupy CPU/memory for long.
- time constraint: none
- **Real-time** Mode: 和環境互動的模式
- **Predictability**
- Performance Metric: tardiness 延遲性
- Resource allocation policy: 確保每個running process的resource,以讓每個進程的狀態一致
- time constraint: 有時間限制(考量tardiness)
### user view
在使用者角度(user view),作業系統設計考量為易用性(ease of use)及一些性能(performance)和安全性(security),對資源利用(sharing resource)少有關注。
### system view
在系統角度,我們可以視它為資源分配器(resource allocator),它必須管理資源、存取、I/O設備⋯⋯等,因此OS可以說是一個控制程式(control program)。
<p align="center">
<img src="https://hackmd.io/_uploads/BJW7Q6ok0.jpg" alt="components of a computer system" width="75%" height="75%">
</p>
### von Neumann/Princeton Architecture
<p align="center">
<img src="https://hackmd.io/_uploads/HJ3zyypJ0.png" alt="Von_Neumann_Architecture" width="75%" height="75%">
</p>
- stored-program computers (instruction儲存在記憶體中)
- shared system bus
- sequential instruction processing
- reduce memory waste
### Harvard Architecture
<p align="center">
<img src="https://hackmd.io/_uploads/Hy-z-y6kC.png" alt="Harvard_Architecture" width="75%" height="75%">
</p>
- stored-program computers
- separate storage and signal pathways
- parallel instruction and data access
- faster access memory
### Types of OSs
#### Stand alone (Uni-Processor) OS
Handle resource allocation and provide virtual interface

##### Monolithic
用一個很大的kernel處理所有事情,ex: original UNIX
- system program
- kernel

<p align="center">
<img src="https://hackmd.io/_uploads/BJwwObpJR.jpg" alt="Harvard_Architecture" width="65%" height="65%">
</p>
##### layered design
用很多層分工,第N層可以使用N-1層的services,並實現可以給N+1層的服務
- Ring/layer
- Ring 0: hardware
- Ring N: the user interface
- some processors have ring -1~-3
##### Microkernel Architecture
small kernel, user-level 有很多額外的功能,之間用IPC傳遞資料(time-costly)。
- 優點:容易extend, port OS to new architecture、更可靠、安全
- 缺點: overhead of user to kernel space

##### Loadable Kernel Modules (LKMs)
模組化,可以動態使用的kernel,更新時不需重新開機
- object-oriented approach
- core component separate
- talks to other over known interface
- loadable when needed
##### Hybrid system 混合系統
現今OS多半不只有使用單一model的kernel,而是多種model的組合,這樣稱為 Hybrid system 混合系統
#### Distributed system
多個電腦或processors透過網路連接,增加運算容量的情況
- network OS (NOS)
- distributed OS (DOS)
- middleware

### Computer-System organization
- kernel: 操作系统的最底層核心部分,負責管理計算機硬體資源,提供系統的基本服務和功能,包括進程管理、內存管理、文件系統管理和設備驅動程序
- systems programs: 管理計算機硬體資源、配置系統設置、監控系統性能、以及排除系統錯誤
- middleware: software framework,連接和整合不同軟體應用程式和服務的軟體層,或是為應用程式開發者提供的額外服務
- appliance programs
<p align="center">
<img src="https://hackmd.io/_uploads/B1r6u6skA.gif" alt="structure of the computer" width="75%" height="75%">
</p>
- bus: 元件和共享記憶體的通道
- device driver: 控制和管理硬體設備與作業系統之間通訊的軟體,使得作業系統能夠正確識別、操作和利用各種硬體設備
<p align="center">
<img src="https://hackmd.io/_uploads/HJzu06jyC.jpg" alt="A typical PC computer system" width="75%" height="75%">
</p>
#### interrupts
向CPU發出的停止訊號,使系統可以隨時停下來處理緊急事件或外部設備的請求,另外當I/O操作程序結束時也會發出interrupt訊號。
- trap/exception: software-generated interrupt
- interrupt vector: 儲存中斷處理程序的地址,以快速將控制權轉移到該處理程序的入口點
- interrupt request line,IRQ: 通知處理器發生了中斷事件
- nonmaskable: for unrecoverable memory errors...
- maskable: 在CPU處理不得中斷的指令之前,CPU可以將其關閉,ex: request by the device controllers.
- interrupt handler routine: 也可以叫做interrupt service routine,ISR。當中斷事件發生時,中斷處理程序被執行以處理該事件
<p align="center">
<img src="https://hackmd.io/_uploads/B13uQAjy0.jpg" alt="Interrupt" width="85%" height="85%">
</p>
<p align="center">
<img src="https://hackmd.io/_uploads/rJh6dk3k0.jpg" alt="Interrupt-driven I/O cycle" width="85%" height="85%">
</p>
- interrupt chaining: 將interrupt分成多種類別,此時的interrupt vector會指向首地址,再透過遍歷的方式去找目標對象。另外,此方法也同時實現了**interrupt priority levels**
<p align="center">
<img src="https://hackmd.io/_uploads/BJQ3nknk0.jpg" alt="Interrupt vector table" width="85%" height="85%">
</p>
#### storage structure
- Primary storage(main memory): CPU access directly
- SRAM
- DRAM

## OS-Structure
### Relocation
在編譯到執行之前,我們使用的程式通常都不是 absolute address,而 relocation 就是分配給 program 中數據對應的 address 以確保使用正確的地址

- compile time: generate **absolute code** (如果事前知道memory location)
- load time: generate **relocatable code** (如果是動態分配的memory)
- execution/run time: **binding delayed** ,到這階段 (execution/run time) 才開始做 (如果process可能會被移動到其他memory segment)
#### Linker
將source code 編譯出來的 relocatable object files 連接 (combine) 起來變成 single binary executable file,過程中可能會包含其他 object files 和 libraries
#### Loader
將 binary executable file 載入到 memory,使得CPU可以執行它,同時也將 dynamically linked libraries 綁定到 file 上,使程式可以使用其定義的函數和資源
## Process
在 program 執行時會產生 process,它也可以說是 **program in execution**,我們必須確保 process 滿足三個特性:
1. multiplexing/time-sharing
- 可佔據 (occupy) 或讓出 (yield) 資源
2. isolation
- 失效的 process 不會影響到其他
- 資料必須有 permission 才能存取
3. interaction
<br>
以下為 process state:

<br>
以下是 memory layout of a C program:

<br>
我們會使用PCB (process control block) 來監控 process 狀態
### client-server communication
透過 computer networks (wired or wireless) 去連接 client computers,它通常是有限計算力的 computers 會發出 requests ( from user/application )給 server computers 請它幫忙運算
#### Sockets
從網路的角度來看,socket就是通訊連結的端點;從程式設計者的角度來看,socket提供了一個良好的介面,使程式設計者不需知道下層網路協定運作的細節便可以撰寫網路通訊程式。在TCP/IP架構下,sockets可分為下面兩類:
1. **Datagram sockets(connectionless)**
資料在datagram sockets間是利用**UDP**封包傳送,因此接收端socket可能會收到次序錯誤的資料,且其中部分資料亦可能會遺失。
- 適合講究即時性的,ex:視訊
2. **Stream sockets(connection-oriented)**
資料在stream sockets間是利用**TCP**封包來傳送,因此接收端socket可以收到順序無誤、無重覆、正確的資料。此外TCP傳送時是採資料流的方式,因在傳送時會所有資料會視情況被分割在數個TCP封包中。
- 適合講究穩定、正確性的,ex:email傳輸
#### remote procedure calls (RPCs)
RPC(Remote Procedure Call)是一種讓程式能夠請求網路另一台主機上的程序執行而不需要了解底層網路技術細節的技術。簡單來說,RPC使得開發分布式系統應用程序變得更加簡單,因為它允許開發人員進行函式或過程調用,就像調用本地系統上的函式一樣,而不用擔心這些函式的實際執行是在遠端系統上。步驟如下:
1. client 啟動RPC (Client Stub)
2. send requests to server stub:RPC框架會在客戶端將函式調用封裝成一個網路請求,然後通過網絡發送給遠端服務器。這個請求包括了要執行的遠程函式的名稱和所需的參數。
3. server handle it:server收到請求後,RPC框架會解析請求,調用本地的函式實現,並將函式的參數傳遞給它。
4. return to client:一旦遠程函式執行完成,其結果會被封裝成一個響應消息,送回給client。
5. client receive result:RPC框架會解析result,提取出函式的返回值,最後將其提供給client。
## Main Memory
### Binding
1. 原始碼(Source Code):這是過程的開始。程序員用高階語言編寫源代碼並將其保存在文件中(通常使用如.c為C或.cpp為C++的擴展名)。
2. 預處理器(Preprocessor):在實際編譯開始之前,預處理器會處理源代碼和指令(如#include, #define等)。這些指令經常包括復制包含文件的內容、宏展開和條件編譯。結果是一個預處理過的源文件,通常不會在文件系統中顯示為單獨的文件。
3. 編譯器(Compiler):預處理過的源代碼然後傳遞給編譯器。編譯器的工作是將高階語言轉換為組合語言,這種語言更接近機器碼,但人類仍然可讀。這保存為組合語言代碼文件(擴展名如.s或.asm)。
4. 組譯器(Assembler):組合代碼隨後傳遞給組譯器。組譯器將組合語言轉換為目標代碼(機器代碼,但還不可執行,因為它可能不包含所有必要的信息)。這導致產生一個目標文件(帶有.o或.obj擴展名)。
5. 鏈接器(Linker):目標文件經常包含對還未鏈接的函數和庫的引用。鏈接器接受這些目標文件,並將它們與必要的庫文件鏈接(system library, 靜態鏈接),或包含信息,以便在運行時鏈接庫代碼(動態鏈接)。結果是一個可執行文件(如Windows上的.exe或Unix/Linux上無擴展名)。
6. 載入器(Loader):當可執行文件運行時,載入器將程序載入內存,包括必要時的動態鏈接庫(Dynamic Link Library)。
7. 執行時間(Execution/Run Time):最終,處理器執行內存中的二進位文件(載入的可執行文件 Loaded Executable or In-Memory Executable)。


## Virtual Memory
### Page Replacement Algos
#### Optimal
Replace the pages that **will not** be used for the longest future.
- look forward in time
- require future knowledge (不太可能實現)
- used by comparison study
#### FIFO
a queue (first in first out)
- face Belady's Anomaly
#### LRU
Replace the pages that **has not** been used for the longest future.
- looking backward in time
##### counter implement
每個PTE都有counter,每次被使用就紀錄時間,要replace時選擇最久遠的,需要遍歷整個table
##### stack implement
用double-linked list,每次使用PTE時把它改到head,每次6個ptrs要改,但不用search
#### !Belady's Anomaly!
page-fault rate 可能會因為 allocated frames 增加而增加,使用較多的memory卻沒有帶來更好的效率。
- page離開的時間獨立於allocated frame num就不會發生,ex: LRU, LFU (frequency), Optimal
- 會發生問題的 ex:FIFO
#### LRU-Approximating Page Replacement Algos
##### Additional-reference-bits algo
固定間隔時間紀錄reference bits,其實是在一段時間內我們會去更改最低階的bit (reference bit),時間到了之後統一update,把該bit放到最高位,同時將剩餘的bits往右shift
##### Second-chance algo
用FIFO的作法加上reference bit,當被referenced,reference bit on,獲得第二次機會,當要swap out時遇到一個被標上R的PTE,我們把它的R拆掉,接著去找下一個順位的人直到找到可swap out的對象
- circular queue (clock algo)
- worse case: 全部都有R->退化回一般的FIFO
##### Enhanced Second-chance algo
同時考慮reference bit及modify bit,情景有以下四種:
- (0, 0) 最適人選
- (0, 1) 需要額外swap out
- (1, 0) 可能最近還會再使用到
- (1, 1) 可能還會使用到+需要swap out
#### Counting-based Page Replacement Algos
將引用次數計數,可能會很貴、不接近最理想的方法
##### LFU algos
最少使用的被replaced,問題:
- 最常使用的會不會只是一開始的階段需要一直用?
->right shift the counter at regular intervals
##### MFU algos
最常使用的被replaced
#### Page-Buffering Algos
維持一個 pool of free frames,再加上一個modified pages list 以延緩寫回secondary storage的時間,如果引發 page fault,先檢查有沒有在 free-frame pool 裡面(pool 必需要額外紀錄原先的page number),沒有的話再去拿一個新的
#### Application and Page Replacement
有一些 application 可以自己做I/O操作,因此如果這樣的 applications 再結合OS操作可能會是兩倍的消耗
-> 提供 **raw disk** 讓 application 可以繞過OS系統直接進行
### Allocation of frames
#### local V.S. global
local指的是每個process只能用自己被分配到的frames,global 則可以從整個系統的 frames 去使用需要的
| 項目 | local | global |
| ---- | ---------------------------------- | ------------------------------------------- |
| 優點 | 行為較為一致、processes 間較為獨立 | 較高的 memory 使用率、greater throughput |
| 缺點 | 較低的 memory 使用率 | 執行時間變化大、易受其他 processes 行為影響 |
- global replacement:
- 透過 free-frame list 的維護進行
- 在 free memory below **Min threshold** 時進行 page replacement(回收 reclaiming)
- 如果 exceed **Max threshold** 時,停止 reclaiming(回收 pages)

#### Fixed Allocation
##### Equal allocation
留下部分的 frames 到 free frames buffer pool,剩下平分給所有processes
##### Proportional allocation
按照 size of process 等比例去分配 frames
$a_i = \frac{s_i}{\sum{s_i}}*p$
#### Priority Allocation
### Out-Of-Memory Killer
**Reaper** 又稱 **OOM Killer**,當系統嚴重缺乏memory時,kernel用來 terminate processes 的手段,參考```badness()```的規則如下:
1. kernel 先給自己最少量的 memory
2. 開始回收大量 memory,使用量少的 processes 跳過
3. 盡量 kill 最少量
4. 可能可以用 priority 去決定要先 kill 哪個(像 user 已經可能要終止的 process)
- oom_score: 越大 memory 使用量,得到分數越高越有可能被 killed
- oom_score_adj: 額外使用者可操作的變數,如果不想要特定 process 被 killed 可以增加一個負數在這個參數上面以減少被 killed 的機率
### Non-Uniform Memory Access (NUMA)
每一區塊的 memory access 其實不太一樣,他們實際所在的位置會影響存取時間
- 為NUMA nodes 設定特別獨立的 free frame list,僅從裡面抓 frame 來使用(Linux的作法)
- 創建 ```lgroups```(locality group),groups 與 groups 之間用 latency 決定階層,從較低的階層開始使用(Solaris的作法)

### Thrashing
當 process 沒有足夠的 page,page fault 次數增加且 CPU utilization 很低,此時系統會為了增加使用率又接受更多 process 的 request,最後使用率就變成爆炸低

- occurs reason: $\sum{locality\:size} > total\:memory\:size$
**Locality** 是在一個process或者是thread執行時,在段某時間內集中使用的 set of pages
- Avoid:
1. working set: allocate frames appropriately
2. page fault frequency: reduce the degree of multiprogramming when the **page fault rate** is too high
#### Working-set Strategy
系統監視 process 的 locality,並且根據其來分配 frames,如果有足夠的 frames 變開始執行新的 process,當 sum of locality 超過 total memory 時,開始 swap out pages,process 會被 suspended
我們將 the locality of each process 稱為 **working set** of a process
<!--這個是final範圍-->
## CPU Scheduling
### Process Scheduling and CPU Scheduling
- **Process Scheduling**: selects which processes to put in ready queue
- more infrequently
- therefore controls the degree of multiprogramming
- **CPU Scheduling**: schedules the execution order of processes in ready queue
- more frequently
### CPU burst and I/O burst
- An I/O bound program usually has many short CPU bursts
- A CPU bound program can have a few long CPU bursts
### CPU Scheduler
- CPU scheduling decisions may take place when a process:
1. running $\to$ waiting
2. running $\to$ ready
3. waiting $\to$ ready
4. terminates
- If only 1. and 4. can occur (i. e. no involuntary context switches), the scheduling is **nonpreemptive**. Otherwise, it is **preemptive**.
- Only preemptive scheduling can suffer from race conditions.

### Preemptive vs. Non-preemptive
- preemptive
- involuntarily release
- 效率較好, 代價:
1. **monitor** the elapsed time of each process
2. sort processes on every **context switch**
3. not all processes can be pre-empted (thread-safe才行)
- Non-preemptive
- voluntarily release (conditions: 1, 4)
### Dispatcher
- **Dispatch latency**: time it takes to stop one process and start another process
### Different mode's standard
- **batch mode**
- MAX CPU使用量
- MAX throughput (單位時間內處理的工作量)
- MIN turnaround time (單個process執行時間)
- **online mode**
- MIN waiting time
- MIN response time
- **real-time mode**
### Scheduling Algorithms
- Assumptions:
- **Single** processor
- **Burst time** are known in advance, Burst time: total CPU time
- Ready processes are queued
- **prediction of length**
- estimation: **exponential averaging**,看過去歷史
- τ$_{n+1}= \alpha \cdot t_n + (1 - \alpha) \cdot$ τ$_{n}$
$=$ τ$_{n+1}= \alpha \cdot t_n + (1-\alpha) \cdot \alpha \cdot t_{n-1} + ... + (1-\alpha)^{j} \cdot \alpha \cdot t_{n-j} +...+(1-\alpha)^{n+1} \cdot t_{0}$
- τ$_{n}$: n-th CPU burst的預測
- $t_n$: n-th CPU burst的實際長度
- $\alpha$:weight of exponential averaging,$\alpha \in [0, 1]$, $\alpha = 0$ history不參考, $\alpha = 1$ 只參考上一個
- By Bernoulli, $\lim_{n \to \infty} (1+\frac{x}{n})^n = e^x$, when $\alpha = \frac{2}{N+1}$, $1-(1-\alpha)^{N+1}$ ≈ 86%, the previous $N+1$ items take into count 86%
- exponential averaging 數據圖,越stable越接近:
<figure>
<img src='https://hackmd.io/_uploads/SkQeLvBNA.jpg' alt='prediction of length'>
</figure>
- **First-Come, First-Served (FCFS)**
- execution order = arrive order
- problem: **convoy effect** - all other processes wait for one big process to get off the CPU
- **Shortest-Job-First (SJF)**
- optimal for minimizing average waiting time
- problem: we don't know the length of CPU bursts →
- **Shortest-Remaining-Time-First**
- preemptive version of SJF
- re-schedule whenever a job arrives or finishes
- **Round-Robin (RR)**
- Each process gets a small unit of CPU time called **time quantum** $q$
- after the time has elapsed, the process is preempted
- so $q$ should be significantly larger than context switch time
- ensures that each process gets $\frac{1}{n}$ of CPU time and that no process waits for more than $(n - 1)q$
- higher turnaround time than SJF, but better *responsiveness*
- for lower turnaround time, $q$ should be longer than 80% of CPU bursts
- **Priority Scheduling**
- a priority number is assigned to each process, high priority processes run first
- problem: **starvation** - low priority process may never execute
- solution: **aging** - as time progresses, increase the priority of processes
- can be combined with round-robin: processes with same priority are scheduled by round-robin
- **Multilevel Queue Scheduling**
- have seperate queues of different priority
- should have methods to determine when to upgrade or demote a process, and which queue a process will enter first,每一層的演算法不盡相同
### Thread Scheduling
#### Contetion Scope
- **Process contention scope (PCS)**
- scheduling competition is within the process
- priority
- **System contention scope (SCS)**
- scheduling competition is among all threads in the system
- one to one model using only SCS
#### Symmetric Multiprocessing
- Each processor is identical ans self scheduling
- 2 possible strategies:
- Each processor has its own private queue
- problem: workload balancing
- All threads are in a common ready queue
- problem: race condition
#### Multithreaded Multicore System
- First level: chooses which *software thread* to run on each *hardware thread*
- Second level: chooses which *hardware thread* to run on each *core*
#### Hardware Thread Granularity
- **Coarse-grained multithreading**: if one thread has a memory stall, switch to another thread,等到有 long latency event時才會switch,switch 的花費非常昂貴
- **Fine-grained multithreading**: multithreading at hardware level,在指令循環(instruction cycle邊界)進行context switch,一次不直接載入所有instructions
#### Load Balancing
- **Push migration**: overloaded CPUs pushes tasks to other CPUs,太高的給低的 (更新queue,overhead高)
- **Pull migration**: idle CPUs pull tasks from other CPUs,低的去找高的工作做 (sharing)
#### Processor Affinity
- Load balancing may affect processor affinity: a thread loses the contents in the cache of a processor when moved.
- **Soft affinity**: OS attempts to keep a thread running on the same processor (no guarantee)
- **Hard affinity**: allows a process to specify a set of processors it may run on
- **NUMA**:會希望較近的CPU跟process綁在一起
#### Proportion Share Scheduling
- Linux 的 CFS與他類似
- a. k. a. **fair-share scheduling**
- Note that *"fair" does not mean "each job gets the same amount of time"*
- Tries to guarantee that each job obtain a *certain percentage* of CPU time
#### Lottery Scheduling
與上方演算法類似,但是是用機率的概念去抽誰要下個執行,shares越多人的被抽到的機率越高
<!--下面都是real-time-->
### Real-Time CPU Scheduling
- **Hard real-time systems**: the system fails if deadline is missed.
- **Soft real-time systems**: the system performance is degraded if deadline is missed.
#### Event Latency
- **Event latency**: from when an event occurs to when it is served
- **Interrupt latency**: from arrival of interrupt to interrupt service routine
- **Dispatch latency**: time for taking process off CPU ans switch to another
- $\text{event latency} = \text{interrupt latency} + \text{dispatch latency}$


- 上圖的conflicts包含:
- 清空和移除process時間
- 釋放資源時間(做完 or 之後重做)
#### Priority-Based Scheduling
- real time supports
- preemption
- **Process/ Workload model**:
- Processing time (worst case) = t
- Deadline = d
- Period = p
- $0$ ≤ $t$ ≤ $d$ ≤ $p$
- rate of periodic task is $1/p$

**!注意!** $\text{absolute deadline} = \text{relative deadline} + \text{arrival time}$
- **Rate-monotonic scheduling**: based on $1/p$, shorter period $\to$ higher priority
- **Earliest-deadline-first (EDF) scheduling**: earlier *absolute* deadline $\to$ higher priority
- dynamic priority
- critical instant:系統附載最大的時刻,發生在兩個deadline同時到時
**!注意!** CPU利用率 $U = \Sigma_{i=1}^{n}(\frac{t_i}{p_i})$:這是理想化的,EDF用這樣去運算是可行的
### Linux Scheduling
constant time $O(1)$,但interative的效能不好
#### Virtual Runtime
不是真的時間,每個process有自己的vruntime
- $\text{Virtual Clock Rate} = \frac{\text{Virtual Clock Time}}{\text{Physical Clock Time}}$
- When the virtual clock time are the same, the higher the clock rate, the faster the virtual budget is consumed. $\rightarrow$ **lower rate for longer physical time**
- The OS adjust this rate to change the allocated physical time to a task.
#### Linux Completely Fair Scheduler (CFS)
- Algorithm:
1. Choose the task with lowest **virtual runtime**
2. Compute the **dynamic time slice** for current task
3. Set the timer for the computed time slice
4. Execute the current task
- Time slice computation:
- `sched_latency` = constant (e. g. 48ms)
- $\text{Per-Process Virtual Timeslice} = \frac{\text{sched_latency}}{\text{Processes}}$
- But time slice cannot be lower than `min_granularity`
- $\text{timeslice}_i = \frac{\text{Weight}_i}{\sum \text{Weight}} \times \text{sched_latency}$
- Updating virtual runtime: (這是一個持續記錄的clock) $\text{VRuntime}_i := \text{VRuntime}_i + \frac{\text{Weight}_0}{\text{Weight}_i} \times \text{Runtime}_i$, weight is depends on **nice value** and $\frac{\text{Weight}_0}{\text{Weight}_i}$ is **virtual clock rate**, $\text{Runtime}_i$ is physical time and is $\leq \text{timeslice}_i$
- $\text{virtual clock} = (\text{delta_exec} \cdot \text{weight}) / \text{lw.weight}$, for $\text{lw.weight}$ is runqueue weight
**!注意!** **nice value** 決定了priority,若nice value越低,priority越高,通常如果 $\text{nice value} < 0$ 則 $\text{rate} < 1$
- **Scheduling classes**
- Each class is assigned a specific priority
- The scheduler selects the highest priority task in the highest priority class, task被分類到不同priority的class中
#### Real-time scheduling
- Real-time tasks's **priorities** 0 to 99
- Normal tasks's 100 to 139
- nice value -20 maps to priority 100
- nice value 19 maps to priority 139
- smaller integer means higher priority
#### scheduling domain
每個domain由許多core組成,通常根據共享的內容來分組(ex. cache memory),就可以彼此負載,是用到NUMA的概念
### Windows Scheduling
priority-based and preemptive
#### thread execution and priority
* 0: memory-management thread
* 1~15: the variable class
* 16~31: the real-time class
* 每個priority都有一個 queue
* no ready thread: run **idle thread** for system maintainence

正常來說,即便被降級了priority也不會比NORMAL(base)低,可能需要手動更改
### Solaris Scheduling
- preemptive
- 數字高代表priority越高
- 分成 6 classes, default: time sharing

### Algorithm Evaluation
#### Deterministic Evaluation
- 計算最小 average **waiting** time
- 使用事先決定好的workload直接計算
- simple and fast
#### Queueing model
- 用機率性質來表達抵達時間以及CPU, I/O burst
- 計算 average throughput, utilization, waiting time , etc
##### **!!! Little's Formula !!!**
- $n = \lambda W$
- $n =$ average queue length
- $\lambda =$ average arrival rate
- $W =$ average waiting time
#### Simulations
- 直接進行模擬,模擬一個電腦系統環境
- 正確率更高
- 使用trace tape,參考過去紀錄應用在模擬上
#### Implementation
直接嘗試實作一個系統,但是有環境限制,且較高風險
## Mass-Storage Systems

### Nonvolatile memory (NVM) devices
recorded on transistor 晶體管(半導體), ex: solid-state disks (SSD), USB drives, NVMe(express, fast, connected directly to PCI bus)
- [+] more reliable 耐震
- [+] faster (no seek time / rotation latency)
- [-] more expensive per MB
- [-] 需要先 erase 才能重寫 (no in-place update)
- [-] shorter life span (writes per day DWPD)
- [-] less capacity
#### NAND Flask Controller Algos

- maintain flash translation layer (**FTL**) table:轉換logical addr to physical addr
- **garbage collection**: 要刪的東西先不要erase,等garbage collection時,將有東西的資料移到別的pages再erase,erase都是用**block**作為單位
- **over-provisioning**: 額外保留一部分的容量,如果某些page物理上的壞掉了就可以替代
- **wear leveling**: 確保每塊晶片磨損程度相近
#### NVM scheduling
沒有移動讀寫頭的latency,可以使用比較簡單的演算法
- FCFS: **merge adjacent requests** 這樣可以讓FCFS更快
- write 可能會蠻好耗時的,因為需要先erase
### Hard disk drive (HDDs)
recorded on magnetic platters 金屬盤,讀寫頭同進同出
- 常用單位 **RPM** 一分鐘幾圈
- **Transfer rate**: the rate at which data flow between drive and computer
- **Seek time**: time to move disk arm to desired cylinder
- **Rotational latency**: time for desired sector to rotate under the disk head (spinde speed)
- $\text{Worst Case Latency} = \frac{1}{\text{RPS}} = \frac{60}{\text{RPM}}$
- $\text{Average Latency} = \frac{1}{2} \text{Worst Case Latency}$
- **Positioning time (random-access time)**: $\text{Seek Time} + \text{Rotational Latency}$
- **Access Latency (Average access time)**: $\text{average seek time} + \text{average rotational latency}$
- $\textbf{Average I/O Time} = \text{Average Positioning Time} + \frac{\text{Amount to Transfer}}{\text{Transfer Rate}} + \text{Controller Overhead}$
- **head crash**: 讀寫頭碰到金屬盤產生刮痕

- addressed as large 1-dimensional arrays of **logical blocks**
- sector 0 是最外圈的第一個sector
#### HDD Scheduling
- **Bandwidth**: 用來顯示傳輸量,$= \frac{\text{total number of bytes}}{\text{completion of last transfer time} - \text{first request time}}$
- **Shortest Seek Time First (SSTF)**: greedy, always find the nearest request,可能要處理**starvation**的問題
- **FCFS**: read in the exact order of requests
- **SCAN** (a. k. a. **elevator**): from one end of the disk to the other, continue in reverse direction,兩端來回跑
- **C-SCAN**: from one end of the disk to the other, continue in same direction,如果是剛加入的request可能就要等很久
- Linux **Deadline Scheduler**:
- 將read, write reqests分開(read 的 request通常較急迫,因為process基本上一定會被block住)
- read, write都有多種queue,使用不同的演算法
- 一開始都是用同樣的處理,等待時間太久的request可以被升級
### Optical disk 光碟
用光來讀寫
### Magnetic tape 磁帶
用磁性概念來讀寫
### Disk attachment
外部硬體和電腦系統溝通

### Error Detection and Correction
- **Error detection**: can detect errors
- **Error-correction code (ECC)**: can detect errors *and correct some of them*, soft error is recoverable, hard can't not (but can be dected)
#### Error Detection
- **Parity Bits**
checksum 的一種,計算整個檔案給user然後讓他自己做一遍看一不一樣,不一樣代表有錯誤
- Add a parity bit to the data to make the number of 1s even or odd (可以自己先預設好要哪一種)
- Can only detect one error,需要更多的bits
- **Cyclic redundancy check (CRC)**
用hash function去運算,像是多項式除法 (generator polynomial)
- *Operations are performed in $Z_2$*
- $M(x) \cdot x^n = Q(x) \cdot K(x) - R(x)$
- $M$: **polynomial term**
- $K$: **generator**
- The sender sends $M(x) \cdot x^n + R(x)$ to the receiver, binary string要加$k-1$個 $0$,$k$是指$K(x)$的長度,並用**modulo-2 binary division**
- The receiver computes $\frac{M(x) \cdot x^n + R(x)}{K(x)}$ and check if remainder餘式 is $\textbf{0}$
**!!注意!!** **modulo-2 binary division**:長除法過程中的減法改成 **XOR**
#### Error-Correction Code: Hamming Code
- $2^p \ge n + p + 1$, $p$: parity bits, $n$: data bits
- $i$-th parity bit $P_i$ is placed at the position $2^i$ and is the even/odd parity bit whose $i$-th location bit is 1.
- For single-bit error: $P_i$ is wrong iff the $i$-th location bit of the error is 1.
- **SECDED (Single-bit Error Correction and Double-bit Error Detection)**: by adding an overall parity bit, we can both correct single-bit errors and detect double-bit errors.
### Drive
- **Low-level formatting (physical formatting)**: dividing a disk into sectors that the disk controller can read and write
- header info
- data
- ECC
- **Partition** the disk into groups of cylinders, each treated as a logical disk
- root partition: OS (mount掛載 at boot time)
- raw
- **Volumes** are created for management
- **Logical formatting**: making a file system
- To increase efficiency, most file systems group blocks into **clusters**
### Storage Attachment
- Computers access storages in 3 ways: **host-attached**, **network-attached**, **cloud**
- Host-attached: access through local I/O ports (e. g. USB)
- Network-attached:
- **Network-attached storage (NAS)**: storage made available over a network

- Cloud storage: accessed over the Internet or a WAN to remote data center

#### SAN
storage area network,擁有storage arrays(大量儲存容量和裝置),這些裝置之間有自己的獨立網路,常用FC光纖(實體的線路)來實現。server和SAN之間有使用額外的網路來處理儲存要求(像是上圖中的cloud機制)。
#### net
- WAN wide area network, 在許多分散地區可以共用的網路,可以是私人或公用
- LAN local area network, 區域網路
- internet, 一種WAN,全球公用
- ethernet, 乙太網路,一種通信協議,通常使用在LAN,WAN有時也會用到
### Redundant Array of Inexpensive Disks (RAID)
概念是:儲存很多份一樣的內容,這樣其中的壞掉不會找不回資料,還可以趕快去維修壞掉的那塊
- *Provide reliability via redundancy*
- Increase the mean time to data loss
- **Mean time between failures (MTBF)**: mean time for data to be *completely* lost from all disks
- **Mean time to repair (MTTR)**: mean time it takes to replace a failed drive (and restore the data)
- **Mean time to data lost (MTTDL)**: $= \frac{\text{MTBF}^N}{N \cdot \text{MTTR}^{N-1}}$
- Problems:
- Does not always assure that data are available
- fix: **pooled storage**, 動態分配記憶體,不限區域功能
- Lack flexibility 不太能從RAID a(某個數字)換成RAID b(另一個數字)
**!!注意!!** **object storage** 存的資料以object為單位,有分用戶導向和電腦導向的,用戶導向會有很多分成樹狀結構讓用戶去找,電腦導向就直接是一個storage pool去放那些物件,直接用ID去找
#### Parallelism
- With **mirroring / shadowing**, the read requests can be handled at doubled rate
- Data striping
- **Bit-level (/Block-level) striping**: splitting the bits (/blocks) of each byte (/file) across multiple drives
#### RAID Levels
- **RAID 0**: non-redundent striping 平行化讀寫
- **RAID 1**: keeps replica of each disk by *mirroring* or *shadowing* 增加容錯性
- **RAID 4(block interleaved parity)**: uses much less redundancy,只放在其中一個block上面
- **RAID 5(block interleaved distributed parity)**: 和RAID 4類似,只是把P分散放到所有其他的drives上(避免唯一那個壞掉)
- **RAID 6**: P+Q(ECC) redundancy (重複放在很多個drives上面),也有多維的實作,每個drives做自己的工作,不過一種功能可以有多個drives擁有
- **RAID 1+0 (striped mirrors) / RAID 0+1 (mirrored stripes)** provide high performance and reliability(數字先的先做,兩者都至少要4個drives)
#### other feature
- **snapshot**: 最後一個更新前的視圖
- **replication**: background處理的備份
- **hot spare**: 未被使用的空間,當有disks failed時,直接替代不需等待替換 (hot swap)
## I/O system
### I/O hardware
#### 種類
- storage
- transmission, ex: network, cloud
- human-interface, ex: touch screen, brain interface
#### signals from I/O devices interface with computer
- **Port**: connnection point for device 物理/網路接口
- **Bus**: daisy chain or shared direct access 總線,computer system內部的溝通管線

- **Controller (host adapter)**: operate ports, buses, devices,有下面兩種:
- integrated
- seperate circuit board (host adapter)

#### Polling
- direct I/O instructions,不斷檢查是否有工作
- For each byte of I/O, repeat:
1. The host repeatedly reads the **busy bit** until it becomes clear
2. The host sets the **write bit** in the command register and writes a byte into the data-out register
3. The host sets the **command-ready bit**
4. When the controller notices that the command-ready bit is set, it sets the busy bit
5. The controller reads the command register and sees the write command. It reads the data-out register to get the byte and does the I/O to the device
6. The controller clears the command-ready bit, clears the erorr bit in the status register to indivate that the device I/O succeeded, and clears the busy bit to indicate that it is finished
- Step 1 is a *busy-wait* cycle
- inefficient of the devices are slow

#### Interrupt-Driven I/O
- 
#### Interrupts
- polling happens in 3
- CPU checks the **interrupt-request line** after each instruction
- There are 2 lines: *maskable*(立刻處理) and *non-maskable*
- The **interrupt handler routines** receive interrupts
- **Interrupt vector**: a table that maps interrupts to handlers
#### DMA (direct memory access)
- avoid programmed I/O (一直仰賴CPU來使用I/O 會太慢)
- DMA controller 負責控制傳輸 (bypass CPU)
- OS會提供資訊給DMA,寫在 DMA command block
- grab bus from CPU(意味著這個期間CPU是不能使用的 **cycle stealing**),結束時會傳給CPU interrupt
- **DVMA**是操作在virtual memory 的版本,更快速

### Application I/O Interface
- Encapsulates device behaviors
- Allows an application to open a file on a disk *without knowing what kind of disk it is*

### I/O characteristics
#### transfer schedule
- **synchronous** 同步的
- **blocking**: process suspended
- **nonblocking**: I/O repuest 後會回傳現在可不可以用,依照使用者的使用方法可能會導致busy waiting
- **asynchronous**: process runs while I/O executes,process會忘記自己有執行I/O過直到收到I/O的訊息or事件,才過去做那些跟I/O相關的東西

#### vector I/O
一次可以操作多個I/O operations
- 減少context switch和system call的時間
- 有些會實作atomicity 一次做完,不然可能會race condition
### kernel I/O subsystem

#### scheduling
- per-device queue
- fairness
- quality of service (IPQOS)

#### buffer
先把資料存在memory,因為一般常常出現的問題:
- device speed 不同
- device transfer size 不同(以不同容量作為單位儲存)
- copy semantics 傳資料過去的時候維持資料內容正確和完整性
- **double buffering**: 兩個複製的data
#### cache
用來加快訪問速度
#### spooling
當一個裝置只能一次處理一組資料,就將request排成隊列來管理,裡面的數據也會被放在一個專門存儲的區域
#### device reservation
需要使用allocate, deallocate額外去存取和釋放,要特別小心**deadlock**
#### Error handling
常見錯誤發生:
- disk read failed
- device unavailable
- transient短暫的 write failures
處理:
- retry read or write
- track error frequencies,停止使用一直發生錯誤的device
- return error number or code
- system error logs 系統錯誤日誌
#### I/O protection
有些user process可能會用非法的I/O破壞系統正常運作(可能是故意也可能是不小心的),處理方式:
- All I/O instructions defined to be privileged
- I/O must be performed via system calls: 使用**monitor mode**看I/O是否合法
- Memory-mapped and I/O port memory locations must be protected too

#### Power management
基本上I/O device會耗電,電腦本身當然也是,所以要看怎樣比較省電,特別是移動式裝置,像是筆電、平板、手機。
**!!注意!!** OS可以幫忙管理,cloud的使用上移動虛擬機(virtual machine) → 集中使用常被使用的server、被維修的server改到別的上面跑、平衡負載
Android的例子:
- **Component-level power management**:
- device tree,實際device的topology拓撲結構
- 管理components,如果某個tree branch裡面的devices都沒在用,就把branch關起來
- **Wake locks**: 防止device進入休眠(背景程式)
- **Power collapse**: 讓device進入深度睡眠,只保持一些最基本的功能來因應**external stimuli**(像是按鍵)
**advanced configuration and power interface (ACPI)**
- device discovery
- device management
- error management
- power management
### Transforming I/O Requests to Hardware Operations

**!!注意!!** 第二步驟選擇yes可能是已經有運算結果在buffer了,真正用到I/O設備是第5步
### STREAMS
組成:
- **head interfaces** with the user process
- 發出request之後block → **synchronous**
- **driver end** interfaces with the device
- zero or more **STREAM modules** between them, **STREAM modules**包含read queue和write queue,每一個都會有自己獨特的功能
- 不同modules處理各自功能 → **asynchronous**
**Flow control**: 如果queue太busy就可以叫上游慢一點或暫停

### Performance
I/O對performance影響蠻大的,以下為原因:
- Demands CPU to execute device driver, kernel I/O code
- Context switches
- Data copying
- Network traffic especially stressful

**!!注意!!** network adapter是用來連接網路的,他會實現和LAN/WAN之間的通信
加快performance的方法:
- [-] context switches
- [-] data copying
- [-] interrupts (large transfers, smart controllers, polling)
- [+] DMA
- [+] marter hardware devices
- [+] balance (highest throughput)
- [+] move user-mode processes / daemons to kernel threads (不需切換上下文)

## File system
### File Concept
- *A file is an abstract data type*
- *The information in a file is defined by its creator*
#### File Attributes
- Name
- for humans to identify files
- Identifier
- for the file system to identify files
- Type
- Location
- Size
- Protection
- Timestamps 時間戳(紀錄數據發生時間)and user identification
- Extended file attributes
- **checksum**
- **character encoding** 可能檔案是用不同字節序列編碼,因此需要解碼
- The information about all files are kept in the **directory** structure
#### File Operations
- Create / Delete
- Open / Close
- Read / Write
- Reposition (seek)
- Truncate 縮減內容大小 (O_TRUNC會清空文件)
#### Opened Files
- **Open-file table**: contain information about *all* open files
- Information associated with an open file include:
- **File pointer**: records the last r/w location, *unique to each process*
- **File-open count**: number of processes opening this file
- Location of file
- Access rights
#### Locking Opened File
- **Shared lock**: can be acquired by **multiple** processes (**reader** lock)
- **Exclusive lock**: can be acquired by **only one** process (**writer** lock)
- **Mandatory**(強制的) file locking: the OS prevents invalid accesses,不能去存取任何被exclusive locked的file
- **Advisory**(建議的) file locking: developers are responsible for preventing invalid accesses,上了鎖之後還是可以存取,所以要developer自己確認有沒有被鎖起來
**!!注意!!** **file-level sharing** 是鎖文件的部分內容,其他人可以對文件的其他部分動作
### Access Method
- Sequential Access: 連續存取,一個一個去讀字元

- Direct(Relative) Access: 文件由 fixed-length **logical records**,可以直接以record單位去讀取
- secondary index file: index for the index file
- ISAM: indexed sequential-access
- index and relative files

### Directory Structure
directory用來管理files
- **Single-level directory**: directories can only contain normal files
- **Two-level directory**
只有兩層,一層是給不同user的directory
- path name: user name + file name
- search path: 開啟某個file的搜索路線
- **Tree-structured directory**: directories can contain directories
- **Absolute path name**: path name starting from root
- **Relative path name**: path name starting from current directory
- **Acyclic-Graph directory**: directories can share child directories (file可以有多個parent,但是不能形成cycle)
- **General-Graph directory**: directories can form cycles
- 當file不再被使用時,如果有cycle可能會導致**reference count**不為0 → **garbage collection**刪掉他們
#### hard V.S. soft link
| 特性 | 硬鏈接(Hard Link) | 軟鏈接(Soft Link) |
| ---------- | ----------------- | ---------------------- |
| 定義 | 指向同一文件數據 | 指向文件/目錄路徑 |
| 磁碟空間 | 無需額外空間 | 佔用少量空間 |
| 跨文件系統 | N | Y |
| aim | file | file n dir |
| 檔案刪除 | 不影響其他 | 原始文件被刪,連結失效 |
#### find cycle

- **Floyd’s Cycle-Finding**: 龜兔賽跑,一個fast一個slow
- Mark without extra flag: 多加一個temp的file,跑到下一個點過後改成指到temp,如果有人走到temp就代表有cycle
### Access-control list (ACL)
specifies user names and the types of access allowed
for each user
- read write execute 二進位換成十進位表示
- chmod (owner) (group) (public) (file name)
### Memory-Mapped Files
- Simplify r/w operations to memory operations
- By associating a part of the virtual address space with the file
- data are written back when file is closed
- he change can be lost if crash
#### sharing
- COW
- 在read-only mode, 可以用COW創建個人專用的副本進行改寫
- 每個人都可以看見修改後的檔案(va連接到的page被更改)

### Layered File System

- **File system**
- provides UI to storage
- **Logical file system**
- manages metadata information
- translates file name into inode number
- **File-organization module**
- translates logical block number to physical block number
- free space, disk allocation
- **File control block (FCB)**
- store file info
- hardware controller
- **Basic file system / block I/O subsystem**
- translates physical block number to device driver commands
- **Device drivers**
- translates device driver commands to hardware specific commands
### File-System Operations
- On-storage structures:
- **Boot control block**: contains info needed to boot OS
- book block(UFS), partition boot sector(NTFS)
- **Volume control block**: contains volume details (blocks, free blocks, free block pointers etc.)
- superblock(UFS), master file table(NTFS)
- Directory structure per file system
- Per-file **file control block (FCB)** contains details about the file
- Typically contains: permissions, times, owner, group, size, data blocks
### In-Memory File System Structures
- We don't want the storage device to be accessed whenever we access file metadata.
- So we need to **mount** file systems, i. e. load the file system into memory.
- **Mount table**: stores file system mounts, mount points, file system types
- **System-wide open-file table**
- **Per-process open-file table**
### Block Allocation on Storage Devices
- **Contiguous allocation**: each file occupy a set of contiguous blocks on the device
- [+] minimal head movement for HDD
- [-] external fragmentation
- [-] dynamic storage-allocation problem
- **Linked allocation**: each file is a linked list of blocks on the device
- [+] no external fragmentation
- [-] slow random access on a file
- [-] requires more space for pointers
- **File-allocation table (FAT)**: a table containing the head of each file
- **Indexed allocation**: maintain a list of used blocks for each file
- [+] no external fragmentation
- [-] requires even more space than linked allocation
- Solution: **multilevel index**
- **Direct blocks**: pointers point directly to blocks
- **Single/Double/Triple indirect blocks**: 1/2/3 more layers than direct blocks
## W14
### Free-Space Management
- **Free-space list**: `bit[i] = 1` iff block $i$ is free
- [+] simple
- [-] the entire vector should be kept in main memory
- **Grouping**: stores the addresses of $n$ free blocks in the first free block
- so it is faster to find a large number of free blocks
- **Linked list**: link together all the free blocks
- [+] no waste of space
- [-] inefficient to traverse the list
- **Counting**: each entry consists of an address of a free block and the number of free contiguous blocks that follows
- [+] shorter list
### Volumes and Partitions
- A device can be sliced up into **partitions**
- A **volume** contains a file system and its files
- A volume can span one or more partitions

### File System Mounting
- **Mount point**: the location within the file structure where the FS is to be attached
- Steps:
1. The OS is given the device name and the mount point
2. The OS verifies that the device contains a valid FS
3. The OS notes in its directory structure that a FS is mounted
### Partitions
- Each partition can be either **raw** (having no FS) or **cooked** (having a FS)
- If a partition contains a FS that is bootable, then the partition also needs boot information
### Semantics
- **UNIX semantics**: writes to a file are visible to other users *immediately*
- **Session semantics**: writes to a file are visible to other users *after the session doing the write closes*
- **Immutable-shared-files semantics**: written content cannot be modified
### Stateful and Stateless File Servers
- **Stateful**: operations can be dependent, requires information be maintained for a certain amount of time
- better efficiency
```
fid = open(fname)
read(fid, <>)
write(fid, <>)
close(fname)
```
- **Stateless**: each operation is independent
- easier to recover from crash
```
read(fname, <>)
write(fname, <>)
```
## W15
### Critical Section Problem
```
do{
entry section
critical section
exit section
remainder section
}while(1)
```
- In the **critical section** the process accesses shared data
- So when a process is in its critical section, no other processes should be in their critical sections
- **Critical section problem** is to design a protocol to solve this issue
### Solution Requirements
1. **Mutual exclusion (safety)**: only one process can be in its critical section at a time
2. **Progress (liveness)**: if some processes wish to enter their critical sections, only processes *not in their remainder sections* can decide which will enter its critical section next, and this decision must be done in finite time
3. **Bounded waiting (fairness)**: a process should get its turn after a finite number of turns
- Assumptions:
- each process executes at a nonzero speed
- but we can make no assumptions concerning the relative speed between processes
### Interrupt-Based Solution
- Disable interrupts during the critical section, so the process cannot be context switched
- Problems:
- The critical section can occupy CPU for a long time
- Some processes can starve
- On multi-CPU systems, need to disable interrupts on all CPUs
### Peterson's Solution
## file system
- 暫存設備比較
| buffer | cache |
| ------------------- | -------------- |
| server | client |
| store 部分disk 內容 | 最常用的內容 |
| I/O | CPU, processor |
## Reference
1. Tanenbaum, A. S., & Woodhull, A. S. (2018). Operating System Concepts (10th ed.). John Wiley & Sons.