# OS Introduction + CPU Management ## Ch. 1 Introduction + I/O > **The kernel is built surrounding interrupts** - What is OS - No definition - A program that acts as an intermediary between a user of a computer and the computer hardware - a resource allocator - a control program - *Everything a vendor ships when you order an operating system* - The one program running at all times on the computer is the kernel - UNIX OS - POSIX:一個 OS 的標準 - A kernel + some system program (coreutil+ binutil) - 參照這個標準的例子:UNIX、Linux - Typical PC Organization - North bridge (memory hub): CPU, Memory, PCIe device (顯卡、NVMe SSDs...) - South bridge (IO hub): PCI, USB, SCSI, SATA - IO operation - polling - CPU 全程介入 - Busy-wait cycle to wait for I/O from device - CPU perform data movement - e.g. GIPO - interrupt - 等到 IO 做好 CPU 再回來繼續 - Interrupt vector to dispatch interrupt to correct handler - Interrupt handler (ISR) receives interrupts - Synchronous I/O vs. Asynchronous I/O - sync I/O: blocking call, e.g. `read()`, `fread()` - async I/O: non-blocking call, e.g. `write()`, `fwrite()` - e.g. - `fsync()`:讓檔案的寫入變 sync,可以確保資料已經寫入 -> sync IO - `aio_read()`:讓 read 變 async,用在 prefetch,如果應用程式預測未來會用到這些資料,就可以先叫 OS 去拿 ### Interrupt ![image](https://hackmd.io/_uploads/rk8bOLB20.png) - Common function - hardware: interrupt req. (IRQ) - IO completion, timer - async. interrupt - software: trap, exception - devide by 0, access violation, system call - sync. interrupt - interrupt handling - 假設 IO 裝置發了一個中斷訊號,他會送給 PIC (=programmable interrupt controller) - PIC 送訊號給 CPU - CPU 把 PC 和 flag 存起來 - CPU 去看 IVT (interrupt vector table) 找出要跳到的 ISR 的位址 - CPU 跳到該位置,去找 ISR (interrupt service routine) 把事情做完 - 做到 IRET 代表作完了,把 PC 和 flag 拿回來,再回到 user program - Some terms - ISR (interrupt service routine):一串處理 interrupt 的 code - Interrupt vector:一個 interrupt 對應到一個 ISR 位址 - IVT (interrupt vector table):一群 Interrupt vector,CPU 收到 interrupt 後會來這裡找 ISR 的位址,再跳到該位址處理 interrupt - IRQ (interrupt request):由硬體產生的 interrupt - PIC (programmable interrupt controller):首先接收 IO 設備來的 interrupt 再進一步傳給 CPU ### Direct Memory Access (DMA) - Data movement of I/O operation - Option 1: CPU moves data between main memory and I/O local buffers - slow - waste cycle - Option 2: CPU offloads data movement to DMA - DMA - 硬體 - 用來快速把 IO 的資料傳到 memory - ![image](https://hackmd.io/_uploads/B1VPBuphA.png) - CPU 會告訴 DMA controller 要從哪裡傳到哪裡、要傳多少 - 接著 DMA controller 就能直接開始資料傳輸,無需 CPU 參與 - 傳輸結束後,DMA controller 發個 interrupt 給 CPU 告知傳輸已完成,讓 CPU 進行後續處理 ### Multiprogramming - Multiprogramming organizes processes so CPU always has one to execute - Timesharing (Single core case) - 頻繁切換 CPU 在跑的 process(大於等於 100 Hz) - 切換 process 方式 - 上個 process 發出 sync IO - 跑太久 - 因為頻率很高,所以看起來像同步運行 - terminology - Multiprogramming: multiple processes are loaded into memory for execution - Multitasking: Multiprogramming with overlapped task execution - Timesharing: multitasking + periodic switch among processes(有了時間的概念) ### Booting - Bootstrap program - Typically stored in ROM or EEPROM, generally known as firmware - Initializes all aspects of system - also called BIOS - bootloading procedure - FFFF:0000 - BIOS,接著去讀磁碟上的第 0 個 section(MBR) - MBR (master boot record):去磁碟把 OS 讀進來 - Boot Manager - Pre-OS - OS - ![image](https://hackmd.io/_uploads/BJ7AUCEl1e.png) - BIOS+MBR has been replaced by UEFI+GPT, an extensive spec. ## Ch. 2 System Structures ![image](https://hackmd.io/_uploads/ByyeAOR3A.png) ### System Program - e.g. `cp`, `mv`, `ls`, ... - UNIX = system program + kernel - GNU coreutils + binutils - ![image](https://hackmd.io/_uploads/BkY1kKC2R.png) - Programming-language support - Compilers, assemblers, debuggers and interpreters sometimes provided ### User Interface - Shell: the interface program between users and the kernel - CLI, GUI(對,GUI 也算 shell) ### System Call ![image](https://hackmd.io/_uploads/S19oWt0nA.png) | | system call | system API | system program | | - | -------- | -------- | -------- | | **e.g.** | some assembly | `fork()`, `read()` | `cat`, `ls`, `cp` | - kernel 能碰到所有東西(如記憶體),並且 system call 是用中斷的方式執行,所以我們很少直接用 system call,像 `fork()` 就是 system API - Three most common system APIs - Win32 API for Windows - POSIX API for UNIX - Java API for the Java VM - ![image](https://hackmd.io/_uploads/rkp3zFChC.png) - `int` x86 處理器的 instruction,用來觸發 `2e` 號中斷 - Dual Mode Operations - kernel mode & user mode - supported by OS, allowed OS to protect itself against un-trusted code - some instructions designated as privileged - application 呼叫 system call,利用 software interrupt 進入 kernel mode - ![image](https://hackmd.io/_uploads/ByD1mkSgJe.png) - System call parameter passing - 存進 register - 存進一個 block (table) 並把其位址放進 register 傳入 system call - 放進 program 的 stack - e.g. - ![image](https://hackmd.io/_uploads/BJlFvKR3R.png) - `mov eax, 4`: system call number (4: sys_write) - `int 0x80`: 觸發中斷進入 kernel - `mov ebx, 0`: 回傳值 - 寫入和 exit 都是一個 system call ### OS Design and Implementation - mechanism & policy - mechanism: How to do it - policy: What will be done(或如何選擇) - e.g. disk IO - Mechanism: How to read and write from disk? - Policy: Which disk I/O operation should be performed first? ### OS Structure - MS-DOS - 很早的 OS - 應用程式可以直接碰到記憶體和控制硬體 -> 危險 - 效能超快 - 現在還用在主機板韌體升級 - ![image](https://hackmd.io/_uploads/ryWSx9Cn0.png) - FreeRTOS - A real-time, embedded operating systems - 把 OS 和 program 直接寫在一起,也沒辦法下載新的 program - UNIX - Systems programs + Kernel - ![image](https://hackmd.io/_uploads/SJkXbcC2C.png) - module - 原先的版本也沒辦法安裝程式,要重新編譯 kernel 才行 - 但後來做了 module,就能透過新增 kernel module 來達成 - Microkernel - 把大部分的東西丟到 user,只留必要的(CPU scheduling、Memory Management、IPC) - easy to extend:每個功能都是一個 user 的 process,所以只要加 user process 就能擴充 - easy to port the operating system to new architectures:kernel 本身很小很好移植 - reliable and secure - 超慢:process 間傳訊息才能溝通 + context switch - ![image](https://hackmd.io/_uploads/HyXcVq03R.png) - Mac OS - Mach (μ-kernel): memory management, RPC, IPC, message passing, thread scheduling - BSD: networking, file systems, POSIX APIs - ![image](https://hackmd.io/_uploads/B1wSLqAh0.png) - Google Fuchsia - based on the Zircon microkernel - designed for IoT devices ### Virtual Machine - Provides an interface identical to the underlying bare hardware - Now popular because of cloud-based computation - Different levels of virtualization exist - 虛擬化硬體 e.g. VM Ware - 虛擬化 system call e.g. 在 Linux kernel 上跑 WIN32 的 system call - 做法:call WIN32 後轉成 POSIX 傳給 kernel - Namespace - 執行環境 - e.g. Docker、虛擬化 File system - ![image](https://hackmd.io/_uploads/HJtQLLbAR.png) - Each virtual machine is isolated from all other virtual machines - VMmare Architecture - ![image](https://hackmd.io/_uploads/BJw1PLWRC.png) - Processor Management - ![image](https://hackmd.io/_uploads/SkyeuU-CA.png) - 只有 VMM 在 kernel mode,其他都在 user mode - 當有人在 application 呼叫一個 system call,會發送一個 trap 給 VMM - VMM 檢查,選擇適當的 guest OS,並將指令送過去 - guest OS 做完之後,IRET 會發送給 VMM - VMM 再送回 application 讓他知道完了 - sensitive / privilege instruction - 只能在 kernel 做的指令 - 如果 application 要做,則需要先送給 VMM - Java Virtual Machine (JVM) - JVM 的 ISA 跟底層的 kernel 都不一樣,所以會用到 interpreter 轉換 - ![image](https://hackmd.io/_uploads/SylH5IW0C.png) - e.g. Android ## Ch. 3 Process Concept - A process uses the following context - Text section: executable binaries (e.g. function) - Stack section: function args. + local vars. - Data section: global vars. + heap - global var. 有沒有初始值,程式對待他的方式會不一樣 - 沒有初始值的 global var. 又叫做 BSS - heap:動態記憶體配置的記憶體來源 - Program counter and other CPU registers - e.g. ```cpp= int i, j = 2; int foo(int x) { int *y; char c; static char d = 'x'; i = 0; y = (int*)malloc(100); } ``` - stack: `x`, `y`, `c` - data: `i`, `j`, `d` - text: `foo` - `y` 指向一塊由 heap 分配來的記憶體 - Process schedule - ![image](https://hackmd.io/_uploads/H1OhChPpR.png) - running -> ready - Case 1: run out of its time quantum - Case 2: IO interrupt 讓優先度高的 process 從 waiting 進到 ready 並馬上進入 running,擠掉原先在 running 的 process - Process Control Block (PCB) - 每個 process 都有一個,儲存 process 的資訊 - 包含 - PID - Process state - Saved CPU registers values(之前在 CPU 跑的時候的一些資料) - CPU scheduling info (e.g., priority) - Memory-management information (e.g. segment table and page-table base register) - I/O status info (e.g., opened files) - Context Switch - ready 和 running 間轉換的過程 - context-switch 的時候 CPU 沒辦法處理任何 process - 約 2000ns on Intel 5150 - 花費時間來源:寫入 & 取回資料進 CPU、pipeline stall 和 cache 的花費時間 - 流程 1. 執行 P0 2. P0 被打斷讓出 CPU ,此時處於 idle 狀態 3. 進行 Context Switch:將 P0 的狀態存於 PCB 並讀取 P1 的 PCB。 4. 執行 P1 - ![image](https://hackmd.io/_uploads/SJjptTPpA.png) - 先存 PC(CS + IP) 和 FLAG(PSW) - 再存 general register - 將 stack pointer 存進 PCB (process control block) ### Process scheduling - ![image](https://hackmd.io/_uploads/BJIMh6PTR.png) - 下面的都是 waiting - Scheduler - Short-term scheduler (or CPU scheduler) - selects which process should be executed next and allocates CPU - involked frequently (ms) -> must to be fast - Long-term scheduler (or job scheduler) - selects which processes should be brought into the ready queue - controls the degree of multiprogramming - invoked in seconds or minutes - process 的性質:這個 process 大部分的時間花在何處 - IO-bound: spend more time on IO - CPU-bound: more time on computation - 一個 long-term scheduler 需要有效的混合兩種類型的 process 才能完整運用資源 - 大部分的 PC 其實沒有 long-term scheduler,因為使用者自己決定有哪些 program 要被運行 - e.g. MapReduce - Mid-term scheduler - 當記憶體滿的時候,決定哪些 process 的記憶體要被 swap out - ![image](https://hackmd.io/_uploads/rJzSApPpC.png) - 現今已被虛擬記憶體取代(process 的記憶體分成 page,根據使用度移出 page) ### Operation on Process #### Process creation - `fork` system call creates new process. The child is an exact copy of the parent - `exec` system call used after a fork to replace the process' memory space with a new program - ![image](https://hackmd.io/_uploads/ryvNk0D60.png) - ![image](https://hackmd.io/_uploads/r13TyS300.png) - hthreadd:kernel daemons - pdflush():`write()` 是 async IO,pdflush 就是負責背景將這些資料寫進記憶體 - `fork()` -> `exec()` convention - fork 複製一份當前的 process,但 fork 會替換記憶體位置 - fork 不會真的完全複製,用 memory mapping 的方式有效地複製,由 MMU(Memory Merged Unit,一個硬體) 負責 - 細節:複製初 child 只會對應到原本 process 的 memory,當 child 有改某塊記憶體時,才會額外配置給他,這樣的做法又叫 Copy-On-Write(CoW) - 如果 CPU 沒有 MMU 的話會用 `vfork()` 取代 `fork()` - 他會暫停 parent,直接把 child 的 memory 加到 stack 上,而如果 child 再呼叫 exec 就會重新配置一塊記憶體給他,再把原先 stack 上 child 的東西刪掉,把 stack 還給 parent 再繼續跑 ```cpp= # include <stdio.h> # include <stdlib.h> # include <unistd.h> int x=0; int main() { pid_t pid; /* fork another process */ pid = vfork(); if (pid < 0) { /* error occurred */ fprintf(stderr, "Fork Failed"); exit(-1); } else if (pid == 0) { /* child process */ x++; // often, here calls exec() _exit(0); } else { /* parent process */ /* parent will wait for the child to complete */ wait (NULL); printf ("%d",x); // Output will be 1 exit(0); } } ``` - Multiprocess architecture example:web browser (Chrome) - 每個頁面都是一個 process,這樣一個頁面 crash 也不會影響其他頁面 #### Process termination - 兩種方法 1. 用 `exit()` 自己結束,回傳值要被 parent 回收 2. parent 發送 `signal(SIGKILL)` 強制結束 child - zombie (defunct) process - 一個 process 結束但沒有人接收他的 return value - 他的所有資料都清掉了,但會留一個 entry 在 process table - orphan process - 在 child 結束以前 parent 就已經結束 - 在 Linux,此時他會被 init (pid=0) 接收,init 會等他並接收他的回傳值 - application: using double fork to process zombie process - fork 兩次並把中間那個 process 刪掉,這樣 child 就會自動被 init 接收 #### Inter-Process Communication ![image](https://hackmd.io/_uploads/HJLdpSnA0.png) ##### Shared Memory - Linux system call - `shmget()`: create a block of shared memory - `shmat()`: attach shared memory to the current process’s address space - `shmdt()` - `shmctl()`: control shared memory (including delete) - Producer-consumer problem - producer 一直給,consumer 一直拿,要怎麼用 buffer(shared memory)同步兩人的動作 - 限制 - buffer 大小有限 - 不允許 overwriting 和 null reading - solution: round buffer - 兩個指標代表 `in`、`out` 的位置 - `in == out`:空的 - `out + 1 == in`:滿了 - 還是有問題:浪費 CPU cycle -> require proper synchronization (in Ch. 6) ##### Message Passing - process 間看不到對方的資料 - 需要建立通道 - 有 buffer - 會自動處理空或滿的 buffer(built-in synchronization) - e.g. pipe - pipe is created and configured by the parent process - 若需寫入(出),另一端需關掉 ##### Signal - 一個 process 溝通的方式(但通常是拿來處理一些情況) - signal for process 很像 interrupt for CPU - sync vs. async - sync - 自己觸發 - divide overflow, memory-access violations, ... - e.g. SIGSEGV (Memory protection fault), SIGFPE (Arithmetic fault) - async - e.g. SIGKILL, SIGSTOP, SIGCHLD (child terminated) - signal handler:當某個 signal 來的時候,要如何去處理(會用一個函式) ## Ch. 4 Thread Programming ![image](https://hackmd.io/_uploads/r1k0E9pR0.png) - thread 又稱 lightweight process,創建 thread 的時間更快,並且 context switch 也比較快 ### Multithreading Models - Thread Library - Pthreads (P for POSIX), Win32 threads, Java threads - POSIX 也規範了在這個作業系統底下 threads 相關的 API - 但 Pthread spec 沒有定義其 model(for flexibility),意即 Pthreads 可以用 1-1、M-1、M-M 來實作 - User thread vs. kernel thread - user thread:透過 API 創建出的 thread - kernel thread:並無此名稱,只是用來理解。指的是 CPU 中可以獲得時間或 CPU 排程的 unit - mapping of user thread to kernel thread - Many-to-One - One-to-One - Many-to-Many - ![image](https://hackmd.io/_uploads/HyWlEayJJe.png) - Many-to-One - 許多 user thread mapping 到一個 kernel thread - CPU 排成排到這個 process 只會有一個 thread 可以用,如何在 process 中挑選哪個 thread 執行由 thread library 決定 - pros & cons - portability:Kernel 不支援 multithread 也能用,也不一定要用在 POSIX 系統 - 如果其中一個 thread 卡住 (e.g. 執行 `read()`), 則所有 threads 都會被卡住,因為對 kernel 來說只有一個 thread 而他正在 read - 但其實有一些 wrapper function 可以繞過,如 ```cpp ssize_t pth_read(int fd, void *buf, size_t nbytes) ``` - 無法平行化 - e.g. - Solaris Green Threads (Java thread) - GNU Portable Threads: An implementation of the Pthread specification - One-to-One - pros & cons - Require kernel support; less portable - 可平行化 - 不會被 sync functuon 卡住 - e.g. Windows NT/XP/2000, Linux Pthread, Solaris 9 and later - Many-to-Many - 現在較少見 - ![image](https://hackmd.io/_uploads/HyhHFayJ1e.png) - Amdahl’s Law(詳見計組) ### Threading issue - `fork()` and `exec()` in thread: undefined - signal - Synchronous signal:被造成這個 signal 的 thread 收到。e.g. access violation - Asynchronous signal:typically delivered to the first thread that does not block the signal. e.g. process termination ## Ch. 5 CPU Scheduling ![image](https://hackmd.io/_uploads/Byelm0ykyx.png) - I/O wait:有多少比例的時間是在等 IO - $p$: I/O wait; $n$: Degree of multiprogramming - CPU scheduling decisions may take place when a process: 1. Switches from running to waiting state (sync IO, `sleep()`, `yield()`) 2. Switches from running to ready state (context switch) 3. Switches from waiting to ready (when high priority process become ready, scheduler will work and send it to CPU) 4. Terminates - Scheduling under only 1 and 4 is non-preemptive or cooperative - Easy to implement - e.g. Windows 3.1, Old versions of Mac OS, Sensor OS (e.g., TinyOS 1.0) - All other scheduling is preemptive(可搶先的) - Higher responsiveness - Higher context switch overheads - Must deal with race conditions (In Ch. 6) - Dispatcher - 由 scheduler 決定誰被執行,dispatcher 執行換人的動作,其負責 - switching context - 從 kernel mode 轉換到 user mode - process 換了之後,page table 的 pointer、program counter(load 到新的 state) - Dispatch latency: dispatcher 執行換人動作,停止一個 process,啟動另一個 process 所需的時間 - Scheduling time - Interrupt re-enabling time - Context switch time ### Scheduling Algorithms #### Criteria - (+) CPU utilization – keep the CPU as busy as possible - (+) Throughput – # of processes that complete their execution per time unit - (-) Turnaround time – time to execute a particular process - (-) Waiting time – time waiting in the ready queue - (-) Responsiveness (interactivity) – time from user sending input until the system delivers a feedback #### Algorithms - **First-Come, First-Served (FCFS)** - ![image](https://hackmd.io/_uploads/BJ7mqQeJkx.png) - absolutely fair - convoy effect: short process behind long process - 明明做很快卻要等很久 - e.g. I/O-bound process 無法趕快做完交給 I/O 繼續做 - I/O 利用效率不好 - I/O 反應很慢 - **Shortest-Job-First (SJF)** - Two schemes: - Non-preemptive:直接看 burst time - Preemptive:看 remaining time,如果新進來的 process 比當前的 remaining time 還快,則會讓給新來的 - starvation:很長的 process 儘管早到,若中間一直有小 process 進來,則他會一直無法完成 - Optimal average waiting time - Large waiting time variation - 實務上,無法知道 process 實際的 burst time -> using prediction - e.g. exponential moving average - 根據先前他的 burst time 預測下一次 - $\tau_{n+1}=\alpha t_n + (1-\alpha)\tau_n$ - The long-standing dilemma: efficiency vs. fairness - Solution: aging(隨著等待時間提高 priority) - **Priority Scheduling** - Convention:數字越小 priority 越高 - Problem: Starvation - Solution: Aging(等越久 priority 越高) - **Round Robin (RR)** - Time sharing:每個 process 獲得一個單位時間 $q$(通常是 10~100 ms),時間到了還沒結束,就會強制送回 ready queue - $q$ large = FIFO - $q$ small:high overhead of context switch - ![image](https://hackmd.io/_uploads/ryjnA8GJkg.png) - large quantum time:better avg turnaround(process 從抵達到完成所花的時間) - small quantum time:better interactivity - **Multilevel Queue** - 將 ready queue 分為兩個(或以上)的 queue,每個 queue 用不同方法管理 - foreground (interactive):RR - background (batch):FIFO - scheduling 也需要在兩個 queue 間切換,如 - fixed priority:foreground 做完再做 background -> starvation - 分配不同比例的 CPU 時間給 queue - ![image](https://hackmd.io/_uploads/B1Z8ewf1Jx.png) - **Multilevel Feedback Queue** - Promotion / Demotion:把 process 送進不同 queue - e.g. - ![image](https://hackmd.io/_uploads/HkGwNvzyyl.png) - 在前兩個 queue 若把時間用完,則會被 demote 到下個 queue,反之則會被 promote - 讓 I/O-bound process 能獲得較高優先度,提升 I/O 使用度以及互動性 - 讓 CPU-bound 能獲得較大的 time quantum,減少 turnaround 及 throughput ### Multiple Processor Scheduling #### Multiprocessor Architecture - Homogeneous architecture: All CPUs share global memory (**SMP**, Symmetric MultiProcessing) and contend for memory cycles via the common bus - Easy - Low process migration cost - Bad scalability:多 CPU 會過於競爭 memory - Cache coherence:每個 CPU 裡都會有 cache 暫存資料,若 CPU 過多則會導致 cache 間的一致性很差 - Heterogeneous architecture: CPUs have local memory and memory reference is either local or remote (**NUMA**, Non-Uniform Memory Access) - ![image](https://hackmd.io/_uploads/rkJlDPGJyx.png) - 現今較常見 - Good scalability - High process migration cost - common topology: ring, mesh, hypercube*, ad-hoc #### Multiple Processor Scheduling Algorithm ![image](https://hackmd.io/_uploads/SkAoDDMy1g.png) - partitioned scheduling - 較適合 NUMA:不用一直傳遞資料 - 較常見 - e.g. Linux - Goal: - Processor affinity - process migration cost (for NUMA): cache re-population, pipeline re-start, process data transfer - 最好一個 process 固定在一個 CPU 跑 - Load balancing - 上述兩者互相衝突 #### Hardware Multithreading - A thread represents a logical processor, emulated by an independent set of register - 節省 memory stall 的時間,再讀取的時候可以先去做其他事 - ![image](https://hackmd.io/_uploads/Sk3a5Pz11e.png) - ![image](https://hackmd.io/_uploads/BJtH3vGk1g.png) - physical 上的兩個 logical CPU 其實還是共享同一個 physical CPU,所以如果此時有兩個工作 1 0 1 0 會比 1 1 0 0 還好 - Linux strategy:從下往上平均分配 process ### Real-Time Scheduling - IEEE definition of real time systems: *"A real-time system is a system whose correctness includes its response time as well as its functional correctness."* - 不用快,但要在 DDL 前做完 - Soft real-time:超過 DDL 沒關係但要盡快 e.g. 影音串流 - Hard real-time:超過會出事 e.g. 航空電腦 #### Real-Time Scheduling Algorithm - Support preemptive - Processes have new characteristics: periodic - 每過一定週期就會需要做這個 process 一次 - 通常 DDL = 週期 - **Rate Montonic Scheduling** - 週期越短優先度越高 - ![image](https://hackmd.io/_uploads/BJsZy_z1kg.png) - 有機會 miss DDL - **Earliest Deadline First Scheduling (EDF)** - DDL 越早優先度越高 - ![image](https://hackmd.io/_uploads/HkJDJdzJyx.png) ### Thread Scheduling - Light-Weight Process (LWP) - abstraction for mapping of user threads to kernel threads - LWP 對 kernel thread 通常是 1-1 - ![image](https://hackmd.io/_uploads/Hygvedz1ye.png) - Two scheduling - Local Scheduling - User kernel 如何排到 LWP - Process contention scope (PCS) - Global Scheduling - Kernel 如何排到 CPU - System contention scope (SCS) - Pthread API ```cpp= /* set the scheduling algorithm to PROCESS or SYSTEM */ pthread attr setscope(&attr, PTHREAD_SCOPE_SYSTEM); /* set the scheduling policy - FIFO, RR, or OTHER */ pthread attr setschedpolicy(&attr, SCHED_OTHER); ``` - Linux 原生的 thread model 是 1-1,所以只能設 `PTHREAD_SCOPE_SYSTEM` ### Operating System Example - Solaris - priority-base and RR in the same priority - ![image](https://hackmd.io/_uploads/HyUrzuGJJe.png) - 每類有自己的 priority 及演算法 - ![image](https://hackmd.io/_uploads/H1scGOf1ke.png) - time quantum expired:時間用完後 priority 會變多少(demote) - return from sleep:sleep 完後 priority 會變多少(promote) - Linux CFS Scheduling (2.6.23+) - Virtual runtime (vruntime) - 每個 process 有自己的 clock,且跑的速度可能不一樣 - 每次挑選 vruntime 最小的跑 - 又叫 virtual clock algorithm - Nice value:-20 ~ +19 (priority: high ~ low) - ![image](https://hackmd.io/_uploads/BkdiB_zkkg.png) - $\text{priority}=120+\text{nice value}$ - Completely Fair Scheduler (CFS) - CFS 傾向 I/O-bound,所以其 clock 會跑得比較慢 - CFS 根據 nice value 和最近的運行狀況(I/O-bound、CPU-bound)調整 vruntime 的增加速度 ## Ch. 6 Synchronization - Background:同時讀寫資料可能會導致 data inconsistency,尤其是多 process 的狀況,先前利用兩個指標 `in`、`out` 表示 buffer 的空間,那能不能利用 `cnt` 存 buffer 內的資料數量 ```cpp= /* Producer Code */ while (true) { /* produce an item and put in nextProduced */ while (count == BUFFER_SIZE) ; // do nothing buffer [in] = nextProduced; in = (in + 1) % BUFFER_SIZE; count++; } /* Consumer Code */ while (true) { while (count == 0) ; // do nothing nextConsumed = buffer[out]; out = (out + 1) % BUFFER_SIZE; count--; /* consume the item in nextConsumed */ } ``` - 但 `count++`、`count--` 實際上由三行組語組成,由於 context switch, Preemptive scheduling 等因素,這些指令可能不是一次執行完成,導致執行結果不符合預期(Race condition) ```= move ax, counter add ax, 1 move counter, ax ``` ### Critical Section Problem - 每一個 Process 存取 shared data 的 code segment 我們稱為 critical section - solution - Mutual Exclusion:當一個 Process 執行 critical section 時,其他 Processes 不得執行 critical section - Progress:如果有 process 想進入 critical section,且沒有其他人也在裡面,則他一定可以進去 - 為了避免先行排程好進入 critical 的順序 - e.g. 依照 P1 -> P2 -> P3 的順序,但若 P2 本身程式邏輯就很少會進入 critical section,則會導致 P1 和 P3 都被卡住 - Bounded Waiting:Process 等待進入 critical section 的期限必須受到限制,不可一直排隊等待 ### Synchronization Hardware #### Interrupt Disabling - Masking interrupts prevents preemption, resolve race problem - 在 critical section 前後把 interrupt 開關 - 常用在 kernel - interrupt disabling is Privilege instruction, cannot be used in user mode - 無法處理 multiprocessor 不同 CPU 競爭的問題 #### Atomic Instructions - 現今機器有些特別的 atomic hardware instructions,即無法被中斷的指令 - 在 miltiprocessor 中還會特別把這些 atomic inst. 存取的 memory 獨立開來 - e.g. - Test memory word and set value (TSL) - Swap contents of two memory words (XCHG) - TestAndSet:把特定 section 或記憶體空間鎖住 ```cpp= bool TestAndSet (bool *target) { bool rv = *target; *target = TRUE; return rv: } /*** P1 ***/ do { while (TestAndSet(&lock)); /* do nothing */ // critical section lock = FALSE; // remainder section } while (TRUE); /*** P2 ***/ do { while (TestAndSet(&lock)); /* do nothing */ // critical section lock = FALSE; // remainder section } while (TRUE); ``` - swap ```cpp= /* In process */ do { key = TRUE; while (key == TRUE) Swap(&lock, &key); // critical section lock = FALSE; // remainder section } while (TRUE); ``` - TAS / SWAP - can be used in uniprocessor, multiprocessor, user mode, or kernel mode - issue - wasting CPU cycle(要在那邊等解鎖) - the contention is stateless, starvation is possible - solve mutual exclusive, progressive, but not bounded waiting ### Semaphores - Semaphore `S`:integer variable - Init: S >= 0 - `wait()` - `S--` - running -> waiting (process status) - `signal()` - `S++` - waiting -> ready - Associated with a waiting queue ```cpp= wait(S) { S--; if (S < 0) { // add this process to waiting queue block(); } } Signal(S) { S++; if (S <= 0) { // remove a process P from the waiting queue wakeup(P); } } ``` - `signal` 內的條件為 `<= 0`,代表 `S++` 前小於零,亦即有 process 被 block 住,而現在有空間了,所以可以放出 P - Semaphores operations must be atomic - semaphore 的功用用初始值來決定 - Mutex lock: init = 1 - Sequencing or event: init = 0 - Capacity control: init = capacity - Mutex lock: `Semaphore mutex = 1` ```cpp= /* P1 */ /* P2 */ do { do { waiting(mutex); waiting(mutex); // Critical section // Critical section signal(mutex); signal(mutex); // Reminder section // Reminder section } while (TRUE); } while (TRUE); ``` - 假設 P1 先進入,`waiting(mutex)` 將 `mutex` 變成 0,若此時 P2 進入則再把 `mutex` 改成 -1,此時 P2 被 block - P1 結束 critical section執行 `signal()`,把 `mutex` 變成 0,此時就放出 P2 - Sequencing or event: `Semaphore synch = 0` ```cpp= /* P1 */ /* P2 */ S1; wait(synch) signal(synch); S2; ``` - Capacity control: `Semaphore sem = capacity` ```cpp= /* Pi, Pj, Pk, ... */ { // ... wait(sem); // ... signal(sem); // ... } ``` - Deadlock - `S` and `Q` be two semaphores initialized to 1 ```cpp= /* P0 */ /* P1 */ wait(S); wait(Q); wait(Q); wait(S); // Critical section // Critical section signal(Q); signal(S); signal(S); signal(Q) ``` - P0 跑完 `wait(S)`,突然 context switch 到 P1 執行 `waiy(Q)`,此時兩個 process 就會卡住 ### Classic Problems of Synchronization #### Bounded-Buffer Problem - Semaphore `mutex` initialized to 1 - protect buffer - allow many consumer and producer :::warning 如何能取用 / 新建資料? A: 成功進入 buffer 後看使否為空 / 已滿 Issue: 當 buffer 為空且 consumer 先進入 buffer 時,必須先 block 等 producer 放入再往下跑 -> busy waiting ::: - Semaphore `full` initialized to 0 - empty (for consumer) - Semaphore `empty` initialized to N - N free slots ```cpp= /*** Producer ***/ do { // produce an item wait (empty); wait (mutex); // add the item to the buffer signal (mutex); signal (full); } while (true); /*** Consumer ***/ do { wait (full); wait (mutex); // remove an item from buffer signal (mutex); signal (empty); // consume the removed item } while (true); ``` #### Readers and Writers Problem > Description: A data set is shared among a number of concurrent processes > - Readers: 只能寫入 > - Writers: 可讀取及寫出 > - Allow multiple reader to read at the same time. > - Only one single writer can access the shared data at the same time. > - No readers will wait until the writer locked the shared object(當有一個 reader 進了,之後的 reader 都可以一直進) > - Readers need not to synch with each other - Semaphore `wrt` initialized to 1 - Semaphore `mutex` initialized to 1. (to protect “readcount”) - Integer `readcount` initialized to 0 - 處理第一個 reader 的情況 ```cpp= /* Writer */ do { wait(wrt); // writing is performed signal(wrt); } while (true) /* Reader */ do { wait(mutex); readcount++; if (readercount == 1) wait(wrt); signal(mutex); // reading is performed wait(mutex); readcount--; if (readcount == 0) signal(wrt); signal (mutex); } while (true) ``` - Writer may be starve (the second readers-writers problem) - writer 優先 - 有 writer 要進來,後進來的 reader 不能更新 `readcount`,所以需要一個多的 lock 加在 `readmutex` 外:`readTry` - 當 writer 正在使用資源,決定什麼時候 `readTry` 要加 / 解鎖:`writecount` - wmutex 保護 `writecount` #### Dining-Philosophers Problem > Description:假設有五位哲學家圍坐在一張圓形餐桌旁,做以下兩件事情之一:吃飯,或者思考。吃東西的時候,他們就停止思考,思考的時候也停止吃東西。所有哲學家吃飯的時候需要身邊的兩隻叉子。問題在於**如何設計一套規則,使得在哲學家們在完全不交談,也就是無法知道其他人可能在什麼時候要吃飯或者思考的情況下,可以在這兩種狀態下永遠交替下去。** ```cpp= do { wait(chopstick[i]); wait(chopStick[(i + 1) % 5]); // eat signal(chopstick[i]); signal(chopstick[(i + 1) % 5]); // think } while(true); ``` - 可能導致 deadlock(每個人都拿著左邊的筷子) - 只允許 <5 人可以吃(加個訊號機即可) - 一個人都拿左邊,其他人都拿右邊 - 兩隻筷子同時拿 #### Sleeping Barber Problem > Description:一個理髮廳有 $n$ 張椅子及一個理髮師 > - 如果沒有人,理髮師會睡覺 > - 如果椅子滿了,客人會直接離開 > - 如果有空位且理髮師正在忙,則客人會坐下等 > - 如果有空位但理髮師在睡覺,客人會去叫醒理髮師 - Semaphore `customers` = 0 - customers are waiting - barber wait if `customer` = 0 - Semaphore `barber` = 0 - barber is ready - customer wait if `barber` is busy - Semaphore `accessSeats` = 1 (lock `NumberOfFreeSeats`) - int `NumberOfFreeSeats` = N ```cpp= /* Customer */ while(1) { wait(accessSeats) // mutex protect the number of available seats if ( NumberOfFreeSeats > 0 ) { // if any free seats NumberOfFreeSeats--; // sitting down on a chair signal(customers); // notify the Barber signal(accessSeats); // release the lock wait(barber); // wait if the B is busy ... // cut hair } else { // there are no free seats signal(accessSeats); // release the lock on the seats ... // C leaves without a haircut } } /* Barber */ while(1) { wait(Customers) ; // wait for C and sleep wait(accessSeats); // mutex protect the number of available seats NumberOfFreeSeats++; // one chair gets free signal(Barber); // Bring in a C for haircut signal(accessSeats); // release the mutex on the chairs ... } ``` ### Monitors #### Mutex Locks - `pthread_mutex_lock()`, `pthread_mutex_unlock()` - 類似於 init value = 1 的 semaphore - A mutex can only be unlocked by the thread that has locked the mutex #### Moniter - 可以將 Moniter 想成一個特殊 class (high level language construct),每一個 Moniter 都有可能發生 race condition 的 local variable (對應下圖的 shared data) 需要保護,而要操作這些 var 必須透過 Moniter 的 method。 - ![image](https://hackmd.io/_uploads/SkSqrs2ykg.png) - Condition variable - Two operation - `x.wait()` - `x.signal()`:如果有 process 正在等待,則叫醒他,若無則無用處(與 semphore 不同,semphore 的 `wait()` 會把 cnt+1) - 在 monitor 當中,condition variable 不是 counter 而是一個 queue,`wait()`、`signal()` 分別對應到 push 和 pop,因此若 queue 為空則 `signal()` 沒有用 ## Ch. 7 Deadlock ### System Model - Resource $R_1,R_2,\dots,R_m$:CPU cycles, memory objects, I/O devices... - Each resource type $R_i$ has $W_i$ instances:DMA channels... - Each process utilizes a resource as follows 1. request 2. use 3. release - Resource Allocation Graph - ![image](https://hackmd.io/_uploads/S1vv5o3kyl.png) - ![image](https://hackmd.io/_uploads/S1BucshyJe.png) ### Deadlock Characterization > 若 Deadlock 發生了,則以下事件都會發生 - Mutual exclusion - Hold and wait:拿有某些資源的 process 正在 wait 被其他 process 佔有的資源 - No preemption:Resource 不能被搶先使用,必須等上一個 process 自願放出 - Circular wait:$P_0$ 正在等 $P_1$ 的 resource,$P_1$ 正在等 $P_2$ 的 resource、...、$P_n$ 正在等 $P_0$ 的 resource - In resource allocation graph - ![螢幕擷取畫面 2024-10-16 105518](https://hackmd.io/_uploads/BJETji3kkg.png) - Resources have multiple instances:Deadlock $\Rightarrow$ Cycle - Resources have single instances:Deadlock $\Leftrightarrow$ Cycle - If graph contains no cycles $\Rightarrow$ no deadlock - If graph contains a cycle $\Rightarrow$ - Resources have single instance, then deadlock - Resources have multiple instances, then **possible** deadlock ### Handle Deadlock - 保證不讓系統進 deadlock state(prevention or avoidance),e.g. RTOS - 允許系統進入 deadlock,但能夠復原(detection and recovery) - Just ignore... (used by most OS xdddd) ### Deadlock Prevention - blocking at least one of the four conditions required for deadlock to occur - No mutex: impossible - No hold and wait: low concurrency - No preemption - 當一個握有 $R_i$ 的 process(called victim) 正在等另一個資源,則當其他 process 需要 $R_i$ 的時候就會被放出來 - $R_i$ 可用時 victim 會被重啟 - 需要 checkpoint 的機制 -> Computationally expensive - No circular wait - impose a total ordering or a partial ordering of allocation on resources - e.g. $R_1\rightarrow R_2$ then $R_2\rightarrow R_1$ can not be exist - implemented by programmer ### Deadlock Avoidance - 預測 Deadlock,不要讓 process 進到 unsafe state #### Avoidance with single instance resources - 重新標示 resource-allocation graph 的邊 - Claim edge: may use a resource at some time - Request edge: is requesting a resource - Assignment edge: is holding a resource - 每當 process 發出需求,檢查加入該邊後 graph 是否有 cycle,若有則 wait - In commidity OS, deadlock management has been dropped and become the programmer’s responsibility to write deadlock-free programs - But in real-time OS (RTOS), there are still resource-synchronization protocols to avoid deadlocks - e.g. Highest Locker’s Protocol in RTOS - 當有很多 process 會去鎖一個 resource,當一個 process 成功鎖到時,該 process 的 priority 會增加成那群 process 最高的 priority,一旦釋出 resource 後會還原成原先的 priority - mutex 只能被鎖上的人解開 - ![image](https://hackmd.io/_uploads/BkG2x1ElJl.png) - 上半部為一個 deadlock - 下半部:L 拿到紫色的 resource 後會將自己的 priority 變成 H,此時當 H 進入時不會被 preempt,當 L 釋放紫色後回到原先的 priority,此時 H 才開始跑 - 不要產生 cycle,先把拿到資源的 process 處理完 #### Avoidance with multiple instance resources - Safe/unsafe-state method - safe state:一定沒有 deadlock - unsafe-state:可能產生 deadlock - ![image](https://hackmd.io/_uploads/SJw7PWVx1x.png) - ![image](https://hackmd.io/_uploads/HkO4wbNgJl.png) - ![image](https://hackmd.io/_uploads/Sy0SwZVlJx.png)